The Powerful Capabilities of an NVIDIA Blackwell Server

2026-06-25

Key Takeaways

An NVIDIA Blackwell server is built around the B200 GPU, a dual-die chip with 208 billion transistors, 192GB of HBM3e memory, and a second-generation Transformer Engine that runs native FP4 math.
A single B200 delivers up to 20 petaFLOPS of FP4 AI performance, roughly 8TB/s of memory bandwidth, and connects to its neighbours over fifth-generation NVLink at 1.8TB/s.
Blackwell ships in three main shapes: an 8-GPU HGX/DGX B200 server, the GB200 Grace Blackwell Superchip, and the rack-scale GB200 NVL72.
The GB200 NVL72 wires 72 B200 GPUs and 36 Grace CPUs into one liquid-cooled rack that behaves like a single GPU, hitting 1.44 exaFLOPS of FP4 and 130TB/s of all-to-all NVLink bandwidth.
NVIDIA rates the NVL72 at 30x faster real-time inference for trillion-parameter models and 25x more performance per watt than air-cooled H100 systems.
The Blackwell Ultra refresh (B300 and GB300 NVL72) raises memory to 288GB per GPU, adds 1.5x more FP4 compute, and doubles attention-layer speed for reasoning and agentic workloads.
A full Blackwell rack draws about 120kW and weighs around 1.36 tonnes, forcing liquid cooling and a redesign of data center power and floor planning.
Blackwell is the current production platform; the Rubin generation is expected to reach cloud providers in the second half of 2026.

NVIDIA Blackwell server. Image credit: NVIDIA

An NVIDIA Blackwell server is a GPU system built on NVIDIA’s Blackwell architecture, and its power comes from three things working together: enormous on-chip memory, native low-precision math, and an interconnect fast enough to make many GPUs act as one. At its core sits the B200, a graphics processor with 208 billion transistors, 192GB of HBM3e memory, and a second-generation Transformer Engine that introduces 4-bit floating point (FP4). A single B200 produces up to 20 petaFLOPS of FP4 compute and moves data at roughly 8TB/s, which is more than double the bandwidth of the previous Hopper generation.

What turns those chips into a Blackwell server is how they are wired. Eight B200 GPUs form a standard HGX or DGX node, while the flagship GB200 NVL72 packs 72 of them with 36 Grace CPUs into one liquid-cooled rack connected by fifth-generation NVLink. That rack reaches 1.44 exaFLOPS of FP4 performance and 130TB/s of internal bandwidth, letting it train and serve trillion-parameter models that no single server could hold. The short version: Blackwell does not just run bigger models faster — it changes the unit of computing from the GPU to the entire rack.

What is an NVIDIA Blackwell server?

The Blackwell architecture, named after mathematician David Blackwell, debuted at NVIDIA’s GTC 2024 conference as the successor to Hopper. A Blackwell server is any system that uses its GPUs, but the term covers a wide range of hardware, from an 8-GPU air-coolable box to a 72-GPU liquid-cooled cabinet that weighs as much as a small car.

Three building blocks define the platform. The B200 is the GPU. The Grace CPU is a 72-core Arm processor that handles data preparation, orchestration, and model logic. The GB200 Grace Blackwell Superchip fuses two B200 GPUs and one Grace CPU on a single module, linked by a 900GB/s coherent connection called NVLink-C2C. Stack those pieces in different quantities and you get the server tiers customers actually buy.

The B200 GPU: the engine inside every Blackwell server

The B200 is a dual-die design. Because a single piece of silicon can only grow so large, NVIDIA fabricates two reticle-sized dies on TSMC’s custom 4NP process and joins them with a 10TB/s link, so the pair behaves as one unified GPU. That trick yields 208 billion transistors and a single pool of 192GB HBM3e memory (cloud platforms typically expose 180GB), fed by eight memory stacks across an 8,192-bit bus.

The headline upgrade is the second-generation Transformer Engine. It adds FP4 and FP6 to the existing FP8 and FP16 formats, and it switches precision automatically based on what each layer of a model needs. FP4 halves the memory a model occupies compared with FP8, so a B200 can hold a far larger model or run many more requests at once. Native FP4 is what drives Blackwell’s inference numbers: MLPerf results have shown a B200 serving Llama 2 70B at roughly 17,500 tokens per second, against about 3,000 for an H100.

Every B200 also carries a dedicated reliability engine that watches voltage, temperature, and memory errors and can move work off a failing component before it crashes — a feature that matters when thousands of GPUs run flat out for weeks. The trade-off for all this inference muscle is power: the B200 runs at a 1,000W thermal envelope, a 43% jump over the 700W H100.

B200 specification	Detail
Process and design	TSMC 4NP, dual-die, 208 billion transistors
Memory	192GB HBM3e (180GB usable on many clouds)
Memory bandwidth	Up to 8TB/s
Peak FP4 compute	Up to 20 petaFLOPS (with sparsity)
Precision formats	FP4, FP6, FP8, FP16/BF16, TF32, FP64
Interconnect	NVLink 5 at 1.8TB/s per GPU
Power (TDP)	1,000W

The Grace Blackwell Superchip and unified memory

Pairing GPUs with a CPU over the normal PCIe bus creates a bottleneck, because PCIe Gen5 tops out near 128GB/s. NVIDIA sidesteps that with NVLink-C2C, a 900GB/s coherent link between the Grace CPU and its Blackwell GPUs — about seven times the PCIe rate. Because the connection is cache-coherent, the CPU and GPU share a single memory address space and skip the costly copy operations that older architectures require.

The Grace CPU itself uses 72 Arm Neoverse V2 cores and up to 480GB of LPDDR5X memory, delivering roughly 4TB/s of CPU memory bandwidth at about twice the energy efficiency of mainstream server processors. In a Blackwell server, Grace handles tokenization, data preprocessing, and model orchestration while the GPUs concentrate on math.

From server to rack: HGX, DGX B200 and the GB200 NVL72

Most enterprises meet Blackwell as an 8-GPU node. The HGX B200 is the baseboard system integrators drop into their own chassis; the DGX B200 is NVIDIA’s fully built version. An 8-GPU node delivers about 144 petaFLOPS of FP4 and 1.4TB of fast memory, with NVLink running at 14.4TB/s inside the box. That is plenty to train a 7-billion-parameter model in days rather than weeks, or to serve quantized 70B models at high throughput.

The GB200 NVL72 is a different category of machine. It assembles 36 Grace Blackwell Superchips across 18 compute trays, joined by 9 NVLink switch trays, into a single rack of 72 GPUs and 36 CPUs. The result is 13.4TB of unified GPU memory, 1.44 exaFLOPS of FP4 compute, and an internal NVLink fabric moving 130TB/s. NVIDIA describes the rack as an exascale computer that acts like one massive GPU, and it claims 30x faster real-time inference on trillion-parameter language models versus Hopper, with 10x gains specifically on mixture-of-experts architectures. Decompression engines built into the silicon also speed database joins by about 18x against CPUs.

Specification	HGX / DGX B200 (8 GPU)	GB200 NVL72 (72 GPU)
Blackwell GPUs	8	72
Grace CPUs	0	36
FP4 Tensor Core	144 petaFLOPS	1,440 petaFLOPS (1.44 exaFLOPS)
Fast memory	Up to 1.4TB	Up to 13.4TB unified
NVLink bandwidth	14.4TB/s	130TB/s
Cooling	Air or liquid	Liquid only
Indicative price	~$400K–$500K (DGX B200)	~$2M–$3M per rack

Why FP4 and the Transformer Engine matter for inference cost

The most practical capability of a Blackwell server is cheaper inference. FP4 is not simply faster FP8 — it changes the economics of serving a model. A model that needs 80GB at FP8 fits in roughly 40GB at FP4, freeing the rest of the 192GB for batching and KV-cache. Inference providers running on thin margins care about this directly: a workload that costs around $0.50 per million tokens on an H100 at FP8 can drop toward $0.10–$0.15 on a B200 at FP4. That difference decides which models are economical to run at all.

Blackwell pays for this gain by trimming high-precision FP64 throughput. For AI training and inference, 64-bit math is rarely needed, so NVIDIA reallocated that silicon budget toward the lower-precision formats that dominate modern workloads. Teams running classic scientific simulations that depend on FP64 may see less benefit, which is why many keep Hopper-class GPUs for those jobs.

NVLink fabric: making 72 GPUs behave like one

In a conventional cluster, GPUs inside a server talk over fast NVLink, but GPUs in separate servers fall back to InfiniBand, which is roughly ten to twenty times slower. That two-tier hierarchy caps how large a model can grow before communication overhead dominates. The GB200 NVL72 removes the second tier inside the rack. Every one of its 72 GPUs reaches every other GPU at full NVLink speed through a non-blocking switch fabric, so the rack presents itself to software as one enormous accelerator.

The payoff shows up in how the rack uses its memory. On a trillion-parameter model in FP4, only a couple of GPUs are needed to hold the weights, which leaves the rest free for KV-cache, activations, and concurrent reasoning passes. Attention computation, expert routing, and cache access all happen across the fabric without leaving the rack.

Power, cooling, and the liquid-cooled reality

A Blackwell rack is data center infrastructure, not a server you slide onto a shelf. The GB200 NVL72 draws about 120kW and weighs roughly 1.36 tonnes. A traditional enterprise rack was designed for 10 to 15kW, so a single Blackwell cabinet packs close to ten times that density. Air cannot remove heat at this concentration, which makes direct liquid cooling mandatory.

That density is also the source of Blackwell’s efficiency claim. Liquid cooling lets NVIDIA pack GPUs tightly and keep clocks high, and the company states the GB200 delivers 25x more performance at the same power as air-cooled H100 infrastructure while cutting water use. The catch is facility-level: deploying these racks means rebuilding power distribution, cooling loops, and floor loading around the hardware. As one assessment of premium AI builds put it, get one layer wrong and the most expensive GPUs in the building sit idle.

Blackwell Ultra: the B300 and GB300 NVL72

NVIDIA refreshed the platform with Blackwell Ultra, introduced at GTC in March 2025 and aimed squarely at AI reasoning. The B300 GPU keeps the dual-die approach but stacks 12-high HBM3e to reach 288GB per GPU, 50% more memory than the B200. It adds roughly 1.5x more dense FP4 compute and doubles attention-layer acceleration, the part of inference that now dominates cost for long-context and reasoning models. The trade-off is a higher 1,400W envelope that requires liquid cooling in every form factor.

At rack scale, the GB300 NVL72 combines 72 B300 GPUs and 36 Grace CPUs to deliver about 20.7TB of unified HBM3e and 1.1 exaFLOPS of dense FP4. NVIDIA pairs it with ConnectX-8 SuperNICs that provide 800Gb/s of networking per GPU and Quantum-X800 InfiniBand or Spectrum-X Ethernet for scaling across racks. The company markets up to 50x higher AI factory output than Hopper-based systems for reasoning workloads. The 8-GPU DGX B300 brings the same architecture into a more familiar node, rated at 192 petaFLOPS for inference and 70 petaFLOPS for training.

Attribute	B200 (Blackwell)	B300 (Blackwell Ultra)
HBM3e memory per GPU	192GB	288GB
Dense FP4 compute	~9 petaFLOPS	~15 petaFLOPS
Attention acceleration	Baseline	2x B200
Power (TDP)	1,000W	1,400W
Rack system	GB200 NVL72, 1.44 exaFLOPS FP4	GB300 NVL72, 1.1 exaFLOPS dense FP4

What a Blackwell server actually does well

The platform earns its cost on a specific set of jobs. It trains and serves trillion-parameter language models that need their weights, cache, and activations in one fast memory space. It runs reasoning and agentic models, where the workload spikes unpredictably as the model verifies, searches, and refines answers across many steps. It handles mixture-of-experts models efficiently, since the wide memory holds many experts at once. Blackwell Ultra also unlocks near real-time video generation from world models, where NVIDIA cites a 30x speedup over Hopper, plus the database acceleration that comes from the decompression engines.

For smaller work, the math flips. A 7B-to-34B model that fits comfortably in 80GB sees little advantage from a 72-GPU NVLink domain, and an H100 or H200 often offers better cost per FLOP. The 72-GPU rack only pays off when a workload genuinely needs that much connected memory and compute at once.

Where Blackwell stands now and what comes next

Blackwell is the platform powering the current build-out of large AI data centers. Microsoft Azure, CoreWeave, Oracle Cloud, and Google Cloud all offer GB200 and GB300 systems, and operators including Lambda and xAI are racking thousands of these GPUs for frontier training and inference. A single rack now provides more compute than many countries had across their entire GPU fleet three years ago.

The next step is already named. NVIDIA’s Rubin architecture, led by the R100 chip and the Vera CPU, is expected to reach cloud providers in the second half of 2026, with a Rubin Ultra NVL576 rack projected to deliver roughly 15 exaflops of FP4 inference further out. For now, a Blackwell server represents the most capable AI hardware in general production — a machine engineered from chip to chiller, where memory, interconnect, power, and cooling are specified as one system rather than assembled from parts.

If you are interested in this topic, we suggest you check our articles:

Sources: NVIDIA GB200 NVL72, NVIDIA GB300 NVL72, NVIDIA DGX B300, NVIDIA Technical Blog, Spheron (B200), Spheron (GB200 NVL72), Civo, ServerSimply, Introl, Radiant, Jonathan Hui

Written by Alius Noreika