How the H200 Is Built for Memory-Bound AI Workloads

Resources » Blog » How the H200 Is Built for Memory-Bound AI Workloads

In this article

This article explains why memory bandwidth matters as much as VRAM capacity for large model workloads, where the H200 NVL fits in private AI infrastructure, and how it differs from both the RP6000 and hyperscaler GPU rental for teams running 70B+ models at production scale.

Two numbers define the H200: 141GB of HBM3e and 4.8 TB/s of memory bandwidth. The first is roughly 76% more capacity than the H100’s 80GB. The second is about 43% more bandwidth than the H100 SXM5 and nearly three times the bandwidth of the RTX PRO 6000’s GDDR7. Those aren’t incremental improvements. They change which workloads fit on a single GPU and how fast memory-bound operations run. For teams where the constraint isn’t raw compute but the speed at which model weights can be streamed through the GPU, the H200 is where the economics shift.

What the H200 Actually Is

The H200 is not a new GPU architecture. It uses the same Hopper GH100 die as the H100, with the upgrade entirely in the memory subsystem. Where the H100 ships with 80GB of HBM3, the H200 ships with 141GB of HBM3e, NVIDIA’s first GPU to use the faster HBM3e standard. Memory bandwidth increases from 3.35 TB/s on the H100 SXM5 to 4.8 TB/s on the H200.

FP8 compute throughput is identical between H100 and H200, since the Tensor Core architecture is the same. The H200 earns its performance gains on memory-bound workloads, not on compute-bound ones. For tasks where the GPU spends most of its time waiting for data to arrive from memory rather than executing computation, the bandwidth increase is the relevant number.

H200 GPU

SXM5 vs NVL: Why the Variant Matters

The H200 ships in two variants with meaningfully different characteristics.

H200 SXM5 is the high-density module used in HGX H200 and DGX H200 nodes. It runs at 700W TDP, typically requires liquid cooling, and connects to other GPUs via NVLink 4.0 at 900 GB/s bidirectional bandwidth per GPU through NVIDIA’s NVSwitch fabric. In an 8-GPU HGX node, NVSwitch creates a full-mesh topology where every GPU can communicate with every other GPU at full bandwidth simultaneously. This is the standard for large-scale distributed training across 70B+ parameter models at full precision.

H200 NVL is the PCIe Gen 5 add-in card variant. It runs at 600W, can be air-cooled in well-ventilated racks, and fits standard PCIe server hardware without a custom NVLink Switch System board. Multi-GPU NVL configurations connect via PCIe 5.0, which provides substantially lower GPU-to-GPU bandwidth than the NVSwitch fabric.

OpenMetal’s H200 server uses the NVL PCIe variant, with 1 or 2 GPUs per server. This is the right configuration for single- and dual-GPU inference and fine-tuning workloads on dedicated hardware. For organizations running 8-GPU distributed training clusters at scale, the SXM5 in HGX nodes is the datacenter-grade platform and is a different deployment model.

Why Memory Bandwidth Determines Inference Throughput

For LLM inference, every generated token requires a forward pass through the model. That forward pass streams the model weights through the GPU. The rate at which tokens can be generated is bounded by how fast those weights can move from GPU memory into the compute units, which means memory bandwidth is the throughput ceiling for memory-bound inference.

Using the Hugging Face rule of thumb (approximately 2 bytes per parameter at FP16), a 70B parameter model requires approximately 140GB of VRAM just for weights. That fills a single H200’s 141GB nearly completely, but it fits on one card. An H100 at 80GB cannot hold a 70B model at FP16 at all without multi-GPU tensor parallelism or quantization.

At FP8 (1 byte per parameter), a 70B model occupies approximately 70GB, fitting an H100 with some KV cache headroom, or an H200 with substantial headroom for KV cache and large batch processing. The bandwidth difference matters here: at 4.8 TB/s, the H200 streams the same 70GB of weights faster than an H100 does at 3.35 TB/s, roughly a 1.4x reduction in the memory-bound portion of each forward pass. For high-throughput serving, the H200 processes each request faster even when both GPUs can technically hold the model.

For 100B+ parameter models at FP16, even the H200’s 141GB isn’t sufficient for a single GPU, and multi-GPU tensor parallelism is required regardless of hardware. At that scale, SXM5 with NVSwitch is the appropriate platform.

The 141GB Floor for Large Models

The capacity difference matters most at the boundary between what fits and what doesn’t. A 70B model at FP16 occupies approximately 140GB. That barely fits in a single H200. It doesn’t fit in an H100 or an RTX PRO 6000 at FP16. It fits in a dual-RP6000 configuration (192GB combined GDDR7), but without NVLink, inter-GPU tensor parallelism runs over PCIe, adding communication overhead.

For teams running 70B models at FP16 for maximum output quality, the H200 is the only single-GPU option. For teams running 70B at FP8 and prioritizing throughput at a lower cost point, the RTX Pro 6000 is a strong alternative.

The two-GPU H200 configuration (282GB combined HBM3e across two discrete GPUs) opens up the range of models above 70B at full precision, including the growing class of Mixture-of-Experts models where all expert weights must remain resident in GPU memory. As with any multi-GPU setup on this platform, the two GPUs are not pooled over NVLink, so models spanning both cards run tensor or pipeline parallelism over PCIe, adding communication overhead.

What OpenMetal’s H200 Server Includes

The H200 server is built on the same Intel Xeon 6530P dual-socket platform as the RP6000 and XL v5:

1 or 2x NVIDIA H200 NVL (141GB HBM3e per GPU, 4.8 TB/s memory bandwidth)
2x Intel Xeon 6530P (64 cores / 128 threads total)
1TB DDR5-6400 (up to 2TB)
Up to 8x NVMe drives (single 8-bay group) using Micron 7500 MAX
20 Gbps private network / 10 Gbps public network

The 64-core/128-thread CPU and DDR5-6400 memory matter for inference serving. Modern frameworks like vLLM and SGLang rely heavily on CPU-side preprocessing: tokenization, batching, KV cache management, and scheduling. The CPU needs to stay ahead of the GPU to avoid starving the accelerator. 64 cores (128 threads) at DDR5-6400 speeds does that across a wide range of serving throughput targets.

The 20 Gbps private network enables multi-node inference configurations where tensor parallelism runs across multiple servers. For teams scaling past a single H200, low-latency high-bandwidth private networking between nodes matters for the communication overhead in distributed inference.

Best-Fit Workloads

70B and larger model inference at FP16

The H200 is the only single-GPU option for serving 70B models at full FP16 precision. Teams running production endpoints on Llama 3 70B, Qwen 72B, or similar models who prioritize output quality over cost-per-token should evaluate the H200 over quantized alternatives.

Memory-bandwidth-bound serving

For multi-user inference endpoints where throughput is the primary concern, the H200’s 4.8 TB/s bandwidth means each forward pass completes faster than on lower-bandwidth GPUs. For high-concurrency serving of large models, bandwidth scales throughput more directly than additional VRAM headroom.

Large-context inference

KV cache grows with context length. At very long contexts (100K+ tokens), KV cache memory requirements become significant alongside model weights. The H200’s 141GB provides more KV cache headroom at large model sizes than any alternative single-GPU option.

QLoRA and fine-tuning at 70B scale

QLoRA fine-tuning of 70B models runs within the H200’s 141GB budget with room for optimizer state overhead. For teams adapting frontier-scale models to domain-specific data, the H200 provides the memory budget needed without multi-GPU tensor parallelism complexity.

HPC and scientific computing

The H200 datasheet cites up to 110x faster time to results on memory-bound HPC workloads compared to CPU-only systems. NVIDIA measures this against a dual x86 CPU baseline (dual Sapphire Rapids) on the MILC benchmark. Memory-intensive scientific simulations, molecular dynamics, and numerical methods that are bandwidth-bound benefit from HBM3e at the same level as LLM workloads.

Where the RP6000 Makes More Sense

The H200 is the right choice when 141GB HBM3e and 4.8 TB/s bandwidth are the reasons you’re buying. For workloads that fit within 96GB GDDR7, the RP6000 delivers comparable or better cost efficiency per token on standard inference workloads and is better suited to:

Inference and fine-tuning on 30B-70B models at FP8 or Q4
Mixed AI and visual computing pipelines (RT Cores, NVENC video encoding)
Teams where per-token cost efficiency at the 30B-70B range is the primary goal

The H200 earns its place when the model size exceeds 96GB at target precision, when bandwidth is the measured throughput ceiling, or when a team is running workloads that specifically require the HBM3e memory architecture.

Getting Started

H200 servers are reserved through OpenMetal’s Configure / Reserve flow. GPU hardware at this tier carries lead times. Get a written quote (valid for 30 days) and plan the reservation against your actual deployment timeline.

For teams validating whether the H200 configuration fits their workload before committing, OpenMetal’s PoC program provides a structured evaluation path with engineer-to-engineer support.

Frequently Asked Questions

What is the difference between the H200 SXM5 and H200 NVL?

Both variants use the same Hopper GH100 die and 141GB HBM3e memory subsystem. The SXM5 mounts on the NVLink Switch System board in HGX H200 nodes, runs at 700W, typically requires liquid cooling, and connects GPUs at 900 GB/s bidirectional via NVSwitch for full-mesh all-to-all communication. The NVL is a PCIe Gen 5 add-in card, runs at 600W, can be air-cooled, and fits standard server hardware. For multi-GPU configurations at 1-4 GPUs per server in standard rack hardware, NVL is the appropriate variant. For 8-GPU distributed training nodes with full NVSwitch fabric, SXM5 is the datacenter standard.

Why does memory bandwidth matter for LLM inference?

Each token generated during inference requires a forward pass through the model, which streams model weights from GPU memory into the compute units. The rate at which those weights can be moved determines inference throughput for memory-bound workloads. Higher memory bandwidth means each forward pass completes faster, which translates directly to higher tokens per second at a given batch size.

Can a 70B model run on a single H200?

Yes, at FP16. Using the standard rule from Hugging Face’s LLM optimization documentation, a 70B model requires approximately 140GB at FP16, which fits within the H200’s 141GB with minimal headroom for KV cache. At FP8, the same model occupies approximately 70GB, leaving roughly 71GB for KV cache. The H200 is the only single-GPU option for 70B inference at FP16.

How does the H200 compare to the RTX PRO 6000 for inference?

The H200 NVL has 141GB HBM3e at 4.8 TB/s. The RTX PRO 6000 has 96GB GDDR7 at 1.79 TB/s. The H200 is better suited for models above 96GB at target precision and for workloads where memory bandwidth is the throughput bottleneck. The RP6000 is better suited for 30B-70B inference at FP8 or Q4 quantization, offers better cost efficiency at that model size range, and supports mixed AI-visual computing workloads via its RT Cores and video encoding engines.

What is NVLink and does the H200 NVL support it?

NVLink is NVIDIA’s high-speed GPU-to-GPU interconnect. H200 SXM5 uses NVLink 4.0 at 900 GB/s bidirectional per GPU through NVIDIA’s NVSwitch fabric. OpenMetal’s dual H200 NVL GPUs are discrete. They are not pooled into a single 282GB address space, and they are not joined by an NVLink bridge. The H200 NVL card can accept an optional 2-way or 4-way NVLink bridge in principle, but the bridge is a short physical connector that spans only two to four adjacent slots. In OpenMetal’s dual-GPU chassis the two cards sit at opposite ends of the board, so a bridge cannot reach across them. The GPUs communicate over PCIe 5.0 instead. PCIe 5.0 supports peer-to-peer DMA between the cards, which covers inference and serving workloads that do not depend on a shared memory pool.

Is the H200 available on-demand at OpenMetal?

H200 servers are reserved through OpenMetal’s Configure / Reserve flow. Written quotes are valid for 30 days. GPU hardware at this tier has lead times, so plan the reservation against your deployment timeline rather than assuming immediate availability.

Chat With Our Team

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

How the H200 Is Built for Memory-Bound AI Workloads

What the H200 Actually Is

SXM5 vs NVL: Why the Variant Matters

Why Memory Bandwidth Determines Inference Throughput

The 141GB Floor for Large Models

What OpenMetal’s H200 Server Includes

Best-Fit Workloads

70B and larger model inference at FP16

Memory-bandwidth-bound serving

Large-context inference

QLoRA and fine-tuning at 70B scale

HPC and scientific computing

Where the RP6000 Makes More Sense

Getting Started

Frequently Asked Questions

What is the difference between the H200 SXM5 and H200 NVL?

Why does memory bandwidth matter for LLM inference?

Can a 70B model run on a single H200?

How does the H200 compare to the RTX PRO 6000 for inference?

What is NVLink and does the H200 NVL support it?

Is the H200 available on-demand at OpenMetal?

Chat With Our Team

Schedule a Consultation

Try It Out

Running Llama 3.3 70B on an OpenMetal H200

What AI Startups Need to Plan for Before Their Cloud Credits Run Out

How the H200 Is Built for Memory-Bound AI Workloads

Why 96GB VRAM Changes the Economics of Private LLM Inference

Comparing the NVIDIA RTX Pro 6000 vs. H100 for AI Inference

Why DePIN Compute Networks Require Bare Metal Infrastructure To Function Correctly

Scaling Your OpenMetal Private Cloud from Proof of Concept to Production

Why AI Workloads Are Driving the Private Cloud Renaissance

GPU-Accelerated Blockchain Workloads: Bare Metal Power for AI-Driven Smart Contracts

Why Retail Organizations Need Private AI Infrastructure for Image Generation