In this article
This article explains why memory bandwidth matters as much as VRAM capacity for large model workloads, where the H200 NVL fits in private AI infrastructure, and how it differs from both the RP6000 and hyperscaler GPU rental for teams running 70B+ models at production scale.
Two numbers define the H200: 141GB of HBM3e and 4.8 TB/s of memory bandwidth. The first is roughly 76% more capacity than the H100’s 80GB. The second is about 43% more bandwidth than the H100 SXM5 and nearly three times the bandwidth of the RTX PRO 6000’s GDDR7. Those aren’t incremental improvements. They change which workloads fit on a single GPU and how fast memory-bound operations run. For teams where the constraint isn’t raw compute but the speed at which model weights can be streamed through the GPU, the H200 is where the economics shift.
What the H200 Actually Is
The H200 is not a new GPU architecture. It uses the same Hopper GH100 die as the H100, with the upgrade entirely in the memory subsystem. Where the H100 ships with 80GB of HBM3, the H200 ships with 141GB of HBM3e, NVIDIA’s first GPU to use the faster HBM3e standard. Memory bandwidth increases from 3.35 TB/s on the H100 SXM5 to 4.8 TB/s on the H200.
FP8 compute throughput is identical between H100 and H200, since the Tensor Core architecture is the same. The H200 earns its performance gains on memory-bound workloads, not on compute-bound ones. For tasks where the GPU spends most of its time waiting for data to arrive from memory rather than executing computation, the bandwidth increase is the relevant number.

SXM5 vs NVL: Why the Variant Matters
The H200 ships in two variants with meaningfully different characteristics.
H200 SXM5 is the high-density module used in HGX H200 and DGX H200 nodes. It runs at 700W TDP, typically requires liquid cooling, and connects to other GPUs via NVLink 4.0 at 900 GB/s bidirectional bandwidth per GPU through NVIDIA’s NVSwitch fabric. In an 8-GPU HGX node, NVSwitch creates a full-mesh topology where every GPU can communicate with every other GPU at full bandwidth simultaneously. This is the standard for large-scale distributed training across 70B+ parameter models at full precision.
H200 NVL is the PCIe Gen 5 add-in card variant. It runs at 600W, can be air-cooled in well-ventilated racks, and fits standard PCIe server hardware without a custom NVLink Switch System board. Multi-GPU NVL configurations connect via PCIe 5.0, which provides substantially lower GPU-to-GPU bandwidth than the NVSwitch fabric.
OpenMetal’s H200 server uses the NVL PCIe variant, with 1 or 2 GPUs per server. This is the right configuration for single- and dual-GPU inference and fine-tuning workloads on dedicated hardware. For organizations running 8-GPU distributed training clusters at scale, the SXM5 in HGX nodes is the datacenter-grade platform and is a different deployment model.
Why Memory Bandwidth Determines Inference Throughput
For LLM inference, every generated token requires a forward pass through the model. That forward pass streams the model weights through the GPU. The rate at which tokens can be generated is bounded by how fast those weights can move from GPU memory into the compute units, which means memory bandwidth is the throughput ceiling for memory-bound inference.
Using the Hugging Face rule of thumb (approximately 2 bytes per parameter at FP16), a 70B parameter model requires approximately 140GB of VRAM just for weights. That fills a single H200’s 141GB nearly completely, but it fits on one card. An H100 at 80GB cannot hold a 70B model at FP16 at all without multi-GPU tensor parallelism or quantization.
At FP8 (1 byte per parameter), a 70B model occupies approximately 70GB, fitting an H100 with some KV cache headroom, or an H200 with substantial headroom for KV cache and large batch processing. The bandwidth difference matters here: streaming 70GB of weights through the GPU at 4.8 TB/s takes roughly the same amount of time as streaming 40GB at 2.7 TB/s (the approximate bandwidth ratio). For high-throughput serving, the H200 processes each request faster even when both GPUs can technically hold the model.
For 100B+ parameter models at FP16, even the H200’s 141GB isn’t sufficient for a single GPU, and multi-GPU tensor parallelism is required regardless of hardware. At that scale, SXM5 with NVSwitch is the appropriate platform.
The 141GB Floor for Large Models
The capacity difference matters most at the boundary between what fits and what doesn’t. A 70B model at FP16 occupies approximately 140GB. That barely fits in a single H200. It doesn’t fit in an H100 or an RTX PRO 6000 at FP16. It fits in a dual-RP6000 configuration (192GB combined GDDR7), but without NVLink, inter-GPU tensor parallelism runs over PCIe, adding communication overhead.
For teams running 70B models at FP16 for maximum output quality, the H200 is the only single-GPU option. For teams running 70B at FP8 and prioritizing throughput at a lower cost point, the RTX Pro 6000 is a strong alternative.
The two-GPU H200 configuration (282GB total HBM3e) opens up the range of models above 70B at full precision, including the growing class of Mixture-of-Experts models where all expert weights must remain resident in GPU memory simultaneously.
What OpenMetal’s H200 Server Includes
The H200 server is built on the same Intel Xeon 6530P dual-socket platform as the RP6000 and XL v5:
- 1 or 2x NVIDIA H200 NVL (141GB HBM3e per GPU, 4.8 TB/s memory bandwidth)
- 2x Intel Xeon 6530P (64 cores / 128 threads total)
- 1TB DDR5-6400 (up to 2TB)
- Up to 24x NVMe drives using Micron 7500 MAX
- 40 Gbps private network / 10 Gbps public network
The 128-core CPU and DDR5-6400 memory matter for inference serving. Modern frameworks like vLLM and SGLang rely heavily on CPU-side preprocessing: tokenization, batching, KV cache management, and scheduling. The CPU needs to stay ahead of the GPU to avoid starving the accelerator. 128 cores at DDR5-6400 speeds does that across a wide range of serving throughput targets.
The 40 Gbps private network enables multi-node inference configurations where tensor parallelism runs across multiple servers. For teams scaling past a single H200, low-latency high-bandwidth private networking between nodes matters for the communication overhead in distributed inference.
Best-Fit Workloads
70B and larger model inference at FP16
The H200 is the only single-GPU option for serving 70B models at full FP16 precision. Teams running production endpoints on Llama 3 70B, Qwen 72B, or similar models who prioritize output quality over cost-per-token should evaluate the H200 over quantized alternatives.
Memory-bandwidth-bound serving
For multi-user inference endpoints where throughput is the primary concern, the H200’s 4.8 TB/s bandwidth means each forward pass completes faster than on lower-bandwidth GPUs. For high-concurrency serving of large models, bandwidth scales throughput more directly than additional VRAM headroom.
Large-context inference
KV cache grows with context length. At very long contexts (100K+ tokens), KV cache memory requirements become significant alongside model weights. The H200’s 141GB provides more KV cache headroom at large model sizes than any alternative single-GPU option.
QLoRA and fine-tuning at 70B scale
QLoRA fine-tuning of 70B models runs within the H200’s 141GB budget with room for optimizer state overhead. For teams adapting frontier-scale models to domain-specific data, the H200 provides the memory budget needed without multi-GPU tensor parallelism complexity.
HPC and scientific computing
The H200 datasheet cites 110x faster HPC performance in select benchmarks. Memory-intensive scientific simulations, molecular dynamics, and numerical methods that are bandwidth-bound benefit from HBM3e at the same level as LLM workloads.
Where the RP6000 Makes More Sense
The H200 is the right choice when 141GB HBM3e and 4.8 TB/s bandwidth are the reasons you’re buying. For workloads that fit within 96GB GDDR7, the RP6000 delivers comparable or better cost efficiency per token on standard inference workloads and is better suited to:
- Inference and fine-tuning on 30B-70B models at FP8 or Q4
- Mixed AI and visual computing pipelines (RT Cores, NVENC video encoding)
- Teams where per-token cost efficiency at the 30B-70B range is the primary goal
The H200 earns its place when the model size exceeds 96GB at target precision, when bandwidth is the measured throughput ceiling, or when a team is running workloads that specifically require the HBM3e memory architecture.
Getting Started
H200 servers are reserved through OpenMetal’s Configure / Reserve flow. GPU hardware at this tier carries lead times. Get a written quote (valid for 30 days) and plan the reservation against your actual deployment timeline.
For teams validating whether the H200 configuration fits their workload before committing, OpenMetal’s PoC program provides a structured evaluation path with engineer-to-engineer support.
Frequently Asked Questions
What is the difference between the H200 SXM5 and H200 NVL?
Both variants use the same Hopper GH100 die and 141GB HBM3e memory subsystem. The SXM5 mounts on the NVLink Switch System board in HGX H200 nodes, runs at 700W, typically requires liquid cooling, and connects GPUs at 900 GB/s bidirectional via NVSwitch for full-mesh all-to-all communication. The NVL is a PCIe Gen 5 add-in card, runs at 600W, can be air-cooled, and fits standard server hardware. For multi-GPU configurations at 1-4 GPUs per server in standard rack hardware, NVL is the appropriate variant. For 8-GPU distributed training nodes with full NVSwitch fabric, SXM5 is the datacenter standard.
Why does memory bandwidth matter for LLM inference?
Each token generated during inference requires a forward pass through the model, which streams model weights from GPU memory into the compute units. The rate at which those weights can be moved determines inference throughput for memory-bound workloads. Higher memory bandwidth means each forward pass completes faster, which translates directly to higher tokens per second at a given batch size.
Can a 70B model run on a single H200?
Yes, at FP16. Using the standard rule from Hugging Face’s LLM optimization documentation, a 70B model requires approximately 140GB at FP16, which fits within the H200’s 141GB with minimal headroom for KV cache. At FP8, the same model occupies approximately 70GB, leaving roughly 71GB for KV cache. The H200 is the only single-GPU option for 70B inference at FP16.
How does the H200 compare to the RTX PRO 6000 for inference?
The H200 NVL has 141GB HBM3e at 4.8 TB/s. The RTX PRO 6000 has 96GB GDDR7 at 1.79 TB/s. The H200 is better suited for models above 96GB at target precision and for workloads where memory bandwidth is the throughput bottleneck. The RP6000 is better suited for 30B-70B inference at FP8 or Q4 quantization, offers better cost efficiency at that model size range, and supports mixed AI-visual computing workloads via its RT Cores and video encoding engines.
What is NVLink and does the H200 NVL support it?
NVLink is NVIDIA’s high-speed GPU-to-GPU interconnect. H200 SXM5 uses NVLink 4.0 at 900 GB/s bidirectional per GPU through NVIDIA’s NVSwitch fabric. H200 NVL PCIe supports NVLink bridges for 2-4 GPU configurations, but at lower bandwidth than SXM5’s NVSwitch fabric. For 8-GPU distributed training nodes requiring full-mesh GPU communication, SXM5 in HGX configurations is the appropriate platform.
Is the H200 available on-demand at OpenMetal?
H200 servers are reserved through OpenMetal’s Configure / Reserve flow. Written quotes are valid for 30 days. GPU hardware at this tier has lead times, so plan the reservation against your deployment timeline rather than assuming immediate availability.
Schedule a Consultation
Get a deeper assessment and discuss your unique requirements.
Read More on the OpenMetal Blog



































