In this article

This article explains why memory bandwidth matters as much as VRAM capacity for large model workloads, where the H200 NVL fits in private AI infrastructure, and how it differs from both the RP6000 and hyperscaler GPU rental for teams running 70B+ models at production scale.


Two numbers define the H200: 141GB of HBM3e and 4.8 TB/s of memory bandwidth. The first is roughly 76% more capacity than the H100’s 80GB. The second is about 43% more bandwidth than the H100 SXM5 and nearly three times the bandwidth of the RTX PRO 6000’s GDDR7. Those aren’t incremental improvements. They change which workloads fit on a single GPU and how fast memory-bound operations run. For teams where the constraint isn’t raw compute but the speed at which model weights can be streamed through the GPU, the H200 is where the economics shift.

What the H200 Actually Is

The H200 is not a new GPU architecture. It uses the same Hopper GH100 die as the H100, with the upgrade entirely in the memory subsystem. Where the H100 ships with 80GB of HBM3, the H200 ships with 141GB of HBM3e, NVIDIA’s first GPU to use the faster HBM3e standard. Memory bandwidth increases from 3.35 TB/s on the H100 SXM5 to 4.8 TB/s on the H200.

FP8 compute throughput is identical between H100 and H200, since the Tensor Core architecture is the same. The H200 earns its performance gains on memory-bound workloads, not on compute-bound ones. For tasks where the GPU spends most of its time waiting for data to arrive from memory rather than executing computation, the bandwidth increase is the relevant number.

H200 GPU

SXM5 vs NVL: Why the Variant Matters

The H200 ships in two variants with meaningfully different characteristics.

H200 SXM5 is the high-density module used in HGX H200 and DGX H200 nodes. It runs at 700W TDP, typically requires liquid cooling, and connects to other GPUs via NVLink 4.0 at 900 GB/s bidirectional bandwidth per GPU through NVIDIA’s NVSwitch fabric. In an 8-GPU HGX node, NVSwitch creates a full-mesh topology where every GPU can communicate with every other GPU at full bandwidth simultaneously. This is the standard for large-scale distributed training across 70B+ parameter models at full precision.

H200 NVL is the PCIe Gen 5 add-in card variant. It runs at 600W, can be air-cooled in well-ventilated racks, and fits standard PCIe server hardware without a custom NVLink Switch System board. Multi-GPU NVL configurations connect via PCIe 5.0, which provides substantially lower GPU-to-GPU bandwidth than the NVSwitch fabric.

OpenMetal’s H200 server uses the NVL PCIe variant, with 1 or 2 GPUs per server. This is the right configuration for single- and dual-GPU inference and fine-tuning workloads on dedicated hardware. For organizations running 8-GPU distributed training clusters at scale, the SXM5 in HGX nodes is the datacenter-grade platform and is a different deployment model.

Why Memory Bandwidth Determines Inference Throughput

For LLM inference, every generated token requires a forward pass through the model. That forward pass streams the model weights through the GPU. The rate at which tokens can be generated is bounded by how fast those weights can move from GPU memory into the compute units, which means memory bandwidth is the throughput ceiling for memory-bound inference.

Using the Hugging Face rule of thumb (approximately 2 bytes per parameter at FP16), a 70B parameter model requires approximately 140GB of VRAM just for weights. That fills a single H200’s 141GB nearly completely, but it fits on one card. An H100 at 80GB cannot hold a 70B model at FP16 at all without multi-GPU tensor parallelism or quantization.

At FP8 (1 byte per parameter), a 70B model occupies approximately 70GB, fitting an H100 with some KV cache headroom, or an H200 with substantial headroom for KV cache and large batch processing. The bandwidth difference matters here: streaming 70GB of weights through the GPU at 4.8 TB/s takes roughly the same amount of time as streaming 40GB at 2.7 TB/s (the approximate bandwidth ratio). For high-throughput serving, the H200 processes each request faster even when both GPUs can technically hold the model.

For 100B+ parameter models at FP16, even the H200’s 141GB isn’t sufficient for a single GPU, and multi-GPU tensor parallelism is required regardless of hardware. At that scale, SXM5 with NVSwitch is the appropriate platform.

The 141GB Floor for Large Models

The capacity difference matters most at the boundary between what fits and what doesn’t. A 70B model at FP16 occupies approximately 140GB. That barely fits in a single H200. It doesn’t fit in an H100 or an RTX PRO 6000 at FP16. It fits in a dual-RP6000 configuration (192GB combined GDDR7), but without NVLink, inter-GPU tensor parallelism runs over PCIe, adding communication overhead.

For teams running 70B models at FP16 for maximum output quality, the H200 is the only single-GPU option. For teams running 70B at FP8 and prioritizing throughput at a lower cost point, the RTX Pro 6000 is a strong alternative.

The two-GPU H200 configuration (282GB total HBM3e) opens up the range of models above 70B at full precision, including the growing class of Mixture-of-Experts models where all expert weights must remain resident in GPU memory simultaneously.

What OpenMetal’s H200 Server Includes

The H200 server is built on the same Intel Xeon 6530P dual-socket platform as the RP6000 and XL v5:

  • 1 or 2x NVIDIA H200 NVL (141GB HBM3e per GPU, 4.8 TB/s memory bandwidth)
  • 2x Intel Xeon 6530P (64 cores / 128 threads total)
  • 1TB DDR5-6400 (up to 2TB)
  • Up to 24x NVMe drives using Micron 7500 MAX
  • 40 Gbps private network / 10 Gbps public network

The 128-core CPU and DDR5-6400 memory matter for inference serving. Modern frameworks like vLLM and SGLang rely heavily on CPU-side preprocessing: tokenization, batching, KV cache management, and scheduling. The CPU needs to stay ahead of the GPU to avoid starving the accelerator. 128 cores at DDR5-6400 speeds does that across a wide range of serving throughput targets.

The 40 Gbps private network enables multi-node inference configurations where tensor parallelism runs across multiple servers. For teams scaling past a single H200, low-latency high-bandwidth private networking between nodes matters for the communication overhead in distributed inference.

Best-Fit Workloads

70B and larger model inference at FP16

The H200 is the only single-GPU option for serving 70B models at full FP16 precision. Teams running production endpoints on Llama 3 70B, Qwen 72B, or similar models who prioritize output quality over cost-per-token should evaluate the H200 over quantized alternatives.

Memory-bandwidth-bound serving

For multi-user inference endpoints where throughput is the primary concern, the H200’s 4.8 TB/s bandwidth means each forward pass completes faster than on lower-bandwidth GPUs. For high-concurrency serving of large models, bandwidth scales throughput more directly than additional VRAM headroom.

Large-context inference

KV cache grows with context length. At very long contexts (100K+ tokens), KV cache memory requirements become significant alongside model weights. The H200’s 141GB provides more KV cache headroom at large model sizes than any alternative single-GPU option.

QLoRA and fine-tuning at 70B scale

QLoRA fine-tuning of 70B models runs within the H200’s 141GB budget with room for optimizer state overhead. For teams adapting frontier-scale models to domain-specific data, the H200 provides the memory budget needed without multi-GPU tensor parallelism complexity.

HPC and scientific computing

The H200 datasheet cites 110x faster HPC performance in select benchmarks. Memory-intensive scientific simulations, molecular dynamics, and numerical methods that are bandwidth-bound benefit from HBM3e at the same level as LLM workloads.

Where the RP6000 Makes More Sense

The H200 is the right choice when 141GB HBM3e and 4.8 TB/s bandwidth are the reasons you’re buying. For workloads that fit within 96GB GDDR7, the RP6000 delivers comparable or better cost efficiency per token on standard inference workloads and is better suited to:

  • Inference and fine-tuning on 30B-70B models at FP8 or Q4
  • Mixed AI and visual computing pipelines (RT Cores, NVENC video encoding)
  • Teams where per-token cost efficiency at the 30B-70B range is the primary goal

The H200 earns its place when the model size exceeds 96GB at target precision, when bandwidth is the measured throughput ceiling, or when a team is running workloads that specifically require the HBM3e memory architecture.

Getting Started

H200 servers are reserved through OpenMetal’s Configure / Reserve flow. GPU hardware at this tier carries lead times. Get a written quote (valid for 30 days) and plan the reservation against your actual deployment timeline.

For teams validating whether the H200 configuration fits their workload before committing, OpenMetal’s PoC program provides a structured evaluation path with engineer-to-engineer support.

Frequently Asked Questions

What is the difference between the H200 SXM5 and H200 NVL?

Both variants use the same Hopper GH100 die and 141GB HBM3e memory subsystem. The SXM5 mounts on the NVLink Switch System board in HGX H200 nodes, runs at 700W, typically requires liquid cooling, and connects GPUs at 900 GB/s bidirectional via NVSwitch for full-mesh all-to-all communication. The NVL is a PCIe Gen 5 add-in card, runs at 600W, can be air-cooled, and fits standard server hardware. For multi-GPU configurations at 1-4 GPUs per server in standard rack hardware, NVL is the appropriate variant. For 8-GPU distributed training nodes with full NVSwitch fabric, SXM5 is the datacenter standard.

Why does memory bandwidth matter for LLM inference?

Each token generated during inference requires a forward pass through the model, which streams model weights from GPU memory into the compute units. The rate at which those weights can be moved determines inference throughput for memory-bound workloads. Higher memory bandwidth means each forward pass completes faster, which translates directly to higher tokens per second at a given batch size.

Can a 70B model run on a single H200?

Yes, at FP16. Using the standard rule from Hugging Face’s LLM optimization documentation, a 70B model requires approximately 140GB at FP16, which fits within the H200’s 141GB with minimal headroom for KV cache. At FP8, the same model occupies approximately 70GB, leaving roughly 71GB for KV cache. The H200 is the only single-GPU option for 70B inference at FP16.

How does the H200 compare to the RTX PRO 6000 for inference?

The H200 NVL has 141GB HBM3e at 4.8 TB/s. The RTX PRO 6000 has 96GB GDDR7 at 1.79 TB/s. The H200 is better suited for models above 96GB at target precision and for workloads where memory bandwidth is the throughput bottleneck. The RP6000 is better suited for 30B-70B inference at FP8 or Q4 quantization, offers better cost efficiency at that model size range, and supports mixed AI-visual computing workloads via its RT Cores and video encoding engines.

What is NVLink and does the H200 NVL support it?

NVLink is NVIDIA’s high-speed GPU-to-GPU interconnect. H200 SXM5 uses NVLink 4.0 at 900 GB/s bidirectional per GPU through NVIDIA’s NVSwitch fabric. H200 NVL PCIe supports NVLink bridges for 2-4 GPU configurations, but at lower bandwidth than SXM5’s NVSwitch fabric. For 8-GPU distributed training nodes requiring full-mesh GPU communication, SXM5 in HGX configurations is the appropriate platform.

Is the H200 available on-demand at OpenMetal?

H200 servers are reserved through OpenMetal’s Configure / Reserve flow. Written quotes are valid for 30 days. GPU hardware at this tier has lead times, so plan the reservation against your deployment timeline rather than assuming immediate availability.

Chat With Our Team

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

 

 

 Read More on the OpenMetal Blog

How the H200 Is Built for Memory-Bound AI Workloads

Jun 24, 2026

The H200 is a memory upgrade on the Hopper architecture, not a new compute platform. This article covers why bandwidth matters as much as VRAM capacity, where the 141GB floor changes what fits on a single GPU, and how the NVL PCIe variant differs from the SXM5 for dedicated private infrastructure.

Why 96GB VRAM Changes the Economics of Private LLM Inference

Jun 19, 2026

The RTX PRO 6000’s 96GB VRAM fits 70B models at FP8 on a single card with real KV cache headroom. This article covers what that unlocks, how dedicated fixed-cost GPU infrastructure compares structurally to cloud rental, and where the H200 is the better choice.

Comparing the NVIDIA RTX Pro 6000 vs. H100 for AI Inference

Apr 15, 2026

The H100 has been hard to get and expensive when you can find it. The RTX Pro 6000 Blackwell offers 96GB VRAM, newer Blackwell architecture, and strong single-GPU inference performance. This post breaks down where each GPU fits, and where each one falls short.

Why DePIN Compute Networks Require Bare Metal Infrastructure To Function Correctly

Dec 11, 2025

Render Network, Akash, io.net, and Gensyn nodes fail on AWS because virtualization breaks hardware attestation. DePIN protocols need cryptographic proof of physical GPUs and hypervisors mask the identities protocols verify. This guide covers why bare metal works, real operator economics, and setup.

Scaling Your OpenMetal Private Cloud from Proof of Concept to Production

Oct 15, 2025

Discover how to transition your OpenMetal private cloud from proof of concept to production. Learn expansion strategies using converged nodes, compute resources, storage clusters, and GPU acceleration for real-world workloads at scale.

Why AI Workloads Are Driving the Private Cloud Renaissance

Oct 02, 2025

Generative AI and AI workloads are reshaping cloud infrastructure demands. Public cloud limitations around GPU availability, egress costs, and shared resources are driving enterprises toward private cloud solutions. Learn how OpenMetal’s hosted private cloud delivers dedicated GPU resources, transparent pricing, and hybrid flexibility for AI success.

GPU-Accelerated Blockchain Workloads: Bare Metal Power for AI-Driven Smart Contracts

Sep 16, 2025

Discover how GPU acceleration transforms blockchain applications with AI-driven smart contracts. Learn why bare metal infrastructure provides the performance, security, and cost predictability needed for next-generation blockchain workloads that integrate machine learning and decentralized computing.

Why Retail Organizations Need Private AI Infrastructure for Image Generation

Jul 29, 2025

Retail brands face a dilemma: AI image generation tools offer unprecedented speed, but public APIs expose intellectual property, violate compliance, and create unpredictable costs. Private AI infrastructure solves these challenges while delivering superior ROI.

Don’t Bet Your AI Startup on Public Cloud by Default – Here’s Where Private Infrastructure Wins

Apr 24, 2025

Many AI startups default to public cloud and face soaring costs, performance issues, and compliance risks. This article explores how private AI infrastructure delivers predictable pricing, dedicated resources, and better business outcomes—setting you up for success.

Data Center Nvidia GPU Comparison Table with Specs

Mar 06, 2024

Nvidia is adapting to both AI and improvements needed in data center GPUs for non-AI work. View a comparison of their GPUs here.