Running a 70B LLM on a Single OpenMetal H200

Q: Can I run a 70B parameter LLM on a single OpenMetal H200?

Yes, a single OpenMetal H200 runs a 70B-parameter model in 16-bit precision, because its 141GB of HBM3e holds the roughly 140GB of weights on one card without tensor-parallel sharding.

Explore GPU servers

Fitting a 70B model on one GPU avoids splitting it across two cards, which removes the inter-GPU communication overhead and deployment complexity of tensor parallelism. For inference serving, one H200 can replace two H100s for these model sizes, and the 4.8 TB/s HBM3e bandwidth keeps tokens-per-second high on memory-bound generation.

There is headroom to tune: at 16-bit, weights consume most of the 141GB, so very long context windows or large batch sizes may still benefit from quantization (FP8) or a second card. With LoRA/QLoRA or lower-precision serving, even larger models fit comfortably, and a two-card H200 server provides 282GB aggregate across two discrete cards (not pooled) for bigger jobs sharded in software.

OpenMetal delivers the H200 as a single-tenant bare metal server with full root access, so you choose the serving stack (NVIDIA NIM, vLLM, or TensorRT-LLM) and run it on fixed monthly pricing with included egress.

Interesting Articles

“OpenMetal Cloud provides on-demand private infrastructure, which brings cloud fundamentals like elasticity and usage billing to the cloud deployment itself. It’s awesome to see OpenMetal’s latest product use OpenStack to combine the benefits of public cloud and managed private cloud, powered by open infrastructure.”

Thierry Carrez, VP of Engineering — Open Infrastructure Foundation

Interested in OpenMetal Products?

Contact Us

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options