In this article
This article walks through how to architect a confidential AI inference pipeline on dedicated bare metal using Intel TDX on OpenMetal’s XL v5 servers. We cover why public cloud confidential VMs create problems for this workload, what the XL v5 hardware gives you out of the box, how to provision and manage Trust Domains directly on bare metal, deploy vLLM inside them, handle model storage, and satisfy the attestation requirements that matter to auditors and compliance teams.
If you’re running inference on sensitive data like patient records, financial transactions, and personal identifiers the hardware isolation story matters as much as the software stack. Encryption in transit and at rest gets you partway there. Hardware-level isolation of the inference process itself is a different class of protection, and it’s what Intel TDX is designed to provide.
The problem is finding infrastructure that actually delivers it without introducing a new set of tradeoffs. Public cloud confidential VMs exist, but they come with constraints that are hard to work around at scale. Bare metal TDX on dedicated hardware is the cleaner solution, and with the OpenMetal XL v5, it’s available without any upgrade path or special ordering process.
Here’s how to build the stack.
Why Public Cloud Confidential VMs Fall Short for Inference Workloads
Public cloud providers offer TDX-based confidential VM options. Azure Confidential Computing is the most mature. On paper, this looks like the easy path. In practice, a few things get complicated quickly.
Shared tenancy and attestation trust chains. TDX’s attestation model proves that a specific Trust Domain is running on a specific hardware configuration in a specific state. On a dedicated bare metal server, that chain is clean. On a shared cloud host, the attestation you receive reflects the hypervisor layer the cloud provider controls, not a physical server you have any direct relationship with. For regulated workloads where auditors want proof of hardware-level isolation, that distinction matters. “We ran this in a confidential VM on a multi-tenant cloud host” is a harder story to tell than “we ran this on dedicated hardware with a clean attestation chain”.
Egress costs for model artifacts. Large language models are large. Moving a 70B parameter model between cloud storage and a confidential VM repeatedly gets expensive fast. Cloud egress pricing is metered and billed per GB. On dedicated bare metal with local NVMe and unmetered private bandwidth between servers, this is not a line item.
Inference latency variability. Confidential VMs on public cloud are still subject to noisy neighbor effects at the host level. Inference workloads are latency-sensitive. Running on a dedicated physical server with no other tenants removes that variable entirely.
Cost predictability. Cloud confidential VM pricing layers compute costs on top of an already expensive category of infrastructure. For teams running inference continuously rather than in bursts, fixed-cost dedicated hardware almost always wins on TCO past a certain utilization threshold.
What the XL v5 Gives You Out of the Box
The XL v5 runs on Intel’s Granite Rapids architecture (Intel 3 process node) and ships TDX-active at the BIOS level. There is no RAM upgrade required, no special configuration order, and no waiting for a custom build.
CPU: 2x Intel Xeon 6530P, 32 cores / 64 threads per socket (64 cores / 128 threads total). Granite Rapids is the first generation where TDX is production-stable across the full feature set.
RAM: 1TB DDR5-6400. All IMC channels are populated at 1DPC across both sockets, which is the actual Intel hardware requirement for TDX activation. Intel TDX requires all Slot 0 positions on every IMC channel to be populated for both CPUs. The XL v5 satisfies this at base configuration, which is why TDX is ready without any upgrade. The 1TB figure is a consequence of how OpenMetal stocks 64GB RDIMMs across 16 slots, not an arbitrary threshold.
Why does 1TB matter operationally? Three reasons:
TDX reserves per-page metadata before any Trust Domain launches. On a 1TB node this overhead is roughly 4GB, significant but manageable. At lower RAM totals, the overhead consumes a larger share of available memory before workloads start.
Each confidential VM carries its own EPC reservations, encryption metadata, and integrity-tracking overhead. Running multiple Trust Domains simultaneously is the normal production pattern for an inference service, and this overhead stacks. 1TB gives you the headroom to run several TDs without memory pressure.
NUMA locality under TDX: with TDX enabled, remote memory accesses across sockets are encrypted over UPI, adding latency. 1TB across two sockets gives 512GB per socket of local memory to work with. This matters for inference workloads that keep large model weights resident in memory.
Storage: 4x 6.4TB Micron 7500 MAX NVMe (25.6TB total). This is enough to keep multiple large model checkpoints on local storage. A 70B parameter model in float16 is roughly 140GB, so you can comfortably cache several models locally. The 7500 MAX is an enterprise-grade drive rated for sustained read/write throughput under continuous load.
Networking: Dual 10 Gbps NICs with 40 Gbps aggregate private bandwidth, unmetered between servers. Public bandwidth is 6 Gbps. Each customer gets dedicated VLANs at the infrastructure level.
Compliance posture: OpenMetal operates as a Business Associate under HIPAA and maintains a security and privacy framework grounded in ISO 27001:2022. This includes documented policies and procedures, incident response and breach notification processes, and periodic HIPAA/HITECH security assessments. For healthcare inference workloads, deploying on infrastructure with this compliance posture, combined with hardware-level attestation, meaningfully strengthens the technical safeguard story for auditors.
Architecting the Inference Stack on TDX Bare Metal
Trust Domain Provisioning on Bare Metal
OpenMetal currently supports Intel TDX on bare metal only. OpenStack Nova does not yet have native support for managing TDX Trust Domains upstream, so TD provisioning happens at the bare metal level directly rather than through Nova.
What this means in practice: you provision your XL v5 bare metal server with TDX active at the BIOS level, then manage the TD lifecycle using Intel’s TDX tooling directly on the host.
At a high level, the bare metal TDX flow looks like this:
- Deploy on an XL v5 bare metal server with TDX active at BIOS
- Use Intel TDX tooling to launch a Trust Domain on the host
- The TD boots with hardware-isolated, memory-encrypted execution
- The TD generates a TD Report that can be used for remote attestation
Your inference workload, the model weights loaded into memory, and the intermediate activations during inference are all encrypted and isolated from anything else on the host, including the host OS.
Deploying vLLM Inside the Trust Domain
vLLM is the right framework for this workload. It handles continuous batching, paged attention for KV cache management, and efficient GPU-adjacent inference on CPU/memory-heavy hardware. For models in the 7B–70B range running on a 1TB, 128-thread system, vLLM’s memory management is a better fit than running raw HuggingFace Transformers inference.
Inside the Trust Domain, vLLM deployment is straightforward. The TD looks like a standard Linux VM to the application layer. TDX encryption is transparent to the software stack. You install vLLM, point it at your model artifacts, and it serves the OpenAI-compatible API endpoint as normal.
A few configuration points worth noting for this specific hardware:
Thread and memory allocation. The XL v5 has 128 threads across two NUMA nodes. vLLM’s tensor parallelism should be set to align with NUMA boundaries where possible to minimize cross-socket memory traffic. With TDX enabled and cross-socket UPI encrypted, keeping model shards within a single NUMA node is worth the extra configuration effort for latency-sensitive deployments.
KV cache sizing. With 1TB of RAM and TDX overhead accounted for, you have significant headroom for KV cache. vLLM’s memory utilization settings should be tuned based on your model size and expected concurrency. Consult the current vLLM documentation for specific flags, as these have evolved across releases.
Note: The vLLM configuration guidance in this section describes the general approach and relevant architectural considerations for running inference on TDX bare metal. Specific flag names and defaults should be verified against current vLLM release documentation before deploying.
Model serving isolation. If you’re serving multiple models or multiple tenants from the same physical host, run each in a separate Trust Domain rather than multiplexing inside a single TD. The per-TD overhead is real but manageable, and the isolation guarantee is the point of the whole architecture.
Memory Sizing for Multi-TD Workloads
Running more than one Trust Domain simultaneously is the normal production pattern for a real inference service. You may want redundancy, you may be serving different models, or you may be separating workloads by data classification level.
The overhead math for multi-TD on the XL v5:
- PAMT metadata: approximately 4GB at 1TB total RAM, divided across active TDs
- Per-TD EPC overhead: varies by TD size but plan for several GB per active Trust Domain
- Practical headroom for workloads: roughly 900-950GB depending on number of active TDs
For a deployment running three concurrent TDs (one per model tier, for example), you have more than enough headroom. The architecture starts to feel constrained only if you’re trying to run many small TDs with large model footprints simultaneously, which is not the typical inference pattern.
Local NVMe vs. Ceph for Model Storage
The XL v5’s 25.6TB of local NVMe is the right starting point for most inference deployments. Model artifacts are read-heavy, read repeatedly, and benefit from low-latency local access. Keeping your model weights on local NVMe means no network round trip on model load, no dependency on external storage availability, and predictable latency.
The case for adding a Ceph cluster comes when you hit one of these situations:
Multiple inference nodes serving the same models. If you’re running several XL v5 nodes for availability or scale, maintaining synchronized model artifacts across local storage on each node gets unwieldy. A Ceph cluster shared across nodes gives you a single source of truth. OpenMetal offers dedicated Ceph storage clusters that integrate directly with the platform and connect via the same private network fabric.
Model versioning at scale. If you’re running frequent model updates across multiple nodes, Ceph’s object storage interface makes version management cleaner than syncing files across local drives.
Storage capacity beyond 25.6TB. The XL v5’s local NVMe is generous, but if your model library grows beyond what fits locally, Ceph gives you the expansion path.
For a single-node deployment or a small cluster with stable model artifacts, stick with local NVMe. It’s faster and simpler.
The Attestation Flow
This is the section that determines whether your architecture is actually defensible to auditors, not just technically sound.
TD Quote Generation
When a Trust Domain launches, it can generate a TD Report that contains measurements of the TD’s initial state: the software loaded into it, the configuration, and the hardware it’s running on. This report is signed by the TD’s private key, which never leaves the TD.
To make this report verifiable by someone outside the TD, it needs to be converted into a TD Quote. The Quoting Enclave (part of the TDX attestation infrastructure) takes the TD Report and signs it with a key that chains back to Intel’s attestation certificates. The result is a TD Quote: a portable, verifiable attestation of what is running on what hardware.
DCAP Verification
Intel’s Data Center Attestation Primitives (DCAP) is the verification stack. A relying party (your auditor, your compliance system, or a remote attestation service) takes the TD Quote and verifies:
- That the signature chains back to a valid Intel attestation certificate for a Granite Rapids processor
- That the hardware is in a known-good state (no security version number rollbacks, no debug mode enabled)
- That the TD measurements match what you expect, confirming the software inside the TD hasn’t been tampered with
The output of DCAP verification is a cryptographic proof that a specific, unmodified software stack is running inside a hardware-isolated Trust Domain on a specific class of hardware. That proof can be logged, timestamped, and presented to auditors.
What This Proves to Auditors and Compliance Teams
For HIPAA technical safeguard requirements, hardware attestation addresses the “access control” and “audit control” requirements in a way that software controls alone cannot. The attestation record demonstrates:
- PHI processed inside the inference pipeline was isolated at the hardware level during processing
- The software handling the PHI was in a known, verified state
- The hardware configuration providing the isolation is documented and verifiable
OpenMetal operates as a Business Associate under HIPAA, with security practices grounded in ISO 27001:2022 and documented incident response procedures including breach notification. That infrastructure-layer compliance posture, combined with hardware attestation from the TD Quote, gives your compliance team concrete artifacts for the technical safeguard section that go beyond what contractual controls alone can provide.
For SOC 2 Type II audits, attestation logs support the Availability and Confidentiality trust service criteria, specifically around logical and physical access controls and encryption of sensitive data during processing. OpenMetal is currently pursuing SOC 2 certification; you can request SOC report information directly through the SOC Report Request Form.
Networking and Isolation
The network architecture around your confidential inference nodes matters as much as the TD isolation itself. Hardware attestation of the compute layer is weakened if the network path to and from the TD is poorly controlled.
Infrastructure-Level Isolation
At the infrastructure layer, OpenMetal assigns dedicated VLANs to each customer. These operate at the physical network level between servers. Your XL v5 nodes are not sharing a VLAN with any other customer’s infrastructure. Each server includes dual 10 Gbps NICs, giving you 40 Gbps of aggregate private bandwidth within your environment, unmetered.
Above the VLAN layer, network segmentation and access control are managed by you using whatever tooling fits your stack. Common patterns for sensitive inference workloads include host-based firewalls to restrict which processes can receive inbound connections, separate network interfaces or VLANs for management traffic vs. inference traffic, and encrypted tunnels for any data moving between your inference nodes and external systems.
External Connectivity
If your inference pipeline needs to receive data from external systems (a hospital EHR, a financial data feed, a remote processing environment), the connection should be encrypted end to end before data reaches the bare metal host. OpenMetal provides the public IP space and dedicated VLAN infrastructure. The encryption layer for external connections is implemented at the application or OS level on your servers using standard tooling such as WireGuard or IPsec.
FAQ
Does OpenMetal manage the TDX configuration, or do I control it?
TDX is active at the BIOS level on the XL v5. TD lifecycle management (provisioning, attestation, termination) is handled using Intel’s TDX tooling directly on the bare metal host. OpenMetal provides the hardware configuration. The TD workloads are yours to manage.
Can I run non-TDX workloads alongside TDX workloads on the same cluster?
Yes. TDX is a per-Trust-Domain feature. Standard processes and confidential TDs can coexist on the same bare metal host. For compliance purposes you would typically want to keep sensitive inference workloads on dedicated hardware, but the platform supports mixed use.
What models does this architecture support?
Any model that runs on vLLM. For CPU inference on the XL v5, the practical range is roughly 7B to 70B parameters depending on precision and your concurrency requirements. Larger models can be accommodated with Ceph-backed storage and appropriate memory configuration.
Is attestation verification something I do, or does OpenMetal do it?
Attestation is performed by your systems or by a remote attestation service. OpenMetal provides the hardware that generates valid TD Quotes. Verification using Intel’s DCAP libraries is a client-side operation. If you need help setting up the verification pipeline, OpenMetal’s engineer-to-engineer support team can assist.
How does this architecture handle model updates without breaking attestation?
TD measurements are taken at launch time. When you update a model and restart the TD, new measurements are generated for the new state. Your attestation logs will show both the old and new states. This is expected behavior: attestation records a version history, not a single permanent state.
What does HIPAA compliance at the infrastructure level actually cover?
OpenMetal’s HIPAA compliance covers the infrastructure layer: physical security of data centers, access controls, and encryption capabilities. OpenMetal acts as a Business Associate, with documented policies, incident response procedures, and periodic HIPAA/HITECH security assessments. Your application-level HIPAA obligations (workforce training, policies, breach notification procedures) remain your responsibility. The infrastructure compliance removes a significant portion of the technical safeguard checklist.
OpenMetal’s XL v5 servers are available in Ashburn, VA. If you’re evaluating this architecture for a sensitive inference workload, contact the OpenMetal team for a technical conversation or to request access to the hardware specifications and compliance documentation.
Schedule a Consultation
Get a deeper assessment and discuss your unique requirements.
Read More on the OpenMetal Blog



































