In this article
This article walks through how to architect a confidential AI inference pipeline on dedicated bare metal using Intel TDX on OpenMetal’s XL v5 servers. We cover why public cloud confidential VMs create problems for this workload, what the XL v5 hardware gives you out of the box, how to set up Trust Domains in OpenStack, deploy vLLM inside them, handle model storage, and satisfy the attestation requirements that matter to auditors and compliance teams.
If you’re running inference on sensitive data like patient records, financial transactions, and personal identifiers the hardware isolation story matters as much as the software stack. Encryption in transit and at rest gets you partway there. Hardware-level isolation of the inference process itself is a different class of protection, and it’s what Intel TDX is designed to provide.
The problem is finding infrastructure that actually delivers it without introducing a new set of tradeoffs. Public cloud confidential VMs exist, but they come with constraints that are hard to work around at scale. Bare metal TDX on dedicated hardware is the cleaner solution, and with the OpenMetal XL v5, it’s available without any upgrade path or special ordering process.
Here’s how to build the stack.
Why Public Cloud Confidential VMs Fall Short for Inference Workloads
Public cloud providers offer TDX-based confidential VM options. Azure Confidential Computing is the most mature. On paper, this looks like the easy path. In practice, a few things get complicated quickly.
Shared tenancy and attestation trust chains. TDX’s attestation model proves that a specific Trust Domain is running on a specific hardware configuration in a specific state. On a dedicated bare metal server, that chain is clean. On a shared cloud host, the attestation you receive reflects the hypervisor layer the cloud provider controls, not a physical server you have any direct relationship with. For regulated workloads where auditors want proof of hardware-level isolation, that distinction matters. “We ran this in a confidential VM on a multi-tenant cloud host” is a harder story to tell than “we ran this on dedicated hardware with a clean attestation chain”.
Egress costs for model artifacts. Large language models are large. Moving a 70B parameter model between cloud storage and a confidential VM repeatedly gets expensive fast. Cloud egress pricing is metered and billed per GB. On dedicated bare metal with local NVMe and unmetered private bandwidth between servers, this is not a line item.
Inference latency variability. Confidential VMs on public cloud are still subject to noisy neighbor effects at the host level. Inference workloads are latency-sensitive. Running on a dedicated physical server with no other tenants removes that variable entirely.
Cost predictability. Cloud confidential VM pricing layers compute costs on top of an already expensive category of infrastructure. For teams running inference continuously rather than in bursts, fixed-cost dedicated hardware almost always wins on TCO past a certain utilization threshold.
What the XL v5 Gives You Out of the Box
The XL v5 runs on Intel’s Granite Rapids architecture (Intel 3 process node) and ships TDX-active at the BIOS level. There is no RAM upgrade required, no special configuration order, and no waiting for a custom build.
CPU: 2x Intel Xeon 6530P, 32 cores / 64 threads per socket (64 cores / 128 threads total). Granite Rapids is the first generation where TDX is production-stable across the full feature set.
RAM: 1TB DDR5-6400. All IMC channels are populated at 1DPC across both sockets, which is the actual Intel hardware requirement for TDX activation. Intel TDX requires all Slot 0 positions on every IMC channel to be populated for both CPUs. The XL v5 satisfies this at base configuration, which is why TDX is ready without any upgrade. The 1TB figure is a consequence of how OpenMetal stocks 64GB RDIMMs across 16 slots, not an arbitrary threshold.
Why does 1TB matter operationally? Three reasons:
TDX reserves per-page metadata before any Trust Domain launches. On a 1TB node this overhead is roughly 4GB, significant but manageable. At lower RAM totals, the overhead consumes a larger share of available memory before workloads start.
Each confidential VM carries its own EPC reservations, encryption metadata, and integrity-tracking overhead. Running multiple Trust Domains simultaneously is the normal production pattern for an inference service, and this overhead stacks. 1TB gives you the headroom to run several TDs without memory pressure.
NUMA locality under TDX: with TDX enabled, remote memory accesses across sockets are encrypted over UPI, adding latency. 1TB across two sockets gives 512GB per socket of local memory to work with. This matters for inference workloads that keep large model weights resident in memory.
Storage: 4x 6.4TB Micron 7500 MAX NVMe (25.6TB total). This is enough to keep multiple large model checkpoints on local storage. A 70B parameter model in float16 is roughly 140GB, so you can comfortably cache several models locally. The 7500 MAX is an enterprise-grade drive rated for sustained read/write throughput under continuous load.
Networking: Dual 10 Gbps NICs with 40 Gbps aggregate private bandwidth, unmetered between servers. Public bandwidth is 6 Gbps. Each customer gets dedicated VLANs at the infrastructure level.
Compliance posture: OpenMetal operates as a Business Associate under HIPAA and maintains a security and privacy framework grounded in ISO 27001:2022. This includes documented policies and procedures, incident response and breach notification processes, and periodic HIPAA/HITECH security assessments. For healthcare inference workloads, deploying on infrastructure with this compliance posture, combined with hardware-level attestation, meaningfully strengthens the technical safeguard story for auditors.
Architecting the Inference Stack Inside a Trust Domain
Trust Domain Provisioning on OpenStack
OpenMetal private clouds run OpenStack, and Nova handles the compute layer. Launching a TDX Trust Domain is done at the VM provisioning level. You specify a TDX-capable flavor and OpenStack handles the hardware configuration.
At a high level, the provisioning flow looks like this:
- Create a Nova flavor with TDX hardware properties enabled
- Launch a VM instance against that flavor on an XL v5 host
- The VM boots inside a hardware-isolated Trust Domain with memory encryption active
- The TD generates a TD Report that can be used for remote attestation
TDX operates at the VM level. Your inference workload, the model weights loaded into memory, and the intermediate activations during inference are all encrypted and isolated from anything else on the host, including the hypervisor.
Within the OpenStack environment, you can also create VPCs through OpenStack Projects. Each VPC gets its own logically isolated virtual network space, custom IP ranges, firewall rules, and security groups. Your confidential inference nodes can sit in a fully isolated network segment with controlled ingress and egress, which matters for regulated data pipelines where you need to demonstrate network-level isolation alongside hardware isolation.
Deploying vLLM Inside the Trust Domain
vLLM is the right framework for this workload. It handles continuous batching, paged attention for KV cache management, and efficient GPU-adjacent inference on CPU/memory-heavy hardware. For models in the 7B–70B range running on a 1TB, 128-thread system, vLLM’s memory management is a better fit than running raw HuggingFace Transformers inference.
Inside the Trust Domain, vLLM deployment is straightforward. The TD looks like a standard Linux VM to the application layer. TDX encryption is transparent to the software stack. You install vLLM, point it at your model artifacts, and it serves the OpenAI-compatible API endpoint as normal.
A few configuration points worth noting for this specific hardware:
Thread and memory allocation. The XL v5 has 128 threads across two NUMA nodes. vLLM’s tensor parallelism should be set to align with NUMA boundaries where possible to minimize cross-socket memory traffic. With TDX enabled and cross-socket UPI encrypted, keeping model shards within a single NUMA node is worth the extra configuration effort for latency-sensitive deployments.
KV cache sizing. With 1TB of RAM and TDX overhead accounted for, you have significant headroom for KV cache. vLLM’s memory utilization settings should be tuned based on your model size and expected concurrency. Consult the current vLLM documentation for specific flags, as these have evolved across releases.
Note: The vLLM configuration guidance in this section describes the general approach and relevant architectural considerations for running inference on TDX bare metal. Specific flag names and defaults should be verified against current vLLM release documentation before deploying.
Model serving isolation. If you’re serving multiple models or multiple tenants from the same physical host, run each in a separate Trust Domain rather than multiplexing inside a single TD. The per-TD overhead is real but manageable, and the isolation guarantee is the point of the whole architecture.
Memory Sizing for Multi-TD Workloads
Running more than one Trust Domain simultaneously is the normal production pattern for a real inference service. You may want redundancy, you may be serving different models, or you may be separating workloads by data classification level.
The overhead math for multi-TD on the XL v5:
- PAMT metadata: approximately 4GB at 1TB total RAM, divided across active TDs
- Per-TD EPC overhead: varies by TD size but plan for several GB per active Trust Domain
- Practical headroom for workloads: roughly 900-950GB depending on number of active TDs
For a deployment running three concurrent TDs (one per model tier, for example), you have more than enough headroom. The architecture starts to feel constrained only if you’re trying to run many small TDs with large model footprints simultaneously, which is not the typical inference pattern.
Local NVMe vs. Ceph for Model Storage
The XL v5’s 25.6TB of local NVMe is the right starting point for most inference deployments. Model artifacts are read-heavy, read repeatedly, and benefit from low-latency local access. Keeping your model weights on local NVMe means no network round trip on model load, no dependency on external storage availability, and predictable latency.
The case for adding a Ceph cluster comes when you hit one of these situations:
Multiple inference nodes serving the same models. If you’re running several XL v5 nodes for availability or scale, maintaining synchronized model artifacts across local storage on each node gets unwieldy. A Ceph cluster shared across nodes gives you a single source of truth. OpenMetal offers dedicated Ceph storage clusters that integrate directly with the platform and connect via the same private network fabric.
Model versioning at scale. If you’re running frequent model updates across multiple nodes, Ceph’s object storage semantics (via OpenStack Swift, which runs on top of Ceph) make version management cleaner than syncing files across local drives.
Storage capacity beyond 25.6TB. The XL v5’s local NVMe is generous, but if your model library grows beyond what fits locally, Ceph gives you the expansion path.
For a single-node deployment or a small cluster with stable model artifacts, stick with local NVMe. It’s faster and simpler.
The Attestation Flow
This is the section that determines whether your architecture is actually defensible to auditors, not just technically sound.
TD Quote Generation
When a Trust Domain launches, it can generate a TD Report that contains measurements of the TD’s initial state: the software loaded into it, the configuration, and the hardware it’s running on. This report is signed by the TD’s private key, which never leaves the TD.
To make this report verifiable by someone outside the TD, it needs to be converted into a TD Quote. The Quoting Enclave (part of the TDX attestation infrastructure) takes the TD Report and signs it with a key that chains back to Intel’s attestation certificates. The result is a TD Quote: a portable, verifiable attestation of what is running on what hardware.
DCAP Verification
Intel’s Data Center Attestation Primitives (DCAP) is the verification stack. A relying party (your auditor, your compliance system, or a remote attestation service) takes the TD Quote and verifies:
- That the signature chains back to a valid Intel attestation certificate for a Granite Rapids processor
- That the hardware is in a known-good state (no security version number rollbacks, no debug mode enabled)
- That the TD measurements match what you expect, confirming the software inside the TD hasn’t been tampered with
The output of DCAP verification is a cryptographic proof that a specific, unmodified software stack is running inside a hardware-isolated Trust Domain on a specific class of hardware. That proof can be logged, timestamped, and presented to auditors.
What This Proves to Auditors and Compliance Teams
For HIPAA technical safeguard requirements, hardware attestation addresses the “access control” and “audit control” requirements in a way that software controls alone cannot. The attestation record demonstrates:
- PHI processed inside the inference pipeline was isolated at the hardware level during processing
- The software handling the PHI was in a known, verified state
- The hardware configuration providing the isolation is documented and verifiable
OpenMetal operates as a Business Associate under HIPAA, with security practices grounded in ISO 27001:2022 and documented incident response procedures including breach notification. That infrastructure-layer compliance posture, combined with hardware attestation from the TD Quote, gives your compliance team concrete artifacts for the technical safeguard section that go beyond what contractual controls alone can provide.
For SOC 2 Type II audits, attestation logs support the Availability and Confidentiality trust service criteria, specifically around logical and physical access controls and encryption of sensitive data during processing. OpenMetal is currently pursuing SOC 2 certification; you can request SOC report information directly through the SOC Report Request Form.
Networking and Isolation
The network architecture around your confidential inference nodes matters as much as the TD isolation itself. Hardware attestation of the compute layer is weakened if the network path to and from the TD is poorly controlled.
Infrastructure-Level Isolation
At the infrastructure layer, OpenMetal assigns dedicated VLANs to each customer. These operate at the physical network level between servers. Your XL v5 nodes are not sharing a VLAN with any other customer’s infrastructure. This is distinct from the virtual networking layer: it’s isolation at the level of the network hardware itself.
VPC Isolation for Inference Nodes
Within your OpenStack environment, you can create VPCs through OpenStack Projects. Each VPC gets its own VXLAN overlay network running within your customer VLAN. For a confidential inference deployment, a clean architecture isolates your inference nodes in their own VPC, separate from your data ingestion pipeline and your output consumers.
Typical segmentation:
- Inference VPC: Contains the TDX VMs running vLLM. No direct external access. Receives sanitized inputs from the processing layer, returns inference outputs.
- Data processing VPC: Handles input preprocessing and output post-processing. Has controlled connectivity to the inference VPC via security group rules.
- Management VPC: Handles access to OpenMetal Central and OpenStack Horizon. Separate from data paths.
OpenStack’s security groups let you define which ports and protocols are permitted between VPCs. The virtual router and NAT functionality handle connectivity where needed without exposing inference nodes directly.
VPN for External Connections
If your inference pipeline needs to receive data from external systems (a hospital EHR system, a financial data feed, a remote data processing environment), VPN-as-a-Service in OpenStack provides encrypted tunnels for those connections. This keeps sensitive data encrypted in transit from the source all the way to your inference pipeline, not just inside the OpenMetal environment.
FAQ
Does OpenMetal manage the TDX configuration, or do I control it?
TDX is active at the BIOS level on the XL v5. The Trust Domain lifecycle (provisioning, attestation, termination) is managed through OpenStack Nova, which you control. OpenMetal provides the hardware configuration and the OpenStack environment. The TD workloads are yours to manage.
Can I run non-TDX workloads alongside TDX workloads on the same cluster?
Yes. TDX is a per-VM feature in OpenStack. Standard VMs and confidential VMs can coexist on the same cluster. You would typically segment them by network VPC for compliance reasons, but the hardware supports mixed deployments.
What models does this architecture support?
Any model that runs on vLLM. For CPU inference on the XL v5, the practical range is roughly 7B to 70B parameters depending on precision and your concurrency requirements. Larger models can be accommodated with Ceph-backed storage and appropriate memory configuration.
Is attestation verification something I do, or does OpenMetal do it?
Attestation is performed by your systems or by a remote attestation service. OpenMetal provides the hardware that generates valid TD Quotes. Verification using Intel’s DCAP libraries is a client-side operation. If you need help setting up the verification pipeline, OpenMetal’s engineer-to-engineer support team can assist.
How does this architecture handle model updates without breaking attestation?
TD measurements are taken at launch time. When you update a model and restart the TD, new measurements are generated for the new state. Your attestation logs will show both the old and new states. This is expected behavior: attestation records a version history, not a single permanent state.
What does HIPAA compliance at the infrastructure level actually cover?
OpenMetal’s HIPAA compliance covers the infrastructure layer: physical security of data centers, access controls, and encryption capabilities. OpenMetal acts as a Business Associate, with documented policies, incident response procedures, and periodic HIPAA/HITECH security assessments. Your application-level HIPAA obligations (workforce training, policies, breach notification procedures) remain your responsibility. The infrastructure compliance removes a significant portion of the technical safeguard checklist.
OpenMetal’s XL v5 servers are available in Ashburn, VA. If you’re evaluating this architecture for a sensitive inference workload, contact the OpenMetal team for a technical conversation or to request access to the hardware specifications and compliance documentation.
Schedule a Consultation
Get a deeper assessment and discuss your unique requirements.
Read More on the OpenMetal Blog



































