OpenMetal's v5 Hardware and Ceph: Where Intentional Design Meets Distributed Storage

Resources » Blog » OpenMetal’s v5 Hardware and Ceph: Where Intentional Design Meets Distributed Storage

OpenMetal's v5 Hardware and Ceph Where Intentional Design Meets Distributed Storage

Interested in our v5 hardware?

v5 hardware is currently available for deployment in our Ashburn, VA data center and can be ordered directly through our Bare Metal and Private Cloud catalogs. For deployments in other data center locations or for custom configurations, please contact our team.

Ceph behaves the way the hardware beneath its OSDs lets it behave. The OpenMetal v5 line is built so that “the way it behaves” is predictable.

Ceph earned a reputation in the early 2010s for being slow, operationally heavy, and unpredictable under load. Most of that reputation was earned on hardware Ceph was never matched to: spinning disks fronting a thin SSD journal, OSDs sharing a boot drive with the OS, recovery storms saturating a single bonded link, CRUSH maps weighted across nodes that were never the same two boxes twice. The software was doing exactly what it was told. The hardware underneath it was the variable nobody controlled.

OpenMetal’s v5 servers are an argument that the variable is controllable. Every decision in the chassis, from the choice of OSD media to the separation of the boot pool to the PCIe lane budget, is made with the distributed-storage layer in mind. The result is a cluster whose latency floor and recovery behavior sit closer to a flash array than to a commodity Ceph build, without surrendering horizontal scale-out or the native OpenStack control plane. What follows is the set of v5 hardware decisions that produce that behavior, and the distributed-storage consequence of each.

v5 hardware to Ceph cluster topology: three identical hyper-converged nodes, each NVMe drive mapped to one Ceph OSD, isolated boot pool, 3x replication across node-level failure domains, and per-pool data placement — Figure: How v5 hardware shapes the Ceph topology. Each NVMe drive is one OSD; the boot pool is isolated; objects replicate 3x across three identical nodes whose node-level failure domain is what the cluster survives.

Key Takeaways

Each NVMe drive maps to one Ceph OSD, so the drive’s QoS is the cluster’s QoS. The Micron 7500 MAX’s sub-1ms latency floor at six nines is what keeps Cinder block volume latency flat under mixed tenant load, rather than degrading into a long tail no tuning can recover.
Boot and data storage are physically separate. Dedicated 960 GB RAID 1 boot drives keep OS I/O off the OSD pool, removing a common source of p99 jitter in tenant volumes before any Ceph tunable is touched.

Endurance is sized for Ceph’s write amplification, not a single drive’s. At 3 DWPD and roughly 35 PBW, the 7500 MAX absorbs replication, BlueStore, and rebalance overhead without becoming the component that triggers a refresh.
Identical hyper-converged nodes make recovery predictable. Uniform CRUSH weights spread a rebalance evenly across the cluster, and node-level failure domains survive the loss of a full node with no data loss at the default 3x replication.

Replication is a per-pool dial, not a cluster-wide setting. 3x for the default block pool, 2x for regenerable data, and 4+2 erasure coding (around 67 percent usable) for object and archival tiers, all on the same physical OSDs.

The OSD is a hardware decision before it’s a software one

On an OpenMetal Hosted Private Cloud, each NVMe data drive maps to a single Ceph OSD. There is no abstraction softening that relationship, so the drive’s characteristics are the OSD’s characteristics. The v5 line standardizes on the Micron 7500 MAX in 6.4 TB capacity: 232-layer 3D TLC NAND over PCIe Gen4 x4, rated at 1,100,000 random read IOPS, 7,000 MB/s sequential read, 70 microsecond typical read latency, and, most relevantly for a storage daemon, a sub-1ms guarantee at six nines of QoS for 4K random reads up to QD128.

That QoS line is the one that matters for distributed storage. Ceph’s tail latency is the sum of the slowest replica acknowledgement on every write and the slowest OSD on every read. A drive that is fast on average but ragged at p99.9999 turns into a long tail that no amount of cluster tuning erases. Six-nines QoS on the underlying media is what lets Cinder block volumes hold a flat latency profile under mixed tenant load.

Endurance is the second deliberate choice. Ceph multiplies host writes before they ever reach NAND: 3x replication triples them, BlueStore’s checksums and RocksDB compaction add more, and recovery rewrites whole placement groups during a rebalance. An OSD drive sees write amplification a single-purpose drive never does. The 7500 MAX is a mixed-use class drive at 3 DWPD and roughly 35 PBW of endurance, sized so the OSD outlives the refresh cycle rather than becoming the thing that triggers it.

And because the entire pool is NVMe, there is no separate journal or WAL tier to provision, fail, or bottleneck. BlueStore’s write-ahead log and its RocksDB metadata live on the same fast flash as the data. The split-media designs that produced Ceph’s worst latency cliffs, fast journal in front of slow bulk, simply do not exist here. Every operation is end to end NVMe on Gen4 silicon.

Boot and data isolation: the unglamorous decision that matters most

Every v5 server ships with two dedicated 960 GB boot SSDs in RAID 1, completely separate from the data NVMe that becomes the OSD pool. It is the least glamorous line on the spec sheet and arguably the one that does the most for storage predictability.

When the operating system shares a drive with an OSD, every package update, log rotation, monitoring agent flush, and journald write competes with client I/O for the same queue depth. On a storage node that contention surfaces as p99 jitter in tenant volumes that correlates with nothing the tenant did. Isolating the OS onto its own RAID 1 pair removes an entire class of latency noise at the hardware level, before any Ceph tunable is touched. It is a decision OpenMetal carried forward across every current-generation server precisely because it is invisible when it works and miserable when it is missing.

The lane budget and the path to the drive

Granite Rapids gives each v5 socket 88 lanes of PCIe 5.0, 176 across the two-socket node. That budget is spent feeding the NVMe drives and the LACP-bonded NICs at full link width, with headroom left over rather than fanning a starved lane count through a PCIe switch. An OSD that has to share oversubscribed lanes with three other drives is an OSD whose throughput ceiling moves depending on its neighbors. Giving each drive a clean path keeps per-OSD performance a property of the drive, not of the slot it landed in.

The chassis backs this with expansion room. The Large v5 moved from six drive bays to ten, leaving eight open above the base configuration. Ceph scales capacity by adding OSDs, and in-chassis headroom means an operator can densify an existing node before incurring the coordination cost of adding a whole node to the CRUSH map. It is a deliberate widening of the cheapest scaling path.

The failure domain is the node, and the node is identical to its neighbors

OpenMetal builds Hosted Private Cloud on hyper-converged cloud cores: three identical servers, each carrying control plane, compute, and OSDs together. That symmetry is doing quiet work for Ceph. CRUSH distributes data by weight, and uniform nodes mean uniform weights, which means a rebalance moves data evenly instead of hammering one lopsided host. Recovery time becomes a number you can predict rather than discover.

The failure domain is node-level, and the default replicated pool writes three copies across three distinct nodes. The arithmetic follows directly from the hardware. A 3-node Large v5 cluster carries two 6.4 TB OSDs per node, 38.4 TB raw, yielding roughly 12.8 TB usable at 3x once Ceph overhead and rebalance headroom are accounted for. A 3-node XL v5 carries four OSDs per node, 76.8 TB raw and roughly 25.6 TB usable at the same factor. In both cases the cluster survives the loss of an entire node with no data loss, because the failure domain was drawn at the boundary the hardware actually fails at.

Replication is a per-pool dial, not a doctrine

The 3x default exists because it matches what general-purpose VM workloads expect: Cinder block volumes and Nova ephemeral disks live on the replicated pool and inherit its latency profile. But replication is set per pool, and the architect owns the dial.

Data that can be regenerated, build artifacts, caches, checkpointable compute state, does not need three copies; a 2x pool reclaims a third of the capacity for workloads where durability is cheap to rebuild. Capacity-bound tiers go the other direction: a 4+2 erasure-coded pool fronting the RADOS Gateway for S3-compatible object storage or archival data pushes usable capacity toward 67 percent of raw, trading some write latency and CPU for space. The common production layout runs all three at once, 3x for the default block pool, 2x or erasure coding for object and cold tiers, on the same physical OSDs. The hardware does not care which policy a placement group follows, which is exactly the point: the storage behavior is a software choice made on top of hardware that imposes no penalty of its own.

What sits above the OSDs matters too

Hyper-convergence means the OSDs share each node with tenant VMs and the OpenStack control plane, so the rest of the box has to have headroom to spare. Ceph OSDs are not idle processes: they checksum every object, run erasure-code math on EC pools, and hold BlueStore cache in memory. The v5 node answers with 32 cores and 64 threads, 144 MB of L3 cache, and 512 GB of DDR5-6400 delivering roughly 819 GB/s of aggregate bandwidth, with AVX-512 and QuickAssist available to offload the crypto and compression paths. Because the hardware is single-tenant and not oversubscribed, the cycles an OSD needs during a recovery storm are cycles that are actually there, not cycles borrowed from a neighbor on a shared host.

Scaling is a rebalance you schedule, not a forklift you dread

Growth follows the same intentional grain. Adding a node to an existing cluster takes roughly twenty minutes; Ceph rebalances automatically, the OpenStack control plane stays online, and tenant VMs keep running through the redistribution. When storage is the bottleneck rather than compute, dedicated Storage Server nodes grow the pool without adding cores. Either path is a planned, non-disruptive operation rather than a maintenance window, because the cluster was designed to absorb new OSDs as a routine event.

“Moving our Netherlands region onto OpenMetal completely changed how we operate. We went from a room full of aging leased hardware to a handful of modern NVMe-backed servers that are faster, denser, and far more cost-efficient. The best part was how easy the transition felt. Our cloud stack didn’t need to change at all. It just worked.” – Vanessa Vasile, Director of Infrastructure, RamNode

The thesis, restated

Distributed storage inherits the discipline of the hardware it runs on. Ceph on mixed media, shared boot drives, and mismatched nodes is the Ceph people complain about. Ceph on all-NVMe OSDs with six-nines QoS, an isolated boot pool, a clean lane budget, identical failure domains, and headroom in CPU and memory is a different system wearing the same name. OpenMetal’s v5 line is the second one on purpose. The predictability operators want from storage is not a tuning artifact discovered after deployment; it is engineered into the box before the first OSD comes up.

Latency floors, failure domains, and per-pool replication trade-offs are easier to trust once you have watched them hold against your own traffic. OpenMetal Proof of Concept clusters are available for validating Ceph behavior, replication policy, and recovery timing on real v5 hardware before you commit to a production deployment. Bring a representative I/O profile and an OpenMetal architect will help you size the OSD pool, set per-pool replication and erasure-coding policy, and model the scale-out path for where your capacity is heading.

Talk to our team