In this article
- Building the Foundation: Hardware and Architecture for Peak Performance
- Unlocking Performance: System and Ceph-Level Tuning
- Real-World Results: Deconstructing a 1 TiB/s Ceph Cluster
- The Economics of Speed: TCO and Cost-Effectiveness
- The Road Ahead: Ceph Crimson and the Future of NVMe Storage
- Wrapping Up and Key Takeaways
- Frequently Asked Questions (FAQs)
- Get Started on an OpenStack- and Ceph-Powered Hosted Private Cloud
Moving to all-NVMe storage completely changes what Ceph can do and who it competes with. For years, Ceph built its name on being able to scale up massively and affordably using regular hard disk drives (HDDs). This made it a go-to for huge object storage needs. But the physical slowness of spinning disks always put a cap on its performance, keeping it a step behind pricey, specialized systems when it came to workloads that needed instant responses.
NVMe (Non-Volatile Memory Express) blows that cap right off.
When you combine Ceph’s smart, distributed software with storage that can handle millions of operations per second with near-zero delay, you get a solution that can go head-to-head with high-end, specialized systems. This opens the door for Ceph to run demanding applications like large-scale databases, AI/ML training, and high-performance computing in private and hybrid clouds.
But getting that speed isn’t as simple as just swapping out your old hard drives for new NVMe ones. The raw speed of the hardware is so fast that it finds new bottlenecks in other parts of your system. As OpenMetal’s own tests and documents show, the very software that makes Ceph so stable and scalable can, without proper tuning, leave you with as little as one-tenth of the performance your new drives are capable of. Getting a high-performance all-NVMe Ceph cluster running is really an engineering challenge: you have to hunt down and eliminate bottlenecks in your CPU, network, and software settings to get the full power of your flash storage.
Performance Potential
When you get the architecture and tuning right, the performance of an all-NVMe Ceph cluster is truly impressive. Benchmarks on recent Ceph releases show numbers that put it squarely in the high-performance storage league.
For instance, a detailed test on a 10-node cluster with 60 NVMe drives running Ceph Reef produced some stunning results. The cluster hit around 4.4 million random read IOPS (I/O operations per second) and a sustained 71 GB/s of speed for large file reads. For writing data, it managed about 800,000 random write IOPS. Because Ceph makes three copies of data by default for safety, that’s actually 2.4 million write operations happening across the cluster’s drives. And for applications that need immediate responses, like databases, the cluster kept its average write latency under 0.5 milliseconds.
It’s important to know these aren’t “out-of-the-box” numbers. They came from a specific, balanced hardware setup—in this case, Dell servers with AMD EPYC CPUs and Samsung NVMe drives—and a lot of careful tuning that we’ll break down here. These benchmarks show what’s possible when you build your cluster with a clear plan.
Metric | Achievement (Ceph Reef Benchmark) |
---|---|
Random Read IOPS (4K) | ~4.4 Million |
Random Write IOPS (4K) | ~800,000 (2.4M replicated) |
Large Read Throughput (4MB) | ~71 GB/s |
Large Write Throughput (4MB) | ~25 GB/s (75 GB/s replicated) |
Average Write Latency (4K Sync) | < 0.5 ms |
Source: Ceph Reef RBD Performance Benchmark
The big difference between read and write performance comes down to how Ceph works. A read is simple: Ceph just grabs data from one primary storage daemon (OSD). A write is more involved. With 3x replication, Ceph has to send the data to a primary OSD, which then copies it to two other OSDs. It has to get confirmation from all of them before the write is marked as complete. This, plus the work Ceph’s BlueStore backend does to manage metadata, makes writing data naturally slower than reading it. The benchmarks show this clearly, with reads being over five times faster for random operations and nearly three times faster for large files.
This tells you a lot about what these clusters are good for. All-NVMe Ceph is a fantastic fit for read-heavy jobs like data analytics, content delivery networks (CDNs), and AI inference. It’s still very fast for write-heavy tasks like databases or logging, but you need to have realistic expectations and tune the cluster specifically to make writes as efficient as possible. The goal isn’t just to buy fast hardware, but to build a complete, high-performance storage system from the ground up.
Building the Foundation: Hardware and Architecture for Peak Performance
In a classic Ceph cluster with hard drives, the bottleneck is almost always the drives themselves. With an all-NVMe cluster, it’s the other way around. The NVMe drives are so fast that they put all the performance pressure on the other parts of the server: the CPU, the memory, and the network. This means you have to design your cluster with a balanced approach, choosing every component to support the massive I/O your drives can push.
CPU: The New Engine of Storage
With NVMe, the CPU isn’t just running the operating system anymore, it’s actively driving your storage. Every I/O operation needs CPU cycles. Ceph’s algorithms have to figure out where to place data, erasure coding requires math to create parity, checksums are calculated to keep data safe, and the BlueStore backend uses a database called RocksDB that needs CPU power for metadata.
Testing shows that your speed scales directly with how many CPU threads you can give to each NVMe drive. With the Ceph Reef release, a single OSD doesn’t get much faster after about 14-16 CPU threads. This means you need CPUs with a lot of cores to handle multiple OSDs on one server. The record-breaking 1 TiB/s cluster built by Clyso used powerful 48-core AMD EPYC 9454P CPUs in each of its 68 nodes to get the parallel processing power needed to run hundreds of NVMe OSDs at once (more about that soon). Trying to save money on the CPU in an all-NVMe build is a sure way to end up with a slow, bottlenecked cluster.
Drives: Enterprise Grade is Non-Negotiable
One of the biggest and most costly mistakes you can make when building an all-NVMe Ceph cluster is using consumer-grade NVMe drives. They might look cheaper upfront, but they’re missing two things that are absolutely essential for enterprise storage: write endurance and power-loss protection (PLP).
Enterprise workloads, especially on a system like Ceph that’s always doing background work, can be very write-heavy. Enterprise SSDs are built for this and have a rating called Drive Writes Per Day (DWPD), which tells you how many times you can write to the entire drive every day for its 5-year warranty period. Read-focused enterprise drives usually have a DWPD under 1, mixed-use drives are rated from 1 to 3, and write-focused models can go above 3.
Even more important is power-loss protection. Enterprise drives have capacitors on board that, in a power outage, provide enough juice to save any data in the drive’s temporary cache to the permanent flash storage. Without this, you can lose data that was in the middle of being written, leading to silent data corruption. This is a deal-breaker for Ceph’s BlueStore, which depends on the integrity of its logs and metadata. As one user on Reddit found out after trying consumer drives, the result was “mediocre” speed and latency that went “through the roof” with any real write load.
Networking: The Cluster’s Backbone
The network is the nervous system of a Ceph cluster. It handles everything from client traffic and data replication to rebalancing and, most importantly, recovery after a failure. A 10GbE/10Gbps network might be the bare minimum for production, but it’s not nearly enough for an all-NVMe cluster. Benchmarks have shown that a single client reading large files can completely saturate a 10GbE link by itself.
For any serious all-NVMe setup, you need a 25GbE, 40GbE, or even a 100GbE network. The cluster that hit 1 TiB/s used two 100GbE network cards in every single node, which shows you the kind of bandwidth you need at the top end. A slow network becomes your biggest problem during recovery. When a drive or server fails, Ceph starts copying its data to other places in the cluster to get back to full redundancy. This floods the network. A slow network makes this recovery window much longer, leaving your cluster in a risky, degraded state and increasing the chance of another failure causing data loss.
OpenMetal’s Validated Hardware
Picking and testing a balanced set of components can be a headache. This is where a platform like OpenMetal really helps. OpenMetal offers hardware configurations that are already vetted and ready for tough OpenStack and Ceph workloads. Server options like the “Large” and “XL” configurations give you a solid foundation with high-core-count CPUs (16-32 cores), plenty of RAM (up to 1TB), and support for multiple NVMe drives per server. For storage-heavy needs, servers like the Large v3 mix high-capacity HDDs with options for NVMe flash to speed things up.
By offering these pre-validated setups, OpenMetal takes the complexity of hardware selection off your plate. You can deploy on a platform where you know the components are balanced and ready for a high-performance private cloud, helping you avoid the common mistakes of a DIY approach that can lead to a slow and unstable system. Our team has been working with OpenStack, Ceph, and other open source systems for many years, so you know you’re in good hands deploying these systems with us.
Your hardware choices set the performance limit for your cluster. No amount of software tuning can fix an underpowered CPU, a slow network, or unreliable drives. The fact that top-tier benchmarks consistently use high-core-count CPUs and 100GbE networking sends a clear message: top performance requires top hardware. This needs to be part of your budget planning. Trying to save money on the CPU or network to buy more NVMe capacity will leave you with an unbalanced system that won’t meet your expectations.
There’s also a strategic choice to make about your cluster’s physical design. Modern servers can hold ten or more NVMe drives, which creates a new problem to think about: the “blast radius” of a server failure. Densely packed nodes are cheaper, but if one of those servers fails, it triggers a massive data recovery event. It’s been estimated that recovering from a lost 60-drive server could take up to two months, leaving your cluster at high risk the whole time. This forces you to choose between fewer, denser nodes (cheaper upfront, but higher risk) and more, less-dense nodes (more expensive, but lower risk and faster recovery). This is a decision that goes beyond just performance and gets to the heart of your organization’s resilience plan.
Unlocking Performance: System and Ceph-Level Tuning
Starting with a balanced, enterprise-grade hardware foundation is the first step. But to get from the hardware’s potential to the actual performance your applications see, you need to do some deliberate tuning. These tweaks, which range from the server’s BIOS to Ceph’s configuration files, are what make the difference between a working cluster and a high-performance one. Often, the biggest wins come from changes you make outside of Ceph itself.
System-Level Tuning: The Low-Hanging Fruit
Before you even start a Ceph process, you can unlock a lot of performance at the system’s lowest levels. Two of the most effective changes are turning off features designed to save power and manage devices, as they add delays that are bad for a high-performance storage system.
Disabling CPU Power Saving (C-States)
Modern CPUs use power-saving modes called C-states to use less energy when they’re idle. When a CPU core isn’t doing anything, the operating system can put it into a “sleep” state (C1, C2, etc.), turning off parts like its clock and caches to save power. This is great for desktops, but it’s a performance killer for storage systems that need to be fast all the time.
The process of a core “waking up” from a deep C-state isn’t instant; it adds a small but critical delay. For a storage system like Ceph that needs to respond to requests in microseconds, this wake-up delay can make performance inconsistent and jumpy. The team that built the 1 TiB/s Ceph cluster found that just turning off C-states in the server BIOS gave them a 10-20% performance boost right away. For any all-NVMe Ceph setup, setting the CPU to “performance” mode and disabling C-states is a must-do first step.
Disabling IOMMU (VT-d/AMD-Vi)
The Input/Output Memory Management Unit (IOMMU) is a hardware feature that creates a virtual memory layer for your devices. It allows you to do things like pass a device directly to a virtual machine and protects your system’s memory from faulty devices.
But this protection has a cost. The IOMMU has to translate the memory address for every I/O operation, which adds a little bit of overhead and delay to each request. In a trusted, dedicated storage environment, this translation layer is an unnecessary performance hit. During their investigation, the Clyso team found that a lot of CPU time was being wasted on IOMMU-related operations. Turning off IOMMU at the kernel level gave them a “substantial performance boost” by getting rid of this overhead. While it’s not for every environment, it’s a key optimization for bare metal, performance-focused Ceph clusters.
Ceph-Specific Configuration: The OSD-per-Drive Dilemma
One of the biggest tuning decisions you’ll make inside Ceph is how many OSD daemons to run on each NVMe drive. In the past, with slower flash, running multiple OSDs per drive was a good way to get more performance out of it. With today’s super-fast NVMe drives and modern Ceph, the choice is more complicated. It’s a direct trade-off between raw speed and consistent latency.
Testing with Ceph Reef shows two main strategies:
- One OSD per NVMe: This is the simplest setup. It usually gives you the best speed for writing large, sequential files and is easier on your CPU. This is the recommended default if your cluster doesn’t have a ton of extra CPU power or if your main job is streaming large files, like for video storage or backups.
- Two OSDs per NVMe: This setup needs more CPU and memory, but it has a big advantage for applications that are sensitive to latency. Tests show that running two OSDs per drive can cut your 99th percentile tail latency in half. It also lets you get more IOPS in very CPU-heavy environments (like if you have 16 or more CPU threads for each NVMe drive). This makes it a better choice for things like virtual desktops (VDI) or databases, where a predictable, fast response time is more important than getting the absolute maximum throughput.
There’s no right or wrong answer here. It’s about matching your cluster’s setup to what your main applications need. You have to understand that optimizing for average latency is different from optimizing for tail latency, which is what your users actually feel.
Essential ceph.conf
and BlueStore Settings
Besides the OSD layout, a few other Ceph settings are key for performance:
- Memory Allocation: The BlueStore backend uses a lot of memory for caching metadata. The
osd_memory_target
setting in yourceph.conf
file should be at least 8GB per OSD to make sure it has enough memory to work without having to constantly read from the slower drives. If you have plenty of RAM, making this number bigger can improve performance even more. - RocksDB Compilation: A tricky issue found during high-performance testing was that the standard Ceph packages for some Linux distributions weren’t compiled with the best settings for RocksDB. This led to slow database operations and throttled write performance. For top-tier performance, making sure Ceph is built with an optimized RocksDB is a must.
- Network Configuration: On your cluster’s internal network, enabling jumbo frames by setting the MTU (Maximum Transmission Unit) to 9000 is a standard best practice. This lets you send more data in each network packet, which reduces the number of packets that need to be processed and lowers CPU use.
Here’s a quick checklist of these essential tuning settings to help you get the most out of your all-NVMe Ceph cluster.
Layer | Parameter | Recommended Value | Reason |
---|---|---|---|
BIOS/UEFI | CPU C-States | Disabled | Gets rid of CPU wake-up delays, which is key for microsecond-level responses. |
Kernel Boot Parameter | iommu.passthrough=1 or intel_iommu=off / amd_iommu=off | On/Off as specified | Reduces I/O address translation overhead in trusted, bare-metal setups. |
Network Interface | MTU (Cluster Network) | 9000 (Jumbo Frames) | Reduces packet processing overhead and lowers CPU usage. |
Ceph OSD Layout | OSDs per NVMe Drive | 1 (for throughput) or 2 (for tail latency) | Balances raw speed against consistent latency based on your workload. |
Ceph ceph.conf | osd_memory_target | 8GB minimum per OSD | Makes sure BlueStore has enough memory for its metadata cache. |
Real-World Results: Deconstructing a 1 TiB/s Ceph Cluster
Benchmarks and tuning guides are a great start, but the real test is seeing how these ideas work in the real world. In early 2024, a team from the Ceph consultancy Clyso that we mentioned earlier took on a project that pushed the limits of what people thought was possible with Ceph. By systematically applying the tuning principles we’ve discussed, they built and optimized an all-NVMe cluster that broke the 1 terabyte-per-second (TiB/s) speed barrier, giving us a powerful case study in high-performance storage.
The Challenge and the Hardware
The goal was a big one: move an existing HDD-based cluster to a new, 10-petabyte all-NVMe setup without any downtime. The hardware they used was impressive. The final setup had 68 Dell PowerEdge R6615 servers. Each server was a beast, with a 48-core AMD EPYC 9454P CPU, 192GiB of DDR5 RAM, two 100GbE Mellanox network cards, and ten 15.36TB Dell enterprise NVMe drives. This hardware was chosen specifically for its high memory bandwidth, huge number of CPU cores, and extreme network speed, creating a balanced platform that could keep up with the fast drives.
From Disappointment to Breakthrough
Even with all that powerful hardware, the first tests were a huge letdown. The performance was much lower than they expected, and even worse, it was inconsistent. It would get slower over time and only get better after a full reboot. This led to a deep investigation that uncovered several tricky, system-level bottlenecks. The fixes they found prove that the tuning strategies we’ve talked about really work:
- Problem: Latency Spikes. The first issue was inconsistent latency that was killing performance. Solution: Disabling C-States. The team correctly guessed that CPU power-saving states were causing wake-up delays. Turning off C-states in the BIOS gave them an immediate 10-20% performance boost by making sure the CPUs were always active and ready.
- Problem: Kernel-Level Bottlenecks. Even with C-states off, performance still wasn’t where it should be. A deeper look with kernel-level tools showed that a lot of time was being wasted on kernel operations. Solution: Disabling IOMMU. The team traced this issue to the IOMMU. By turning it off with a kernel boot setting, they got rid of the address translation overhead for every I/O operation, which resulted in another “substantial performance boost,” especially as they added more NVMe drives.
- Problem: Poor Write Performance. The last major problem was surprisingly slow 4K random write performance. Solution: Fixing RocksDB Compilation. The investigation found that the standard Ceph packages they were using hadn’t been compiled with the best settings for RocksDB, the database that powers BlueStore. By recompiling Ceph with the right settings, RocksDB’s internal cleanup time dropped, and 4K random write performance basically doubled.
This journey shows a critical point: getting top-tier performance isn’t a simple “install and go” job. It takes data-driven problem-solving, deep system knowledge, and the willingness to look beyond Ceph’s own settings to the operating system and hardware underneath.
The Results: Pushing Ceph to its Limits
With these fixes in place, the cluster’s true potential was finally unlocked. The results set a new public record for Ceph performance:
- 3x Replication Performance: In its fastest configuration using 3x replication, the 63-node, 630-OSD cluster achieved an incredible 1025 GiB/s (just over 1 TiB/s) of large read speed. For small random reads, it hit 25.5 million 4K IOPS. These numbers show that a well-tuned, large-scale Ceph cluster can perform on the same level as the most advanced proprietary storage systems.
- Erasure Coding Performance: The team then reconfigured the cluster to use a 6+2 erasure coding (EC) profile, which was the setup the client planned to use to get the most storage efficiency. Even with the extra work of EC, the cluster still delivered amazing performance: over 500 GiB/s for reads and nearly 400 GiB/s for writes.
This case study gives us a valuable real-world look at the performance trade-offs between replication and erasure coding on an all-NVMe platform. While 3x replication gives you the absolute best read performance, erasure coding gives you huge storage efficiency with still-massive speed, especially for writes.
The performance of EC on NVMe is completely different from how it behaves on HDDs. On spinning disks, the random I/O hit from EC is often too much to handle. But on NVMe, the drives are so fast that the CPU cost of calculating parity and the network cost of talking to multiple nodes become the new bottlenecks.
The Clyso benchmark found something that might seem backward: EC writes were much faster than replicated writes at this scale. This is because an EC write is spread across many nodes at the same time (K+M nodes), while a replicated write is limited by the one-by-one process of writing to three nodes. On the other hand, EC reads were slower than replicated reads (500 GiB/s vs. 1 TiB/s). An EC read has to grab data from multiple (K) OSDs and have the client’s CPU put the original object back together, which creates more network traffic and work than a simple read from a single OSD in a replicated pool. This is a critical design choice for anyone building high-performance, all-NVMe Ceph clusters with erasure coding.
The Economics of Speed: TCO and Cost-Effectiveness
The biggest hurdle to adopting all-NVMe storage has always been the cost. If you only look at the upfront price tag, measured in dollars per gigabyte ($/GB), an all-flash solution can seem way too expensive compared to traditional HDD systems. But that’s not the whole story. A full Total Cost of Ownership (TCO) analysis, which includes operational costs like power, cooling, rack space, and management time, tells a very different story. For performance-hungry workloads, all-NVMe Ceph can be a very smart financial choice.
Erasure Coding as The Great Equalizer
The most important feature for making an all-NVMe Ceph cluster affordable is erasure coding (EC). In a standard setup, Ceph uses 3x replication, which means it stores three copies of every piece of data. This means you have a 200% storage overhead; to store 1PB of data, you need 3PB of raw disk space.
Erasure coding gives you the same level of data protection, or even better, with much less overhead. A common k=4, m=2 (or 4+2) EC profile, for example, splits data into four “data chunks” and creates two “parity chunks.” This setup can survive the failure of any two OSDs, just like 3x replication, but it only needs 1.5PB of raw storage to give you 1PB of usable space—that’s a 50% overhead instead of 200%. This basically cuts your raw storage hardware cost in half. OpenMetal even provides a public erasure coding calculator to help you figure out these trade-offs between usable space, fault tolerance, and cost.
On HDD clusters, the performance hit from erasure coding’s random I/O patterns often means it’s only used for cold archives. But on all-NVMe clusters, the baseline performance is so high that EC becomes a great and cost-effective option for a much wider range of workloads, including warm and even hot data.
Power, Cooling, and Density Savings
The TCO benefits of all-NVMe go way beyond just the storage drives. Flash storage is much more efficient when it comes to power, cooling, and space.
- Power Consumption: An active 3.5-inch enterprise HDD consumes between 6 and 10 watts, while an active enterprise NVMe SSD typically consumes between 3 and 8 watts. The difference is even bigger when they’re idle, where an HDD can draw 4-8W while an NVMe drive can drop to as low as 0.5-2W. A study by Micron comparing an all-NVMe Ceph cluster to a hybrid HDD+NVMe cluster found the all-NVMe setup was way more efficient, with 46 times lower power use per GBps of write speed and 16 times lower power per GBps of read speed.
- Density and Footprint: Because they pack so much performance into a small space, you need far fewer all-NVMe nodes to get the same performance as an HDD-based cluster. The same Micron study found that it would take 161 HDD-based nodes to match the write performance of just 6 all-NVMe nodes.
The Endurance Question: Managing Long-Term Costs
A valid long-term cost to consider for any flash system is that the drives have a limited write endurance. This is a manageable problem if you plan for it. Enterprise NVMe drives are warrantied for a certain amount of writing, measured in Drive Writes Per Day (DWPD) over a typical five-year period.
The key to managing this cost is to match the drive’s endurance rating to how write-heavy your workload is. It’s a common myth that all enterprise applications need expensive, high-endurance drives. In reality, you can choose drives based on your workload:
- Read-intensive workloads (like analytics or archives) can use cheaper, high-capacity “read-intensive” drives with a DWPD of 1 or less.
- Mixed-use workloads (like general virtual machine storage or web apps) are fine with mainstream “mixed-use” drives with a DWPD between 1 and 3.
- Write-intensive workloads (like database logs or high-frequency data collection) are the only ones that usually need premium “write-intensive” drives with a DWPD greater than 3.
By looking at your application’s I/O patterns and picking the right class of enterprise NVMe drive, you can avoid overspending on endurance you don’t need, which optimizes both your initial cost and your long-term replacement costs.
This table gives a high-level TCO comparison for a hypothetical 1PB usable storage cluster, showing how the higher initial cost of NVMe hardware is balanced out by big operational savings.
Factor | Hybrid HDD Cluster (Example) | All-NVMe Cluster (Example) |
---|---|---|
Data Protection Scheme | 3x Replication | Erasure Coding (EC 4+2) |
Raw Storage Required | 3 PB | 1.5 PB |
# of Nodes (Performance-Equivalent) | ~100+ | ~10 |
Rack Units Occupied | High (e.g., 20U+) | Low (e.g., 4U) |
Power Consumption (Storage System) | High (e.g., ~15-20 kW) | Low (e.g., ~3-5 kW) |
Initial Hardware Cost (CapEx) | Lower | Higher |
3-Year TCO (Hardware + Power/Space OpEx) | Higher | Lower |
Note: Values are for illustration and depend on specific hardware and power costs. Source: Micron
In the end, the main economic advantage of all-NVMe Ceph isn’t a lower cost-per-gigabyte, but a much lower cost-per-IOPS and cost-per-watt. When you start measuring based on performance and operational efficiency instead of just capacity, the financial case for all-flash becomes very strong. It also makes your data center simpler. Hybrid arrays rely on complex caching and tiering to move data between fast flash and slow HDDs, which adds management work and can lead to unpredictable performance. An all-flash setup gets rid of this complexity, creating a single, high-performance tier that’s easier to manage and gives you more predictable application response times.
The Road Ahead: Ceph Crimson and the Future of NVMe Storage
While today’s all-NVMe Ceph clusters, built on the solid BlueStore backend, deliver great performance, the Ceph community knows that the platform’s original design wasn’t made for the microsecond-latency world of modern storage. To really unlock the full potential of next-gen hardware, a major redesign is in the works. This project, called Crimson, is the future of Ceph and shows the community’s commitment to staying on the cutting edge of storage tech.
The Problem: Why Classic Ceph Can’t Keep Up
Ceph was created back when spinning disks, which could handle a few hundred IOPS, were the norm. In that world, the CPU had plenty of time to manage each I/O operation. As storage got faster, that time has shrunk dramatically. A modern NVMe drive can do over a million IOPS, which leaves the CPU with only a few thousand cycles to handle each request.
The classic Ceph OSD, with its multi-threaded design, wasn’t built for this. Its internal architecture can lead to performance-sapping context switches, threads fighting over locks, and other overhead that eats up precious CPU cycles and adds latency. This is the main reason for the performance gap between what the raw hardware can do and what Ceph delivers out of the box, as noted in OpenMetal’s documentation. While a lot of tuning can help, the architecture itself puts a limit on how efficient it can ultimately be.
The Solution: Crimson and SeaStore
The Crimson project is the Ceph community’s way of saying that to break through this limit, a new architecture is needed. It’s a complete rewrite of the Ceph OSD from the ground up, not just an update. Built on Seastar, an advanced C++ framework for high-performance applications, Crimson is engineered with a different set of rules.
Its main goal is to cut down on CPU overhead and latency by using a thread-per-core, shared-nothing design. In this model, work is split up among CPU cores to avoid threads having to communicate and lock each other, which lets a single I/O operation be processed on a single core without interruption as much as possible.
Along with this new OSD comes SeaStore, a new storage backend designed specifically for fast, random-access media like NVMe. SeaStore will eventually replace BlueStore as the default backend for Crimson, offering a more streamlined and efficient data path that isn’t held back by the legacy needs of supporting HDDs.
Current Status and Realistic Expectations
Developing a new storage engine is a huge job. For perspective, BlueStore itself took over two years to go from an idea to being production-ready. Crimson and SeaStore are a project of similar, if not greater, complexity.
Right now, Crimson is available as a technology preview in recent Ceph releases like Squid and in commercial versions like IBM Storage Ceph. This means it’s not feature-complete yet—for example, full support for erasure coding is still being worked on—and it’s not recommended for production use. Early performance tests that pair the new Crimson OSD with the stable BlueStore backend show that its performance is already as good as or slightly better than the classic OSD. However, the real, game-changing performance improvements are expected to come when the native SeaStore backend is fully mature.
This gives organizations a clear and practical roadmap. For the near future, production deployments of high-performance Ceph will continue to be based on the proven, stable, and very fast classic OSD with BlueStore. This is the architecture that powers today’s fastest clusters, including the 1 TiB/s system. Businesses can invest in these solutions with confidence, like those offered by OpenMetal, knowing they’re deploying a mature technology while also having a clear path forward with Crimson that promises to unlock even more performance in the years to come. The message is: “stable and powerful now, with an even faster future.”
Wrapping Up and Key Takeaways
Pairing all-NVMe storage with Ceph is a major step forward for the open source storage platform, turning it from a scale-out workhorse into a high-performance machine that can compete with specialized proprietary systems. The potential is huge, with benchmarks showing multi-million IOPS and speeds of over 1 TiB/s. However, as this report has shown, that kind of performance doesn’t just happen out of the box, it’s the result of a deliberate and well-planned engineering effort.
Here are some takeaways for you to consider:
- Top Performance is Conditional: Getting the best performance from all-NVMe Ceph depends on a balanced architecture and careful tuning. The raw speed of NVMe drives moves the bottleneck to other parts of the system, making high-core-count CPUs and high-bandwidth (25/100Gbps) networking essential for success.
- System-Level Tuning is Key: Some of the biggest performance wins come from outside of Ceph’s own settings. Turning off CPU power-saving features (C-states) and, in trusted setups, the IOMMU are critical steps to cut down on latency and unlock your hardware’s true potential.
- Economics are Changed by Efficiency: While the upfront hardware cost of NVMe is higher than HDD, a TCO analysis tells a different story. Using erasure coding on NVMe makes it financially competitive, and the huge gains in performance density lead to major savings in power, cooling, and rack space, often resulting in a lower overall TCO for performance-heavy workloads.
- Architecture Must Fit the Workload: There’s no single “best” setup. You have to make smart trade-offs, like choosing between one OSD per drive for speed or two for better tail latency, and picking between 3x replication for the best read performance or erasure coding for storage efficiency and write speed.
- The Path Forward is Clear: Today’s production-ready, BlueStore-based clusters are mature, stable, and can handle the toughest enterprise workloads. The ongoing development of the next-generation Crimson OSD and SeaStore backend ensures that Ceph has a clear roadmap to take even better advantage of future storage technologies.
Deploying a high-performance all-NVMe Ceph cluster is a complex job that requires deep knowledge of storage, networking, and systems administration. The common mistakes like using consumer-grade hardware, under-sizing the network, or skipping system-level tuning can easily leave you with a frustratingly slow system.
For organizations that want to use the power of all-NVMe Ceph without the steep learning curve and risk, partnering with an expert provider is the best way to succeed. OpenMetal’s on-demand private clouds offer a pre-tuned, production-ready environment, letting customers get the benefits of this powerful technology right away.
Frequently Asked Questions (FAQs)
How do all-NVMe Ceph clusters compare to traditional HDD-based systems?
All-NVMe Ceph clusters are in a completely different league than HDD-based systems. The difference isn’t small, it’s orders of magnitude. A well-tuned all-NVMe cluster can deliver millions of IOPS and tens of gigabytes per second of speed, while an HDD cluster is measured in thousands of IOPS and hundreds of megabytes per second. Latency drops from milliseconds on HDDs to microseconds on NVMe.
On the cost side, the upfront price-per-gigabyte of NVMe hardware is higher than HDDs. However, for any workload where performance matters, the Total Cost of Ownership (TCO) for an all-NVMe cluster is often much lower. This is because of huge efficiency gains in power, cooling, and rack space. For example, one study found it would take over 100 HDD-based nodes to match the write performance of just six all-NVMe nodes, which leads to massive operational savings that make up for the higher initial hardware cost.
What are the most critical tuning steps for an all-NVMe cluster?
While you can tweak many settings, three tuning steps are the most important for unlocking performance in an all-NVMe Ceph cluster:
- Disable CPU C-States: This is a BIOS/UEFI setting. Turning off power-saving C-states stops the CPU from going into sleep modes, which gets rid of the “wake-up” delay that hurts low-latency storage. This one change can give you a 10-20% performance boost.
- Disable IOMMU: In a trusted bare metal setup, the IOMMU (VT-d/AMD-Vi) adds an unnecessary address translation step to every I/O. Turning it off with a kernel boot setting can give you a substantial performance increase by reducing CPU work.
- Optimize OSDs-per-Drive Strategy: Deciding to run one or two OSDs per NVMe drive is a key trade-off. Use one OSD per drive for the best large-write speed and CPU efficiency. Use two OSDs per drive to drastically reduce 99% tail latency, which is critical for apps that need consistent, predictable response times.
How does network setup affect performance and recovery speed of all-NVMe Ceph clusters?
The network is the lifeblood of a Ceph cluster, and its importance in an all-NVMe setup can’t be overstated. A fast, low-latency network (25Gbps minimum, 100Gbps recommended for larger clusters) is essential for two reasons:
- Peak Performance: The speed of modern NVMe drives can easily overwhelm slower networks. A 10Gbps network, for example, can be completely used up by a single client, creating a hard limit on performance no matter how fast your drives are.
- Recovery Speed: This is the most critical part. When a drive or server fails, Ceph has to copy terabytes of data across the network to get back to full redundancy. The speed of this recovery is directly tied to your available network bandwidth. A slow network makes the recovery process much longer, leaving your cluster in a degraded and vulnerable state and increasing the risk of data loss from another failure. A faster network means a more resilient and reliable cluster.
When should I use erasure coding versus 3x replication on an all-NVMe cluster?
The choice between erasure coding (EC) and 3x replication on an all-NVMe cluster is a strategic one based on your workload and cost goals.
- Use 3x Replication for:
- Maximum Read Performance: If your main goal is the absolute highest read IOPS and lowest read latency, 3x replication is the way to go. This makes it perfect for latency-sensitive workloads like block storage for high-transaction databases. The 1 TiB/s benchmark was hit using 3x replication.
- Simplicity: Replication is less work for the CPU than erasure coding.
- Use Erasure Coding (e.g., 4+2, 8+3) for:
- Storage Efficiency and Cost Savings: EC dramatically reduces storage overhead, which is the key to making large all-NVMe clusters affordable.
- High Write Throughput: It might seem backward, but at scale, EC can deliver higher write speeds than replication because writes are spread across many OSDs at the same time.
- Object Storage and Archives: It’s a great fit for large-scale object storage or file storage where capacity and cost-efficiency are the top priorities and slightly higher read latency is okay.
What is the most common mistake when deploying Ceph on NVMe?
The most common and damaging mistake is using consumer-grade hardware. This mistake shows up in two main ways:
- Consumer NVMe Drives: Using drives made for desktops or laptops is a recipe for disaster. These drives are missing essential enterprise features like power-loss protection capacitors (which risks data corruption) and have very low write endurance (DWPD), which causes them to wear out and fail early under Ceph’s constant I/O load.
- Under-provisioned Networking: Deploying an all-NVMe cluster on a 1GbE or even a 10GbE network. This creates an immediate and major performance bottleneck that stops the cluster from getting any of the benefits of the fast storage and critically slows down recovery operations, putting the cluster’s resilience at risk.
These hardware choices create fundamental limits and reliability risks that no amount of software tuning can fix. A successful all-NVMe Ceph deployment starts with a foundation of the right, enterprise-grade hardware.
Read More on the OpenMetal Blog