In this article
- Related Video: Ceph Storage Cluster Beginner’s Guide and Demo
- How Ceph Replication Works
- Ceph’s Consistency Model: Strong Consistency
- Balancing Performance, Durability, and Client Behavior
- Configuration Guide: Replication, Performance, and Client Caching
- Wrapping Up: Ceph Replication and Consistency Model
- Get Started on an OpenStack- and Ceph-Powered Hosted Private Cloud
Ceph is a distributed storage system designed for scalability, high availability, and data durability. It manages data across clusters using robust replication strategies and a well-defined consistency model. Here’s a quick overview of what we’ll get into in this article:
- Replication: Ceph ensures data redundancy, fault tolerance, and performance by replicating data across multiple storage devices (OSDs).
- Consistency Model: Ceph’s core storage layer (RADOS) provides Strong Consistency for acknowledged operations, guaranteeing that reads reflect the latest successfully written data.
- Key Components: RADOS (core storage engine), CRUSH (data placement algorithm), OSDs (storage devices handling data), and Monitors (cluster state tracking).
Ceph‘s architecture provides reliability and performance, allowing tuning based on workload needs. This article explains how Ceph achieves this through its replication strategies, write processes, consistency guarantees, and failure recovery mechanisms.
Ceph Storage Cluster Beginner’s Guide and Demo
How Ceph Replication Works
Ceph‘s replication system is built on RADOS (Reliable Autonomic Distributed Object Store), which controls data distribution and consistency across the storage cluster. Below, we break down how RADOS operates, how CRUSH handles data placement, and the roles of primary and replica OSDs.
RADOS and Placement Groups
RADOS organizes data into Placement Groups (PGs) to manage the distribution of objects across the cluster’s Object Storage Devices (OSDs). Each PG contains a set of objects, and RADOS ensures that data within a PG is replicated across different OSDs according to the pool’s configured replication factor (size).
Component | Function | Purpose |
---|---|---|
Placement Groups | Organize object distribution | Determine where data replicas are stored and managed |
OSDs | Store data blocks | Host the actual primary and replica copies of data |
Monitor Nodes | Track cluster state and PG maps | Maintain the overall cluster map and ensure consensus on its state |
CRUSH Data Placement
CRUSH (Controlled Replication Under Scalable Hashing) is Ceph’s algorithm for intelligently placing data replicas across the cluster without relying on a central lookup table. This decentralized approach allows for efficient scaling and resilient data distribution. CRUSH calculates where data should be stored based on the cluster map, placement rules, and the object’s name.
CRUSH evaluates a few different factors to determine placement:
- Failure Domains: Ensures replicas are stored in separate physical locations (e.g., different hosts, racks, rows) to minimize the risk of data loss from correlated failures.
- Device Weights: Considers the storage capacity of OSDs to ensure a balanced distribution of data across the cluster.
- Hierarchy Rules: Uses the defined cluster topology (e.g., data centers, racks, hosts) to optimize replica placement for resilience and potentially performance.
These considerations help maintain a balanced and fault-tolerant replication strategy across the cluster.
Primary and Replica OSDs
Within each Placement Group, Ceph assigns one OSD as the Primary. The Primary OSD coordinates all write operations for that PG and ensures consistency among its replicas.
Here’s how write operations typically work:
- The client sends the write request to the Primary OSD for the relevant PG.
- The Primary OSD validates the request, assigns a sequence number, and sends the write operation to its Replica OSDs.
- Replica OSDs store the data and send acknowledgments back to the Primary.
- The Primary confirms the write back to the client only after receiving acknowledgments from the required number of OSDs (based on the
min_size
setting), guaranteeing consistency and durability.
RADOS continuously monitors the health of OSDs via heartbeat mechanisms. If a Primary OSD fails or network issues occur, the Monitor nodes orchestrate promoting a suitable Replica OSD to become the new Primary for the affected PGs. This self-healing mechanism maintains data availability even during disruptions.
Ceph’s Consistency Model: Strong Consistency
Ceph’s core storage layer, RADOS, provides strong consistency for data operations within the cluster. This means that once Ceph confirms a write operation is complete, subsequent reads initiated after that confirmation are guaranteed to return the newly written data.
How Ceph Ensures Strong Consistency
Ceph uses its primary-copy replication mechanism for writes within each Placement Group (PG) to achieve strong consistency:
- Write Initiation: A client sends a write request to the Primary OSD for the target PG.
- Replication: The Primary OSD writes the data locally and simultaneously sends the write operation to all Replica OSDs in the PG’s current acting set.
- Acknowledgment: Each Replica OSD processes the write, persists it to its storage, and sends an acknowledgment back to the Primary OSD.
- Client Confirmation: The Primary OSD waits until it receives acknowledgments from a minimum number of OSDs (including itself). This minimum is defined by the pool setting
min_size
. Oncemin_size copies
are safely persisted, the Primary OSD acknowledges the write completion back to the client.
Key Factors
min_size
Setting: This pool parameter is critical for both durability and availability during OSD failures. It dictates how many copies of the data must be successfully written before a write is considered committed and acknowledged to the client. For example, with a pool size
of 3 (total replicas) and min_size
of 2, Ceph can tolerate one OSD failure within that PG’s acting set while still accepting writes. If fewer than min_size
OSDs are available, the PG becomes degraded, and write operations will be blocked until enough OSDs recover, preserving consistency.
Read Operations: Reads are typically coordinated by the Primary OSD to ensure the most up-to-date data is returned. While optimizations might allow reading from replicas in some scenarios, the write protocol ensures that acknowledged data is consistent across the required number of replicas.
This model supports data durability and strong consistency, making Ceph a good fit for workloads requiring high data integrity like databases, virtual machine block storage (RBD), and shared file systems (CephFS).
Read-Your-Writes Property
A direct consequence of Ceph’s strong consistency model is the “read-your-writes” property. Because writes are confirmed only after being successfully replicated to min_size
OSDs, a client performing a read after receiving a successful write acknowledgment is guaranteed to see the data it just wrote.
Note on Client-Side Caching and Higher-Level Services
While Ceph’s core RADOS provides strong consistency for acknowledged writes, the consistency perceived by the application can sometimes be affected by client-side caching mechanisms (like RBD caching, discussed later).
Additionally, higher-level services built on Ceph, like the RADOS Gateway (RGW) for S3/Swift object storage, might have features (e.g., multi-site replication, bucket index updates) that exhibit eventual consistency characteristics by design, but this is distinct from the core RADOS replication model.
sbb-itb-f4461f5
Balancing Performance, Durability, and Client Behavior
Ceph manages performance and data consistency using smart techniques. The core consistency remains strong, but configurations related to client interaction and recovery can be tuned.
Primary Copy Method
As mentioned, Ceph uses a “primary copy” method where a designated primary OSD coordinates operations for specific data placement groups. This setup helps maintain proper write sequencing, centralizes coordination to ensure consistency, and avoids conflicting updates.
Write Confirmation Steps
Ceph’s write operations follow a clear process designed for reliability:
- Initial Write Reception: The primary OSD receives and validates the write request.
- Replica Distribution: The primary OSD sends the write to replicas simultaneously.
- Acknowledgment Collection: The primary OSD collects acknowledgments from replica OSDs. A write is only acknowledged back to the client once the
min_size
requirement for the pool is met (i.e., the write is persisted on at leastmin_size
OSDs). This step enforces strong consistency within the cluster for committed writes.
This process works hand-in-hand with Ceph’s recovery features to maintain data reliability.
Failure Recovery
Ceph is designed to detect and recover from failures without compromising data integrity:
- Continuous Monitoring: OSD health is tracked using heartbeat signals exchanged between OSDs and Monitors.
- Automatic Failover: If an OSD fails, Ceph automatically marks it down. PGs with data on that OSD enter a degraded state. If a Primary OSD fails, a Replica is promoted. CRUSH then calculates new locations for the data replicas that were on the failed OSD, and the system initiates “backfill” or “recovery” to restore the desired replication level (
size
) on available OSDs. - Consistency Safeguards: Strict write sequencing and peering processes during recovery prevent data corruption and ensure PGs return to a consistent state.
These measures ensure that Ceph remains reliable and consistent even when hardware failures or network issues occur.
Configuration Guide: Replication, Performance, and Client Caching
Tuning Ceph involves understanding key parameters related to replication, operational behavior, and client-side interactions like caching.
Setting Replication Levels
Ceph’s replication (size
) boosts data durability and read availability. The min_size
parameter controls the minimum number of replicas needed for writes to succeed, ensuring consistency during temporary failures. It’s important to tune these based on your specific needs.
Replication Factor (size
) Use Case – Impact
- size=2: Development environments, non-critical data
- Lower storage overhead, but only tolerates 1 OSD failure before data is inaccessible. Not recommended for production.
- size=3 (Default): Production workloads, general-purpose storage
- Good balance: tolerates 1 OSD failure while allowing writes (if
min_size
=2), tolerates 2 OSD failures before potential data loss.
- Good balance: tolerates 1 OSD failure while allowing writes (if
- size=4+: Mission-critical systems, regulatory compliance
- Higher storage usage, maximum reliability, tolerates more concurrent failures.
You can adjust these per pool using:
ceph osd pool set <pool_name> size <num_replicas>
ceph osd pool set <pool_name> min_size <min_replicas_for_write>
It’s important to note that min_size
should always be less than or equal to size
. A common practice is min_size = size - 1
to allow writes during a single OSD failure within a PG’s acting set. Setting min_size = 1
is highly discouraged as it compromises durability guarantees.
Operational Tuning Parameters
These settings influence cluster stability, failure detection speed, and recovery performance:
osd_client_op_timeout
: How long the client waits for an operation acknowledgment.osd_heartbeat_grace
: How long before an OSD is considered down if heartbeats aren’t received.osd_recovery_max_active
: Controls how many concurrent recovery operations run per OSD.osd_recovery_op_priority
: Sets the priority of recovery operations versus client I/O.
Adjusting these requires careful consideration of the trade-offs between faster failure detection/recovery and potential instability from false positives or excessive recovery load.
Tuning Client Performance with RBD Caching
For RADOS Block Devices (RBD), Ceph offers client-side caching options. These significantly impact performance perceived by the client application but also affect the application’s view of consistency and durability, especially regarding uncommitted writes. These settings do not change Ceph’s underlying strong consistency guarantee for data committed to the RADOS cluster.
Caching Modes:
- No Caching (
rbd_cache = false
– Default): All reads/writes go to the cluster. Writes confirmed only aftermin_size
commitment in RADOS. Safest, ensures client sees cluster-committed state. - Write-Through Caching (
rbd_cache = true
,rbd_cache_writethrough_until_flush = true
): Reads may hit cache. Writes are sent to the cluster, and the client application waits for cluster (min_size
) acknowledgment. Improves read performance for cached data, maintains strong client-perceived write safety. - Write-Back Caching (
rbd_cache = true
,rbd_cache_writethrough_until_flush = false
,rbd_cache_max_dirty > 0
): Reads may hit cache. Writes are acknowledged to the application immediately after entering the client cache. Data is written to RADOS asynchronously. Offers highest client-perceived write performance but risks data loss if the client crashes before data is flushed to the cluster. Provides weaker consistency/durability from the application’s perspective until data is flushed.
Example Scenarios and Configuration Snippets
Database Storage (Prioritize Safety): Use No Caching or Write-Through.
# Recommended: rbd_cache = false (default)
# Or Explicit Write-Through: rbd_cache = true rbd_cache_writethrough_until_flush = true rbd_cache_max_dirty = 0
High-Performance VMs / Scratch (Prioritize Client Speed – Use with Caution)
Write-Back might be considered, understanding the risks.
rbd_cache = true rbd_cache_writethrough_until_flush = false rbd_cache_size = 67108864
# 64MB example rbd_cache_max_dirty = 48000000
# 48MB dirty limit example rbd_cache_target_dirty = 32000000
# 32MB target dirty example
Always monitor cluster health (ceph health detail
) and performance (ceph osd pool stats
, client metrics) after making adjustments. Ensure caching strategies align with application data safety requirements.
Wrapping Up: Ceph Replication and Consistency Model
Key Points Review
Ceph provides a powerful, flexible, scalable, and cost-effective storage solution. Its main features include:
- Replication: Ceph uses replication (
size
) across failure domains (via CRUSH) for durability and availability. - Consistency: Core RADOS ensures Strong Consistency for writes acknowledged by the cluster (based on
min_size
). - Performance Tuning: Client-side caching (RBD) and operational parameters can be tuned, requiring careful consideration of performance versus client-perceived consistency and safety trade-offs.
What’s Next for Ceph
Ceph’s development continues to focus on improving performance, data reliability, and usability. Refining replication processes will play a major role in speeding up operations and maintaining data accuracy. These advancements are also making recovery processes more efficient, which is fantastic news for large-scale, distributed systems. Many of these updates are already proving their worth in real-world applications.
OpenMetal’s Use of Ceph
We’ve integrated Ceph into our hosted private cloud built on OpenStack to achieve greater reliability and efficiency, delivering up to 3.5x better performance compared to traditional public cloud setups. Check out how one of our clients used it to save money and gain more control over their infrastructure.
Read More on the OpenMetal Blog