Ceph Replication and Consistency Model Explained

Resources » Blog » Ceph Replication and Consistency Model Explained

In this article

Related Video: Ceph Storage Cluster Beginner’s Guide and Demo
How Ceph Replication Works
Ceph’s Consistency Model: Strong Consistency
Balancing Performance, Durability, and Client Behavior
Configuration Guide: Replication, Performance, and Client Caching
Wrapping Up: Ceph Replication and Consistency Model
Get Started on an OpenStack- and Ceph-Powered Hosted Private Cloud

Ceph is a distributed storage system designed for scalability, high availability, and data durability. It manages data across clusters using robust replication strategies and a well-defined consistency model. Here’s a quick overview of what we’ll get into in this article:

Replication: Ceph ensures data redundancy, fault tolerance, and performance by replicating data across multiple storage devices (OSDs).
Consistency Model: Ceph’s core storage layer (RADOS) provides Strong Consistency for acknowledged operations, guaranteeing that reads reflect the latest successfully written data.
Key Components: RADOS (core storage engine), CRUSH (data placement algorithm), OSDs (storage devices handling data), and Monitors (cluster state tracking).

Ceph‘s architecture provides reliability and performance, allowing tuning based on workload needs. This article explains how Ceph achieves this through its replication strategies, write processes, consistency guarantees, and failure recovery mechanisms.

Ceph Storage Cluster Beginner’s Guide and Demo

How Ceph Replication Works

Ceph‘s replication system is built on RADOS (Reliable Autonomic Distributed Object Store), which controls data distribution and consistency across the storage cluster. Below, we break down how RADOS operates, how CRUSH handles data placement, and the roles of primary and replica OSDs.

RADOS and Placement Groups

RADOS organizes data into Placement Groups (PGs) to manage the distribution of objects across the cluster’s Object Storage Devices (OSDs). Each PG contains a set of objects, and RADOS ensures that data within a PG is replicated across different OSDs according to the pool’s configured replication factor (size).

Component	Function	Purpose
Placement Groups	Organize object distribution	Determine where data replicas are stored and managed
OSDs	Store data blocks	Host the actual primary and replica copies of data
Monitor Nodes	Track cluster state and PG maps	Maintain the overall cluster map and ensure consensus on its state

CRUSH Data Placement

CRUSH

CRUSH (Controlled Replication Under Scalable Hashing) is Ceph’s algorithm for intelligently placing data replicas across the cluster without relying on a central lookup table. This decentralized approach allows for efficient scaling and resilient data distribution. CRUSH calculates where data should be stored based on the cluster map, placement rules, and the object’s name.

CRUSH evaluates a few different factors to determine placement:

Failure Domains: Ensures replicas are stored in separate physical locations (e.g., different hosts, racks, rows) to minimize the risk of data loss from correlated failures.
Device Weights: Considers the storage capacity of OSDs to ensure a balanced distribution of data across the cluster.
Hierarchy Rules: Uses the defined cluster topology (e.g., data centers, racks, hosts) to optimize replica placement for resilience and potentially performance.

These considerations help maintain a balanced and fault-tolerant replication strategy across the cluster.

Primary and Replica OSDs

Within each Placement Group, Ceph assigns one OSD as the Primary. The Primary OSD coordinates all write operations for that PG and ensures consistency among its replicas.

Here’s how write operations typically work:

The client sends the write request to the Primary OSD for the relevant PG.
The Primary OSD validates the request, assigns a sequence number, and sends the write operation to its Replica OSDs.
Replica OSDs store the data and send acknowledgments back to the Primary.
The Primary confirms the write back to the client only after receiving acknowledgments from the required number of OSDs (based on the min_size setting), guaranteeing consistency and durability.

RADOS continuously monitors the health of OSDs via heartbeat mechanisms. If a Primary OSD fails or network issues occur, the Monitor nodes orchestrate promoting a suitable Replica OSD to become the new Primary for the affected PGs. This self-healing mechanism maintains data availability even during disruptions.

Ceph’s Consistency Model: Strong Consistency

Ceph’s core storage layer, RADOS, provides strong consistency for data operations within the cluster. This means that once Ceph confirms a write operation is complete, subsequent reads initiated after that confirmation are guaranteed to return the newly written data.

How Ceph Ensures Strong Consistency

Ceph uses its primary-copy replication mechanism for writes within each Placement Group (PG) to achieve strong consistency:

Write Initiation: A client sends a write request to the Primary OSD for the target PG.
Replication: The Primary OSD writes the data locally and simultaneously sends the write operation to all Replica OSDs in the PG’s current acting set.
Acknowledgment: Each Replica OSD processes the write, persists it to its storage, and sends an acknowledgment back to the Primary OSD.
Client Confirmation: The Primary OSD waits until it receives acknowledgments from a minimum number of OSDs (including itself). This minimum is defined by the pool setting min_size. Once min_size copies are safely persisted, the Primary OSD acknowledges the write completion back to the client.

Key Factors

min_size Setting: This pool parameter is critical for both durability and availability during OSD failures. It dictates how many copies of the data must be successfully written before a write is considered committed and acknowledged to the client. For example, with a pool size of 3 (total replicas) and min_size of 2, Ceph can tolerate one OSD failure within that PG’s acting set while still accepting writes. If fewer than min_size OSDs are available, the PG becomes degraded, and write operations will be blocked until enough OSDs recover, preserving consistency.

Read Operations: Reads are typically coordinated by the Primary OSD to ensure the most up-to-date data is returned. While optimizations might allow reading from replicas in some scenarios, the write protocol ensures that acknowledged data is consistent across the required number of replicas.

This model supports data durability and strong consistency, making Ceph a good fit for workloads requiring high data integrity like databases, virtual machine block storage (RBD), and shared file systems (CephFS).

Read-Your-Writes Property

A direct consequence of Ceph’s strong consistency model is the “read-your-writes” property. Because writes are confirmed only after being successfully replicated to min_size OSDs, a client performing a read after receiving a successful write acknowledgment is guaranteed to see the data it just wrote.

Note on Client-Side Caching and Higher-Level Services

While Ceph’s core RADOS provides strong consistency for acknowledged writes, the consistency perceived by the application can sometimes be affected by client-side caching mechanisms (like RBD caching, discussed later).

Additionally, higher-level services built on Ceph, like the RADOS Gateway (RGW) for S3/Swift object storage, might have features (e.g., multi-site replication, bucket index updates) that exhibit eventual consistency characteristics by design, but this is distinct from the core RADOS replication model.

Balancing Performance, Durability, and Client Behavior

Ceph manages performance and data consistency using smart techniques. The core consistency remains strong, but configurations related to client interaction and recovery can be tuned.

Primary Copy Method

As mentioned, Ceph uses a “primary copy” method where a designated primary OSD coordinates operations for specific data placement groups. This setup helps maintain proper write sequencing, centralizes coordination to ensure consistency, and avoids conflicting updates.

Write Confirmation Steps

Ceph’s write operations follow a clear process designed for reliability:

Initial Write Reception: The primary OSD receives and validates the write request.
Replica Distribution: The primary OSD sends the write to replicas simultaneously.
Acknowledgment Collection: The primary OSD collects acknowledgments from replica OSDs. A write is only acknowledged back to the client once the min_size requirement for the pool is met (i.e., the write is persisted on at least min_size OSDs). This step enforces strong consistency within the cluster for committed writes.

This process works hand-in-hand with Ceph’s recovery features to maintain data reliability.

Failure Recovery

Ceph is designed to detect and recover from failures without compromising data integrity:

Continuous Monitoring: OSD health is tracked using heartbeat signals exchanged between OSDs and Monitors.
Automatic Failover: If an OSD fails, Ceph automatically marks it down. PGs with data on that OSD enter a degraded state. If a Primary OSD fails, a Replica is promoted. CRUSH then calculates new locations for the data replicas that were on the failed OSD, and the system initiates “backfill” or “recovery” to restore the desired replication level (size) on available OSDs.
Consistency Safeguards: Strict write sequencing and peering processes during recovery prevent data corruption and ensure PGs return to a consistent state.

These measures ensure that Ceph remains reliable and consistent even when hardware failures or network issues occur.

Configuration Guide: Replication, Performance, and Client Caching

Tuning Ceph involves understanding key parameters related to replication, operational behavior, and client-side interactions like caching.

Setting Replication Levels

Ceph’s replication (size) boosts data durability and read availability. The min_size parameter controls the minimum number of replicas needed for writes to succeed, ensuring consistency during temporary failures. It’s important to tune these based on your specific needs.

Replication Factor (`size`) Use Case – Impact

size=2: Development environments, non-critical data
- Lower storage overhead, but only tolerates 1 OSD failure before data is inaccessible. Not recommended for production.
size=3 (Default): Production workloads, general-purpose storage
- Good balance: tolerates 1 OSD failure while allowing writes (if min_size=2), tolerates 2 OSD failures before potential data loss.
size=4+: Mission-critical systems, regulatory compliance
- Higher storage usage, maximum reliability, tolerates more concurrent failures.

You can adjust these per pool using:

ceph osd pool set <pool_name> size <num_replicas> ceph osd pool set <pool_name> min_size <min_replicas_for_write>

It’s important to note that min_size should always be less than or equal to size. A common practice is min_size = size - 1 to allow writes during a single OSD failure within a PG’s acting set. Setting min_size = 1 is highly discouraged as it compromises durability guarantees.

Operational Tuning Parameters

These settings influence cluster stability, failure detection speed, and recovery performance:

osd_client_op_timeout: How long the client waits for an operation acknowledgment.
osd_heartbeat_grace: How long before an OSD is considered down if heartbeats aren’t received.
osd_recovery_max_active: Controls how many concurrent recovery operations run per OSD.
osd_recovery_op_priority: Sets the priority of recovery operations versus client I/O.

Adjusting these requires careful consideration of the trade-offs between faster failure detection/recovery and potential instability from false positives or excessive recovery load.

Tuning Client Performance with RBD Caching

For RADOS Block Devices (RBD), Ceph offers client-side caching options. These significantly impact performance perceived by the client application but also affect the application’s view of consistency and durability, especially regarding uncommitted writes. These settings do not change Ceph’s underlying strong consistency guarantee for data committed to the RADOS cluster.

Caching Modes:

No Caching (rbd_cache = false – Default): All reads/writes go to the cluster. Writes confirmed only after min_size commitment in RADOS. Safest, ensures client sees cluster-committed state.
Write-Through Caching (rbd_cache = true, rbd_cache_writethrough_until_flush = true): Reads may hit cache. Writes are sent to the cluster, and the client application waits for cluster (min_size) acknowledgment. Improves read performance for cached data, maintains strong client-perceived write safety.
Write-Back Caching (rbd_cache = true, rbd_cache_writethrough_until_flush = false, rbd_cache_max_dirty > 0): Reads may hit cache. Writes are acknowledged to the application immediately after entering the client cache. Data is written to RADOS asynchronously. Offers highest client-perceived write performance but risks data loss if the client crashes before data is flushed to the cluster. Provides weaker consistency/durability from the application’s perspective until data is flushed.

Example Scenarios and Configuration Snippets

Database Storage (Prioritize Safety): Use No Caching or Write-Through.
# Recommended: rbd_cache = false (default) # Or Explicit Write-Through: rbd_cache = true rbd_cache_writethrough_until_flush = true rbd_cache_max_dirty = 0

High-Performance VMs / Scratch (Prioritize Client Speed – Use with Caution)

Write-Back might be considered, understanding the risks.
rbd_cache = true rbd_cache_writethrough_until_flush = false rbd_cache_size = 67108864 # 64MB example rbd_cache_max_dirty = 48000000 # 48MB dirty limit example rbd_cache_target_dirty = 32000000 # 32MB target dirty example

Always monitor cluster health (ceph health detail) and performance (ceph osd pool stats, client metrics) after making adjustments. Ensure caching strategies align with application data safety requirements.

Wrapping Up: Ceph Replication and Consistency Model

Key Points Review

Ceph provides a powerful, flexible, scalable, and cost-effective storage solution. Its main features include:

Replication: Ceph uses replication (size) across failure domains (via CRUSH) for durability and availability.
Consistency: Core RADOS ensures Strong Consistency for writes acknowledged by the cluster (based on min_size).
Performance Tuning: Client-side caching (RBD) and operational parameters can be tuned, requiring careful consideration of performance versus client-perceived consistency and safety trade-offs.

What’s Next for Ceph

Ceph’s development continues to focus on improving performance, data reliability, and usability. Refining replication processes will play a major role in speeding up operations and maintaining data accuracy. These advancements are also making recovery processes more efficient, which is fantastic news for large-scale, distributed systems. Many of these updates are already proving their worth in real-world applications.

OpenMetal’s Use of Ceph

We’ve integrated Ceph into our hosted private cloud built on OpenStack to achieve greater reliability and efficiency, delivering up to 3.5x better performance compared to traditional public cloud setups. Check out how one of our clients used it to save money and gain more control over their infrastructure.

Get Started Today on an OpenStack- and Ceph-Powered Private Cloud

Try It Out

We offer complimentary access for testing our production-ready private cloud infrastructure prior to making a purchase. Choose from short term self-service or up to 30 day proof of concept cloud trials.

Start Free Trial

Buy Now

Heard enough and ready to get started with your new private cloud solution? Create your account and enjoy simple, secure, self-serve ordering through our web-based management portal.

Buy Private Cloud

Get a Quote

Have a complicated configuration or need a detailed cost breakdown to discuss with your team? Let us know your requirements and we’ll be happy to provide a custom quote plus discounts you may qualify for.

Request a Quote

Ceph Storage Cluster Beginner’s Guide and Demo

How Ceph Replication Works

RADOS and Placement Groups

CRUSH Data Placement

Primary and Replica OSDs

Ceph’s Consistency Model: Strong Consistency

How Ceph Ensures Strong Consistency​

Key Factors

Read-Your-Writes Property

Note on Client-Side Caching and Higher-Level Services

sbb-itb-f4461f5

Balancing Performance, Durability, and Client Behavior

Primary Copy Method

Write Confirmation Steps

Failure Recovery

Configuration Guide: Replication, Performance, and Client Caching

Setting Replication Levels

Replication Factor (size) Use Case – Impact

Operational Tuning Parameters

Tuning Client Performance with RBD Caching

Caching Modes:

Example Scenarios and Configuration Snippets

High-Performance VMs / Scratch (Prioritize Client Speed – Use with Caution)

Wrapping Up: Ceph Replication and Consistency Model

Key Points Review

What’s Next for Ceph

OpenMetal’s Use of Ceph

Get Started Today on an OpenStack- and Ceph-Powered Private Cloud

Try It Out

Buy Now

Get a Quote

Choosing Between Ceph Dual and Triple Replication for Production Workloads

Storage Migration from VMware to OpenStack + Ceph: Tips, Tools & Pitfalls

Deciding Between Local Storage and Ceph Network Storage

Lowering Redundancy in Development for Cost Savings on Staging Environments

Building a HIPAA-Compliant Healthcare Data Lake With Ceph Storage

How to Build a Resilient Validator Cluster with Bare Metal and Private Cloud

Why Grant-Funded Orgs Prefer Fixed-Price Confidential Private Clouds Over Hyperscalers

Scaling Your OpenMetal Private Cloud from Proof of Concept to Production

From Hot to Cold: How OpenMetal’s Storage Servers Meet Every Storage Need

Confidential Computing as Regulators Tighten Cross-Border Data Transfer Rules

How Ceph Ensures Strong Consistency

Replication Factor (`size`) Use Case – Impact