Hosting a Powerful ClickHouse Deployment with a Mix of Bare Metal, OpenStack, and Ceph

This case study explores a real-world architecture of a high-performance ClickHouse cluster, showcasing a combination of bare metal, OpenStack Cloud, and Ceph Object Storage.

The challenge? Ingesting and analyzing huge streams of “hot” real time security event data while controlling the costs of an ever growing historical set of “cool” but critical data.

The Architecture: A Hybrid Approach

Cybersecurity ClickHouse Cluster Full Diagram

The solution is built on a hybrid architecture, combining the strengths of bare metal servers, large scale S3 compatible object storage,  and OpenStack-powered private cloud infrastructure. This allowed the architect of this ClickHouse cluster to leverage the native functionality in ClickHouse to use the ultra high I/O of the bare metal NVMe for hot data while connecting ClickHouse to the S3 compatible object gateway, directly in the same VLANs, for cool data.  Lets explore this more below.

Cluster 1: Bare Metal ClickHouse Cluster

The servers are interconnected with a 20Gbps private network for fast data exchange and replication within the ClickHouse cluster.  Optionally it can be increased to 40Gbps.

Each of the 6 “XL v4” server contains:

  • CPU: 2x Intel Xeon Gold 6530 (64 cores, 128 threads, 2.1/4.0 GHz)
  • Memory: 1024GB DDR5 4800MHz
  • OS: 2X960GB Micron Pro, Raid 1
  • Working Storage: 7 x 6.4TB high-performance Micron 7450 NVMe MAX – see specs
  • Operating System: Ubuntu 24.04

The Engine – Bare Metal for Raw Power

The heavy lifting of data processing is handled by a cluster of six bare metal servers.  These servers hold both the Kafka ingestion system and ClickHouse.  The local “working drives” are directly managed by ClickHouse as the “hot” layer.

The configuration of sharding and replication is part of the secret sauce of this customer, so we won’t cover it specifically and instead supply our general guidance.  What we can provide for customers is assignment to our big data database engineers and introductions to other architects, including the CTO that designed this.  We hope you consider joining OpenMetal.

The below is based on our experience and recommendations issued by the ClickHouse team.  A special thanks to this video, check it out for a great introduction to replication and sharding.

 

Cluster 2: The Cool Vault – Ceph for Cost-Effective Object Storage

For long-term data retention and cost-effective storage of “cool” data, the solution incorporates a Ceph-based object storage cluster.  The OpenMetal Storage Cluster offers out of the box compatibility with ClickHouse.  In addition, specific tuning, including choices of Erasure Coding redundancy, is supported to align with your performance versus budget requirements.

This cluster consists of three “Storage Large v3 18T” servers:

  • CPU: 2x Intel Xeon Silver 4314 (32 cores, 64 threads, 2.4/3.4 GHz)
  • Memory: 256GB DDR4 3200MHz
  • Storage: 12x 18TB HDD (216TB total) + 4x 4TB NVMe SSD (16TB total)

It can provide up to:

  • 330TiB available (at 85% hardware utilization) at 2/1 erasure coding (66.67% efficient)
  • 275TiB available (at 85% hardware utilization) at Replica 2 (50% efficient)
  • 184TiB available (at 85% hardware utilization) at Replica 3 (33.33% efficient)

Cluster 2: Ceph Object Storage Cluster

Ceph provides an S3-compatible object storage gateway called RGW.  The service is a horizontally scaling system that is independent of the storage layer.  This allows for easy additions of Storage Large V3s to grow the cluster capacity.  If an additional radosgw is needed above the default 2, that is added and assigned to one of the new servers.

The servers are also connected with a 20Gbps uplink for fast data exchange between the cluster members.  Optionally it can be increased to 40Gbps.

Cluster 3: OpenStack Private Cloud

NOTE: It is recommended to run 3 Zookeeper instances.  This is appropriate design for an HA cluster.  Further, it is also recommended to use “Anti-Affinity” in your OpenStack Cloud to force the VMs to run on separate hardware nodes.  This is a feature typically not available on public clouds and is a benefit of running a private cloud.

Cluster 3: The Foundation – OpenStack for Management and Coordination

At the core of the deployment is a small (but mighty!) OpenStack private cloud. Our “Small Cloud Core” is built on three hyperconverged servers. Each server is equipped with:

  • CPU: 1x Intel Xeon D-2141l (8 cores, 16 threads, 2.20/3.00 GHz)
  • Memory: 128GB DDR4 2933MHz
  • Storage: 3.2TB NVMe SSD

This OpenStack cluster hosts a set of VMs running the HA  ZooKeeper service, supporting the bare metal ClickHouse cluster. This is a general purpose private cloud so it is carrying many workloads and it was efficient to simply keep ZooKeeper here.

It is interconnected with the bare metal ClickHouse servers in Cluster 1 through a private VLAN for secure and high-bandwidth communication.

Why This Architecture Works

This three-tiered architecture is designed for ideal ClickHouse performance and scalability:

  • Bare Metal for ClickHouse: Running ClickHouse directly on bare metal gets rid of the overhead of virtualization. It can then fully tap into all available hardware resources. This is important for achieving the extremely low latency required for real-time security analysis.
  • OpenStack for Control and Flexibility: The OpenStack private cloud provides a flexible and manageable environment for supporting services like ZooKeeper. It also hosts additional services like load balancers and supporting applications, all while remaining connected to the bare metal servers.
  • Ceph for Cost-Effective Scalability: Ceph’s object storage provides a scalable and cost-effective way to store large volumes of historical data. This lets the cybersecurity firm meet compliance requirements and perform long-term trend analysis.
  • Hybrid Network: Merging VLANs allows the virtual servers to talk directly to the bare metal for high bandwidth, low latency communication.

Real-World Results

 

This deployment is not a theoretical exercise! It’s a live production system currently supporting a major cybersecurity firm. With this hybrid approach of bare metal, OpenStack, and Ceph, OpenMetal is a great fit to power demanding big data solutions like ClickHouse. 

This architecture delivers a powerful platform for real-time analytics, providing insights for the client’s foundational security operations. Being able to mix, match, and connect the right infrastructure for each area ensures that our customer can easily deploy this powerful solution while keeping costs relatively low and performance high.

Interested in ClickHouse on OpenMetal but not sure where to start?  Check out our quick start installation guide for ClickHouse on OpenMetal.

Does This Resonate With Your Business Needs?

Contact our cloud team to find out how OpenMetal can support your company’s goals and become your partner in success.