Building High-Throughput Data Ingestion Pipelines with Kafka on OpenMetal

Resources » Blog » Building High-Throughput Data Ingestion Pipelines with Kafka on OpenMetal

In this article

Why a Private Cloud is the Right Foundation for Kafka
Core Kafka Concepts
How to Build and Configure the Pipeline
Scaling Your Pipeline for Future Growth
Ready to Build Your Big Data Solution With OpenMetal?

Handling continuous streams of data is a common requirement for modern applications, from IoT device monitoring to real-time analytics. A data ingestion pipeline is the foundation that collects and channels this data for processing and storage. As data volumes grow, building a pipeline that can manage high throughput without failure becomes a complex engineering challenge. Apache Kafka is the industry standard for building these real-time data pipelines, but its performance is directly tied to the underlying infrastructure.

This guide walks you through building a high-throughput data ingestion pipeline using Apache Kafka on an OpenMetal private cloud. You will learn how the combination of Kafka’s distributed streaming platform and a dedicated private cloud environment gives you the performance, control, and scalability needed for the most demanding data workloads.

Why a Private Cloud is the Right Foundation for Kafka

When you deploy a high-throughput system like Kafka, the choice of infrastructure has major consequences. In multi-tenant public clouds, you contend with resource contention from “noisy neighbors,” opaque infrastructure abstractions, and networking limitations that can bottleneck performance. For a system where latency and throughput are paramount, these are not acceptable trade-offs.

An OpenMetal private cloud provides a dedicated environment built on bare metal hardware. This approach gives you full root access and control over your servers. You can tune the operating system, networking stack, and storage configurations to match Kafka’s specific needs, a level of control that is not possible in a shared cloud environment. Because the platform is built with open source tools like OpenStack, you avoid vendor lock-in and retain the flexibility to build your stack with the best tools for the job.

This dedicated hardware and direct control mean you can construct a highly customized and secure pipeline. As seen in the real-life customer architecture below, Kafka can be deployed as the central ingestion point within a larger big data ecosystem, feeding data into processing engines like Apache Spark and long-term storage solutions like a Ceph-backed data lake. The entire system runs on a set of horizontally scalable servers that you control.

This architecture is designed for growth. When your data ingestion requirements increase, you can add more bare metal nodes to your Kafka cluster and expand your private cloud’s core resources, ensuring your pipeline can scale with demand.

Core Kafka Concepts

Before building the pipeline, it’s helpful to understand the main components of a Kafka system. According to the official Apache Kafka documentation, the system is built around a few key concepts:

Producer: An application that sends streams of records to one or more Kafka topics.
Consumer: An application that subscribes to one or more topics and processes the stream of records produced to them.
Broker: A Kafka server that stores data. A Kafka cluster is composed of one or more brokers. Brokers manage the replication and partitioning of data.
Topic: A category or feed name to which records are published. Topics in Kafka are always multi-subscriber.
Partition: Topics are split into ordered, immutable sequences of records known as partitions. Distributing a topic across multiple partitions allows for parallelism, as consumers can read from different partitions simultaneously.

How to Build and Configure the Pipeline

Building a data pipeline with Kafka involves setting up the brokers, configuring them for your workload, creating topics, and developing producers and consumers to send and receive data. A well-architected Kafka pipeline can decouple data sources from data sinks, providing a durable buffer that handles backpressure and allows different parts of your system to evolve independently.

Step 1: Provision Your OpenMetal Cloud

Begin by deploying your on-demand private cloud from the OpenMetal portal. For a production Kafka cluster, you will want a minimum of three bare metal servers to act as brokers. This allows for a replication factor of three, providing a good balance of durability and performance. You can choose server configurations that match your expected workload, with options for CPU cores, RAM, and NVMe storage.

Step 2: Install Kafka

Once your servers are provisioned, SSH into each one and install Java, which is a prerequisite for running Kafka.

# Update package lists and install OpenJDK
sudo apt update
sudo apt install default-jdk -y

Next, download and extract the latest version of Apache Kafka. You can find the most recent release on the official website.

# Download and extract Kafka
wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
tar -xzf kafka_2.13-3.7.0.tgz
cd kafka_2.13-3.7.0

Repeat these installation steps on each server that will act as a broker in your cluster.

Step 3: Configure Brokers for High Throughput

This is where running on a private cloud offers major advantages. You can tune Kafka’s configuration to take full advantage of the dedicated hardware. Navigate to the config/ directory and edit the server.properties file on each broker.

Here are a few key parameters to adjust for high throughput:

Identify Each Broker: Assign a unique ID to each broker and list the addresses of all brokers in the cluster.

# In server.properties on Broker 1
broker.id=1
# In server.properties on Broker 2
broker.id=2
# In server.properties on Broker 3
broker.id=3

# On all brokers, list the full cluster
bootstrap.servers=broker1-ip:9092,broker2-ip:9092,broker3-ip:9092

Tune Network Threads: Adjust the number of threads the broker uses for handling network requests. A good starting point is to match the number of CPU cores on your bare metal server.
```
# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=8
```
Tune I/O Threads: This sets the number of threads the broker uses for processing requests, including disk I/O. This value should also be at least the number of cores to handle disk writes efficiently.
```
# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8
```

Adjust Buffer Sizes: With dedicated network hardware, you can increase the socket buffer sizes to handle larger amounts of in-flight data.

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

Log Configuration: Define the directory for log data. On OpenMetal, you can mount a dedicated high-performance Ceph block device for this path to separate I/O from the operating system.
```
# A comma separated list of directories where Kafka will store its log data
log.dirs=/var/lib/kafka/data
```

After configuring each broker, you can start the Kafka service on each server.

# Start the Kafka server in the background
bin/kafka-server-start.sh -daemon config/server.properties

Step 4: Create a Topic

Now, from one of the brokers, create a topic for your data ingestion. When creating a topic for high throughput, the number of partitions is a primary consideration. More partitions allow for greater parallelism, but also add overhead. A good rule of thumb is to start with one to two partitions per core per broker.

This command creates a topic named iot-data with 12 partitions and a replication factor of 3, meaning every message is copied to three different brokers for durability.

bin/kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--replication-factor 3 \
--partitions 12 \
--topic iot-data

Step 5: Test with a Producer and Consumer

Finally, you can test your pipeline. Use the command-line producer and consumer scripts included with Kafka to validate that the cluster is working correctly.

Open two terminals. In the first, start a producer that sends messages to your iot-data topic.

# Start a producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic iot-data
>{"device_id": "temp-sensor-01", "reading": 70.1}
>{"device_id": "temp-sensor-02", "reading": 71.5}

In the second terminal, start a consumer to read the messages from the beginning.

# Start a consumer
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic iot-data --from-beginning
{"device_id": "temp-sensor-01", "reading": 70.1}
{"device_id": "temp-sensor-02", "reading": 71.5}

You have now created a basic, yet scalable, data ingestion pipeline. For production use, you would replace the console tools with applications using one of Kafka’s client libraries, such as kafka-python or the Java client.

Scaling Your Pipeline for Future Growth

The architecture you have built is ready to scale. As your data volume increases, you have two primary scaling directions:

Add More Partitions: You can increase the partition count for a topic to further parallelize processing.
Add More Brokers: With OpenMetal, you can easily provision additional bare metal servers and add them to your Kafka cluster. After adding the new broker to the configuration, you can use Kafka’s tools to reassign partitions and rebalance the load across the expanded cluster.

As shown in the reference architecture above, this Kafka pipeline serves as the entry point to a much larger data platform. The data ingested can be consumed by Apache Spark for real-time ETL, stored in a Delta Lake format on a distributed Ceph object storage cluster, and ultimately made available for analytics and machine learning applications.

By building your high-throughput data ingestion pipeline on a private cloud, you create a foundation that provides the necessary performance and control for today’s needs and the flexibility to grow with your data tomorrow. Feel free to contact us if you’d like help building your big data pipeline. We’ve done this for multiple customers and through our expertise can drastically shorten the time from development to production, saving you time and money!

Interested in OpenMetal Cloud?

Contact Us

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

Building High-Throughput Data Ingestion Pipelines with Kafka on OpenMetal

Why a Private Cloud is the Right Foundation for Kafka

Core Kafka Concepts

How to Build and Configure the Pipeline

Step 1: Provision Your OpenMetal Cloud

Step 2: Install Kafka

Step 3: Configure Brokers for High Throughput

Step 4: Create a Topic

Step 5: Test with a Producer and Consumer

Scaling Your Pipeline for Future Growth

Interested in OpenMetal Cloud?

Contact Us

Schedule a Consultation

Try It Out

Building a HIPAA-Compliant Healthcare Data Lake With Ceph Storage

From Hot to Cold: How OpenMetal’s Storage Servers Meet Every Storage Need

Big Data for Fraud Detection: A Guide for Financial Services and E-commerce

How to Build a High-Performance Time-Series Database on OpenMetal

The Benefits of a Single-Tenant Private Cloud for High-Volume Data Collection

Big Data Explained: Everything You Need To Know – Learn Linux TV Collaboration

Healthcare Analytics Infrastructure for Population Health Management

Financial Services Risk Analytics on Private Cloud Infrastructure

Using Greenplum to Build a Massively Parallel Processing (MPP) Data Warehouse on OpenMetal

Real-Time Data Processing with Apache Storm/Flink on OpenMetal