Explaining Kafka

Understanding Basic Kafka Terminologies
What is Zookeeper?
What about brokers?
How Does Kafka Ensure Parallel Processing?

Understanding Basic Kafka Terminologies

Kafka topics are categories used to organize messages. Each topic has a name that is unique across the entire kafka cluster. Producers write messages to a topic, while consumers read messages from a topic. Within a topic, there are multiple partitions, each with its own offset. When a producer sends a message to a topic, it is appended to the end of the partition log and assigned a new offset. When a consumer subscribe to a topic, they pull messages from the partition in the order they're stored. They maintain their current offset, which indicates the last processed message in each partition.

What is Zookeeper?

Zookeeper is used for managing and coordinating distributed systems. It stores metadata about the kafka cluster, such as the list of active kafka brokers in the cluster and their availability. Zookeeper also maintains metadata about topics such as number of partitions and replicas for each topic, as well as partition assignment. This includes determining which broker acts as the leader for each partition and the replicas designated for failover, ensuring high availability and fault tolerance.

What about brokers?

Brokers are key components in kafka responsible for handling the storage, distribution and retrieval of data in kafka topics. They store messages on disk in partitions that belong to topics. Each partition is a log file containing an ordered sequence of messages. Brokers also replicate partitions across other brokers for fault tolerance. This ensures data availability even if one broker fails.

Kafka brokers receive messages from producers and append them to the appropriate partition in the respective topic. Once the messages are successfully written, brokers will provide an acknowledgement to the producer.

Kafka brokers provide consumers with messages stored in partitions. They handle offset tracking and allow consumers to pull data at their own pace

Each partition in a topic is assigned a leader broker which handles all read and write requests for that partition. Other brokers store replicas of the partition, ensuring fault tolerance. In the event of a leader failure, a follower replica is promoted to the leader role. In a Kafka cluster, partitions are distributed across brokers to balance the workload. This enables kafka to scale horizontally by adding more brokers to handle increased throughput. Producers and consumers are redirected to the appropriate broker based on partition assignments.

Creating additional brokers in AWS EC2 involves deploying more EC2 instances, installing/configuring kafka on them, and integrating these instances into the existing cluster.

How Does Kafka Ensure Parallel Processing?

Kafka uses a group ID, a unique string that identifies a group of consumers working together to consume messages from a topic. Each consumer informs the Kafka broker of the group it belongs to, and the broker ensures that partitions are evenly distributed among the consumers within the group. The broker continuously monitors and redistributes partitions as needed to maintain this balance. This mechanism enables parallel message processing, allowing multiple consumers in the same group to collaboratively handle messages from one or more topics. It provides scalability, fault tolerance, and efficient parallel processing in a distributed Kafka environment.

Kafka dynamically adjusts partition assignments as consumers join or leave the group, redistributing partitions to maintain balance and fault tolerance. This capability allows the system to scale horizontally by simply adding more consumers to the group, making it highly efficient for distributed processing in a Kafka environment.

To further enhance parallelism, you can increase the concurrency level within the same consumer group. By specifying a concurrency level in the @KafkaListenerannotation, Kafka creates multiple listener instances under the same group. These instances collaboratively process messages, distributing the workload among themselves to complete tasks faster and improve throughput.

1@KafkaListener(topics = "books", groupId = "book-notification-consumer", concurrency = "2")
2public void bookNotificationConsumer(BookEvent event) {
3    logger.info("Books event received for notification => {}", event);
4}

If we need to consume the same messages multiple times and apply distinct processing logic for each listener, then we can configure@KafkaListener annotation to have distinct group IDs. In the example below, we have a Kafka topic called books, and we need to handle events from this topic in two ways: Full-text search indexing, Price indexing. By assigning distinct group IDs, we ensure that each consumer group processes the same messages independently, applying the respective logic without interfering with each other.

1@KafkaListener(topics = "books", groupId = "books-content-search")
2public void bookContentSearchConsumer(BookEvent event) {
3    logger.info("Books event received for full-text search indexing => {}", event);
4}
5
6@KafkaListener(topics = "books", groupId = "books-price-index")
7public void bookPriceIndexerConsumer(BookEvent event) {
8    logger.info("Books event received for price indexing => {}", event);
9}

Explaining Kafka

Table of Contents

Understanding Basic Kafka Terminologies

What is Zookeeper?

What about brokers?

How Does Kafka Ensure Parallel Processing?