Apache Kafka is a distributed event streaming platform used for building high-performance data pipelines, streaming analytics, and real-time applications. It is designed to be fast, scalable, durable, and fault-tolerant.
Key Concepts in Kafka
1. Broker
- A Kafka broker is a server that stores data and serves clients (producers and consumers).
- A Kafka cluster is made up of multiple brokers.
- Each broker handles storage and coordination for assigned topic partitions.
2. Topic
- A topic is a named stream of records.
- Producers write data to topics, and consumers read data from topics.
- Topics can have multiple partitions for scalability.
3. Partition
- Topics are split into partitions to enable parallelism.
- Each partition is an ordered, immutable sequence of records.
- Partitions allow multiple consumers to read from a topic concurrently.
4. Offset
- Every record in a partition has a unique offset, representing its position in the partition.
- Consumers use offsets to keep track of which messages have been consumed.
5. Producer
- A producer is a client that publishes data to Kafka topics.
- It can choose the partition explicitly or let Kafka decide based on a key.
- Producers can control message durability via acknowledgment settings (e.g.,
acks=all
for max reliability).
6. Consumer
- A consumer reads data from one or more partitions of a topic.
- Consumers can be grouped into consumer groups for parallel data processing.
7. Consumer Group
- Consumers in the same group share work: each partition is consumed by only one consumer in the group.
- Provides load balancing and fault tolerance.
8. Replication
- Kafka replicates partitions across brokers for high availability.
- Each partition has:
- One leader (handles reads/writes).
- One or more followers (replicate the leader's data).
- Ensures fault tolerance if a broker fails.
9. Zookeeper
- Kafka uses Zookeeper (pre-2.8) for cluster coordination: leader election, configuration, and metadata storage.
- Newer Kafka versions (2.8+) can operate in KRaft mode, eliminating the need for Zookeeper.
10. Log
- Each partition is an append-only log file.
- Logs are persistent, and messages can be replayed from a specific offset.
11. Acknowledgments (acks)
- Producer settings control how many brokers must acknowledge a write:
acks=0
: no acknowledgment.acks=1
: only leader acknowledges.acks=all
: all replicas acknowledge (most reliable).
Conclusion
Apache Kafka simplifies the processing of real-time data through publish-subscribe and log-based architecture. Its core components—topics, partitions, offsets, and consumer groups—provide the foundation for scalable, fault-tolerant data pipelines.