☰ See All Chapters |
What is Kafka
Apache Kafka is a distributed publish-subscribe messaging system used for collecting and delivering high volumes of data with low latency, similar to a traditional message broker.
In today's world, real-time information is continuously generated by various applications, including those related to business, social media, and other domains. This information requires easy and reliable ways to be quickly routed to multiple types of receivers. For example, when you access your mobile banking application and perform actions such as making a payment to a merchant or transferring money to a friend, multiple screen flows are involved. These actions include logging in, opening payment sections, selecting transfers, choosing a payee, entering the amount, account number, and submitting the transaction. In the event of a failure where the money is deducted from your account but hasn't reached your friend's account, you may call customer care to raise a complaint. The customer care representative will have complete data about your actions, starting from the login process. Each action you perform on your mobile application will publish relevant details to receivers in the background.
Now, with plenty of such application users, an enormous amount of action details are being generated. If the application that is consuming or receiving these messages goes down, we will certainly miss the action details from users. Therefore, a mechanism is required for seamless integration of information of producers and consumers to avoid any loss of data.
To meet these demands, engineers at LinkedIn developed and open-sourced Kafka a publish-subscribe based fault-tolerant messaging system in 2012. It was initially developed in Scala and later re-implemented in Java to improve its performance and maintainability. Since then, the Java implementation has been the primary one.
Benefits of Kafka
Reliability: Kafka's distributed architecture, topic partitioning, and data replication ensure high reliability. Messages are stored in multiple brokers, making it resilient to failures. Even if one broker goes down, the data can still be accessed from other replicas, providing uninterrupted service.
Scalability: Kafka's design allows it to scale horizontally by adding more brokers to the cluster as the data load increases. This makes it easy to handle large amounts of data and growing workloads without significant performance degradation.
Durability: Kafka provides disk-based data retention, meaning messages are persisted on disk based on retention policies set per topic. This ensures durability, and data is available for consumption even if consumers experience issues or fall behind.
High Performance: Kafka's architecture, coupled with its ability to handle large-scale data processing and real-time streaming, makes it a high-performance messaging system. It can process and deliver messages with low latency, making it suitable for use cases requiring real-time data.
Publish-Subscribe Model: Kafka follows a publish-subscribe model, allowing multiple consumers to subscribe to topics and receive data in parallel. This decouples data producers and consumers, providing flexibility and scalability in data distribution.
Real-time Stream Processing: Kafka's ability to handle real-time data streams enables various applications, such as real-time analytics, monitoring, and event-driven architectures.
Extensibility: Kafka has a rich ecosystem and integrates well with other Big Data tools and frameworks, making it a versatile component in a data processing pipeline.
All Chapters