Apache kafka

Apache Kafka

In today’s world, data is the main ingredient of internet applications and typically encompasses the following :

Page visits and clicks
User activities
Events corresponding to logins
Social networking activities such as likes, shares and comments
Application-specific metrics (e.g. logs, page load time, performance etc.)

This data can be used to run analytics in real time serving various purposes, some of which are:

Delivering advertisements
Tracking abnormal user behaviors
Displaying search based on relevance
Showing recommendations based on previous activities

Problem: Collecting all the data is not easy as data is generated from various sources in different formats

Solution: One of the ways to solve this problem is to use a messaging system. Messaging systems provide a seamless integration between distributed applications with the help of messages.

Kafka in Linkedin

About

Apache Kafka is a distributed publish subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another which was originally developed at LinkedIn and later on became a part of the Apache project. Kafka is fast, agile, scalable and distributed by design.

Publish subscribing messaging system

In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic.
A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can subscribe to their own set of channels and get them whenever their
subscribed channels are available.

Topics and logs

The Kafka cluster stores streams of records in categories called topics.Each record consists of a key, a value, and a timestamp. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

Key Points

Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss.
Kafka is built on top of the ZooKeeper synchronization service.
Kafka is very fast and guarantees zero downtime and zero data loss.

Need for Kafka

It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that
all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket.

Apache Kafka Use Cases

Kafka Architecture

In the above diagram, a topic is configured into three partitions. Partition 1 has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2, and 3. Partition 3 has one offset factor 0. The id of the replica is same as the id of the server that hosts it.

Assume, if the replication factor of the topic is set to 3, then Kafka will create 3 identical replicas of each partition and place them in the cluster to make available for all its operations. To balance a load in cluster, each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time.

S.No	Components and Description
1	Topics A stream of messages belonging to a particular category is called a topic. Data is stored in topics. Topics are split into partitions. For each topic, Kafka keeps a mini-mum of one partition. Each such partition contains messages in an immutable ordered sequence. A partition is implemented as a set of segment files of equal sizes.
2	Partition Topics may have many partitions, so it can handle an arbitrary amount of data.
3	Partition offset Each partitioned message has a unique sequence id called as offset.
4	Replicas of partition Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss.
5	Brokers Brokers are simple system responsible for maintaining the pub-lished data. Each broker may have zero or more partitions per topic. Assume, if there are N partitions in a topic and N number of brokers, each broker will have one partition. Assume if there are N partitions in a topic and more than N brokers (n + m), the first N broker will have one partition and the next M broker will not have any partition for that particular topic. Assume if there are N partitions in a topic and less than N brokers (n-m), each broker will have one or more partition sharing among them. This scenario is not recommended due to unequal load distri-bution among the broker.
6	Kafka Cluster Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data.
7	Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last segment file. Actually, the message will be appended to a partition. Producer can also send messages to a partition of their choice.
8	Consumers Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers.
9	Leader Leader is the node responsible for all reads and writes for the given partition. Every partition has one server acting as a leader.
10	Follower Node which follows leader instructions are called as follower. If the leader fails, one of the follower will automatically become the new leader. A follower acts as normal consumer, pulls messages and up-dates its own data store.

Cluster Architecture of Apache kafka

Role of Zookeeper

Zookeeper serves as the coordination interface between the Kafka brokers and consumers. The Kafka servers share information via a Zookeeper cluster. Kafka stores basic metadata in Zookeeper such as information about topics, brokers, consumer offsets (queue readers) and so on.
Since all the critical information is stored in the Zookeeper and it normally replicates this data across its ensemble, failure of Kafka broker / Zookeeper does not affect the state
of the Kafka cluster.Kafka will restore the state, once the Zookeeper restarts. This gives zero downtime for Kafka.

Broker

Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka brokers are stateless, so they use ZooKeeper for maintaining their cluster state. Kafka broker leader election can be done by ZooKeeper.

Producers

Producers push data to brokers. When the new broker is started, all the producers search it and automatically sends a message to that new broker. Kafka producer doesn’t wait for acknowledgements from the broker and sends messages as fast as the broker can handle.

Consumers

Since Kafka brokers are stateless, which means that the consumer has to maintain how many messages have been consumed by using partition offset.

The key difference between stateful and stateless applications is that stateless applications don’t “store” data whereas stateful applications require backing storage. Stateful
applications like the Cassandra, MongoDB and mySQL databases all require some type of persistent storage that will survive service restarts.

Two type of workflows in the Kafka

1. Publish Subscribing Messaging Workflow
2. Queue messaging workflow

Workflow of Pub-Sub Messaging

Following is the step wise workflow of the Pub-Sub Messaging -

Producers send message to a topic at regular intervals.
Kafka broker stores all messages in the partitions configured for that particular topic. It ensures the messages are equally shared between partitions. If the producer sends two messages and there are two partitions, Kafka will store one message in the first partition and the second message in the second partition.
Consumer subscribes to a specific topic.
Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble.
Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
Once Kafka receives the messages from producers, it forwards these messages to the consumers.
Consumer will receive the message and process it.
Once the messages are processed, consumer will send an acknowledgement to the Kafka broker.
Once Kafka receives an acknowledgement, it changes the offset to the new value and updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read next message correctly even during server outrages.
This above flow will repeat until the consumer stops the request.
Consumer has the option to rewind/skip to the desired offset of a topic at any time and read all the subsequent messages.

Type of Configuration

1. Single node Single broker configuration
2. Single node -Multiple broker configuration

Kafka has four core APIs:

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table

Basic Topic Operations

Create Topic-- bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 -partitions 1 --topic topic-name
Modify Topic-- bin/kafka-topics.sh —zookeeper localhost:2181 --alter --topic topic_name --parti-tions count
Delete Topic-- bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic topic_name

Data is the new salt

Search This Blog