Skip to main content

Kafka Integration with Spark

              Apache Kafka - Integration With Spark



About Spark

Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, etc., 
and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.



Integration with Spark

Kafka is a potential messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of data and is processed using complex algorithms in Spark Streaming. Once the data is processed, Spark Streaming could be publishing results into yet another Kafka topic or store in HDFS, databases or dashboards. 
The following diagram depicts the conceptual flow.




Now, let's go through Kafka-Spark API’s in detail.
1. SparkConf API
2. StreamingContext API
3. KafkaUtils API


SparkConf API

SparkConf class has the following methods -

a. set(string key, string value) 
b. remove(string key)
c. setAppName(string name)
d. get(string key)



StreamingContext API


  • master - cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
  • appName - a name for your job, to display on the cluster web UI
  • batchDuration - the time interval at which streaming data will be divided into batches



KafkaUtils API


  • KafkaUtils API is used to connect the Kafka cluster to Spark streaming. This API has the significant method createStream 
  • ssc - StreamingContext object.
  • zkQuorum - Zookeeper quorum.
  • groupId - The group id for this consumer.
  • topics - return a map of topics to consume.
  • storageLevel - Storage level to use for storing the received objects.



Pages written

Comments