stream processing with big data: knowledgent big data palooza meet-up
DESCRIPTION
On September 17, 2014 at the NJ Big Data Palooza MeetUp, Kishore Veleti, Big Data Engineer at Knowledgent, presented on Stream Processing with Big Data using Apache Kafka.This presentation includes the content he covered during the event, including an overview of Kafka terminology and processes.TRANSCRIPT
©2014 Knowledgent Group Inc. All Rights Reserved
Stream Processing with Big Data
Learn Apache KafkaKishore VeletiBig Data Engineer
©2014 Knowledgent Group Inc. All Rights Reserved2
• Big Data Engineer at Knowledgent
• Background in enterprise application development using Hadoop stack, Java, PHP
• Worked in Healthcare, Banking, and Social Media Applications
• Passionate in sharing knowledge
About Me
©2014 Knowledgent Group Inc. All Rights Reserved3
Tutorial
©2014 Knowledgent Group Inc. All Rights Reserved4
• What is Apache Kafka?
• Apache Kafka Terminology
• Apache Kafka – about Topic & Partition
• Apache Kafka hands-on
We will discuss:
©2014 Knowledgent Group Inc. All Rights Reserved5
• Apache Kafka is a publish-subscribe messaging system implemented as a distributed commit log
• It is written in Java/Scala
• Built by LinkedIn to process activity stream data from their website
What is Apache Kafka?
©2014 Knowledgent Group Inc. All Rights Reserved6
• All the messages in Kafka are real-time
• There are many subscribers to a message
• Kafka persists messages to the disk
• Messages are retained for a specific time period
• Subscribers/clients store the state of their reads
• Easy to replay messages
What is Apache Kafka?
©2014 Knowledgent Group Inc. All Rights Reserved7
• Message: A datum to send
• Topic: Kafka maintains messages in categories called “topics”
• Partition: A logical division of a topic
• Producer: An API to publish messages to Kafka topic
• Broker: A server
• Cluster: Kafka cluster comprises one or more brokers
• Consumer: API to consume published messages and process further
• Replication: Kafka replicates log for each partition across servers
Apache Kafka Terminology
©2014 Knowledgent Group Inc. All Rights Reserved8
Message Topic Partition Producer Broker
Consumer
At a high level, producers send messages over the network to the Kafka cluster.
Kafka cluster in turn serves them up to consumers.
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved9
Message Topic Partition Producer Broker
Consumer
Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved10
Message Topic Partition Producer Broker
Consumer
In Kafka for each topic a partition log is maintained.
Each partition is an ordered, immutable sequence of messages that is appended to
Each message in the partition is assigned a sequential id number called the offset
Apache Kafka: About Topic and Partition
Partition 1
Writes
Partition 2
Partition 3
©2014 Knowledgent Group Inc. All Rights Reserved11
Message Topic Partition Producer Broker Consumer
In Kafka, a Producer is an API to publish messages to topic
Apache Kafka: About Topic and Partition
©2014 Knowledgent Group Inc. All Rights Reserved12
Message Topic Partition Producer Broker Consumer
In Kafka, a Consumer is an API to consume messages from topics
Apache Kafka: About Topic and Partition
©2014 Knowledgent Group Inc. All Rights Reserved13
Message Topic Partition Producer Broker
Consumer
Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved14
• Trading Systems- Risk Identification in real-time
• Change Data Capture- Capturing the changed data into data lake environment
• Online Gaming- Identifying top scorers of a game
Apache Kafka Use Cases
©2014 Knowledgent Group Inc. All Rights Reserved15
Thank you!
Questions?