how apache kafka is transforming hadoop, spark and storm

29
www.edureka.co/r-for-analyti www.edureka.co/apache-Kafka How Apache Kafka is transforming Hadoop, Spark & Storm

Upload: edureka

Post on 16-Apr-2017

1.159 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: How Apache Kafka is transforming Hadoop, Spark and Storm

www.edureka.co/r-for-analyticswww.edureka.co/apache-Kafka

How Apache Kafka is transforming Hadoop, Spark & Storm

Page 2: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka

Million Dollar Question! Why we need Kafka?

What is Kafka?

Kafka Architecture

Kafka with Hadoop

Kafka with Spark

Kafka with Storm

Companies using Kafka

Demo on Kafka Messaging Service…

What will you learn today?

Page 3: How Apache Kafka is transforming Hadoop, Spark and Storm

Million Dollar Question! Why we need Kafka??

Page 4: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka

Why Kafka is preferred in place of more traditional brokers like JMS and AMQP

Why Kafka Cluster?

Page 5: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka

Kafka Producer Performance with Other Systems

Page 6: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka

Kafka Consumer Performance with Other Systems

Page 7: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka

Salient Features of Kafka

Feature Description

High Throughput Support for millions of messages with modest hardware

Scalability Highly scalable distributed systems with no downtime

Replication Messages can be replicated across cluster, which provides support for multiple subscribers and also in case of failure balances the consumers

Durability Provides support for persistence of messages to disk which can be further used for batch consumption

Stream Processing Kafka can be used along with real time streaming applications like spark and storm

Data Loss Kafka with the proper configurations can ensure zero data loss

Page 8: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka

® With Kafka, we can easily handle hundreds and thousands of messages in a second

® The cluster can be expanded with no downtime, making Kafka highly scalable

® Messages are replicated, which provides reliability and durability

® Fault tolerant

® Scalable

Kafka Advantages

Page 9: How Apache Kafka is transforming Hadoop, Spark and Storm

What is Kafka?

Page 10: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafka

® A distributed publish-subscribe messaging system

® Developed at LinkedIn Corporation

® Provides solution to handle all activity stream data

® Fully supported in Hadoop platform

® Partitions real time consumption across cluster of

machines

® Provides a mechanism for parallel load into Hadoop

What is Kafka ?

Page 11: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka

Apache Kafka – Overview

Kafka

External Tracking

ProxyFrontend FrontendFrontend

Background Service

(Consumer)

Background Service

(Consumer)

Hadoop DWH

Background Service

(Producer)

Background Service

(Producer)

Page 12: How Apache Kafka is transforming Hadoop, Spark and Storm

Kafka Architecture

Page 13: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka

Kafka Architecture

Producer(Front End)

Producer(Services)

Producer(Proxies)

Producer(Adapters)

Other Producer

Zookeeper

Consumers (Real Time)

Consumers (NoSQL)

Consumers (Hadoop)

Consumers (Warehouses

)Other

Producer

Kafka Kafka Kafka Kafka Broker

Page 14: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka

® Below table lists the core concepts of Kafka

Kafka Core Components

Feature Description

Topic A category or feed to which messages are published

Producer Publishes messages to the Kafka Topic

Consumer Subscribes and consumes messages from Kafka Topic

Broker Handles hundreds of megabytes of reads and writes

Page 15: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka

Kafka Topic® A user defined category where the messages are published

® For each topic a partition log is maintained

® Each partition basically contains an ordered, immutable sequence of messages where each message

is assigned a sequential ID number called offset

® Writes to a partition are generally sequential thereby reducing the number of hard disk seeks

® Reading messages from partition can be random

Page 16: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka

® Applications publishes messages to the topic in kafka cluster.

® Can be of any kind like front end, streaming etc.

® While writing messages, it is also possible to attach a key with

the message

® Same key will arrive in the same partition

® Doesn’t wait for the acknowledgement from the kafka cluster

® Publishes as much messages as fast as the broker in a cluster

can handle

Kafka Producers

Kafka Clusters

Producer

Producer

Producer

Page 17: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka

Kafka Consumers

® Applications subscribes and consumes messages from the

brokers in Kafka cluster

® Can be of any kind like real time consumers, NoSQL

consumers, etc.

® During consumption of messages from a topic, a consumer

group can be configured with multiple consumers

® Each consumer of consumer group reads messages from a

unique subset of partitions in each topic they subscribe to

® Messages with same key arrives at same consumer

® Supports both Queuing and Publish-Subscribe

® Consumers have to maintain the number of messages

consumed

Kafka Clusters

Consumer

Consumer

Consumer

Page 18: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka

® Each server in the cluster is called a broker

® Handles hundreds of MBs of writes from producers and

reads from consumers

® Retains all published messages irrespective of whether

it is consumed or not

® Retention is configured for n days

® Published messages is available for consumptions for

configured ‘n’ days and thereafter it is discarded

® Works like a queue if consumer instances belong to

same consumer group, else works like publish-

subscribe

Kafka Brokers

Page 19: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka

Kafka Producer-Broker-Consumer

Page 20: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka

How Kafka can be used with Hadoop

Page 21: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka

Kafka with Hadoop using Camus

® Camus is LinkedIn's Kafka ->HDFS pipeline

® It is a MapReduce job

® Distributes data loads out of Kafka

® At LinkedIn, it processes tens of billions of messages/day

® All work done with one single Hadoop job

Courtesy : confluent

Page 22: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka

How Kafka can be used with Spark

Page 23: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka

Kafka With Spark Streaming

® If messages are stored in ‘n’ partitions, parallel reading makes things faster

® Generally in Kafka messages are stored in multiple partitions

® Parallel reads can be effectively achieved by spark streaming

® Parallelism of reads is achieved by integrating KafkaInputDStream of Spark with Kafka High Level

Consumer API

Page 24: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 24 www.edureka.co/apache-Kafka

APPS

Kafka

E V E N T S

STREAMING ENGINE

Kafka With Spark Streaming® Generally in Kafka messages are stored in multiple partitions

Page 25: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka

How Kafka can be used with Storm

Page 26: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka

Kafka With Spark Streaming

Page 27: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka

Companies Using Kafka

Page 28: How Apache Kafka is transforming Hadoop, Spark and Storm

Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka

Get Certified in Apache Kafka from Edureka

Edureka's Real-Time Analytics with Apache Kafka course: • Carefully designed to provide knowledge and skills to become a successful Kafka Big Data Developer• Helps you master the concepts of Kafka Cluster, Producers and Consumers, Kafka API, Kafka Integration with Hadoop, Storm

and Spark• Encompasses the fundamental concepts like Kafka cluster, Kafka API to advance topics such as Kafka integration with

Hadoop, Storm, Spark, Maven etc.• Online Live Courses: 15 hours• Assignments: 25 hours• Project: 20 hours• Lifetime Access + 24 X 7 Support

Go to www.edureka.co/apache-kafka

Batch starts from 10th October (Weekend Batch)

Page 29: How Apache Kafka is transforming Hadoop, Spark and Storm

Thank You

Questions/Queries/Feedback/Survey

Recording and presentation will be made available to you within 24 hours