introduction to apache kafka
TRANSCRIPT
![Page 1: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/1.jpg)
INTRO TO KAFKAJim Plush, Director of Cloud Engineering, CrowdStrike.comTwitter: @jimplush
![Page 2: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/2.jpg)
ABOUT ME
Jim Plush, Director of Cloud Engineering @ CrowdStrike.com
Architect of distributed cloud services for catching bad guys
Previously Director of Engineering at gravity.com
personalization service, ingesting clickstream from Yahoo!, New York Times, WSJ, etc…
wrote most of the ETL workflow
![Page 3: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/3.jpg)
ABOUT CROWDSTRIKE
“Big Data” Security Company
Near term focus on targeted, state sponsored attacks and attribution
Single customer can generate 2.2TB of machine data per day we process in our cloud
Horizontally scalable, distributed infrastructure
Uses goodies like Kafka, Cassandra, Elastic Search, Hadoop, Scala, Go
![Page 4: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/4.jpg)
–Said everyone, always
“Some people, when confronted with a problem, think “I know, I'll use a message queue.” Now they have two problems.”
![Page 5: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/5.jpg)
APACHE KAFKA
It’s not a so much a queue, but an activity stream system
Trades stability and speed for consumer complexity
It’s scalable by nature
Supports data replication
You can rewind time
It’s fast!
Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
![Page 6: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/6.jpg)
APACHE KAFKA - CONS
Consumer Complexity
Not “Rack Aware” replication
Lack of tooling/monitoring
Still pre 1.0 release
Operationally, it’s more manual than desired
Requires ZooKeeper
![Page 7: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/7.jpg)
BASIC CONCEPTS
Topics - logical namespace for data (clickstream, app logs)
Partition - physical separation of data to allow for horizontal scalability
Consumer Groups/Offsets - Where your consumer group last check pointed in the stream
Replica - allows for partitions to be replicated across nodes for availability, only one is the active leader
![Page 8: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/8.jpg)
![Page 9: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/9.jpg)
![Page 10: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/10.jpg)
![Page 11: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/11.jpg)
USE CASES
First point for data ingestion, provide back pressure to downstream
Provide a data firehose for clients (with seeks)
Friendly to Blue/Green deployment architectures
Mirroring test data easily
Data Center log aggregation
![Page 12: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/12.jpg)
Seamless Integration with Storm
![Page 13: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/13.jpg)
Data Center Aggregation
![Page 14: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/14.jpg)
Producer
API Server
Customer A Customer B
Data Stream
Serving a Firehose
![Page 15: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/15.jpg)
Data Affinity w/ Key Partitioning
Producer
Consumer B
Data Stream P0
Data Stream P1
UserIds 0-100
Consumer A
UserIds 0-100 UserIds 101-200
![Page 16: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/16.jpg)
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeperController
![Page 17: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/17.jpg)
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeperController
Green Consumer
![Page 18: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/18.jpg)
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeper
Green Consumer
ControllerUser: 555
![Page 19: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/19.jpg)
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
Green Consumer
ControllerUser: 555ZooKeeper
![Page 20: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/20.jpg)
SCALING OUT
1 partition = 1 consumer
1 partition needs to fit on a single machine
Partitions = the scalability of your system from the producer and consumer side
For high scale apps you will probably start out with 100 partitions
![Page 21: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/21.jpg)
ProducerConsumer AP1
P0
P2
![Page 22: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/22.jpg)
Producer
Consumer A
P1
P0
P2
Consumer B
Consumer C
![Page 24: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/24.jpg)
![Page 25: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/25.jpg)
![Page 26: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/26.jpg)
ZOOKEEPERhttp://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-
system.html
![Page 27: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/27.jpg)
WE’RE [email protected]@jimplushcrowdstrike.com/about-us/careers
![Page 28: Introduction to Apache Kafka](https://reader035.vdocuments.site/reader035/viewer/2022081421/587c97a61a28abfa5e8b660d/html5/thumbnails/28.jpg)
Producer A
Producer B
ZooKeeper
Partition 1
Partition 2
ClickStream
Partition OffsetsCommit Offset
Consumer A