apache kafka - martin podval
TRANSCRIPT
ApacheKafka
@MartinPodval, hpsv.cz
What is Apache Kafka?
Messaging SystemDistributedPersistent and ReplicableVery fast - low latency - and scalableSimple but highly configurableBy Linkedin, open sourced under apache.org
Data Streaming
New kind of data ...● User or application data (events) streams● Monitoring - App, System● App Logging● High volume
Data Streaming Cont’d
… you want to process● Using various components● Into a target form● Map, reduce, shuffle● Real time or batch
HP Service Virtualization Use Cases
Process of clients message streams
Real-time performance modeling
Logs aggregation
How To Solve It?
Producers and Consumers● Distributed● Decoupled● Configurable● Dynamic
Kafka Cluster
Brokers● = Instances, Nodes● Topics● Partitions● Replicas
ZK● Coordination
Kafka Topics
Commit Log● Immutable● Ordered● Sequential Offset
Kafka Topics Cont’d
PartitionedIndependently:● Stored● Produced● Consumed
⇒ Scalable
Replicated● On partition basis● Different brokers
⇒ Fault Tolerant
What Can I Do?
producer.write(topic_id, message);
consumer.read(topic_id, offset);
I Want To Produce
● java/scala client● address of one or more brokers● choose a topic where to produce● highly configurable and tunable:
○ partitioner○ number of acks (async=0, master=1, replicas=1+?)○ batching, buffer size, timeouts, retries, ...
I Want To Consume
High Level API● Groups abstraction
○ To All, To One○ To Some
● Stream API● Stores positions to support fault tolerance
I Want To Consume Cont’d
Low Level● Java/scala client● Find a leader for a topic● Calculate an offset● Fetches messages
○ Re-consume if needed
I Want To Consume Cont’d
Delivery Semantic:● At most once● At least once● Exactly once
Kafka Internals - Disks
Avoid:● GC● Random disk
access
Kafka Internals - Disks Cont’d
Disks are fast ...
… when properly used● sequential access - read ahead, write behind● rely on operating system
○ avoid heap, materialization and GC● it’s more like file copy over network
It’s easy … with immutable topics
Kafka Internals - Replication
“In Sync” Replicas● Replication factor on partition basis● One leader + 0..n replicas● Replicas are consumers
○ “In Sync” if they are not “too far” behind a leader○ Batch sync
Kafka Internals - Replication Cont’d
Tunable Trade-Offs● Producer’s write method:
○ Not blocked, async○ Waits for master ACK○ Waits for all in-sync replicas
● Consumer pulls only committed messages● Server’s minimum in-sync replicas
Performance
“Incredible”
Scales with:● clients count, message size● number of replicas, partitions or topics
Depends on network and disk throughput
Performance Cont’d
Our testing● 3 nodes, master + 2 replicas● 500 000 msg/s (100 bytes[])● 400 mbit/s - 1.2 gbit/s network throughput● end2end latency 2-3 ms
@see http://bit.ly/1FsIR9a
Easy of Use
● No installation, just run a java/scala program
● Streams in files & dirs● Transparent zookeeper● Ecosystem
Cons
● Beta version● Dependency on Zookeeper● The way how it is written in Scala● No easy way how to remove messages
Questions?