apache kafka

ApacheKafka

@MartinPodval, hpsv.cz

What is Apache Kafka?

Messaging SystemDistributedPersistent and ReplicableVery fast - low latency - and scalableSimple but highly configurableBy Linkedin, open sourced under apache.org

Data Streaming

New kind of data ...● User or application data (events) streams● Monitoring - App, System● App Logging● High volume

Data Streaming Cont’d

… you want to process● Using various components● Into a target form● Map, reduce, shuffle● Real time or batch

HP Service Virtualization Use Cases

Process of clients message streams

Real-time performance modeling

Logs aggregation

How To Solve It?

Producers and Consumers● Distributed● Decoupled● Configurable● Dynamic

Kafka Cluster

Brokers● = Instances, Nodes● Topics● Partitions● Replicas

ZK● Coordination

Kafka Topics

Commit Log● Immutable● Ordered● Sequential Offset

Kafka Topics Cont’d

PartitionedIndependently:● Stored● Produced● Consumed

⇒ Scalable

Replicated● On partition basis● Different brokers

⇒ Fault Tolerant

What Can I Do?

producer.write(topic_id, message);

consumer.read(topic_id, offset);

I Want To Produce

● java/scala client● address of one or more brokers● choose a topic where to produce● highly configurable and tunable:

○ partitioner○ number of acks (async=0, master=1, replicas=1+?)○ batching, buffer size, timeouts, retries, ...

I Want To Consume

High Level API● Groups abstraction

○ To All, To One○ To Some

● Stream API● Stores positions to support fault tolerance

I Want To Consume Cont’d

Low Level● Java/scala client● Find a leader for a topic● Calculate an offset● Fetches messages

○ Re-consume if needed

I Want To Consume Cont’d

Delivery Semantic:● At most once● At least once● Exactly once

Kafka Internals - Disks

Avoid:● GC● Random disk

access

Kafka Internals - Disks Cont’d

Disks are fast ...

… when properly used● sequential access - read ahead, write behind● rely on operating system

○ avoid heap, materialization and GC● it’s more like file copy over network

It’s easy … with immutable topics

Kafka Internals - Replication

“In Sync” Replicas● Replication factor on partition basis● One leader + 0..n replicas● Replicas are consumers

○ “In Sync” if they are not “too far” behind a leader○ Batch sync

Kafka Internals - Replication Cont’d

Tunable Trade-Offs● Producer’s write method:

○ Not blocked, async○ Waits for master ACK○ Waits for all in-sync replicas

● Consumer pulls only committed messages● Server’s minimum in-sync replicas

Performance

“Incredible”

Scales with:● clients count, message size● number of replicas, partitions or topics

Depends on network and disk throughput

Performance Cont’d

Our testing● 3 nodes, master + 2 replicas● 500 000 msg/s (100 bytes[])● 400 mbit/s - 1.2 gbit/s network throughput● end2end latency 2-3 ms

@see http://bit.ly/1FsIR9a

http://bit.ly/1FsIR9a

Easy of Use

● No installation, just run a java/scala program

● Streams in files & dirs● Transparent zookeeper● Ecosystem

Cons

● Beta version● Dependency on Zookeeper● The way how it is written in Scala● No easy way how to remove messages

Questions?

apache kafka - martin podval

Software

n replicas replicas

sync replicas consumer

oncekafka internals

javascala program streams

javascala client address

easy way

master ack