![Page 1: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/1.jpg)
When it absolutely, positively,
has to be thereReliability Guarantees
in Apache Kafka
@jeffholoman @gwenshap
Gwen ShapiraConfluent
Jeff HolomanCloudera
![Page 2: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/2.jpg)
Kafka High Throughput Low Latency Scalable Centralized Real-time
![Page 3: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/3.jpg)
“If data is the lifeblood of high technology, Apache Kafka is the
circulatory system”
--Todd PalinoKafka SRE @ LinkedIn
![Page 4: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/4.jpg)
If Kafka is a critical piece of our pipeline Can we be 100% sure that our data will get there? Can we lose messages? How do we verify? Who’s fault is it?
![Page 5: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/5.jpg)
Distributed Systems Things Fail Systems are designed to
tolerate failure
We must expect failures and design our code and configure our systems to handle them
![Page 6: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/6.jpg)
Network
Broker MachineClient Machine
Data Flow
Kafka ClientBroker
O/S Socket Buffer
NIC
NIC
Page Cache
Disk
Application Thread
O/S Socket Buffer
async
callback
✗
✗✗
✗
✗
✗✗✗ data
ack / exception
![Page 7: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/7.jpg)
Client Machine
Kafka Client
O/S Socket Buffer
NIC
Application Thread
✗
✗✗Broker Machine
Broker
NIC
Page Cache
Disk
O/S Socket Buffer
miss
✗
✗
✗
✗Network
Data Flow
✗
data
offsets
ZKKafka✗
![Page 8: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/8.jpg)
Replication is your friend Kafka protects against failures by replicating data The unit of replication is the partition One replica is designated as the Leader Follower replicas fetch data from the leader The leader holds the list of “in-sync” replicas
![Page 9: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/9.jpg)
Replication and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker 100
Broker 101
Broker 102
Topic:Partitions
:Replicas:
my_topic33
Partition:
Leader:ISR:
1101
100,102
Partition:
Leader:ISR:
2102
101,100
Partition:
Leader:ISR:
0100
101,102
![Page 10: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/10.jpg)
ISR 2 things make a replica in-sync
- Lag behind leader- replica.lag.time.max.ms – replica that didn’t fetch or is behind - replica.lag.max.messages – will go away in 0.9
- Connection to Zookeeper
![Page 11: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/11.jpg)
Terminology Acked
- Producers will not retry sending. - Depends on producer setting
Committed- Consumers can read. - Only when message got to all ISR.
replica.lag.time.max.ms - how long can a dead replica prevent
consumers from reading?
![Page 12: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/12.jpg)
Replication Acks = all
- only waits for in-sync replicas to reply.
Replica 3100
Replica 2100
Replica 1100
Time
![Page 13: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/13.jpg)
Replication
Replica 2100
101
Replica 1100
101
Time
Replica 3 stopped replicating for some reason
Acked in acks = all
“committed”
Acked in acks = 1
but not “committed”
![Page 14: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/14.jpg)
Replication
Replica 2100
101
Replica 1100
101
Time
One replica drops out of ISR, or goes offline All messages are now acked and committed
![Page 15: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/15.jpg)
Replication
Replica 1100
101102103104Time
2nd Replica drops out, or is offline
![Page 16: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/16.jpg)
Replication
Time
Now we’re in trouble
✗
![Page 17: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/17.jpg)
Replication
Replica 3100
Replica 2100
101
Time
All those are “acked” and “committed”
![Page 18: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/18.jpg)
So what to do Disable Unclean Leader Election
- unclean.leader.election.enable = false Set replication factor
- default.replication.factor = 3 Set minimum ISRs
- min.insync.replicas = 2
![Page 19: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/19.jpg)
Warning min.insync.replicas is applied at the topic-level. Must alter the topic configuration manually if created before the server level change Must manually alter the topic < 0.9.0 (KAFKA-2114)
![Page 20: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/20.jpg)
Replication Replication = 3 Min ISR = 2
Replica 3100
Replica 2100
Replica 1100
Time
![Page 21: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/21.jpg)
Replication
Replica 2100
101
Replica 1100
101
Time
One replica drops out of ISR, or goes offline
![Page 22: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/22.jpg)
Replication
Replica 1100
101102103104
Time
2nd Replica fails out, or is out of sync
Buffers in
Producer
![Page 23: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/23.jpg)
![Page 24: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/24.jpg)
Producer Internals Producer sends batches of messages to a buffer
M3Application
ThreadApplication
ThreadApplication
Threadsend()
M2 M1 M0Batch 3Batch 2Batch 1
Fail? response
retry
Update Future
callback
drain
Metadata orException
![Page 25: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/25.jpg)
Basics Durability can be configured with the producer configuration request.required.acks- 0 The message is written to the network (buffer)- 1 The message is written to the leader- all The producer gets an ack after all ISRs receive the data; the message is
committed
Make sure producer doesn’t just throws messages away!- block.on.buffer.full = true
![Page 26: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/26.jpg)
All calls are non-blocking async 2 Options for checking for failures:
- Immediately block for response: send().get()- Do followup work in Callback, close producer after error threshold
- Be careful about buffering these failures. Future work? KAFKA-1955- Don’t forget to close the producer! producer.close() will block until in-flight txns
complete retries (producer config) defaults to 0 message.send.max.retries (server config) defaults to 3 In flight requests could lead to message re-ordering
![Page 27: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/27.jpg)
![Page 28: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/28.jpg)
Consumer Two choices for Consumer API
- Simple Consumer- High Level Consumer
![Page 29: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/29.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
![Page 30: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/30.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Commit?
![Page 31: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/31.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Commit?
![Page 32: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/32.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Auto-commit
enabled
✗Commit
![Page 33: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/33.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Auto-commit
enabled
✗
![Page 34: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/34.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Auto-commit
enabled Consumer
Picks up here
![Page 35: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/35.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Commit
![Page 36: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/36.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Commit
Offset commits
for all threads
![Page 37: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/37.jpg)
P0 P2 P3 P4 P5 P6
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Consumer Offsets
Auto-commit
DISABLED
Commit
![Page 38: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/38.jpg)
Consumer Recommendations Set autocommit.enable = false Manually commit offsets after the message data is processed / persisted
consumer.commitOffsets(); Run each consumer in it’s own thread
![Page 39: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/39.jpg)
New Consumer! No Zookeeper! At all! Rebalance listener Commit:
- Commit- Commit async- Commit( offset)
Seek(offset)
![Page 40: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/40.jpg)
Exactly Once Semantics At most once is easy At least once is not bad either – commit after 100% sure data is safe Exactly once is tricky
- Commit data and offsets in one transaction- Idempotent producer
![Page 41: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/41.jpg)
Monitoring for Data Loss Monitor for producer errors – watch the retry numbers Monitor consumer lag – MaxLag or via offsets Standard schema:
- Each message should contain timestamp and originating service and host Each producer can report message counts and offsets to a special topic “Monitoring consumer” reports message counts to another special topic “Important consumers” also report message counts Reconcile the results
![Page 42: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f9a99a760da3da068b7060/html5/thumbnails/42.jpg)
Be Safe, Not Sorry Acks = all Block.on.buffer.full = true Retries = MAX_INT ( Max.inflight.requests.per.connect = 1 ) Producer.close() Replication-factor >= 3 Min.insync.replicas = 2 Unclean.leader.election = false Auto.offset.commit = false Commit after processing Monitor!