apache kafka bay area sep meetup - 24/7 customer, inc
TRANSCRIPT
© 2016 24/7 CUSTOMER, INC.
Apache Kafka Bay Area September Meetup - 24/7 CUSTOMER, INC.
Our Kafka journey to 0.10
Engineering Manager - Big Data Platform
Suneet Grover
2© 2016 24/7 CUSTOMER, INC.
About [24]7
© 2016 24/7 CUSTOMER, INC. 3
Today’s engagement is not driving successful moments
Q&A
IVR
© 2016 24/7 CUSTOMER, INC. 4
Smart Customer Engagement
Data-DrivenReflecting All Available Data
Click here to see [24]7 in actionVideo available at http://player.vimeo.com/video/85280070
PredictiveReal-timeDecisions
Omni-channelAcross Digital
& Voice
PersonalizedUser Experience
© 2016 24/7 CUSTOMER, INC. 5
Intent-driven engagement
Anticipate consumer intent
Holistic experience across channels
Delivering the right moments
to
They moved from
Channel-centric engagement
Reacting to consumer behavior
Disconnected, fragmented channels
Too many failed experiences
© 2016 24/7 CUSTOMER, INC. 6
[24]7 by the numbers
1.2bsmart speech
calls/year
127mvirtual agent
inquiries/year
30magent
chats/year
341mweb visitors
/month
5000+digital chat agents
(#1 WW)
70+data scientists
(most in industry)
100+patents
300+software engineers &
designers
© 2016 24/7 CUSTOMER, INC. 7
Agenda• Kafka at [24]7• Challenges and Learnings• Transparency and Resiliency• Upgrade path• Configurations that worked for us• Design for multiple data centers• Our Kafka wish list
8© 2016 24/7 CUSTOMER, INC.
Kafka at [24]7
© 2016 24/7 CUSTOMER, INC. 9
Intent PredictionData AnalyticsBusiness Intelligence
© 2016 24/7 CUSTOMER, INC. 10
Aug 2016
Kafka 0.10.0.1
Easy UpgradeBetter client APIs
Stable so far
Our Kafka Timeline2013
Kafka 0.7
Broker PartitionsLess visibility
Apr 2016
Kafka 0.8.2.2
Non-sticky partitionsRound Robin MMEasier to Manage
Fewer Issues
2014
Kafka 0.7 & 0.8
Sticky partitionsRange based MMs
Migration procsBugs
11© 2016 24/7 CUSTOMER, INC.
Few months ago …
© 2016 24/7 CUSTOMER, INC. 12
Our setup
DC1 - 0.7 DC2 - 0.7
DC2 - 0.8DC1 - 0.8
Topics X
Topics All - X
Mirroring
Migration
© 2016 24/7 CUSTOMER, INC. 13
Challenges with Kafka 0.8• Broker partition stickiness
• Cannot move clusters• No elasticity
• ZK load and latencies• Range based mirror-maker algorithm• Stale topics deletion
© 2016 24/7 CUSTOMER, INC. 14
Experience with Network issues• DNS issue causing runtime issue at ZKClient• Connectivity issues leading to controller re-elections• Conflict errors in mirror-makers• Socket leaks leading to open file descriptors
© 2016 24/7 CUSTOMER, INC. 15
Experience with Kafka 0.8 and upgrade• Mismatch in Kafka vs zookeeper state
• Producers could see certain partitions but consumers couldn’t• We added the same partitions back to the cluster
• Leader-Replica-ISR mismatch• We did the controller broker restart
• Broker not allowed into cluster• Controller task queue went into invalid state - KAFKA-2300
• Repeated Kafka controller switching• Data Loss due to fewer replicas
© 2016 24/7 CUSTOMER, INC. 16
Learnings• **It works to delete the “/controller” node from zookeeper • Always do clean shutdown and restart of brokers• Some issues are not always visible as errors or warnings• Run ZK on SSD
17© 2016 24/7 CUSTOMER, INC.
Upgrade Path
© 2016 24/7 CUSTOMER, INC. 18
Path we took
Kafka 0.8
In-place 0.9
New cluster 0.8.2.2
In-place
0.8.2.2
New cluster
0.9
© 2016 24/7 CUSTOMER, INC. 19
Our Upgrade to 0.8.2.2• Shutdown 0.7 pipeline• Tried in-place upgrade from 0.8.0 to 0.8.2.2• Were successful with moving to a separate 0.8.2.2 cluster• Added a lot more monitors for resiliency
© 2016 24/7 CUSTOMER, INC. 20
Upgrade to Kafka 0.10.0.1• Separated mirror makers from brokers• Only the brokers upgraded to 0.10.0.1• In-place upgrade worked very well• Found an issue with the mirror-maker 0.10.0.1• Yet to change the message format, upgrade clients etc.
21© 2016 24/7 CUSTOMER, INC.
Configurations that worked for us
© 2016 24/7 CUSTOMER, INC. 22
Broker configurations• default.replication.factor = 3• num.partitions = 2• delete.topic.enable = true• auto.leader.rebalance.enable = true• controlled.shutdown.enable = true• queued.max.requests = 1000
© 2016 24/7 CUSTOMER, INC. 23
Upgrade specific configurations• inter.broker.protocol.version = 0.10.0.1• message.format.version = 0.8.2.2
© 2016 24/7 CUSTOMER, INC. 24© 2016 24/7 CUSTOMER, INC.
Transparency and Resiliency
© 2016 24/7 CUSTOMER, INC. 25
Metrics flow
Grafana
Graphite
Kafka BrokerMetrics Reporter
Kafka MM JMXTrans
Zookeeper
Host level Metrics & Alerts
Lag monitor
ELK
© 2016 24/7 CUSTOMER, INC. 26
Essential Broker Metrics• Disk, CPU and throughput utilization• Ingress, egress volume per broker and topic• Active controller count• Offline partitions• Under replicated partitions• Partitions per broker• Log flush rate
© 2016 24/7 CUSTOMER, INC. 27
Basic Alerts• Disk, CPU utilization• Open file handles• Controller count• Controller re-elections• Under replicated partitions• Offline partitions• Stuck pending commands in zookeeper• Conflicts in mirror-makers
© 2016 24/7 CUSTOMER, INC. 28
JMXTrans• Push mirror-maker metrics to graphite
• Throughput per topic, per thread, per instance etc.• WaitOnTake, WaitOnPut
• Push zookeeper metrics to graphite• Latency, quorum, connections etc.
© 2016 24/7 CUSTOMER, INC. 29
Data Lag Monitoring• Measures the event level time delay• Monitors data latencies per cluster, per topic, per partition• Latencies between multiple steps in Kafka pipeline• Optimize and configure sampling ratio• Supports multiple message formats json, avro etc.• Alerts based on pre-defined thresholds
© 2016 24/7 CUSTOMER, INC. 30
Indicative Broker Metrics• Request Metrics
• Local Time• Remote Time• Queue Time
• Request Handler Idle Percent • Network Processor Idle Percent
31© 2016 24/7 CUSTOMER, INC.
Now some demo
© 2016 24/7 CUSTOMER, INC. 32© 2016 24/7 CUSTOMER, INC.
Design for Multiple Data Centers
© 2016 24/7 CUSTOMER, INC. 33
Range Based Mirror Makers
Consumer 1 Consumer 2 Consumer 3 Consumer 41
10
100
10001000
181
14
5
Skewed Partition Assignment
Num Partitions
© 2016 24/7 CUSTOMER, INC. 34
Round Robin Mirror Makers
Consumer 1 Consumer 2 Consumer 3 Consumer 40
50
100
150
200
250
300
350
Uniform Partition Assignment
Num Partitions
© 2016 24/7 CUSTOMER, INC. 35
Mirror-maker fine tuning• Round Robin works better than Range based in most cases• Spread out the topics in multiple MM consumer groups
• If you have a few large volume topics• Negative regex works with whitelist parameter• Doesn’t help to have too many MM consumer threads• Tune socket buffer size (doesn’t apply unless OS allows)
• MM - socket.receive.buffer.bytes = 1048576• Broker - socket.send.buffer.bytes = 1048576
36© 2016 24/7 CUSTOMER, INC.
Critical to our data pipelineCarries data reliably across DCsEasy to manage and operateGood monitoring capabilities
Kafka to our components is like arteries to a body
37© 2016 24/7 CUSTOMER, INC.
Our Kafka wish list
© 2016 24/7 CUSTOMER, INC. 38
It would be great to have• Partition assignment based on volume in brokers and MMs• Blacklisting and whitelisting capabilities in mirror-makers• Rolling restarts of the brokers• Auto cleaning stale topics and partitions• Catching uneven topics with skewed data spread – bad
producers
© 2016 24/7 CUSTOMER, INC. 39
Q & A
© 2016 24/7 CUSTOMER, INC. 40