streaming kafka search utility for mozilla's bagheera
DESCRIPTION
TRANSCRIPT
![Page 1: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/1.jpg)
Streaming Kafka Search Utility for Bagheera
Varunkumar ManoharMetrics Engineering Intern- Summer 2013San Francisco Commons -20th August
![Page 2: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/2.jpg)
Apache KafkaWhy use Kafka ?Mozilla’s Bagheera System Search UtilityPractical UsageOther Projects
![Page 3: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/3.jpg)
Apache Kafka
A high throughput distributed messaging system.
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
![Page 4: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/4.jpg)
Centralized data pipeline
Producer1 Producer2
Producer3
Consumer1 Consumer2 Consumer3
Centralised persistent data pipeline(Apache Kafka)
Since its persistent
consumers can lag behind
Producers and consumers do not know
each other
Consumer maintenance is
easy
![Page 5: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/5.jpg)
High Throughput
Partitioning of data allows production, consumption and brokering to be handled by clusters of machines. Scaling horizontally is easy.
Batch the messages and send large chunks at once.
Usage of filesystem page-cache-Delayed the flush to disk
Shrinkage of data
![Page 6: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/6.jpg)
Metrics
KafkaNode 1
KafkaNode 2
KafkaNode 3
KafkaNode 4
Ptn 1
Ptn 2
Ptn 3Ptn 4
Ptn 1
Ptn 2
Ptn 3Ptn 4
Ptn 1
Ptn 2
Ptn 3Ptn 4
Ptn 1
Ptn 2
Ptn 3Ptn 4
Kafka Commit log
Real time data flow for Metrics topic ( In Production)
![Page 7: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/7.jpg)
Each partition = Commit Log
At offset 0 we have message_37
That can be a json for example
![Page 8: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/8.jpg)
Underlying principle
Use a persistent log as a messaging system.
Parallels the concept of commit log
Append only commit log keep’s track of the incoming messages.
![Page 9: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/9.jpg)
Mozilla’s Bagheera System
![Page 10: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/10.jpg)
Some real-time numbers ! { PER MINUTE PLS }
3.5k
2.6k
1.7k
0.9k
8.7 K messages per minute on week 31
![Page 11: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/11.jpg)
Some questions !
Can we be more granular in finding out the counts Can I get the count of messages that were pushed 3 days
back ?Can I get to know count of messages between
Sunday and Tuesday?Can I get to know the total messages that came in 3
days back & belong to updatechannel=‘release’ Can I get to know the count of messages that
came in from UK two days ago?
![Page 12: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/12.jpg)
We can get into Hadoop or HBase for that matter and scan the data.
But Hadoop/HBase in real time is actually a massive data store- Mind blowing !
Crunching out so much of data – Not all efficient
Can we search the kafka queue that has a fair amount of data retained as per retentition policy ?
Yup ! You can query only the data retained on kafka logs- Typically our queries range within those
bounds
![Page 13: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/13.jpg)
Yes! We can more efficientlyEfficiently use the kafka offsets and data associated
with an offset.
The data we store has a time stamp – { the time of insertion into the queue} – Check the time stamp to know if the message fits our filter conditions.
We can selectively export the data we have retrieved
![Page 14: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/14.jpg)
Concurrent execution across partitionsfor(int i =0;i<totalPartitoins;i++){
/*create a callable object and submit to the exectuor to trigger of a execution */
Callable<Long>callable = new KafkaQueueOps(brokerNodes.get(brokerIndex), topics.get(topicIndex), i, noDays);
final ListenableFuture<Long> future=pool.submit(callable);
computationResults.add(future);}
ListenableFuture<List<Long>> successfulResults= Futures.successfulAsList(computationResults);
![Page 15: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/15.jpg)
long[] sparseLst = consumer.getOffsetsBefore(topicName, partitionNumber, -1,Integer.MAX_VALUE);
/*sparseLst is a sparse list of offsets only (names of log files only)*/
for (int i = 1; i < sparseLst.size(); i++) {
Fetch the message at offLst[i]De-serialize the data using Googleprotocol buffersif(sparseLst[i] <=timeRange) {
checkpoint=sparseLstbreak
}
}
/*start fetching the data from checkpoint skipping through every offset till the precise offset value is obtained*/
![Page 16: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/16.jpg)
State of consumers in Zookeeper
Kafka Broker Node
Kafka Broker Node
Kafka Broker Node
Zookeeper
/consumers/group1/offsets/topic1/0-2:119914/consumers/group1/offsets/topic1/0-1:127994/consumers/group1/offsets/topic1/0-0:130760
![Page 17: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/17.jpg)
Consumer reads the state of their consumption from zookeeper.
What if we can change the offset values to something we want it to be?
we can go back in time and gracefully make the consumer start reading from that point-
We are setting the seek cursor on a distributed log so that the consumers can read from that point.
![Page 18: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/18.jpg)
Do Not Track Dashboard
![Page 19: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/19.jpg)
Hive data processing for DNT Dashboards
JDBC Application
Thrift Service
Driver compiler Executor
Metastore
![Page 20: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/20.jpg)
Threads to execute several hive queries which in turn starts map reduce jobs.
The processed data is converted into to JSON.
All the older JSON records and newly processed JSON records are merged suitably.
The JSON data is used by web-api’s for data binding
![Page 21: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/21.jpg)
2013-04-01 AR 0.11265908876536693 0.122003048921328592013-04-01 AS 0.159090909090 0. 90910.5
JSON Conversion Existing JSON data
Merge&Sort
![Page 22: Streaming kafka search utility for Mozilla's Bagheera](https://reader036.vdocuments.site/reader036/viewer/2022062417/54c672904a7959f67d8b45f8/html5/thumbnails/22.jpg)
Thank you !
Daniel EinspanjerAnuragHarshaMark Reid