using druid for interactive count distinct queries at scale @ nmc
TRANSCRIPT
Yakir Buskilla + Itai YaffeNielsen
USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE
Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big Data and Machine Learning problems
● Big Data Infrastructure Developer
● Dealing with Big Data challenges for the last 5 years
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)
NMC high-level architecture
The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
The need
The need
● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
Our journey
● Elasticsearch
○ Indexing data■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index
What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
KMV intuition
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
“Very fast highly scalable columnar data-store”
DRUID
Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1
Druid architecture
How do we use Druid
Guidelines and pitfalls
● Setup is not easy
Guidelines and pitfalls
● Monitoring your system
Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute Count Distinct Timestamp Attribute Region Count
Distinct
US XXXXXX US
Porsche Intent
XXXXXX
Porsche Intent
... ......
XXXXXX
...
Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
Guidelines and pitfalls
● Community
Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES
THANK YOU!