using druid for interactive count distinct queries at scale @ nmc

Yakir Buskilla + Itai YaffeNielsen

USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Introduction

Yakir Buskilla Itai Yaffe

● Software Architect

● Focusing on Big Data and Machine Learning problems

● Big Data Infrastructure Developer

● Dealing with Big Data challenges for the last 5 years

Nielsen Marketing Cloud (NMC)

● eXelate was acquired by Nielsen 2 years ago

● A leader in the Ad Tech and Marketing Tech industry

● What do we do ?

○ Data as a Service (DaaS)

○ Software as a Service (SaaS)

NMC high-level architecture

The need

● Nielsen Marketing Cloud business question

○ How many unique devices we have encountered:

■ over a given date range

■ for a given set of attributes (segments, regions, etc.)

● Find the number of distinct elements in a data stream which

may contain repeated elements in real time

The need

● Store everything

● Store only 1 bit per device

○ 10B Devices-1.25 GB/day

○ 10B Devices*80K attributes - 100 TB/day

● Approximate

Possible solutions

Naive

Bit VectorApprox.

Our journey

● Elasticsearch

○ Indexing data■ 250 GB of daily data, 10 hours

■ Affect query time

○ Querying

■ Low concurrency

■ Scans on all the shards of the corresponding index

What we tried

● Preprocessing

● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)

● Estimate set cardinality

● Supports set-theoretic operations

X Y

● ThetaSketch mathematical framework - generalization of KMV

X Y

ThetaSketch

KMV intuition

Number of Std Dev 1 2

Confidence Interval 68.27% 95.45%

16,384 0.78% 1.56%

32,768 0.55% 1.10%

65,536 0.39% 0.78%

ThetaSketch error

“Very fast highly scalable columnar data-store”

DRUID

Roll-up

ThetaSketchAggregator

2016-11-15

Timestamp Attribute Device ID

11111 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02

Timestamp Attribute Count Distinct

2016-11-15

2016-11-15

2016-11-15

11111

22222

33333

2

2

1

Druid architecture

How do we use Druid

Guidelines and pitfalls

● Setup is not easy


● Monitoring your system


● Data modeling

○ Reduce the number of intersections

○ Different datasources for different use cases

2016-11-15

2016-11-15

2016-11-15

Timestamp Attribute Count Distinct Timestamp Attribute Region Count

Distinct

US XXXXXX US

Porsche Intent

XXXXXX

Porsche Intent

... ......

XXXXXX

...


● Query optimization

○ Combine multiple queries into single query

○ Use filters


● Batch Ingestion

○ EMR Tuning

■ 140-nodes cluster

● 85% spot instances => ~80% cost reduction

○ Druid input file format - Parquet vs CSV

■ Reduced indexing time by X4

■ Reduced used storage by X10


● Community

Summary

10TB/day

4 Hours/day

15GB/day

280ms-350ms

$55K/month

DRUID

250GB/day

10 Hours/day

2.5TB (total)

500ms-6000ms

$80K/month

ES

THANK YOU!

using druid for interactive count distinct queries at scale @ nmc

Technology