using druid for interactive count distinct queries at scale @ nmc

25
Yakir Buskilla + Itai Yaffe Nielsen USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Upload: ido-shilon

Post on 12-Apr-2017

88 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Using druid  for interactive count distinct queries at scale @ nmc

Yakir Buskilla + Itai YaffeNielsen

USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Page 2: Using druid  for interactive count distinct queries at scale @ nmc

Introduction

Yakir Buskilla Itai Yaffe

● Software Architect

● Focusing on Big Data and Machine Learning problems

● Big Data Infrastructure Developer

● Dealing with Big Data challenges for the last 5 years

Page 3: Using druid  for interactive count distinct queries at scale @ nmc

Nielsen Marketing Cloud (NMC)

● eXelate was acquired by Nielsen 2 years ago

● A leader in the Ad Tech and Marketing Tech industry

● What do we do ?

○ Data as a Service (DaaS)

○ Software as a Service (SaaS)

Page 4: Using druid  for interactive count distinct queries at scale @ nmc

NMC high-level architecture

Page 5: Using druid  for interactive count distinct queries at scale @ nmc

The need

● Nielsen Marketing Cloud business question

○ How many unique devices we have encountered:

■ over a given date range

■ for a given set of attributes (segments, regions, etc.)

● Find the number of distinct elements in a data stream which

may contain repeated elements in real time

Page 6: Using druid  for interactive count distinct queries at scale @ nmc

The need

Page 7: Using druid  for interactive count distinct queries at scale @ nmc

The need

Page 8: Using druid  for interactive count distinct queries at scale @ nmc

● Store everything

● Store only 1 bit per device

○ 10B Devices-1.25 GB/day

○ 10B Devices*80K attributes - 100 TB/day

● Approximate

Possible solutions

Naive

Bit VectorApprox.

Page 9: Using druid  for interactive count distinct queries at scale @ nmc

Our journey

● Elasticsearch

○ Indexing data■ 250 GB of daily data, 10 hours

■ Affect query time

○ Querying

■ Low concurrency

■ Scans on all the shards of the corresponding index

Page 10: Using druid  for interactive count distinct queries at scale @ nmc

What we tried

● Preprocessing

● Statistical algorithms (e.g HyperLogLog)

Page 11: Using druid  for interactive count distinct queries at scale @ nmc

● K Minimum Values (KMV)

● Estimate set cardinality

● Supports set-theoretic operations

X Y

● ThetaSketch mathematical framework - generalization of KMV

X Y

ThetaSketch

Page 12: Using druid  for interactive count distinct queries at scale @ nmc

KMV intuition

Page 13: Using druid  for interactive count distinct queries at scale @ nmc

Number of Std Dev 1 2

Confidence Interval 68.27% 95.45%

16,384 0.78% 1.56%

32,768 0.55% 1.10%

65,536 0.39% 0.78%

ThetaSketch error

Page 14: Using druid  for interactive count distinct queries at scale @ nmc

“Very fast highly scalable columnar data-store”

DRUID

Page 15: Using druid  for interactive count distinct queries at scale @ nmc

Roll-up

ThetaSketchAggregator

2016-11-15

Timestamp Attribute Device ID

11111 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02

Timestamp Attribute Count Distinct

2016-11-15

2016-11-15

2016-11-15

11111

22222

33333

2

2

1

Page 16: Using druid  for interactive count distinct queries at scale @ nmc

Druid architecture

Page 17: Using druid  for interactive count distinct queries at scale @ nmc

How do we use Druid

Page 18: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Setup is not easy

Page 19: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Monitoring your system

Page 20: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Data modeling

○ Reduce the number of intersections

○ Different datasources for different use cases

2016-11-15

2016-11-15

2016-11-15

Timestamp Attribute Count Distinct Timestamp Attribute Region Count

Distinct

US XXXXXX US

Porsche Intent

XXXXXX

Porsche Intent

... ......

XXXXXX

...

Page 21: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Query optimization

○ Combine multiple queries into single query

○ Use filters

Page 22: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Batch Ingestion

○ EMR Tuning

■ 140-nodes cluster

● 85% spot instances => ~80% cost reduction

○ Druid input file format - Parquet vs CSV

■ Reduced indexing time by X4

■ Reduced used storage by X10

Page 23: Using druid  for interactive count distinct queries at scale @ nmc

Guidelines and pitfalls

● Community

Page 24: Using druid  for interactive count distinct queries at scale @ nmc

Summary

10TB/day

4 Hours/day

15GB/day

280ms-350ms

$55K/month

DRUID

250GB/day

10 Hours/day

2.5TB (total)

500ms-6000ms

$80K/month

ES

Page 25: Using druid  for interactive count distinct queries at scale @ nmc

THANK YOU!