scalable real-time analytics using druid

45
Scalable Real- time Analytics using Druid Nishant Bangarwa and Slim Bouguerra Hadoop Summit June 2016

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

2.239 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scalable Real-time analytics using Druid

Scalable Real-time Analytics using DruidNishant Bangarwa and Slim Bouguerra Hadoop SummitJune 2016

Page 2: Scalable Real-time analytics using Druid

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaHistory and Motivation

Druid Architecture

Druid VS BIG DATA

Page 3: Scalable Real-time analytics using Druid

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

History

Development started at Metamarkets in 2011

Initial use case – power ad-tech analytics product

Open sourced in late 2012– GPL licensed initially – Switched to Apache V2 in early 2015

150+ contributors today

Page 4: Scalable Real-time analytics using Druid

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Motivation

Business Intelligence/ OLAP use cases that need interactive real time visualizations on Complex data streams e.g. – Real time bidding events– User activity streams– Voice Call Logs– Network traffic flows– Firewall Events– Application performance metrics

Page 5: Scalable Real-time analytics using Druid

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Solutions Evaluated

RDBMS (Postgres, Mysql)– Star schema with aggregate tables– Slow performance on large scale (upto 20 sec page load times)– Query caching helped, arbitrary queries still slow

Key/value stores (HBase, Cassandra, BigTable)– Pre-aggregate all dimensional combinations – Fast queries were achieved – Precomputation scales exponentially – Takes time to precompute (upto 9 hrs with 14 dimensions)– Not Cost effective

Page 6: Scalable Real-time analytics using Druid

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Druid ?

Column-oriented distributed datastore Sub-Second query times Realtime streaming ingestion arbitrary slicing and dicing of data Automatic Data Summarization Approximate algorithms (hyperLogLog, theta) Scalable to petabytes of data Highly available

Page 7: Scalable Real-time analytics using Druid

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Companies Using Druid

Page 8: Scalable Real-time analytics using Druid

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Features of Druid

Page 9: Scalable Real-time analytics using Druid

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scalability

Ability to handle – petabytes of data – billions of events/day

Largest druid cluster – 50 Trillion+ events – 50PB+ of raw data – Over 500TB of compressed query-able

data– Ingestion Rate over 500,000 events/sec– 10-100K events/sec/core

Page 10: Scalable Real-time analytics using Druid

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Fast Response Time

Critical for interactive user experience Avg query times ~500ms 90%ile under 1 sec 99%ile under 10 sec Handle 1000’s of concurrent queries

Page 11: Scalable Real-time analytics using Druid

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Arbitrary slicing n’ dicing

Ability to support arbitrary filtering, splitting and aggregation of data

Page 12: Scalable Real-time analytics using Druid

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Queries on Immediate Data

Immediate insights into current data

Ability to query data as soon as it is ingested

Recent data more important than old data

Page 13: Scalable Real-time analytics using Druid

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Highly Available

Data Replication across nodes Shared nothing architecture No Single point of failure

Page 14: Scalable Real-time analytics using Druid

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Rolling upgrades without downtime

Maintain backwards compatibility Each node can be upgraded independently Easy to run experiments No Downtime

1

1

1

1

1

1

1

1 1

1

1

1

2

2

2

1 1

2

2

2

2

2

2

1 2

2

2

2

2

3

3

2

Page 15: Scalable Real-time analytics using Druid

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid Architecture

Page 16: Scalable Real-time analytics using Druid

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Early Druid Architecture

Hadoop

Historical Node

Historical Node

Historical Node

Batch Data Broker Node Queries

Page 17: Scalable Real-time analytics using Druid

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Early Druid Architecture

Hadoop

Historical Node

Historical Node

Historical Node

Batch Data Broker Node Queries

Page 18: Scalable Real-time analytics using Druid

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Early Druid Architecture

Hadoop

Historical Node

Historical Node

Historical Node

Batch Data Broker Node Queries

Page 19: Scalable Real-time analytics using Druid

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Historical Nodes

Shared nothing architecture Main workhorses of druid cluster Load immutable read optimized segments Respond to queries Use memory mapped files

to load segments

Page 20: Scalable Real-time analytics using Druid

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Broker Nodes

Keeps track of segment announcements in cluster Scatters query across historical and realtime nodes Merge results from different query nodes (Distributed) caching layer

Page 21: Scalable Real-time analytics using Druid

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Coordinator Nodes

Assigns segments to historical nodes Interval based cost function to distribute segments Makes sure query load is uniform across historical nodes Handles replication of data Configurable rules to load/drop data

Page 22: Scalable Real-time analytics using Druid

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Current Druid Architecture

Hadoop

Historical Node

Historical Node

Historical Node

Batch Data

Broker Node Queries

ETL(Samza,

Kafka, Storm, Spark etc)

Streaming Data Realtime

Node

Realtime Node

Hand

off

Page 23: Scalable Real-time analytics using Druid

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Realtime Nodes

Ability to ingest streams of data Both push and pull based ingestion Stores data in write-optimized structure Periodically converts write-optimized structure

to read-optimized segments Event query-able as soon as it is ingested

Page 24: Scalable Real-time analytics using Druid

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DRUID VS Big Data

Page 25: Scalable Real-time analytics using Druid

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 26: Scalable Real-time analytics using Druid

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Question: number of unique user last minute ?

Pre compute aggregates for every possible set of dimensions. ETL pipeline thousand of stages to compute aggregates. Load to complex stack and layers of databases. Repeat every Hour/Day/Week/Year.

Billion of users. Billions of events per hour. Retain years worth of data.

Will not scale for:

Page 27: Scalable Real-time analytics using Druid

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summarize data you must !!

Any solution ?

Page 28: Scalable Real-time analytics using Druid

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summarization

Row is an ad impression. Clicked == 1 is an actual click. Summarization of the hour ?

timestamp domain user gender clicked2011-01-01T00:01:35Z bieber.com 4312345532 Female 12011-01-01T00:03:03Z bieber.com 3484920241 Female 02011-01-01T00:04:51Z ultra.com 9530174728 Male 12011-01-01T00:05:33Z ultra.com 4098310573 Male 12011-01-01T00:05:53Z ultra.com 5832057930 Female 02011-01-01T00:06:17Z ultra.com 5789283478 Female 12011-01-01T00:23:15Z bieber.com 4730093842 Female 02011-01-01T00:38:51Z ultra.com 3909846810 Male 12011-01-01T00:49:33Z bieber.com 4930097162 Female 12011-01-01T00:49:53Z ultra.com 0381837193 Female 0

Example courtesy of Eric Tschetter, used with his permission

simple, just add up numbers

Page 29: Scalable Real-time analytics using Druid

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summarization – Hourly, Simple

timestamp domain user gender clicked2011-01-01T00:01:35Z bieber.com 4312345532 Female 12011-01-01T00:03:03Z bieber.com 3484920241 Female 02011-01-01T00:04:51Z ultra.com 9530174728 Male 12011-01-01T00:05:33Z ultra.com 4098310573 Male 12011-01-01T00:05:53Z ultra.com 5832057930 Female 02011-01-01T00:06:17Z ultra.com 5789283478 Female 12011-01-01T00:23:15Z bieber.com 4730093842 Female 02011-01-01T00:38:51Z ultra.com 3909846810 Male 12011-01-01T00:49:33Z bieber.com 4930097162 Female 12011-01-01T00:49:53Z ultra.com 0381837193 Female 0

timestamp impressions clicks2011-01-01T00:00:00Z 10 6

We can not query by domain, user or gender!

Page 30: Scalable Real-time analytics using Druid

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summarization – Hourly, Gender + Domaintimestamp domain user gender clicked2011-01-01T00:01:35Z bieber.com 4312345532 Female 12011-01-01T00:03:03Z bieber.com 3484920241 Female 02011-01-01T00:04:51Z ultra.com 9530174728 Male 12011-01-01T00:05:33Z ultra.com 4098310573 Male 12011-01-01T00:05:53Z ultra.com 5832057930 Female 02011-01-01T00:06:17Z ultra.com 5789283478 Female 12011-01-01T00:23:15Z bieber.com 4730093842 Female 02011-01-01T00:38:51Z ultra.com 9530174728 Male 12011-01-01T00:49:33Z bieber.com 4930097162 Female 12011-01-01T00:49:53Z ultra.com 0381837193 Female 0

timestamp domain gender impressions clicks2011-01-01T00:00:00Z bieber.com Female 4 22011-01-01T00:00:00Z ultra.com Female 3 12011-01-01T00:00:00Z ultra.com Male 3 2

(+) Number of rows per hour is bounded by cardinality of (domain X gender) (-) Query granularity can not be less than one hour. (-) Cannot answer to number of unique users !!!.

Page 31: Scalable Real-time analytics using Druid

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summarization, compute unique

timestamp domain gender impressions clicks2011-01-01T00:00:00Z bieber.com Female 4 22011-01-01T00:00:00Z ultra.com Female 3 12011-01-01T00:00:00Z ultra.com Male 3 2

timestamp domain gender impressions clicks unique2011-01-01T00 bieber.com Female 4 2 [4312345532, 3484920241, 4730093842, 4930097162]2011-01-01T00 ultra.com Female 3 1 [5832057930, 5789283478, 0381837193]2011-01-01T00 ultra.com Male 3 2 [9530174728, 4098310573]

“unique” grows linearly !!! billion-entry sets per row !!! Can not be apply push down aggregates and merge approach

Page 32: Scalable Real-time analytics using Druid

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sketching groundwork

timestamp domain gender impressions clicks uniques2011-01-01T00 bieber.com Female 4 2 [4312345532, 3484920241, 4730093842, 4930097162]2011-01-01T00 ultra.com Female 3 1 [5832057930, 5789283478, 0381837193]2011-01-01T00 ultra.com Male 3 2 [9530174728, 4098310573]

timestamp domain gender impressions clicks uniques2011-01-01T00 bieber.com Female 4 2 <sub-linear-data-structure>2011-01-01T00 ultra.com Female 3 1 <sub-linear-data-structure>2011-01-01T00 ultra.com Male 3 2 <sub-linear-data-structure>

Requirements– Streamable.– Mergeable at query time.– Approximate with predictable error the number of unique users.– Limited memory independent from the data size.

Page 33: Scalable Real-time analytics using Druid

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Theta Sketches KMV: Open sourced by Yahoo! [datasketches.github.io]

Predictable approximation error can be trade-off by sketch size– k = 4096 corresponds to an RSE of +/- 3.2% with 95% confidence.– k = 16K corresponds to an RSE of +/- 1.6% with 95% confidence.

Limited memory footprint and independent from data size– k = 4096 -> 32768 bytes.– K = 16384 -> 131072 bytes.

Mergebale at query time.– “merge rate of about 14.5 million sketches per second per processor thread”

[http://datasketches.github.io/docs/Theta/ThetaMergeSpeed.html].

Intersection can be computed at query time. Duplication insensitive.

https://speakerdeck.com/druidio/approximate-algorithms-and-sketches-in-druid

Page 34: Scalable Real-time analytics using Druid

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid success story !

Replaced 5,000 Hbase cluster serving six petabytes of metrics to power Flurry mobile analytics alone.[infoworld.com]

Tracking more than 2 billion mobiles devices [Flurry SDK @ MDC 2016]. Real-time Ingestion 20 billion events per day [Flurry SDK @ MDC 2016]. Sub second query latency. Query the last 15 second.

Page 35: Scalable Real-time analytics using Druid

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

“Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance.”

― Kurt Vonnegut,

Page 36: Scalable Real-time analytics using Druid

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Monitoring

Monitoring (System level)– CPU– MEM– Network IO– …

Alerting– Logged exceptions– System Metrics Threshold

Exploratory debugging and performance tuning ?

Page 37: Scalable Real-time analytics using Druid

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Exploratory Debugging / Performance Tuning, Hard !!

Guess Why ?

Distributed application running on multiple machines with different configuration, across multiple data center….

Page 38: Scalable Real-time analytics using Druid

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Exploratory Debugging / Performance Tuning, Hard !!

Guess Why ?

Can not run benchmark at production environment.

Page 39: Scalable Real-time analytics using Druid

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Exploratory Debugging / Performance Tuning, Hard !!

Guess Why ?

Can not reproduce the production load pattern.

Page 40: Scalable Real-time analytics using Druid

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Exploratory Debugging / Performance Tuning, Hard !!

Guess Why ?

Hard to obtain insights from log files.– Need to be interactive and real-time.– Need to able to arbitrary slice and dice the benchmark results.

Page 41: Scalable Real-time analytics using Druid

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid internal metrics

– Periodic events– Query related events– Ingestion related events

Events Type

{“timestamp”:”2016-05-01T10:14:00”, “metric”: “query/time”, “service”:”druid/broker”, “value”:”234”, “type”:”groupBy”, ” id”: ”12374095094” , …}

Anatomy of Events

Unbounded cardinality for dimensions like query id. Very high throughput of emitted events.

Page 42: Scalable Real-time analytics using Druid

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Metrics Cluster Architecture

http

http

httphttp

Per Real-time node we can 20k Events/sec with at granularity of one minute.

http VIP

Collectorsnodes

Brokers

Historical

Query rewrite

Scatter/GhatherR

eal time

Page 43: Scalable Real-time analytics using Druid

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary Scalability

– Horizontal Scalability.– Columnar storage, indexing and compression. – Multi-tennancy.

Real-time– Ingestion latency < seconds.– Query latency < seconds.

Arbitrary slice and dice big data like ninja– No more pre-canned drill downs.– Query with more fine-grained granularity.

High availability and Rolling deployment capabilities – Less costly to run. – Very active open source community.

Page 44: Scalable Real-time analytics using Druid

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank you ! Questions ?

Page 45: Scalable Real-time analytics using Druid

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid as a Platform

Druid

Batch Ingestion(Hadoop, Spark, …)

Web Services(Fili)

Visualizations(Pivot, Graphana,

Caravel)

Machine Learning(SciPy, R, ScalaNLP)

Streaming Ingestion(Storm, Samza, Spark-Streaming,

Kafka, ….)