case study: realtime analytics with druid

36
Case Study: Real-time Analytics With Druid Salil Kalia, Tech Lead, TO THE NEW Digital

Upload: salil-kalia

Post on 16-Apr-2017

544 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Case Study: Realtime Analytics with Druid

Case Study: Real-time Analytics With Druid

Salil Kalia, Tech Lead, TO THE NEW Digital

Page 2: Case Study: Realtime Analytics with Druid

About Presenter• Over 10 years in software industry

• Working with TO THE NEW Digital since 2009

• Using mainly Java/Groovy/Grails eco-systems for the development purpose

• Working on Digital marketing domain for the last few years

• Cassandra certified trainer

• Loves traveling and exploring new places

Page 3: Case Study: Realtime Analytics with Druid

AgendaUnderstanding the use-case

• Ad workflow• Our use case

Experiments with technologies• Redis• Cassandra

Introduction to Druid• Architecture• Druid in production• Demo

Page 4: Case Study: Realtime Analytics with Druid

Understanding the use-case

Page 5: Case Study: Realtime Analytics with Druid

Understanding The Ad Workflow

AD AGENCY-2

AD AGENCY-3

AD AGENCY-1

USER

Web PageRequest

AdRequest

Ad-Content

PUBLISHERSERVER

ADEXCHANGE

Page 6: Case Study: Realtime Analytics with Druid

Examples From Our Use Case•How many times a video has been viewed ?

•How many times a video has been viewed in a particular time-span ?

•How many times a video has been viewed in a particular time-span at a particular site ?

•How many times a video has been viewed in a particular time-span at a particular site in a particular country ?

•How many times a video has been viewed in a particular time-span at a particular site in a particular country on a particular device ?

Page 7: Case Study: Realtime Analytics with Druid

Video Events For The Analysis• LOAD

• START

• PLAYING

• VIEW

• STOP / PAUSE

• FINISH

Page 8: Case Study: Realtime Analytics with Druid

Event Data (Sample)

TIMESTAMP Ad Site Advertiser Event Action

2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load

2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load

2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start

2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing

2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing

2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop

Page 9: Case Study: Realtime Analytics with Druid

What Is Analytics ?Processing the HISTORICAL data to:

•Understand potential trends

•Analyze the effects of certain decisions or events

•Evaluate the performance of a system

•Make better business decisions

Page 10: Case Study: Realtime Analytics with Druid

What Is Real-time Analytics ?

Page 11: Case Study: Realtime Analytics with Druid

Why (We Need) Real-time Analytics ?

• Understand the real-time performance

• Control the velocity

• Avoid over serving

• Avoid under serving

• Control the targeting

Page 12: Case Study: Realtime Analytics with Druid

Recap – Things We Understood

• How the ad-tech works (in general)

• Our use-case

• Different video player events

• We are expecting a huge amount of data coming at a very high velocity.

Page 13: Case Study: Realtime Analytics with Druid

Experiments with technologies

Page 14: Case Study: Realtime Analytics with Druid

Why We Picked Redis

• Great buzz in the market

• Highly scalable

• Easy to setup, configure and use

• We were not very clear with our use-case

Page 15: Case Study: Realtime Analytics with Druid

Realizations From Redis

• Not a good fit to deal with time-series (big) data

• Persistence is another issue – we can’t afford loosing data

• There was a huge variety of keys all over the place

• Complexity in the (application side) code started increasing

Page 16: Case Study: Realtime Analytics with Druid

Working With Cassandra

• Very good support for the time-series data

• Extremely good for writing the data at a very high speed

• Very easy to scale horizontally

• Supports aggregations through Counters

Page 17: Case Study: Realtime Analytics with Druid

Writing into Cassandra

ANALYTICSSERVER

CASSANDRA

AD PLAYER

Page 18: Case Study: Realtime Analytics with Druid

Reading from Cassandra

ANALYTICSSERVER CASSANDRA

CAMPAIGNMANAGER

Page 19: Case Study: Realtime Analytics with Druid

What didn’t work with Cassandra

• Inconsistent results

• Unreliable counters

• No ad-hoc queries support

• Nodes were crashing out very frequently

Page 20: Case Study: Realtime Analytics with Druid

Crossroads – What next ?

• Third party tools on the top of Cassandra for better consistency

• DataStax Enterprise edition

• Taking a deeper dive into Cassandra to reconfigure the whole architecture and setup

• Switching to different technology

Page 21: Case Study: Realtime Analytics with Druid

Understanding druid

Page 22: Case Study: Realtime Analytics with Druid

About Druid (http://druid.io)

• An open-source analytics data store

• Supports streaming - data ingestion

• Flexible filters for ad-hoc queries

• Fast aggregations – sub second queries

• Distributed, shared-nothing architecture

• Easily scalable

Page 23: Case Study: Realtime Analytics with Druid

Setting Up Druid In Production

KAFKA(CLUSTER)

ANALYTICSSERVER

DRUIDCLUSTER

CASSANDRA

AD PLAYER

Page 24: Case Study: Realtime Analytics with Druid

Druid’s Reliability Check

KAFKA(CLUSTER)

ANALYTICSSERVER

DRUIDCLUSTER

RAW FILECONSUMER

RAWFILES

RAWFILES

RAWFILES

Job To Test Druid’s

Integrity

AD PLAYER

Page 25: Case Study: Realtime Analytics with Druid

A Quick Demo

Page 26: Case Study: Realtime Analytics with Druid
Page 27: Case Study: Realtime Analytics with Druid

Druid Architecture

DEEPSTORAGE

ZOOKEEPER

Druid Nodes

External Dependencies

Queries

MetaData

Data/Segments

Client Queries

StreamingData

REALTIME

NODES

COORDINATORNODES

HISTORICALNODES

BROKERNODES

MY SQL

Page 28: Case Study: Realtime Analytics with Druid

Druid Data Ingestion

DEEPSTORAGE

ZOOKEEPER

Druid Nodes

External Dependencies

Queries

MetaData

Data/Segments

Client Queries

StreamingData

REALTIME

NODES

COORDINATORNODES

HISTORICALNODES

BROKERNODES

MY SQL

Page 29: Case Study: Realtime Analytics with Druid

Druid Data Ingestion (Our System)

KAFKA(CLUSTER)

DRUIDReal-time NodeANALYTICS

SERVERAD PLAYER

Page 30: Case Study: Realtime Analytics with Druid

Druid Data Retrieval

DEEPSTORAGE

ZOOKEEPER

Druid Nodes

External Dependencies

Queries

MetaData

Data/Segments

Client Queries

StreamingData

REALTIME

NODES

COORDINATORNODES

HISTORICALNODES

BROKERNODES

MY SQL

Page 31: Case Study: Realtime Analytics with Druid

Coordinator Nodes

DEEPSTORAGE

ZOOKEEPER

Druid Nodes

External Dependencies

Queries

MetaData

Data/Segments

Client Queries

StreamingData

REALTIME

NODES

COORDINATORNODES

HISTORICALNODES

BROKERNODES

MY SQL

Page 32: Case Study: Realtime Analytics with Druid

Druid Data Segment Propagation

DEEPSTORAGE

ZOOKEEPER

Druid Nodes

External Dependencies

Queries

MetaData

Data/Segments

StreamingData

REALTIME

NODES

COORDINATORNODES

HISTORICALNODES

MY SQL

Page 33: Case Study: Realtime Analytics with Druid

Our Production Stats

•Over 200 million events per day – ingested into Druid cluster

•4 boxes with 8 cores, 64GB RAM, 1TB SSD

•2 coordinator nodes (only one master)

•2 real-time nodes

•4 historical nodes (on each box)

Page 34: Case Study: Realtime Analytics with Druid

Companies Using Druid

Page 35: Case Study: Realtime Analytics with Druid

Questions ?

Page 36: Case Study: Realtime Analytics with Druid