big data real time analytics - a facebook case study

34
SINGLE PLATFORM. COMPLETE SCALABILITY. Real Time Analytics Real Time Analytics for Big Data for Big Data Lessons from Lessons from Facebook.. Facebook..

Upload: nati-shalom

Post on 20-Aug-2015

25.949 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data Real Time Analytics - A Facebook Case Study

SINGLE PLATFORM. COMPLETE SCALABILITY.

Real Time Analytics for Big DataReal Time Analytics for Big DataLessons from Facebook..Lessons from Facebook..

Page 2: Big Data Real Time Analytics - A Facebook Case Study

The Real Time Boom..

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Google Real Time Web Analytics

Google Real Time Search

Facebook Real Time Social Analytics

Twitter paid tweet analytics

SaaS Real Time User Tracking

New Real Time Analytics Startups..

Page 3: Big Data Real Time Analytics - A Facebook Case Study

Analytics @ Twitter

Page 4: Big Data Real Time Analytics - A Facebook Case Study

Note the Time dimension

Page 5: Big Data Real Time Analytics - A Facebook Case Study

The data resolution & processing models

Page 6: Big Data Real Time Analytics - A Facebook Case Study

Traditional analytics applications

• Scale-up Database – Use traditional SQL database– Use stored procedure for event driven reports– Use flash memory disks to reduce disk I/O– Use read only replica to scale-out read queries

• Limitations– Doesn’t scale on write– Extremely expensive (HW + SW)

6® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 7: Big Data Real Time Analytics - A Facebook Case Study

CEP – Complex Event Processing

• Process the data as it comes• Maintain a window of the data in-memory

• Pros:– Extremely low-latency– Relatively low-cost

• Cons– Hard to scale (Mostly limited to scale-up)– Not agile - Queries must be pre-generated– Fairly complex

7® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 8: Big Data Real Time Analytics - A Facebook Case Study

In Memory Data Grid

• Distributed in-memory database• Scale out

• Pros– Scale on write/read– Fits to event driven (CEP style) , ad-hoc query model

• Cons- Cost of memory vs disk- Memory capacity is limited

8® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 9: Big Data Real Time Analytics - A Facebook Case Study

NoSQL

• Use distributed database– Hbase, Cassandra, MongoDB

• Pros– Scale on write/read– Elastic

• Cons– Read latency– Consistency tradeoffs are hard– Maturity – fairly young technology

9® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 10: Big Data Real Time Analytics - A Facebook Case Study

Hadoop MapReudce

• Distributed batch processing

• Pros– Designed to process massive amount of data– Mature– Low cost

• Cons– Not real-time

10® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 11: Big Data Real Time Analytics - A Facebook Case Study

Hadoop Map/Reduce – Reality check..

11® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 12: Big Data Real Time Analytics - A Facebook Case Study

So what’s the bottom line?

12® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 13: Big Data Real Time Analytics - A Facebook Case Study

Facebook Real-timeAnalytics System

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

13

Page 14: Big Data Real Time Analytics - A Facebook Case Study

Goals

• Show why plugins are valuable. – What value is your business deriving from it?

• Make the data more actionable. – Help users take action to make their content more valuable.– How many people see a plugin, how many people take action

on it, and how many are converted to traffic back on your site.  

• Make the data more timely. – Went from a 48-hour turn around to 30 seconds.– Multiple points of failure were removed to make this goal. 

• Handle massive load– 20 billion events per day (200,000 events per second)

14® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 15: Big Data Real Time Analytics - A Facebook Case Study

The actual analytics..

• Like button analytics

• Comments box analytics

15® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 16: Big Data Real Time Analytics - A Facebook Case Study

Technology Evaluation

• MySQL DB Counters• In-Memory Counters• MapReduce• Cassandra• HBase

16® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 17: Big Data Real Time Analytics - A Facebook Case Study

PTail

Scribe

Puma

HbaseFACEBOOK

LogLog

FACEBOOK

LogLog

FACEBOOK

LogLog

HDFS

Real Time Long Term

Batch1.5 Sec

The solution..

10,000 write/sec per server

Page 18: Big Data Real Time Analytics - A Facebook Case Study

Checking the assumptions..

18® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 19: Big Data Real Time Analytics - A Facebook Case Study

Facebook Analytics.Next..

• What if..

19® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

We can rely on memory as a reliable store?

We can’t decide on a particular NoSQL database?

We need to package the solution as a product?

Page 20: Big Data Real Time Analytics - A Facebook Case Study

Step 1: Use memory..

• Instead of treating memory as a cache, why not treat it as a primary data store?

– Facebook keeps 80% of its data in Memory (Stanford research)

– RAM is 100-1000x faster than Disk (Random seek)

• Disk - 5 -10ms

• RAM – x0.001msec

20® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Events

FACEBOOK

FACEBOOK

FACEBOOK

Memory Grid

Data GridData Grid

Data GridData Grid

Data GridData Grid

Page 21: Big Data Real Time Analytics - A Facebook Case Study

Step 1: Use memory..

• Reliability is achieved through redundancy and replication

• One Data. Any API

21® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Events

FACEBOOK

FACEBOOK

FACEBOOK

Any API

Data GridData Grid

Page 22: Big Data Real Time Analytics - A Facebook Case Study

Step 2 – Collocate

Events

FACEBOOK

FACEBOOK

FACEBOOK

Processing Grid

Data GridData Grid

Data GridData Grid

Data GridData Grid

• Putting the code together with the data.

Page 23: Big Data Real Time Analytics - A Facebook Case Study

Step 2 – Collocate

Events

FACEBOOK

FACEBOOK

FACEBOOK

Processing Grid

Data GridData Grid

Data GridData Grid

Data GridData Grid

• Putting the code together with the data.

@EventDriven @Pollingpublic class SimpleListener {

@EventTemplate Data unprocessedData() { Data template = new Data(); template.setProcessed(false); return template; } @SpaceDataEvent public Data eventListener(Data event) { //process Data here }}

Page 24: Big Data Real Time Analytics - A Facebook Case Study

Step 3 – Write behind to SQL/NoSQL

Events

FACEBOOK

FACEBOOK

FACEBOOK

Processing Grid

Data GridData Grid

Data GridData Grid

Data GridData Grid

Open Long Term persistency

Write Behind

Page 25: Big Data Real Time Analytics - A Facebook Case Study

Economic Data Scaling

• Combine memory and disk– Memory is x100, x1000 lower

than disk for high data access rate (Stanford research)

– Disk is lower at cost for high capacity lower access rate.

– Solution: • Memory - short-term data, • Disk - long term. data

– Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

25

Memory Disk

Page 26: Big Data Real Time Analytics - A Facebook Case Study

Economic Scaling

• Automation - reduce operational cost• Elastic Scaling – reduce over provisioning cost• Cloud portability (JClouds) – choose the right cloud for the job• Cloud bursting – scavenge extra capacity when needed

26® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 27: Big Data Real Time Analytics - A Facebook Case Study

Putting it all together

Analytic Application

EventSources

Write behind

- In Memory Data Grid

- RT Processing Grid• Light Event Processing• Map-reduce• Event driven• Execute code with data• Transactional• Secured• Elastic

NoSQL DB• Low cost storage• Write/Read

scalability• Dynamic scaling• Raw Data and

aggregated Data

Generate Patterns

27

Page 28: Big Data Real Time Analytics - A Facebook Case Study

Putting it all together

Analytic Application

EventSources

Write behind

- In Memory Data Grid

- RT Processing Grid• Light Event Processing• Map-reduce• Event driven• Execute code with data• Transactional• Secured• Elastic

NoSQL DB• Low cost storage• Write/Read

scalability• Dynamic scaling• Raw Data and

aggregated Data

Generate Patterns

28

RScript script = new StaticScritpt(“groovy ”,”println hi; return 0”)

Query q = em.createNativeQuery(“execute ?”);q.setParamter(1, script);

Integer result = query.getSingleResult();

Real Time Map/Reduce

Page 29: Big Data Real Time Analytics - A Facebook Case Study

5x better performance per server!

• Hardware – Linux– HP DL380 G6 servers - each has:

– 2 Intel quad-core Xeon X5560 processors (2.8 Ghz Nehalem)

– 32 Gb RAM (4GB per core)

– 6 * 146 Gb 15K RPM SAS disks

– Red Hat 5.2

Event injectorUp to 128 threads

GigaSpaces/ (Other Msg Server)

App ServicesUp to 128 threads

Other

Giga

50,000 write/sec per server

Page 30: Big Data Real Time Analytics - A Facebook Case Study

Live demo

Inter Day Activity (Real Time)

Monthly Trend Analysis

Page 31: Big Data Real Time Analytics - A Facebook Case Study

5 Big Data Predictions

31® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 32: Big Data Real Time Analytics - A Facebook Case Study

Summary

Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability,..

Its Open: Use Any Stack : Avoid LockinAny database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks.

All While Minimizing CostUse Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costsElasticity – reduce over provisioning cost

Page 33: Big Data Real Time Analytics - A Facebook Case Study

Further reading..

33® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 34: Big Data Real Time Analytics - A Facebook Case Study

Thank YOU!

@natishalomhttp://blog.gigaspaces.com

34