big data real time analytics - a facebook case study
TRANSCRIPT
SINGLE PLATFORM. COMPLETE SCALABILITY.
Real Time Analytics for Big DataReal Time Analytics for Big DataLessons from Facebook..Lessons from Facebook..
The Real Time Boom..
2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Google Real Time Web Analytics
Google Real Time Search
Facebook Real Time Social Analytics
Twitter paid tweet analytics
SaaS Real Time User Tracking
New Real Time Analytics Startups..
Analytics @ Twitter
Note the Time dimension
The data resolution & processing models
Traditional analytics applications
• Scale-up Database – Use traditional SQL database– Use stored procedure for event driven reports– Use flash memory disks to reduce disk I/O– Use read only replica to scale-out read queries
• Limitations– Doesn’t scale on write– Extremely expensive (HW + SW)
6® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
CEP – Complex Event Processing
• Process the data as it comes• Maintain a window of the data in-memory
• Pros:– Extremely low-latency– Relatively low-cost
• Cons– Hard to scale (Mostly limited to scale-up)– Not agile - Queries must be pre-generated– Fairly complex
7® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
In Memory Data Grid
• Distributed in-memory database• Scale out
• Pros– Scale on write/read– Fits to event driven (CEP style) , ad-hoc query model
• Cons- Cost of memory vs disk- Memory capacity is limited
8® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
NoSQL
• Use distributed database– Hbase, Cassandra, MongoDB
• Pros– Scale on write/read– Elastic
• Cons– Read latency– Consistency tradeoffs are hard– Maturity – fairly young technology
9® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Hadoop MapReudce
• Distributed batch processing
• Pros– Designed to process massive amount of data– Mature– Low cost
• Cons– Not real-time
10® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Hadoop Map/Reduce – Reality check..
11® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
So what’s the bottom line?
12® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Facebook Real-timeAnalytics System
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
13
Goals
• Show why plugins are valuable. – What value is your business deriving from it?
• Make the data more actionable. – Help users take action to make their content more valuable.– How many people see a plugin, how many people take action
on it, and how many are converted to traffic back on your site.
• Make the data more timely. – Went from a 48-hour turn around to 30 seconds.– Multiple points of failure were removed to make this goal.
• Handle massive load– 20 billion events per day (200,000 events per second)
14® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
The actual analytics..
• Like button analytics
• Comments box analytics
15® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Technology Evaluation
• MySQL DB Counters• In-Memory Counters• MapReduce• Cassandra• HBase
16® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
PTail
Scribe
Puma
HbaseFACEBOOK
LogLog
LogLog
LogLog
HDFS
Real Time Long Term
Batch1.5 Sec
The solution..
10,000 write/sec per server
Checking the assumptions..
18® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Facebook Analytics.Next..
• What if..
19® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We can rely on memory as a reliable store?
We can’t decide on a particular NoSQL database?
We need to package the solution as a product?
Step 1: Use memory..
• Instead of treating memory as a cache, why not treat it as a primary data store?
– Facebook keeps 80% of its data in Memory (Stanford research)
– RAM is 100-1000x faster than Disk (Random seek)
• Disk - 5 -10ms
• RAM – x0.001msec
20® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Events
Memory Grid
Data GridData Grid
Data GridData Grid
Data GridData Grid
Step 1: Use memory..
• Reliability is achieved through redundancy and replication
• One Data. Any API
21® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Events
Any API
Data GridData Grid
Step 2 – Collocate
Events
Processing Grid
Data GridData Grid
Data GridData Grid
Data GridData Grid
• Putting the code together with the data.
Step 2 – Collocate
Events
Processing Grid
Data GridData Grid
Data GridData Grid
Data GridData Grid
• Putting the code together with the data.
@EventDriven @Pollingpublic class SimpleListener {
@EventTemplate Data unprocessedData() { Data template = new Data(); template.setProcessed(false); return template; } @SpaceDataEvent public Data eventListener(Data event) { //process Data here }}
Step 3 – Write behind to SQL/NoSQL
Events
Processing Grid
Data GridData Grid
Data GridData Grid
Data GridData Grid
Open Long Term persistency
Write Behind
Economic Data Scaling
• Combine memory and disk– Memory is x100, x1000 lower
than disk for high data access rate (Stanford research)
– Disk is lower at cost for high capacity lower access rate.
– Solution: • Memory - short-term data, • Disk - long term. data
– Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
25
Memory Disk
Economic Scaling
• Automation - reduce operational cost• Elastic Scaling – reduce over provisioning cost• Cloud portability (JClouds) – choose the right cloud for the job• Cloud bursting – scavenge extra capacity when needed
26® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Putting it all together
Analytic Application
EventSources
Write behind
- In Memory Data Grid
- RT Processing Grid• Light Event Processing• Map-reduce• Event driven• Execute code with data• Transactional• Secured• Elastic
NoSQL DB• Low cost storage• Write/Read
scalability• Dynamic scaling• Raw Data and
aggregated Data
Generate Patterns
27
Putting it all together
Analytic Application
EventSources
Write behind
- In Memory Data Grid
- RT Processing Grid• Light Event Processing• Map-reduce• Event driven• Execute code with data• Transactional• Secured• Elastic
NoSQL DB• Low cost storage• Write/Read
scalability• Dynamic scaling• Raw Data and
aggregated Data
Generate Patterns
28
RScript script = new StaticScritpt(“groovy ”,”println hi; return 0”)
Query q = em.createNativeQuery(“execute ?”);q.setParamter(1, script);
Integer result = query.getSingleResult();
Real Time Map/Reduce
5x better performance per server!
• Hardware – Linux– HP DL380 G6 servers - each has:
– 2 Intel quad-core Xeon X5560 processors (2.8 Ghz Nehalem)
– 32 Gb RAM (4GB per core)
– 6 * 146 Gb 15K RPM SAS disks
– Red Hat 5.2
Event injectorUp to 128 threads
GigaSpaces/ (Other Msg Server)
App ServicesUp to 128 threads
Other
Giga
50,000 write/sec per server
Live demo
Inter Day Activity (Real Time)
Monthly Trend Analysis
5 Big Data Predictions
31® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Summary
Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability,..
Its Open: Use Any Stack : Avoid LockinAny database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks.
All While Minimizing CostUse Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costsElasticity – reduce over provisioning cost
Further reading..
33® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Thank YOU!
@natishalomhttp://blog.gigaspaces.com
34