big data and nosql in real time
DESCRIPTION
Explain the challenge of having real time analytics in big data and nosql applications. Showing Facebook and Twitter examples.TRANSCRIPT
Big Data and NoSQL in REAL TIMEFacebook and Twitter Examples
Ron Zavner
2
Agenda
Our real time world… Flavors of Big Data Facebook messaging and real time analytics system Twitter analytics system Winning architecture?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
What is Real Time?
3® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
Analytics – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
Analytics – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
Analytics – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what we’re here to discuss
FACEBOOK REAL-TIMEANALYTICS SYSTEM
12
13
Store 135+ Billion Messages A Month
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
14
The actual analytics.. Like button analytics
Comments box analytics
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
15
Goals
Show why plugins are valuable Make the data more actionable Make the data more timely Remove point of failures Handle massive load - 200K events per second
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
16
Technology Evaluation
MySQL DB Counters In-Memory Counters MapReduce Cassandra HBase
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
PTail
Scribe
Puma
HbaseFACEBOOK
Log
Log
Log
HDFS
Real Time Long Term
Batch1.5 Sec
The solution..10,000 write/sec per server
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
TWITTER REAL-TIMEANALYTICS SYSTEM
19
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Let’s start with some statistics ….
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
It takes a week for users to
send 1 billion Tweets.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
On average,
140 million tweets get sent every day.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
The highest throughput to date is
6,939 tweets/sec.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
460,000 new accounts
are created daily.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
26
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics• URL mentions• etc.
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered
Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer 3 Filterer 3
Tokenizer n Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams
Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storage Data Persistency Querying
Storm LimitationSpouts
Bolt
Topologies
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Event driven / flow Reliable Storage Data Persistency Querying
Winner is… storm & in memory data grids
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
Facebook messages http://
highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
Facebook Real time analytics http://
highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics
Detailed blog posthttp://bit.ly/gs-bigdata-analytics
Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html
Twitter Storm: http://bit.ly/twitter-storm
References