analytics for the real-time web
DESCRIPTION
TRANSCRIPT
Analytics for the Real-Time Web
Maxim Grinev, Maria Grineva and Martin Hentschel
Outline•Real-time Web and its new requirements
•Our system: Triggy
• Short overview of Cassandra
• Triggy internal mechanisms
•Similar systems:
• Yahoo!’s S4
• Google’s Percolator
•Applications
Real-Time Web
•Web 2.0 + mobile devices = Real-Time Web
•People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare...
Analytics for the Real-Time Web: new requirements
•MapReduce - state-of-the-art for processing of Web 2.0 data
•Batch processing (MapReduce) is too slow now
•New requirements:
• real-time processing: aggregate values incrementally, as new data arrives
• data-base intensive: aggregate values are stored in a database constantly being updated
Our System: Triggy• Based on Cassandra, a distributed key-value
store
• Provides programming model similar to MapReduce, adapted to push-style processing
• Extends Cassandra with
• push-style procedures - to immediately propagate the data to computations;
• serialization - to ensure serialized access to aggregate values (counters)
• Easily scalable
Cassandra OverviewData Model
• Data Model: key-value
• Extends basic key-value with 2 levels of nesting
• Super column - if the second level is presented
• Column family ~ table;
• key-value pair ~ record
• Keys are stored ordered
Cassandra OverviewIncremental Scalability
• Incremental scalability requires mechanism to dynamically partition data over the nodes
•Data partitioned by key using consistent hashing
• Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected
Cassandra OverviewLog-Structured
Storage•Optimized for write-intensive workloads (log-structured storage)
TriggyProgramming Model
•Modified MapReduce to support push-style processing
•Modified only reduce function: reduce*
• reduce* incrementally applies a new input value to an already existing aggregate value
Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> (k2, v3)
TriggyProgramming Model
TriggyExecution of Maps and
Reduces• We extend each node with a queue and worker threads (which execute Map and Reduce tasks
buffered in the queue)
• Map tasks can be executed in parallel at any node in the system and do not require serialization because they do not share any data
• Execution of reduce* tasks has to be serialized for the same key to guarantee correct results:
• We make use of Cassandra’s partitioning strategy: equal keys are routed to the same node
• Serialization within a node: locks on keys that are being processed right now
TriggyFault Tolerance
•No fault tolerance guarantees for execution of map/reduce* tasks: Intermediate data in queue can be lost
•Not critical for analytical applications
Triggy Scalability
•Computation and data storage are tightly coupled: by moving the data, you move the computation - it allows to scale the system easily
•A new node is placed near the most loaded node, part of data is transferred to the new node
Experiments• Generated workload: tweets with user ids (1 .. 100000) with
uniform distribution
• The load generator issues as many requests as the system with N can handle
• Application: count the number of words posted by each user Map: tweet => (user_id, number_of_words_in_tweet) Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) => (user_id, number_of_words_total)
Similar Systems: Yahoo!’s S4
•Distributed stream processing engine:
• Programming interface: Processing Elements written in Java
• Data routed between Processing Elements by key
• No database. All processing in memory keeping the window of the input data at each Processing Element.
•Used to estimate Click-Through-Rate using user’s behavior within a time window
Similar Systems: Google’s Percolator
•Percolator is for incremental data processing: based on BigTable
•BigTable - a distributed key-value store:
• the same data model as in Cassandra
• the same log-structured storage
• BigTable - a distributed system with a master; Cassandra - peer2peer
•Used in Google for incremental update of Web Search Index (instead of MapReduce)
Percolator’s Push-style Processing (Observers)
• Percolator extends BigTable with
• distributed ACID transactions:
• multi-version mechanism with snapshot isolation semantics
• two-phase commit
• observers (similar to database triggers for push-style processing)
• Advantage: no data loss - for each inserted document, the associated observers will be executed
• Overhead: distributed transactions and durable storage of intermediate data
Applications
•Tracking millions of parameters individually (browser cookies, URLs ...)
•Incremental computation of analytical values allows real-time reaction on events
•Monitoring without time window or window of any size for each parameter
Real-Time Advertising (Short Overview)
• Real-Time bidding:
• Sites track your browsing behavior via cookies and sell it to advertising services
• Web publishers offer up display inventory to advertising services
• No fixed CPM, instead: each ad impression is sold to the highest bidder
• Retargeting (remarketing)
• Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere.
Using Social Network Profiles to Enhance Advertising
• Watching for readiness for a purchase intent among Twitter users
ApplicationSocial Media Optimization for news
sites•A/B testing for headlines of news stories
•Optimization of front page to attract more clicks
Real-Time News Recommendations
•TwitterTim.es:
• using social graph to recommend news
• now - batch rebuilding every 2 hours
• goal - real-time updating newspaper
• Google News:
• recommendation via collaborative filtering based on users clicks
• new stories and clicks are constantly coming
• now - batch processing using MapReduce
Other Applications•Recommendations on location checkins:
Foursquare, Facebook places...
•Social Games: monitoring events from millions of users in real-time, react in real-time
Questions?