analytics for the real-time web

25
Analytics for the Real-Time Web Maxim Grinev, Maria Grineva and Martin Hentschel

Upload: mariagrineva

Post on 22-Apr-2015

5.178 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Analytics for the Real-Time Web

Analytics for the Real-Time Web

Maxim Grinev, Maria Grineva and Martin Hentschel

Page 2: Analytics for the Real-Time Web

Outline•Real-time Web and its new requirements

•Our system: Triggy

• Short overview of Cassandra

• Triggy internal mechanisms

•Similar systems:

• Yahoo!’s S4

• Google’s Percolator

•Applications

Page 3: Analytics for the Real-Time Web

Real-Time Web

•Web 2.0 + mobile devices = Real-Time Web

•People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare...

Page 4: Analytics for the Real-Time Web

Analytics for the Real-Time Web: new requirements

•MapReduce - state-of-the-art for processing of Web 2.0 data

•Batch processing (MapReduce) is too slow now

•New requirements:

• real-time processing: aggregate values incrementally, as new data arrives

• data-base intensive: aggregate values are stored in a database constantly being updated

Page 5: Analytics for the Real-Time Web

Our System: Triggy• Based on Cassandra, a distributed key-value

store

• Provides programming model similar to MapReduce, adapted to push-style processing

• Extends Cassandra with

• push-style procedures - to immediately propagate the data to computations;

• serialization - to ensure serialized access to aggregate values (counters)

• Easily scalable

Page 6: Analytics for the Real-Time Web

Cassandra OverviewData Model

• Data Model: key-value

• Extends basic key-value with 2 levels of nesting

• Super column - if the second level is presented

• Column family ~ table;

• key-value pair ~ record

• Keys are stored ordered

Page 7: Analytics for the Real-Time Web

Cassandra OverviewIncremental Scalability

• Incremental scalability requires mechanism to dynamically partition data over the nodes

•Data partitioned by key using consistent hashing

• Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected

Page 8: Analytics for the Real-Time Web

Cassandra OverviewLog-Structured

Storage•Optimized for write-intensive workloads (log-structured storage)

Page 9: Analytics for the Real-Time Web

TriggyProgramming Model

•Modified MapReduce to support push-style processing

•Modified only reduce function: reduce*

• reduce* incrementally applies a new input value to an already existing aggregate value

Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> (k2, v3)

Page 10: Analytics for the Real-Time Web

TriggyProgramming Model

Page 11: Analytics for the Real-Time Web
Page 12: Analytics for the Real-Time Web

TriggyExecution of Maps and

Reduces• We extend each node with a queue and worker threads (which execute Map and Reduce tasks

buffered in the queue)

• Map tasks can be executed in parallel at any node in the system and do not require serialization because they do not share any data

• Execution of reduce* tasks has to be serialized for the same key to guarantee correct results:

• We make use of Cassandra’s partitioning strategy: equal keys are routed to the same node

• Serialization within a node: locks on keys that are being processed right now

Page 13: Analytics for the Real-Time Web

TriggyFault Tolerance

•No fault tolerance guarantees for execution of map/reduce* tasks: Intermediate data in queue can be lost

•Not critical for analytical applications

Page 14: Analytics for the Real-Time Web

Triggy Scalability

•Computation and data storage are tightly coupled: by moving the data, you move the computation - it allows to scale the system easily

•A new node is placed near the most loaded node, part of data is transferred to the new node

Page 15: Analytics for the Real-Time Web

Experiments• Generated workload: tweets with user ids (1 .. 100000) with

uniform distribution

• The load generator issues as many requests as the system with N can handle

• Application: count the number of words posted by each user Map: tweet => (user_id, number_of_words_in_tweet) Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) => (user_id, number_of_words_total)

Page 16: Analytics for the Real-Time Web

Similar Systems: Yahoo!’s S4

•Distributed stream processing engine:

• Programming interface: Processing Elements written in Java

• Data routed between Processing Elements by key

• No database. All processing in memory keeping the window of the input data at each Processing Element.

•Used to estimate Click-Through-Rate using user’s behavior within a time window

Page 17: Analytics for the Real-Time Web

Similar Systems: Google’s Percolator

•Percolator is for incremental data processing: based on BigTable

•BigTable - a distributed key-value store:

• the same data model as in Cassandra

• the same log-structured storage

• BigTable - a distributed system with a master; Cassandra - peer2peer

•Used in Google for incremental update of Web Search Index (instead of MapReduce)

Page 18: Analytics for the Real-Time Web

Percolator’s Push-style Processing (Observers)

• Percolator extends BigTable with

• distributed ACID transactions:

• multi-version mechanism with snapshot isolation semantics

• two-phase commit

• observers (similar to database triggers for push-style processing)

• Advantage: no data loss - for each inserted document, the associated observers will be executed

• Overhead: distributed transactions and durable storage of intermediate data

Page 19: Analytics for the Real-Time Web

Applications

•Tracking millions of parameters individually (browser cookies, URLs ...)

•Incremental computation of analytical values allows real-time reaction on events

•Monitoring without time window or window of any size for each parameter

Page 20: Analytics for the Real-Time Web

Real-Time Advertising (Short Overview)

• Real-Time bidding:

• Sites track your browsing behavior via cookies and sell it to advertising services

• Web publishers offer up display inventory to advertising services

• No fixed CPM, instead: each ad impression is sold to the highest bidder

• Retargeting (remarketing)

• Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere.

Page 21: Analytics for the Real-Time Web

Using Social Network Profiles to Enhance Advertising

• Watching for readiness for a purchase intent among Twitter users

Page 22: Analytics for the Real-Time Web

ApplicationSocial Media Optimization for news

sites•A/B testing for headlines of news stories

•Optimization of front page to attract more clicks

Page 23: Analytics for the Real-Time Web

Real-Time News Recommendations

•TwitterTim.es:

• using social graph to recommend news

• now - batch rebuilding every 2 hours

• goal - real-time updating newspaper

• Google News:

• recommendation via collaborative filtering based on users clicks

• new stories and clicks are constantly coming

• now - batch processing using MapReduce

Page 24: Analytics for the Real-Time Web

Other Applications•Recommendations on location checkins:

Foursquare, Facebook places...

•Social Games: monitoring events from millions of users in real-time, react in real-time

Page 25: Analytics for the Real-Time Web

Questions?