search engine architecture...search engine architecture 12. stream processing this work is licensed...

55
Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from Lin et al.’s Big Data Infrastructure, UMD Spring 2015 with cosmetic changes.

Upload: others

Post on 13-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Search Engine Architecture

12. Stream Processing

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from Lin et al.’s Big Data Infrastructure, UMD Spring 2015 with cosmetic changes.

Page 2: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Today’s Agenda

•  Basics of stream processing •  Sampling and hashing •  Architectures for stream processing

Page 3: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

What is a data stream?

•  Sequence of items: •  Structured (e.g., tuples) •  Ordered (implicitly or timestamped) •  Arriving continuously at high volumes •  Not possible to store entirely •  Sometimes not possible to even examine all items

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 4: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

What to do with data streams?

•  Network traffic monitoring •  Datacenter telemetry monitoring

•  Sensor networks monitoring •  Credit card fraud detection

•  Stock market analysis

•  Online mining of click streams •  Monitoring social media streams

Page 5: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

What’s the scale? Packet data streams

•  Single 2 Gb/sec link; say avg. packet size is 50 bytes •  Number of packets/sec = 5 million

•  Time per packet = 0.2 microseconds

•  If we only capture header information per packet: source/destination IP, time, no. of bytes, etc. – at least 10 bytes •  50 MB per second •  4+ TB per day

•  Per link!

What if you wanted to do deep-packet inspection? Source:  Minos  Garofalakis,  Berkeley    CS  286  

Page 6: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Converged  IP/MPLS  Network  

PSTN  

DSL/Cable  Networks  

Enterprise  Networks  

Network Operations Center (NOC)

BGP  Peer  

R1   R2  R3  

What  are  the  top  (most  frequent)  1000  (source,  dest)  pairs  seen  by  R1  over  the  last  month?  

SELECT  COUNT  (R1.source,  R1.dest)  FROM    R1,  R2  WHERE  R1.source  =  R2.source  

SQL  Join  Query  

How  many  disRnct  (source,  dest)  pairs  have  been  seen  by  both  R1  and  R2  but  not  R3?  

Set-­‐Expression  Query  

Off-­‐line  analysis  –  Data  access  is  slow,  expensive  

 DBMS  

 

Source:  Minos  Garofalakis,  Berkeley    CS  286  

Page 7: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Common Architecture

•  Data stream management system (DSMS) at observation points

•  Voluminous streams-in, reduced streams-out

•  Database management system (DBMS)

•  Outputs of DSMS can be treated as data feeds to databases

DSMS

DSMS

DBMS

data streams queries

queries data feeds

Source:  Peter  Bonz  

Page 8: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

DBMS vs. DSMS

DBMS

•  Model: persistent relations

•  Relation: tuple set/bag

•  Data update: modifications

•  Query: transient

•  Query answer: exact

•  Query evaluation: arbitrary

•  Query plan: fixed

DSMS

•  Model: (mostly) transient relations

•  Relation: tuple sequence

•  Data update: appends

•  Query: persistent

•  Query answer: approximate

•  Query evaluation: one pass

•  Query plan: adaptive

Source:  Peter  Bonz  

Page 9: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

What makes it hard? •  Intrinsic challenges:

•  Volume

•  Velocity

•  Limited storage

•  Strict latency requirements

•  Out-of-order delivery

•  System challenges:

•  Load balancing

•  Unreliable message delivery

•  Fault-tolerance

•  Consistency semantics (lossy, exactly once, at least once, etc.)

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 10: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

What exactly do you do?

•  “Standard” relational operations: •  Select

•  Project •  Transform (i.e., apply custom UDF)

•  Group by

•  Join •  Aggregations

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 11: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Issues of Semantics

•  Group by… aggregate •  When do you stop grouping and start aggregating?

•  Joining a stream and a static source •  Simple lookup

•  Joining two streams •  How long do you wait for the join key in the other stream?

•  Joining two streams, group by and aggregation •  When do you stop joining?

What’s the solution? Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 12: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Windows

•  Mechanism for extracting finite relations from an infinite stream

•  Windows restrict processing scope: •  Windows based on ordering attributes (e.g., time) •  Windows based on item (record) counts •  Windows based on explicit markers (e.g.,

punctuations) •  Variants (e.g., some semantic partitioning constraint)

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 13: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Windows on Ordering Attributes

•  Assumes the existence of an attribute that defines the order of stream elements (e.g., time)

•  Let T be the window size in units of the ordering attribute

t1   t2   t3   t4   t1'   t2’   t3’   t4’  

t1   t2   t3  

sliding  window  

tumbling  window  

ti’  –  ti  =  T  

ti+1  –  ti  =  T  

Source:  Peter  Bonz  

Page 14: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Windows on Counts

•  Window of size N elements (sliding, tumbling) over the stream

•  Challenges: •  Problematic with non-unique timestamps: non-

deterministic output •  Unpredictable window size (and storage

requirements) t1   t2   t3  t1'   t2’   t3’   t4’  

Source:  Peter  Bonz  

Page 15: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Windows from “Punctuations”

•  Application-inserted “end-of-processing” •  Example: stream of actions… “end of user session”

•  Properties •  Advantage: application-controlled semantics •  Disadvantage: unpredictable window size (too large

or too small)

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 16: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Common Techniques

Source:  Wikipedia  (Forge)  

Page 17: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

“Hello World” Stream Processing

•  Problem: •  Count the frequency of items in the stream

•  Why? •  Take some action when frequency exceeds a threshold

•  Data mining: raw counts → co-occurring counts → association rules

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 18: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

The Raw Stream…

stream

items

Source:  Peter  Bonz  

Page 19: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Divide Into Windows…

window 1 window 2 window 3

Source:  Peter  Bonz  

Page 20: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

First Window

Source:  Peter  Bonz  

Page 21: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Second Window

Next Window

+

FrequencyCounts

second window

frequency counts

FrequencyCounts

frequency counts

Source:  Peter  Bonz  

Page 22: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Window Counting

•  What’s the issue? •  What’s the solution?

Lessons learned? Solutions are approximate (or lossy)

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 23: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

General Strategies

•  Sampling •  Hashing

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 24: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Reservoir Sampling •  Task: select s elements from a stream of size N with uniform

probability

•  N can be very very large

•  We might not even know what N is! (infinite stream)

•  Solution: Reservoir sampling

•  Store first s elements

•  For the k-th element thereafter, keep with probability s/k (randomly discard an existing element)

•  Example: s = 10

•  Keep first 10 elements

•  11th element: keep with 10/11

•  12th element: keep with 10/12

•  …

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 25: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Reservoir Sampling: How does it work?

•  Example: s = 10

•  Keep first 10 elements

•  11th element: keep with 10/11

•  General case: at the (k + 1)th element

•  Probability of selecting each item up until now is s/k

•  Probability existing element is replaced: s/(k+1) × 1/s = 1/(k + 1)

•  Probability existing element is not replaced: k/(k + 1)

•  Probability each element survives to (k + 1)th round: (s/k) × k/(k + 1) = s/(k + 1)

If we decide to keep it: sampled uniformly by definition probability existing item discarded: 10/11 × 1/10 = 1/11 probability existing item survives: 10/11

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 26: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Hashing for Three Common Tasks

•  Cardinality estimation •  What’s the cardinality of set S?

•  How many unique visitors to this page?

•  Set membership •  Is x a member of set S?

•  Has this user seen this ad before?

•  Frequency estimation •  How many times have we observed x?

•  How many queries has this user issued?

HashSet

HashSet

HashMap

HLL counter

Bloom Filter

CMS

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 27: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

HyperLogLog Counter

•  Task: cardinality estimation of set •  size() → number of unique elements in the set

•  Observation: hash each item and examine the hash code •  On expectation, 1/2 of the hash codes will start with 1

•  On expectation, 1/4 of the hash codes will start with 01 •  On expectation, 1/8 of the hash codes will start with 001

•  On expectation, 1/16 of the hash codes will start with 0001

•  …

How do we take advantage of this observation? Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 28: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters •  Task: keep track of set membership

•  put(x) → insert x into the set

•  contains(x) → yes if x is a member of the set

•  Components

•  m-bit bit vector

•  k hash functions: h1 … hk

0 0 0 0 0 0 0 0 0 0 0 0

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 29: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: put

0 0 0 0 0 0 0 0 0 0 0 0

x put h1(x) = 2 h2(x) = 5 h3(x) = 11

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 30: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: put

0 1 0 0 1 0 0 0 0 0 1 0

x put

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 31: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

x h1(x) = 2 h2(x) = 5 h3(x) = 11

contains

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 32: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

x contains h1(x) = 2 h2(x) = 5 h3(x) = 11

AND = YES A[h1(x)] A[h2(x)] A[h3(x)]

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 33: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

y contains h1(y) = 2 h2(y) = 6 h3(y) = 9

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 34: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

y h1(y) = 2 h2(y) = 6 h3(y) = 9

What’s going on here?

AND = NO A[h1(y)] A[h2(y)] A[h3(y)]

contains

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 35: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Bloom Filters

•  Error properties: contains(x) •  False positives possible •  No false negatives

•  Usage: •  Constraints: capacity, error probability •  Tunable parameters: size of bit vector m, number of

hash functions k

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 36: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches

•  Task: frequency estimation •  put(x) → increment count of x by one

•  get(x) → returns the frequency of x

•  Components •  k hash functions: h1 … hk

•  m by k array of counters

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

m

k

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 37: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

x put h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 38: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

x put

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 39: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

x put h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 40: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

x put

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 41: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

y put h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 42: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: put

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

y put

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 43: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

x get h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 44: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

x get h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

A[h3(x)] MIN = 2

A[h1(x)] A[h2(x)]

A[h4(x)]

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 45: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

y get h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 46: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

y get h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

MIN = 1 A[h3(y)]

A[h1(y)] A[h2(y)]

A[h4(y)]

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 47: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Count-Min Sketches

•  Error properties: •  Reasonable estimation of heavy-hitters •  Frequent over-estimation of tail

•  Usage: •  Constraints: number of distinct events, distribution

of events, error bounds •  Tunable parameters: number of counters m,

number of hash functions k, size of counters

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 48: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Source:  Wikipedia  (River)  

Stream Processing Architectures

Page 49: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Storm

•  Open-source real-time distributed stream processing system •  Started at BackType •  BackType acquired by Twitter in 2011 •  Now an Apache project

•  Storm aspires to be the Hadoop of real-time processing!

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 50: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Storm Topologies

•  Storm topologies = “job” •  Once started, runs continuously until killed

•  A Storm topology is a computation graph •  Graph contains nodes and edges •  Nodes hold processing logic (i.e., transformation

over its input) •  Directed edges indicate communication between

nodes

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 51: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Streams, Spouts, and Bolts

bolt bolt bolt

bolt bolt

bolt bolt

spout spout

spout

stream

stream stream

•  Spouts –  Stream generators –  May propagate a single stream to

multiple consumers •  Bolts

–  Subscribe to streams –  Streams transformers –  Process incoming streams and

produce new ones

•  Streams –  The basic collection abstraction: an

unbounded sequence of tuples –  Streams are transformed by the

processing elements of a topology

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 52: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Stream Groupings

•  Bolts are executed by multiple workers in parallel •  When a bolt emits a tuple, where should it go?

•  Stream groupings: •  Shuffle grouping: round-robin

•  Field grouping: based on data value

spout spout

boltbolt

bolt

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 53: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Storm: Example //  instantiate  a  new  topology TopologyBuilder  builder  =  new  TopologyBuilder(); //  set  up  a  new  spout  with  five  tasks builder.setSpout("spout",  new  RandomSentenceSpout(),  5); //  the  sentence  splitter  bolt  with  eight  tasks builder.setBolt("split",  new  SplitSentence(),  8)        .shuffleGrouping("spout");  //  shuffle  grouping  for  the  ouput //  word  counter  with  twelve  tasks builder.setBolt("count",  new  WordCount(),  12)        .fieldsGrouping("split",  new  Fields("word"));  //  field  grouping   //  new  configuration Config  conf  =  new  Config(); //  set  the  number  of  workers  for  the  topology;  the  5x8x12=480  tasks //  will  be  allocated  round-­‐robin  to  the  three  workers,  each  task //  running  as  a  separate  thread conf.setNumWorkers(3); //  submit  the  topology  to  the  cluster StormSubmitter.submitTopology("word-­‐count",  conf,  builder.createTopology());

Source:  Lin  et  al.  Big  Data  Infrastructure,  UMD  Spring  2015.  

Page 54: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Spark Streaming

Discretized stream processing: Run a streaming computation as a series of very small, deterministic batch jobs

mutable state

synchronization

primaries

replicas

node 1 node 2

node 1’ node 2’

input

(a) Continuous operator processing model. Each node con-tinuously receives records, updates internal state, and emitsnew records. Fault tolerance is typically achieved throughreplication, using a synchronization protocol like Flux orDPC [33, 5] to ensure that replicas of each node see records inthe same order (e.g., when they have multiple parent nodes).

(b) D-Stream processing model. In each time interval, therecords that arrive are stored reliably across the cluster to forman immutable, partitioned dataset. This is then processed viadeterministic parallel operations to compute other distributeddatasets that represent program output or state to pass to thenext interval. Each series of datasets forms one D-Stream.

Figure 1: Comparison of traditional record-at-a-time stream processing (a) with discretized streams (b).

the state of operators on a lost, or slow, node. Previ-ous systems use one of two schemes, replication andupstream backup [19], which offer a sharp tradeoff be-tween cost and recovery time.

In replication, which is common in database systems,there are two copies of the processing graph, and inputrecords are sent to both. However, simply replicating thenodes is not enough; the system also needs to run a syn-chronization protocol, such as Flux [33] or Borealis’sDPC [5], to ensure that the two copies of each operatorsee messages from upstream parents in the same order.For example, an operator that outputs the union of twoparent streams (the sequence of all records received oneither one) needs to see the parent streams in the sameorder to produce the same output stream, so the twocopies of this operator need to coordinate. Replicationis thus costly, though it recovers quickly from failures.

In upstream backup, each node retains a copy of themessages it sent since some checkpoint. When a nodefails, a standby machine takes over its role, and theparents replay messages to this standby to rebuild itsstate. This approach thus incurs high recovery times,because a single node must recompute the lost stateby running data through the serial stateful operatorcode. TimeStream [32] and MapReduce Online [10]use this model. Popular message queueing systems, likeStorm [36], also use this approach, but typically onlyprovide “at-least-once” delivery for messages, relyingon the user’s code to handle state recovery.2

More importantly, neither replication nor upstreambackup handle stragglers. If a node runs slowly in thereplication model, the whole system is affected because

2 Storm’s Trident layer [25] automatically keeps state in a repli-cated database instead, writing updates in batches. This is expensive,as all updates must be replicated transactionally across the network.

of the synchronization required to have the replicas re-ceive messages in the same order. In upstream backup,the only way to mitigate a straggler is to treat it as a fail-ure, which requires going through the slow state recov-ery process mentioned above, and is heavy-handed for aproblem that may be transient.3 Thus, while traditionalstreaming approaches work well at smaller scales, theyface substantial problems in a large commodity cluster.

3 Discretized Streams (D-Streams)D-Streams avoid the problems with traditional streamprocessing by structuring computations as a set ofshort, stateless, deterministic tasks instead of continu-ous, stateful operators. They then store the state in mem-ory across tasks as fault-tolerant data structures (RDDs)that can be recomputed deterministically. Decomposingcomputations into short tasks exposes dependencies at afine granularity and allows powerful recovery techniqueslike parallel recovery and speculation. Beyond fault tol-erance, the D-Stream model gives other benefits, such aspowerful unification with batch processing.

3.1 Computation ModelWe treat a streaming computation as a series of deter-ministic batch computations on small time intervals. Thedata received in each interval is stored reliably across thecluster to form an input dataset for that interval. Oncethe time interval completes, this dataset is processed viadeterministic parallel operations, such as map, reduceand groupBy, to produce new datasets representing ei-ther program outputs or intermediate state. In the for-mer case, the results may be pushed to an external sys-

3 Note that a speculative execution approach as in batch systemswould be challenging to apply here because the operator code assumesthat it is fed inputs serially, so even a backup copy of an operator wouldneed to spend a long time recovering from its last checkpoint.

425

(a) Continuous operator processing model. Each node con-tinuously receives records, updates internal state, and emitsnew records. Fault tolerance is typically achieved throughreplication, using a synchronization protocol like Flux orDPC [33, 5] to ensure that replicas of each node see records inthe same order (e.g., when they have multiple parent nodes).

t = 1:

t = 2:

D-Stream 1 D-Stream 2

immutable dataset

immutable dataset

batch operation

input

(b) D-Stream processing model. In each time interval, therecords that arrive are stored reliably across the cluster to forman immutable, partitioned dataset. This is then processed viadeterministic parallel operations to compute other distributeddatasets that represent program output or state to pass to thenext interval. Each series of datasets forms one D-Stream.

Figure 1: Comparison of traditional record-at-a-time stream processing (a) with discretized streams (b).

the state of operators on a lost, or slow, node. Previ-ous systems use one of two schemes, replication andupstream backup [19], which offer a sharp tradeoff be-tween cost and recovery time.

In replication, which is common in database systems,there are two copies of the processing graph, and inputrecords are sent to both. However, simply replicating thenodes is not enough; the system also needs to run a syn-chronization protocol, such as Flux [33] or Borealis’sDPC [5], to ensure that the two copies of each operatorsee messages from upstream parents in the same order.For example, an operator that outputs the union of twoparent streams (the sequence of all records received oneither one) needs to see the parent streams in the sameorder to produce the same output stream, so the twocopies of this operator need to coordinate. Replicationis thus costly, though it recovers quickly from failures.

In upstream backup, each node retains a copy of themessages it sent since some checkpoint. When a nodefails, a standby machine takes over its role, and theparents replay messages to this standby to rebuild itsstate. This approach thus incurs high recovery times,because a single node must recompute the lost stateby running data through the serial stateful operatorcode. TimeStream [32] and MapReduce Online [10]use this model. Popular message queueing systems, likeStorm [36], also use this approach, but typically onlyprovide “at-least-once” delivery for messages, relyingon the user’s code to handle state recovery.2

More importantly, neither replication nor upstreambackup handle stragglers. If a node runs slowly in thereplication model, the whole system is affected because

2 Storm’s Trident layer [25] automatically keeps state in a repli-cated database instead, writing updates in batches. This is expensive,as all updates must be replicated transactionally across the network.

of the synchronization required to have the replicas re-ceive messages in the same order. In upstream backup,the only way to mitigate a straggler is to treat it as a fail-ure, which requires going through the slow state recov-ery process mentioned above, and is heavy-handed for aproblem that may be transient.3 Thus, while traditionalstreaming approaches work well at smaller scales, theyface substantial problems in a large commodity cluster.

3 Discretized Streams (D-Streams)D-Streams avoid the problems with traditional streamprocessing by structuring computations as a set ofshort, stateless, deterministic tasks instead of continu-ous, stateful operators. They then store the state in mem-ory across tasks as fault-tolerant data structures (RDDs)that can be recomputed deterministically. Decomposingcomputations into short tasks exposes dependencies at afine granularity and allows powerful recovery techniqueslike parallel recovery and speculation. Beyond fault tol-erance, the D-Stream model gives other benefits, such aspowerful unification with batch processing.

3.1 Computation ModelWe treat a streaming computation as a series of deter-ministic batch computations on small time intervals. Thedata received in each interval is stored reliably across thecluster to form an input dataset for that interval. Oncethe time interval completes, this dataset is processed viadeterministic parallel operations, such as map, reduceand groupBy, to produce new datasets representing ei-ther program outputs or intermediate state. In the for-mer case, the results may be pushed to an external sys-

3 Note that a speculative execution approach as in batch systemswould be challenging to apply here because the operator code assumesthat it is fed inputs serially, so even a backup copy of an operator wouldneed to spend a long time recovering from its last checkpoint.

425

Continuous Operator Model Discretized Streams

Source:  Zaharia  et  al.  (SOSP  2013)  

Page 55: Search Engine Architecture...Search Engine Architecture 12. Stream Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

Spark and Spark Streaming

Spark Streaming

divide data stream into

batches

streaming computations

expressed using DStreams

Spark Task Scheduler

Memory Manager

generate RDD

transfor-mations

batches of input data as RDDs

live input data stream

batches of results

Spark batch jobs to execute RDD transformations

Figure 2: High-level overview of the Spark Streamingsystem. Spark Streaming divides input data streamsinto batches and stores them in Spark’s memory. Itthen executes a streaming application by generatingSpark jobs to process the batches.

tem in a distributed manner. In the latter case, the inter-mediate state is stored as resilient distributed datasets(RDDs) [42], a fast storage abstraction that avoids repli-cation by using lineage for recovery, as we shall explain.This state dataset may then be processed along with thenext batch of input data to produce a new dataset of up-dated intermediate states. Figure 1(b) shows our model.

We implemented our system, Spark Streaming, basedon this model. We used Spark [42] as our batch process-ing engine for each batch of data. Figure 2 shows a high-level sketch of the computation model in the context ofSpark Streaming. This is explained in more detail later.

In our API, users define programs by manipulatingobjects called discretized streams (D-Streams). A D-Stream is a sequence of immutable, partitioned datasets(RDDs) that can be acted on by deterministic transfor-mations. These transformations yield new D-Streams,and may create intermediate state in the form of RDDs.

We illustrate the idea with a Spark Streaming pro-gram that computes a running count of view events byURL. Spark Streaming exposes D-Streams through afunctional API similar to LINQ [41, 2] in the Scala pro-gramming language.4 The code for our program is:

pageViews = readStream("http://...", "1s")

ones = pageViews.map(event => (event.url, 1))

counts = ones.runningReduce((a, b) => a + b)

This code creates a D-Stream called pageViews byreading an event stream over HTTP, and groups theseinto 1-second intervals. It then transforms the eventstream to get a new D-Stream of (URL, 1) pairs calledones, and performs a running count of these with astateful runningReduce transformation. The argumentsto map and runningReduce are Scala function literals.

4Other interfaces, such as streaming SQL, would also be possible.

Figure 3: Lineage graph for RDDs in the view countprogram. Each oval is an RDD, with partitions shownas circles. Each sequence of RDDs is a D-Stream.

To execute this program, Spark Streaming will receivethe data stream, divide it into one second batches andstore them in Spark’s memory as RDDs (see Figure 2).Additionally, it will invoke RDD transformations likemap and reduce to process the RDDs. To execute thesetransformations, Spark will first launch map tasks to pro-cess the events and generate the url-one pairs. Then itwill launch reduce tasks that take both the results of themaps and the results of the previous interval’s reduces,stored in an RDD. These tasks will produce a new RDDwith the updated counts. Each D-Stream in the programthus turns into a sequence of RDDs.

Finally, to recover from faults and stragglers, both D-Streams and RDDs track their lineage, that is, the graphof deterministic operations used to build them [42].Spark tracks this information at the level of partitionswithin each distributed dataset, as shown in Figure 3.When a node fails, it recomputes the RDD partitions thatwere on it by re-running the tasks that built them fromthe original input data stored reliably in the cluster. Thesystem also periodically checkpoints state RDDs (e.g.,by asynchronously replicating every tenth RDD)5 to pre-vent infinite recomputation, but this does not need tohappen for all data, because recovery is often fast: thelost partitions can be recomputed in parallel on separatenodes. In a similar way, if a node straggles, we can spec-ulatively execute copies of its tasks on other nodes [11],because they will produce the same result.

We note that the parallelism usable for recovery in D-Streams is higher than in upstream backup, even if oneran multiple operators per node. D-Streams expose par-allelism across both partitions of an operator and time:1. Much like batch systems run multiple tasks per node,

each timestep of a transformation may create multi-ple RDD partitions per node (e.g., 1000 RDD parti-tions on a 100-core cluster). When the node fails, wecan recompute its partitions in parallel on others.

5Since RDDs are immutable, checkpointing does not block the job.

426

Source:  Zaharia  et  al.  (SOSP  2013)