cs 677: big datammalensek/cs677/schedule/... · scaling big data mining infrastructure: the twitter...

Lecture 18

CS 677: Big Data

Data Sketches

§ In many cases, data is produced much faster than we can analyze it

§ Batch processing systems like MapReduce let us do analysis offline, after the fact§ Good: studying long-term trends§ Bad: reacting quickly…

§ Health monitoring, rerouting traffic, etc.

§ We can use a stream processing system, but what happens when that can’t handle the workload?

Streaming Big Data

11/10/20 CS 677: Big Data 2

§ Rather than storing/processing everything, we can build data sketches of the datasets

§ Some information is thrown away…§ …but we can store a wider breadth of information.§ These approaches have memory and processing

benefits§ Also well-suited to IoT devices, low-powered cloud

instances, etc

Sketching

11/10/20 CS 677: Big Data 3

§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items

Sketches

11/10/20 CS 677: Big Data 4


Sketches

11/10/20 CS 677: Big Data 5

§ Principal component analysis

§ Converts (possible) correlated variables into a set of linearly uncorrelated variables

§ Ordered by largest variance

Dimensionality Reduction: PCA

11/10/20 CS 677: Big Data 6

§ In some datasets, features can be predicted by one or more other features with high accuracy§ Awesome! Easy predictions!

§ However, consider the case where we are building a multiple regression model§ Noise in the dataset will cause increased variance of

one or more regression coefficients

§ Solution: remove these features

Multicollinearity

11/10/20 CS 677: Big Data 7


Sketches

11/10/20 CS 677: Big Data 8

§ Used often for compression§ Think: JPEG

§ We can throw away some of the data and then fit wavelets that will help us reconstruct it

§ Related concept: quantization/discretization§ Put readings/observations into bins § Now instead of maintaining [1.1, 1.2, 1.15] in memory,

we can say our [1.1 – 1.2] bin contains 3 elements§ It’s possible to maintain the distributions within…

Wavelets

11/10/20 CS 677: Big Data 9


Sketches

11/10/20 CS 677: Big Data 10

§ Let’s assume we have a data feed from NOAA that gives us real time climate information

§ We want to provide some basic statistics about the weather§ Highs, lows, averages, etc.

§ If we store these in an array (or similar structure) then we can easily !nd the values we need§ This will consume a lot of memory (or disk space)

Scenario

11/10/20 CS 677: Big Data 11

§ Being the clever big data people we are, we realize that providing the highs/lows doesn’t actually require us to store the entire data stream

§ We can just check whether the new value we’ve seen is larger/smaller than what is recorded

§ Great! But now we also want to know what the average temperature is…

Optimization

11/10/20 CS 677: Big Data 12

Gathering More Statistics

§ To improve the expressiveness of our weather reports, we also want to gather:§ Total number of data points§ Average, variance, and standard deviations

§ These statistics provide a high-level overview of the data distributions§ For instance, if we can assume a normal distribution

then this tells us a lot about the data

11/10/20 CS 677: Big Data 13

Online Statistics Collection

§ Since new records are constantly streaming into the system, recalculating statistics each time is ine!cient

§ We also operate in a distributed world: what if multiple nodes in our cluster are receiving data points at the same time?

§ Ok, so our requirements are:§ Must support unbounded data streams§ Must be accurate§ Must be mergeable across distributed nodes

11/10/20 CS 677: Big Data 14

§ Most of us know how to calculate variance (and therefore standard deviation) with two passes:

§ Given samples x1 … xN:1. Compute their mean2. Compute squared differences from the mean

§ This is bad, though, because, well… it’s two passes§ That’s slow!§ It also might not be feasible if the data stream is

unbounded

Average, Variance, etc.

11/10/20 CS 677: Big Data 15

§ To calculate the variance online, we can accumulate the sums of xi and xi

2:§ sum += xi

§ sumsq += xi2

§ Then the mean and variance will be:§ x ̄ = sum / N§ σ = (sumsq - N * x ̄2) / (N - 1)

§ Except this is bad because it su"ers from massive cancellation

The Streaming Version

11/10/20 CS 677: Big Data 16

Welford’s Method

§ Allows statistics about a dataset to be updated incrementally§ Computation is performed in a single pass (each data point

is inspected once)

§ Each new record incurs a small calculation cost, but avoids re-calculating statistics for the entire dataset§ Takes about 1 microsecond (0.000001 second) on

commodity CPUs

§ No issues with massive cancellation§ Mergeable

11/10/20 CS 677: Big Data 17

§ We’ll maintain:§ the number of observations, n§ the running mean, x ̄§ the sum of squares of di!erences from the current

mean, Sn§ As a recurrence relation:

Welford Implementation

11/10/20 CS 677: Big Data 18

§ We already are tracking the running mean, so now we just need to calculate the variance

§ For 2 ≤ k ≤ N, the nth variance is:σ2 = Sn / (n – 1)

§ …which, of course, means we can get the standard deviation as well:

σ = sqrt(Sn / (n – 1))

Variance and Std. Dev

11/10/20 CS 677: Big Data 19

/** Add a new sample to the running statistics. */public void put(double sample) {

n++; double delta = sample - this.mean;this.mean = this.mean + delta / n; this.Sn = this.Sn +

delta * (sample - this.mean);

min = Math.min(this.min, sample);max = Math.max(this.max, sample);

}

Or, in Code

11/10/20 CS 677: Big Data 20

§ We can use this information to perform t-tests, check the probability of values given a distribution, and more

§ Another big benefit: these statistics can be merged§ Collect data points on each machine in our cluster,

merge them back together!

§ Works well with streaming systems and MapReduce

Additional Statistics

11/10/20 CS 677: Big Data 21

§ We end up maintaining:§ Min§ Max

§ And:§ Count (n)§ Mean§ Sn

§ In Java, we’re looking at around 50 bytes or so

Memory Impact

11/10/20 CS 677: Big Data 22

§ This approach works well for inspecting a single feature such as temperature

§ We can also maintain 2D online statistics:§ put(temperature, humidity)

§ Here we maintain the differences in the sum of squares across the two features§ Keep a 1D instance of each feature plus this

information (just ~8 more bytes)

Pushing it Further

11/10/20 CS 677: Big Data 23

2D Statistics

§ Maintaining the 2D relationships between variables gives us:§ Correlations§ Slope and intercept for linear regression§ Calculation of statistical significance

§ These take milliseconds to compute and consume minimal memory

11/10/20 CS 677: Big Data 24

2D Summary Matrix

§ After creating our 2D summaries, we can put them in a summary matrix

§ Each feature combination ends up being represented twice:§ Temperature à humidity§ Humidity à temperature

§ Additionally, each 1D instance contains duplicate information:§ Number of samples seen§ Mean value

11/10/20 CS 677: Big Data 25

Optimized 2D Summaries

§ Instead of maintaining an entire statistics matrix, place summary instances in a triangular matrix

§ Further, remove all duplicate data from the 1D instances

§ This creates a new summary structure for any number of dimensions while reducing memory consumption by about 40%!

11/10/20 CS 677: Big Data 26

Data Structure (15 features)

Cross-Feature Sum of Squares

Number of Observa4ons

Mean

Sum of Squares

Min

Max

11/10/20 CS 677: Big Data 27


Sketches

11/10/20 CS 677: Big Data 28

§ Ok, we can produce feature summaries… but what if we have a lot of elements in a stream and want to know which happen the most frequently?

§ ”Heavy hitters” – most popular videos, websites, network users, etc.

§ Tracking this in a small amount of space is a challenging problem

Counting Events

11/10/20 CS 677: Big Data 29

§ Let’s say we want to know the top 10 YouTube videos based on URL clicks

§ Initialize an array with size 10 and store (URL, count) pairs in the array

§ Next, start reading items from the data stream…

Frequent Items Sketch (FIS) [1/2]

11/10/20 CS 677: Big Data 30

§ When an element (URL in our case) comes in:§ If it’s in the sketch already, increment its count§ If it’s not in the sketch, but there’s a free space in the array,

insert it in the empty slot§ If it’s not in the sketch and there’s no room for it in the array,

decrement all counts

§ If any count drops to 0, remove it from the array, which will free up a slot

§ To use: sort the array and report the counts. You may also want to maintain a counter of all elements seen (%)

Frequent Items Sketch (FIS) [1/2]

11/10/20 CS 677: Big Data 31

§ FIS is very effective at determining the top N heavy hitters, but we may need frequencies for ALL events

§ Time to revisit our old friend, Bloom Filters!§ Well… counting bloom filters

§ In the count-min sketch (CM sketch), build a 2D array of w columns and d rows§ Each row d is associated with a hash function and

column w contains a counter

Count-Min Sketch [1/2]

11/10/20 CS 677: Big Data 32

§ As items are inserted, they are hashed once for each row in the 2D array

§ Next, column counters are incremented§ hash % w = index of the column to increment

§ …and, finally, the overall count of items is incremented

§ To query the frequency of any event, simply hash it and take the minimum count you find (count min!!! J)

Count-Min Sketch [2/2]

11/10/20 CS 677: Big Data 33

§ Why the minimum value? Well, as we know, bloom filters are susceptible to collisions

§ By choosing the smallest value we know we will not overestimate the frequency§ (possibly by a large margin if we’ve had many

collisions in a particular column)

§ In fact, our true frequency will always be smaller or equal to our estimate, with predictable errors§ (just like with regular Bloom filters)

Count Min Intuition

11/10/20 CS 677: Big Data 34

§ As with Welford’s method, you can merge instances of the CM sketch§ Assumes same array dimensions

§ Very useful for parallel/distributed applications (say, maintaining frequencies across several Spark Workers…)

Merging Sketches

11/10/20 CS 677: Big Data 35

§ Reduce dimensions any time you can§ Don’t go with the brute force algorithms for

summarizing, counting, etc.!§ Always find out if a streaming algorithm is available

Wrapping Up

11/10/20 CS 677: Big Data 36

cs 677: big datammalensek/cs677/schedule/... · scaling big data mining infrastructure: the twitter...

Documents