cs 677: big datammalensek/cs677/schedule/... · scaling big data mining infrastructure: the twitter...
TRANSCRIPT
Lecture 18
CS 677: Big Data
Data Sketches
§ In many cases, data is produced much faster than we can analyze it
§ Batch processing systems like MapReduce let us do analysis offline, after the fact§ Good: studying long-term trends§ Bad: reacting quickly…
§ Health monitoring, rerouting traffic, etc.
§ We can use a stream processing system, but what happens when that can’t handle the workload?
Streaming Big Data
11/10/20 CS 677: Big Data 2
§ Rather than storing/processing everything, we can build data sketches of the datasets
§ Some information is thrown away…§ …but we can store a wider breadth of information.§ These approaches have memory and processing
benefits§ Also well-suited to IoT devices, low-powered cloud
instances, etc
Sketching
11/10/20 CS 677: Big Data 3
§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items
Sketches
11/10/20 CS 677: Big Data 4
§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items
Sketches
11/10/20 CS 677: Big Data 5
§ Principal component analysis
§ Converts (possible) correlated variables into a set of linearly uncorrelated variables
§ Ordered by largest variance
Dimensionality Reduction: PCA
11/10/20 CS 677: Big Data 6
§ In some datasets, features can be predicted by one or more other features with high accuracy§ Awesome! Easy predictions!
§ However, consider the case where we are building a multiple regression model§ Noise in the dataset will cause increased variance of
one or more regression coefficients
§ Solution: remove these features
Multicollinearity
11/10/20 CS 677: Big Data 7
§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items
Sketches
11/10/20 CS 677: Big Data 8
§ Used often for compression§ Think: JPEG
§ We can throw away some of the data and then fit wavelets that will help us reconstruct it
§ Related concept: quantization/discretization§ Put readings/observations into bins § Now instead of maintaining [1.1, 1.2, 1.15] in memory,
we can say our [1.1 – 1.2] bin contains 3 elements§ It’s possible to maintain the distributions within…
Wavelets
11/10/20 CS 677: Big Data 9
§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items
Sketches
11/10/20 CS 677: Big Data 10
§ Let’s assume we have a data feed from NOAA that gives us real time climate information
§ We want to provide some basic statistics about the weather§ Highs, lows, averages, etc.
§ If we store these in an array (or similar structure) then we can easily !nd the values we need§ This will consume a lot of memory (or disk space)
Scenario
11/10/20 CS 677: Big Data 11
§ Being the clever big data people we are, we realize that providing the highs/lows doesn’t actually require us to store the entire data stream
§ We can just check whether the new value we’ve seen is larger/smaller than what is recorded
§ Great! But now we also want to know what the average temperature is…
Optimization
11/10/20 CS 677: Big Data 12
Gathering More Statistics
§ To improve the expressiveness of our weather reports, we also want to gather:§ Total number of data points§ Average, variance, and standard deviations
§ These statistics provide a high-level overview of the data distributions§ For instance, if we can assume a normal distribution
then this tells us a lot about the data
11/10/20 CS 677: Big Data 13
Online Statistics Collection
§ Since new records are constantly streaming into the system, recalculating statistics each time is ine!cient
§ We also operate in a distributed world: what if multiple nodes in our cluster are receiving data points at the same time?
§ Ok, so our requirements are:§ Must support unbounded data streams§ Must be accurate§ Must be mergeable across distributed nodes
11/10/20 CS 677: Big Data 14
§ Most of us know how to calculate variance (and therefore standard deviation) with two passes:
§ Given samples x1 … xN:1. Compute their mean2. Compute squared differences from the mean
§ This is bad, though, because, well… it’s two passes§ That’s slow!§ It also might not be feasible if the data stream is
unbounded
Average, Variance, etc.
11/10/20 CS 677: Big Data 15
§ To calculate the variance online, we can accumulate the sums of xi and xi
2:§ sum += xi
§ sumsq += xi2
§ Then the mean and variance will be:§ x ̄ = sum / N§ σ = (sumsq - N * x ̄2) / (N - 1)
§ Except this is bad because it su"ers from massive cancellation
The Streaming Version
11/10/20 CS 677: Big Data 16
Welford’s Method
§ Allows statistics about a dataset to be updated incrementally§ Computation is performed in a single pass (each data point
is inspected once)
§ Each new record incurs a small calculation cost, but avoids re-calculating statistics for the entire dataset§ Takes about 1 microsecond (0.000001 second) on
commodity CPUs
§ No issues with massive cancellation§ Mergeable
11/10/20 CS 677: Big Data 17
§ We’ll maintain:§ the number of observations, n§ the running mean, x ̄§ the sum of squares of di!erences from the current
mean, Sn§ As a recurrence relation:
Welford Implementation
11/10/20 CS 677: Big Data 18
§ We already are tracking the running mean, so now we just need to calculate the variance
§ For 2 ≤ k ≤ N, the nth variance is:σ2 = Sn / (n – 1)
§ …which, of course, means we can get the standard deviation as well:
σ = sqrt(Sn / (n – 1))
Variance and Std. Dev
11/10/20 CS 677: Big Data 19
/** Add a new sample to the running statistics. */public void put(double sample) {
n++; double delta = sample - this.mean;this.mean = this.mean + delta / n; this.Sn = this.Sn +
delta * (sample - this.mean);
min = Math.min(this.min, sample);max = Math.max(this.max, sample);
}
Or, in Code
11/10/20 CS 677: Big Data 20
§ We can use this information to perform t-tests, check the probability of values given a distribution, and more
§ Another big benefit: these statistics can be merged§ Collect data points on each machine in our cluster,
merge them back together!
§ Works well with streaming systems and MapReduce
Additional Statistics
11/10/20 CS 677: Big Data 21
§ We end up maintaining:§ Min§ Max
§ And:§ Count (n)§ Mean§ Sn
§ In Java, we’re looking at around 50 bytes or so
Memory Impact
11/10/20 CS 677: Big Data 22
§ This approach works well for inspecting a single feature such as temperature
§ We can also maintain 2D online statistics:§ put(temperature, humidity)
§ Here we maintain the differences in the sum of squares across the two features§ Keep a 1D instance of each feature plus this
information (just ~8 more bytes)
Pushing it Further
11/10/20 CS 677: Big Data 23
2D Statistics
§ Maintaining the 2D relationships between variables gives us:§ Correlations§ Slope and intercept for linear regression§ Calculation of statistical significance
§ These take milliseconds to compute and consume minimal memory
11/10/20 CS 677: Big Data 24
2D Summary Matrix
§ After creating our 2D summaries, we can put them in a summary matrix
§ Each feature combination ends up being represented twice:§ Temperature à humidity§ Humidity à temperature
§ Additionally, each 1D instance contains duplicate information:§ Number of samples seen§ Mean value
11/10/20 CS 677: Big Data 25
Optimized 2D Summaries
§ Instead of maintaining an entire statistics matrix, place summary instances in a triangular matrix
§ Further, remove all duplicate data from the 1D instances
§ This creates a new summary structure for any number of dimensions while reducing memory consumption by about 40%!
11/10/20 CS 677: Big Data 26
Data Structure (15 features)
Cross-Feature Sum of Squares
Number of Observa4ons
Mean
Sum of Squares
Min
Max
11/10/20 CS 677: Big Data 27
§ Dimensionality reduction§ Wavelets§ Summary Statistics§ Frequent Items
Sketches
11/10/20 CS 677: Big Data 28
§ Ok, we can produce feature summaries… but what if we have a lot of elements in a stream and want to know which happen the most frequently?
§ ”Heavy hitters” – most popular videos, websites, network users, etc.
§ Tracking this in a small amount of space is a challenging problem
Counting Events
11/10/20 CS 677: Big Data 29
§ Let’s say we want to know the top 10 YouTube videos based on URL clicks
§ Initialize an array with size 10 and store (URL, count) pairs in the array
§ Next, start reading items from the data stream…
Frequent Items Sketch (FIS) [1/2]
11/10/20 CS 677: Big Data 30
§ When an element (URL in our case) comes in:§ If it’s in the sketch already, increment its count§ If it’s not in the sketch, but there’s a free space in the array,
insert it in the empty slot§ If it’s not in the sketch and there’s no room for it in the array,
decrement all counts
§ If any count drops to 0, remove it from the array, which will free up a slot
§ To use: sort the array and report the counts. You may also want to maintain a counter of all elements seen (%)
Frequent Items Sketch (FIS) [1/2]
11/10/20 CS 677: Big Data 31
§ FIS is very effective at determining the top N heavy hitters, but we may need frequencies for ALL events
§ Time to revisit our old friend, Bloom Filters!§ Well… counting bloom filters
§ In the count-min sketch (CM sketch), build a 2D array of w columns and d rows§ Each row d is associated with a hash function and
column w contains a counter
Count-Min Sketch [1/2]
11/10/20 CS 677: Big Data 32
§ As items are inserted, they are hashed once for each row in the 2D array
§ Next, column counters are incremented§ hash % w = index of the column to increment
§ …and, finally, the overall count of items is incremented
§ To query the frequency of any event, simply hash it and take the minimum count you find (count min!!! J)
Count-Min Sketch [2/2]
11/10/20 CS 677: Big Data 33
§ Why the minimum value? Well, as we know, bloom filters are susceptible to collisions
§ By choosing the smallest value we know we will not overestimate the frequency§ (possibly by a large margin if we’ve had many
collisions in a particular column)
§ In fact, our true frequency will always be smaller or equal to our estimate, with predictable errors§ (just like with regular Bloom filters)
Count Min Intuition
11/10/20 CS 677: Big Data 34
§ As with Welford’s method, you can merge instances of the CM sketch§ Assumes same array dimensions
§ Very useful for parallel/distributed applications (say, maintaining frequencies across several Spark Workers…)
Merging Sketches
11/10/20 CS 677: Big Data 35
§ Reduce dimensions any time you can§ Don’t go with the brute force algorithms for
summarizing, counting, etc.!§ Always find out if a streaming algorithm is available
Wrapping Up
11/10/20 CS 677: Big Data 36