what the bleep is big data? a holistic view of data and algorithms
TRANSCRIPT
What the #(&*$ is Big Data?A Holistic View of Data and Algorithms
Alice Zheng, GraphLabStrata Conference, Santa Clara
February, 2014
Background• Machine Learning• Enable machines to understand the world• Play with data
• GraphLab• Unleash data science!• Enable non-ML experts to play with data
• This talk: a look at Big Data and Machine Learning from a tool builder’s perspective
Strata Conf, Feb 2014 2
DATA
Strata Conf, Feb 2014
What is Data?• Data is an extension of ourselves• Pictures, texts, messages, logs• Sensors and devices• Measurements and experiments
• Data is organic; it is wild and messy• Data proliferates
Strata Conf, Feb 2014 4
Producers of Big Data• Tech industry
• Google, Microsoft, Facebook, Amazon, Twitter, …• Consumer/Retail
• Walmart, Target, Amazon, Netflix, …• Telecomm
• Verizon, AT&T, Telefonica, …• Finance
• Thomson Reuters, Dow Jones, …• Health care and monitoring
• Personal health metrics, health care records, …• Science
• Genome research, high energy physics, astronomy, NASA, …• Etc.
Strata Conf, Feb 2014 5
• 1.11 billion active users [March 2013]• 665 million daily users on average [March 2013]• Daily data amount: [Aug 2012]• 500+ TB data• 2.5 billion pieces of content• 2.7 billion “Like” actions • 300 mil photos• Scans 105 TB data every ½ hour
• 100+ PB data stored on a single Hadoop cluster [Aug 2012]
Strata Conf, Feb 2014 6
Data Sources: [Yahoo! news] [TechCrunch]
System Event LogsETW (Event Tracing for Windows)• Logs of kernel and application events• Up to 100K events per second• Binary log size: ~200 MB every 2-5
minutes• 20-50 TB/year from one machine• ~50 PB/year from 1000 machines
Strata Conf, Feb 2014 7
Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
A Picture of Big Data
Strata Conf, Feb 2014 8
WikipediaWebSpam
Sys Logs
Walmart
LHC
WholeGenome Scans
SDSS
Flickr
CellphoneCDRs
GB
TB
PB
EB
Total Size / Year
Structure
Science
Tech
Size of bubble = Size of a single record (log-scale)
Other
9
TAKING THE LEAP
Strata Conf, Feb 2014
Data
Insight
ALGORITHMS
Strata Conf, Feb 2014 10
The Way to Insight• What do people do with Big Data?• Myriad algorithms for myriad tasks• Two disparate examples• What movies would Bob like? –
discovering recommendations from a crowd
• Why is my machine so slow? – diagnosing systems using event logs
Strata Conf, Feb 2014 11
Algorithm Example 1: A Recommender System
Strata Conf, Feb 2014
What Movies Would Bob Like?• Bob watched “Silver Linings
Playbook” and “Twin Peaks.” What else might Bob like?
• Given movie selections of many users, make recommendations for individuals
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Finding Similar Movies• Jaccard similarity between a pair of movies
• If every user who watched one or the other movie, ends up watching both, then the two movies must be very similar.
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
Movie Similarity Matrix
Strata Conf, Feb 2014
Silver Linings Playbook
Hunger Games
Twin Peaks Iron Man 3 Mulholland Drive
Silver Linings Playbook
1 1/3 2/3 0 1/3
Hunger Games 1/3 1 1/4 0 1/3
Twin Peaks 2/3 1/4 1 0 2/3
Iron Man 3 0 0 0 1 0
Mulholland Drive 1/3 1/3 2/3 0 1
Making New Recommendationsrecs = [ ]for movie in user.preferences:
new_movies = Sim[movie, :].topk( )recs.append(new_movies)
recs.sort()
• Equivalently, take the vector-matrix product• vector = the user’s preferences• matrix = movie similarity matrix
Strata Conf, Feb 2014
Key Ideas• During training: compute item-item
similarity matrix• Making recommendations: take
vector-matrix product
Strata Conf, Feb 2014
Algorithm Example 2:Diagnosing a slow computer
Strata Conf, Feb 2014
Why is My Machine So Slow?• Slow machines are frustrating!• Diagnose slowness via event logs
ETW – Event Tracing for Windows• Fine-grained event tracing• Up to 100,000 events per second
Strata Conf, Feb 2014 25
Excerpt of Sample ETW log
Diagnosing Slowness• Start from slow thread• Walk backwards to construct wait graph
Strata Conf, Feb 2014
Firefox
Time
Network StackTCP/IP packet
Search Indexer
File Lock
Anti-Virus Checker
File Lock
Key Algorithm Ideas• The insight is a wait graph• Constructing the graph involves
repeated queries into a large set of events
• Iterate:• What was the current thread waiting on?• Go to the source of the wait
Strata Conf, Feb 2014
What links these algorithms and data?
Strata Conf, Feb 2014
DATA STRUCTURES – THE BRIDGE
Strata Conf, Feb 2014
Between Data and Algorithms
• Data structures• Organized data• Optimized for certain computations• The key to efficient analysis
• Algorithms prefer certain data structures• Raw data is amenable to certain data structures
Data AlgorithmsDataStructures
Amenable Preference
The Disconnect
• Machine Learning research – largely disconnected from implementation • Some recent advances in large-scale ML are rediscovering
known data structures• Next-gen ML tools need well-tailored data structures
Strata Conf, Feb 2014
Machine Learning(Statistics, optimization,linear algebra, …)
Data Structures(Lists, trees,tables, graphs, …)
Two Useful Data Structures• Flat tables• Graphs
Strata Conf, Feb 2014
Data Structure 1: Flat Table
Strata Conf, Feb 2014
Flat Tables• Rows and columns• Rows = records• Columns can be typed
• A lot of raw data looks like flat tables!
Strata Conf, Feb 2014
Example 1User Item Rating Time
Alice Breaking Bad, Season 1 3 …
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Strata Conf, Feb 2014
User-Item interaction data
Example 2Timestamp Name PID CPU Stack …
447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEventntkrnlpa.exe!WaitForLock
447590411 csrss.exe 460 0 …
447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects
…
Strata Conf, Feb 2014
Event log data
Variations of Flat Tables• Query vs. computation• Random access (in-memory) vs.
sequential access (on-disk)• Column vs. row-wise representation• Indexed or not• Distributed or not• Key-value stores (hash tables)
Strata Conf, Feb 2014
Data Structure 1.5: Indexed Flat Table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Back to the Recommender• Training: compute a matrix• Recommending: vector-matrix
product
• Raw data: user-item interaction log• Load in as flat table• Build index (user-item matrix)• Iterate through the users to train
Strata Conf, Feb 2014
ML on Flat Tables• Anything where data is represented
as feature vectors• Computations operate on rows• Stochastic gradient descent• K-means clustering
• … or columns• Decision tree family
Strata Conf, Feb 2014
Data Structure 2: Graph
Strata Conf, Feb 2014
Example
Strata Conf, Feb 2014
Anna
Diana
Charlie
Frank
Tina
Bob
Sam
Implementation 1: Edge List
• A simple flat table!• Additional columns = edge attributes (e.g., user rating of
movie, time watched, etc.)
Strata Conf, Feb 2014
User Item
Alice Breaking Bad, Season 1
Charlie Twilight
Bob Silver Linings Playbook
Frank American Hustle
Tina Plan 9 From Outer Space
Bob Twin Peaks
Diana Dr. Strangelove
…
Implementation 2: Edge List + Vertex List
• Two flat tables• Pre-computed join on VertexID
Strata Conf, Feb 2014
VertexID Name Age Genre
1 Alice 50
2 Charlie 26
3 Bob 33
…
100001 Silver Linings Playbook Romance
100002 Iron Man 3 Action
100003 Twin Peaks Thriller
SrcVertex DstVertex
1 389944
2 136782
3 100001
4 572639
5 200835
3 100003
…
Graph Operations• get_neighbors():
1. Query indexed flat table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Graph Operations• get_neighbors():
1. Query indexed flat table2. Join with vertex table on VertexID or Name
Strata Conf, Feb 2014
User Movie RatingBob Silver Linings Playbook 4Bob Twin Peaks 2
VertexID Name Age Genre3 Bob 33
100001 Silver Linings Playbook Romance
100003 Twin Peaks Thriller
Graph Operations• get_subgraph():• get_neighbors(), instantiate new table with
subset of rows of old tables• Find edges/vertices with attribute = x• Filter old tables
• Hypergraph – edges span more than 2 vertices• Just add more columns to the edge table
Strata Conf, Feb 2014
Back to Syslog Mining• Wait graph construction = search and filter• Iterate:• get_neighbors()• filter on edge and vertex attribute to find
culprits• Sequential process• Underlying event graph is enormous• SLOW
Strata Conf, Feb 2014
ML on Graphs• Graphical models (Bayes nets)• Belief propagation• Gibbs sampling
• Random walk on Markov chains• PageRank
• Some algos are implementable on either• Matrix factorization
Strata Conf, Feb 2014
Graphs vs. Tables
Strata Santa Clara, Feb 2014
Tabl
esGraphs
Graphs vs. Tables• Closely related• Graphs can be implemented on top of
tables• … yet different• What key operations to optimize• How much to pre-compute• Indexes• Joins• Filters
Strata Santa Clara, Feb 2014
Popular Implementations
Strata Santa Clara, Feb 2014
Flat Tables
Strata Conf, Feb 2014
Random Access(In Memory)
Sequential Access(On Disk)
Querying(Interactive)
Computation(Batch)
Pandas
Spark
SQL
Hive/Pig
GraphLabSFrame
Graphs
Strata Conf, Feb 2014
Random Access(In-Memory)
Sequential Access(On disk)
Querying(Interactive)
Computation(Batch)
GraphLabGraph
GraphChiGraph
GraphDBs:HyperGraphDB,
Titan, Neo4j
Giraph
Conclusions
• Fast and scalable analysis hinges upon efficient data structures• Match the algo to the data structure• Morph raw data into the data structure
Strata Conf, Feb 2014
Raw Data Data Structure Algorithm Insight
Advertising• GraphLab Tutorial this afternoon!• “Large Scale Machine Learning
Cookbook Using GraphLab”• Ballroom G, 1:30pm—5pm
Strata Santa Clara, Feb 2014