what the bleep is big data? a holistic view of data and algorithms

60
What the #(&*$ is Big Data? A Holistic View of Data and Algorithms Alice Zheng, GraphLab Strata Conference, Santa Clara February, 2014

Upload: alice-zheng

Post on 18-Jan-2017

190 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: What the Bleep is Big Data? A Holistic View of Data and Algorithms

What the #(&*$ is Big Data?A Holistic View of Data and Algorithms

Alice Zheng, GraphLabStrata Conference, Santa Clara

February, 2014

Page 2: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Background• Machine Learning• Enable machines to understand the world• Play with data

• GraphLab• Unleash data science!• Enable non-ML experts to play with data

• This talk: a look at Big Data and Machine Learning from a tool builder’s perspective

Strata Conf, Feb 2014 2

Page 3: What the Bleep is Big Data? A Holistic View of Data and Algorithms

DATA

Strata Conf, Feb 2014

Page 4: What the Bleep is Big Data? A Holistic View of Data and Algorithms

What is Data?• Data is an extension of ourselves• Pictures, texts, messages, logs• Sensors and devices• Measurements and experiments

• Data is organic; it is wild and messy• Data proliferates

Strata Conf, Feb 2014 4

Page 5: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Producers of Big Data• Tech industry

• Google, Microsoft, Facebook, Amazon, Twitter, …• Consumer/Retail

• Walmart, Target, Amazon, Netflix, …• Telecomm

• Verizon, AT&T, Telefonica, …• Finance

• Thomson Reuters, Dow Jones, …• Health care and monitoring

• Personal health metrics, health care records, …• Science

• Genome research, high energy physics, astronomy, NASA, …• Etc.

Strata Conf, Feb 2014 5

Page 6: What the Bleep is Big Data? A Holistic View of Data and Algorithms

• 1.11 billion active users [March 2013]• 665 million daily users on average [March 2013]• Daily data amount: [Aug 2012]• 500+ TB data• 2.5 billion pieces of content• 2.7 billion “Like” actions • 300 mil photos• Scans 105 TB data every ½ hour

• 100+ PB data stored on a single Hadoop cluster [Aug 2012]

Strata Conf, Feb 2014 6

Data Sources: [Yahoo! news] [TechCrunch]

Page 7: What the Bleep is Big Data? A Holistic View of Data and Algorithms

System Event LogsETW (Event Tracing for Windows)• Logs of kernel and application events• Up to 100K events per second• Binary log size: ~200 MB every 2-5

minutes• 20-50 TB/year from one machine• ~50 PB/year from 1000 machines

Strata Conf, Feb 2014 7

Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx

Page 8: What the Bleep is Big Data? A Holistic View of Data and Algorithms

A Picture of Big Data

Strata Conf, Feb 2014 8

WikipediaWebSpam

Sys Logs

Walmart

LHC

WholeGenome Scans

SDSS

Flickr

CellphoneCDRs

Facebook

Twitter

GB

TB

PB

EB

Total Size / Year

Structure

Science

Tech

Size of bubble = Size of a single record (log-scale)

Other

Page 9: What the Bleep is Big Data? A Holistic View of Data and Algorithms

9

TAKING THE LEAP

Strata Conf, Feb 2014

Data

Insight

Page 10: What the Bleep is Big Data? A Holistic View of Data and Algorithms

ALGORITHMS

Strata Conf, Feb 2014 10

Page 11: What the Bleep is Big Data? A Holistic View of Data and Algorithms

The Way to Insight• What do people do with Big Data?• Myriad algorithms for myriad tasks• Two disparate examples• What movies would Bob like? –

discovering recommendations from a crowd

• Why is my machine so slow? – diagnosing systems using event logs

Strata Conf, Feb 2014 11

Page 12: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Algorithm Example 1: A Recommender System

Strata Conf, Feb 2014

Page 13: What the Bleep is Big Data? A Holistic View of Data and Algorithms

What Movies Would Bob Like?• Bob watched “Silver Linings

Playbook” and “Twin Peaks.” What else might Bob like?

• Given movie selections of many users, make recommendations for individuals

Strata Conf, Feb 2014

Page 14: What the Bleep is Big Data? A Holistic View of Data and Algorithms

User-Movie Interaction Matrix

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Bob

Anna

David

Ethan

Strata Conf, Feb 2014

Page 15: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Finding Similar Movies• Jaccard similarity between a pair of movies

• If every user who watched one or the other movie, ends up watching both, then the two movies must be very similar.

Strata Conf, Feb 2014

Page 16: What the Bleep is Big Data? A Holistic View of Data and Algorithms

User-Movie Interaction Matrix

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Bob

Anna

David

Ethan

Strata Conf, Feb 2014

Sim(“Silver Linings Playbook”, “Hunger Games”) = ?

Page 17: What the Bleep is Big Data? A Holistic View of Data and Algorithms

User-Movie Interaction Matrix

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Bob

Anna

David

Ethan

Strata Conf, Feb 2014

Sim(“Silver Linings Playbook”, “Hunger Games”) = ?

Page 18: What the Bleep is Big Data? A Holistic View of Data and Algorithms

User-Movie Interaction Matrix

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Bob

Anna

David

Ethan

Strata Conf, Feb 2014

Sim(“Silver Linings Playbook”, “Hunger Games”) = ?

Page 19: What the Bleep is Big Data? A Holistic View of Data and Algorithms

User-Movie Interaction Matrix

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Bob

Anna

David

Ethan

Strata Conf, Feb 2014

Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3

Page 20: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Movie Similarity Matrix

Strata Conf, Feb 2014

Silver Linings Playbook

Hunger Games

Twin Peaks Iron Man 3 Mulholland Drive

Silver Linings Playbook

1 1/3 2/3 0 1/3

Hunger Games 1/3 1 1/4 0 1/3

Twin Peaks 2/3 1/4 1 0 2/3

Iron Man 3 0 0 0 1 0

Mulholland Drive 1/3 1/3 2/3 0 1

Page 21: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Making New Recommendationsrecs = [ ]for movie in user.preferences:

new_movies = Sim[movie, :].topk( )recs.append(new_movies)

recs.sort()

• Equivalently, take the vector-matrix product• vector = the user’s preferences• matrix = movie similarity matrix

Strata Conf, Feb 2014

Page 22: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Key Ideas• During training: compute item-item

similarity matrix• Making recommendations: take

vector-matrix product

Strata Conf, Feb 2014

Page 23: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Algorithm Example 2:Diagnosing a slow computer

Strata Conf, Feb 2014

Page 24: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Why is My Machine So Slow?• Slow machines are frustrating!• Diagnose slowness via event logs

Page 25: What the Bleep is Big Data? A Holistic View of Data and Algorithms

ETW – Event Tracing for Windows• Fine-grained event tracing• Up to 100,000 events per second

Strata Conf, Feb 2014 25

Excerpt of Sample ETW log

Page 26: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Diagnosing Slowness• Start from slow thread• Walk backwards to construct wait graph

Strata Conf, Feb 2014

Firefox

Time

Network StackTCP/IP packet

Search Indexer

File Lock

Anti-Virus Checker

File Lock

Page 27: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Key Algorithm Ideas• The insight is a wait graph• Constructing the graph involves

repeated queries into a large set of events

• Iterate:• What was the current thread waiting on?• Go to the source of the wait

Strata Conf, Feb 2014

Page 28: What the Bleep is Big Data? A Holistic View of Data and Algorithms

What links these algorithms and data?

Strata Conf, Feb 2014

Page 29: What the Bleep is Big Data? A Holistic View of Data and Algorithms

DATA STRUCTURES – THE BRIDGE

Strata Conf, Feb 2014

Page 30: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Between Data and Algorithms

• Data structures• Organized data• Optimized for certain computations• The key to efficient analysis

• Algorithms prefer certain data structures• Raw data is amenable to certain data structures

Data AlgorithmsDataStructures

Amenable Preference

Page 31: What the Bleep is Big Data? A Holistic View of Data and Algorithms

The Disconnect

• Machine Learning research – largely disconnected from implementation • Some recent advances in large-scale ML are rediscovering

known data structures• Next-gen ML tools need well-tailored data structures

Strata Conf, Feb 2014

Machine Learning(Statistics, optimization,linear algebra, …)

Data Structures(Lists, trees,tables, graphs, …)

Page 32: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Two Useful Data Structures• Flat tables• Graphs

Strata Conf, Feb 2014

Page 33: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Data Structure 1: Flat Table

Strata Conf, Feb 2014

Page 34: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Flat Tables• Rows and columns• Rows = records• Columns can be typed

• A lot of raw data looks like flat tables!

Strata Conf, Feb 2014

Page 35: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example 1User Item Rating Time

Alice Breaking Bad, Season 1 3 …

Charlie Twilight 2

Bob Silver Linings Playbook 4

Frank American Hustle 2

Tina Plan 9 From Outer Space 4

Bob Twin Peaks 2

Diana Dr. Strangelove 5

Strata Conf, Feb 2014

User-Item interaction data

Page 36: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example 2Timestamp Name PID CPU Stack …

447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEventntkrnlpa.exe!WaitForLock

447590411 csrss.exe 460 0 …

447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects

Strata Conf, Feb 2014

Event log data

Page 37: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Variations of Flat Tables• Query vs. computation• Random access (in-memory) vs.

sequential access (on-disk)• Column vs. row-wise representation• Indexed or not• Distributed or not• Key-value stores (hash tables)

Strata Conf, Feb 2014

Page 38: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Data Structure 1.5: Indexed Flat Table

Strata Conf, Feb 2014

Page 39: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example of Indexed Flat Table

Strata Conf, Feb 2014

User Item Rating

Alice Breaking Bad, Season 1 3

Charlie Twilight 2

Bob Silver Linings Playbook 4

Frank American Hustle 2

Tina Plan 9 From Outer Space 4

Bob Twin Peaks 2

Diana Dr. Strangelove 5

Page 40: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example of Indexed Flat Table

Strata Conf, Feb 2014

User Item Rating

Alice Breaking Bad, Season 1 3

Charlie Twilight 2

Bob Silver Linings Playbook 4

Frank American Hustle 2

Tina Plan 9 From Outer Space 4

Bob Twin Peaks 2

Diana Dr. Strangelove 5

Index

Query: What items did Bob rate?

Page 41: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example of Indexed Flat Table

Strata Conf, Feb 2014

User Item Rating

Alice Breaking Bad, Season 1 3

Charlie Twilight 2

Bob Silver Linings Playbook 4

Frank American Hustle 2

Tina Plan 9 From Outer Space 4

Bob Twin Peaks 2

Diana Dr. Strangelove 5

Index

Query: What items did Bob rate?

Index of “Bob” points to rows 3 and 6

Page 42: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Back to the Recommender• Training: compute a matrix• Recommending: vector-matrix

product

• Raw data: user-item interaction log• Load in as flat table• Build index (user-item matrix)• Iterate through the users to train

Strata Conf, Feb 2014

Page 43: What the Bleep is Big Data? A Holistic View of Data and Algorithms

ML on Flat Tables• Anything where data is represented

as feature vectors• Computations operate on rows• Stochastic gradient descent• K-means clustering

• … or columns• Decision tree family

Strata Conf, Feb 2014

Page 44: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Data Structure 2: Graph

Strata Conf, Feb 2014

Page 45: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example

Strata Conf, Feb 2014

Anna

Diana

Charlie

Frank

Tina

Bob

Sam

Page 46: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Implementation 1: Edge List

• A simple flat table!• Additional columns = edge attributes (e.g., user rating of

movie, time watched, etc.)

Strata Conf, Feb 2014

User Item

Alice Breaking Bad, Season 1

Charlie Twilight

Bob Silver Linings Playbook

Frank American Hustle

Tina Plan 9 From Outer Space

Bob Twin Peaks

Diana Dr. Strangelove

Page 47: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Implementation 2: Edge List + Vertex List

• Two flat tables• Pre-computed join on VertexID

Strata Conf, Feb 2014

VertexID Name Age Genre

1 Alice 50

2 Charlie 26

3 Bob 33

100001 Silver Linings Playbook Romance

100002 Iron Man 3 Action

100003 Twin Peaks Thriller

SrcVertex DstVertex

1 389944

2 136782

3 100001

4 572639

5 200835

3 100003

Page 48: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graph Operations• get_neighbors():

1. Query indexed flat table

Strata Conf, Feb 2014

Page 49: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Example of Indexed Flat Table

Strata Conf, Feb 2014

User Item Rating

Alice Breaking Bad, Season 1 3

Charlie Twilight 2

Bob Silver Linings Playbook 4

Frank American Hustle 2

Tina Plan 9 From Outer Space 4

Bob Twin Peaks 2

Diana Dr. Strangelove 5

Index

Query: What items did Bob rate?

Index of “Bob” points to rows 3 and 6

Page 50: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graph Operations• get_neighbors():

1. Query indexed flat table2. Join with vertex table on VertexID or Name

Strata Conf, Feb 2014

User Movie RatingBob Silver Linings Playbook 4Bob Twin Peaks 2

VertexID Name Age Genre3 Bob 33

100001 Silver Linings Playbook Romance

100003 Twin Peaks Thriller

Page 51: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graph Operations• get_subgraph():• get_neighbors(), instantiate new table with

subset of rows of old tables• Find edges/vertices with attribute = x• Filter old tables

• Hypergraph – edges span more than 2 vertices• Just add more columns to the edge table

Strata Conf, Feb 2014

Page 52: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Back to Syslog Mining• Wait graph construction = search and filter• Iterate:• get_neighbors()• filter on edge and vertex attribute to find

culprits• Sequential process• Underlying event graph is enormous• SLOW

Strata Conf, Feb 2014

Page 53: What the Bleep is Big Data? A Holistic View of Data and Algorithms

ML on Graphs• Graphical models (Bayes nets)• Belief propagation• Gibbs sampling

• Random walk on Markov chains• PageRank

• Some algos are implementable on either• Matrix factorization

Strata Conf, Feb 2014

Page 54: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graphs vs. Tables

Strata Santa Clara, Feb 2014

Tabl

esGraphs

Page 55: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graphs vs. Tables• Closely related• Graphs can be implemented on top of

tables• … yet different• What key operations to optimize• How much to pre-compute• Indexes• Joins• Filters

Strata Santa Clara, Feb 2014

Page 56: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Popular Implementations

Strata Santa Clara, Feb 2014

Page 57: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Flat Tables

Strata Conf, Feb 2014

Random Access(In Memory)

Sequential Access(On Disk)

Querying(Interactive)

Computation(Batch)

Pandas

Spark

SQL

Hive/Pig

GraphLabSFrame

Page 58: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Graphs

Strata Conf, Feb 2014

Random Access(In-Memory)

Sequential Access(On disk)

Querying(Interactive)

Computation(Batch)

GraphLabGraph

GraphChiGraph

GraphDBs:HyperGraphDB,

Titan, Neo4j

Giraph

Page 59: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Conclusions

• Fast and scalable analysis hinges upon efficient data structures• Match the algo to the data structure• Morph raw data into the data structure

Strata Conf, Feb 2014

Raw Data Data Structure Algorithm Insight

Page 60: What the Bleep is Big Data? A Holistic View of Data and Algorithms

Advertising• GraphLab Tutorial this afternoon!• “Large Scale Machine Learning

Cookbook Using GraphLab”• Ballroom G, 1:30pm—5pm

Strata Santa Clara, Feb 2014