why exploring big data is hard - danyel fisher

53
WHY EXPLORING BIG DATA IS HARD (& WHAT WE CAN DO ABOUT IT) DANYEL FISHER, MICROSOFT RESEARCH

Upload: seattle-daml-meetup

Post on 22-Jan-2018

757 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Why Exploring Big Data is Hard - Danyel Fisher

WHY EXPLORING BIG DATA IS HARD(& WHAT WE CAN DO ABOUT IT)DANYEL FISHER, MICROSOFT RESEARCH

Page 2: Why Exploring Big Data is Hard - Danyel Fisher
Page 3: Why Exploring Big Data is Hard - Danyel Fisher

/tiles/r02123002133111.png

Page 4: Why Exploring Big Data is Hard - Danyel Fisher
Page 5: Why Exploring Big Data is Hard - Danyel Fisher
Page 6: Why Exploring Big Data is Hard - Danyel Fisher

One of the most popular spots in the world.

Based on a table with a few billion rows

Page 7: Why Exploring Big Data is Hard - Danyel Fisher
Page 8: Why Exploring Big Data is Hard - Danyel Fisher

Can you distinguish American users from international?

Page 9: Why Exploring Big Data is Hard - Danyel Fisher

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data

Create shapes

Assign scales to shapes

Render to screen

By hand!

SLOW!

NETWORK!

Page 10: Why Exploring Big Data is Hard - Danyel Fisher

Defining “Big” Volume

“…200,000 magnetic tape reels

which represent over 900 billion

characters of data”

1975

Page 11: Why Exploring Big Data is Hard - Danyel Fisher

“the size of the dataset is part of

the problem”

Page 12: Why Exploring Big Data is Hard - Danyel Fisher

Why is Big Data different?

REPRESENTATION

What visualizations are suitable for big data?

INTERACTION

What do we need to do to make that visualization useful for interaction?

Page 13: Why Exploring Big Data is Hard - Danyel Fisher

And it’s costly!

Big data has the potential to cost unlimited amounts of money

A query on 100 cores for an hour costs 100 core-hours … and an analyst-hour.

Massive savings for doing less, or early termination

Page 14: Why Exploring Big Data is Hard - Danyel Fisher

A Note on Infrastructure

Page 15: Why Exploring Big Data is Hard - Danyel Fisher

You Won’t Plot Every Point…Screen space to draw each data point [106 points]

Every data point in memory [109 bytes]

Store all the data points [1012 bytes]

Page 16: Why Exploring Big Data is Hard - Danyel Fisher

… Even If You Tried

x

y

Scatterplot(at least one pixel per point)

Network DiagramParallel Coordinates

(individual lines)

Page 17: Why Exploring Big Data is Hard - Danyel Fisher

Aggregation

What is the aggregation equivalent of a bar graph?

What is an aggregated line chart, or a scatterplot?

N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Trans-actions on Visualization and Computer Graphics, 16(3):439–454, May 2010.

Page 18: Why Exploring Big Data is Hard - Danyel Fisher

Some things aggregate well

Page 19: Why Exploring Big Data is Hard - Danyel Fisher

020406080

100120140160180200

3/1

3/1

98

6

3/1

3/1

98

7

3/1

3/1

98

8

3/1

3/1

98

9

3/1

3/1

99

0

3/1

3/1

99

1

3/1

3/1

99

2

3/1

3/1

99

3

3/1

3/1

99

4

3/1

3/1

99

5

3/1

3/1

99

6

3/1

3/1

99

7

3/1

3/1

99

8

3/1

3/1

99

9

3/1

3/2

00

0

3/1

3/2

00

1

3/1

3/2

00

2

3/1

3/2

00

3

3/1

3/2

00

4

3/1

3/2

00

5

3/1

3/2

00

6

3/1

3/2

00

7

3/1

3/2

00

8

3/1

3/2

00

9

3/1

3/2

01

0

3/1

3/2

01

1

3/1

3/2

01

2

3/1

3/2

01

3

3/1

3/2

01

4

Daily values

0

20

40

60

80

100

120

140

160

180

200

Monthly aggregate:min and max

Page 20: Why Exploring Big Data is Hard - Danyel Fisher

Multiple dimensions

Liu, Jiang, Heer: imMens (2013)

Page 21: Why Exploring Big Data is Hard - Danyel Fisher

Wattenberg: PivotGraph (2005)

Page 22: Why Exploring Big Data is Hard - Danyel Fisher
Page 23: Why Exploring Big Data is Hard - Danyel Fisher

Treemaps (mostly)

Page 24: Why Exploring Big Data is Hard - Danyel Fisher

“Generalized Histograms”Select buckets on data

then

Examine points, placing them into buckets

then

Create shapes based on buckets

Hadley Wickham: "Bin, Summarize, Smooth: A Framework for Visualizing Large Data"

Page 25: Why Exploring Big Data is Hard - Danyel Fisher

Big Data Exploration

EXPLORATION

Learn about the dataset

Explore multiple hypotheses

Manipulate data freely

May be discarded after completion

Rapid iteration

Examples: Some of Tableau, PowerView, GGPLOT, etc

PRESENTATION

Communicate a specific view

Constrain interaction

Visual style important

Examples: visual dashboards, data storytelling

Page 26: Why Exploring Big Data is Hard - Danyel Fisher
Page 27: Why Exploring Big Data is Hard - Danyel Fisher

The Story of Walt

the hypothetical histogram

Page 28: Why Exploring Big Data is Hard - Danyel Fisher

The Story of Walt

ASSUMPTION

The dataset is too big to fit into memory

ASSUMPTION

Every query takes a full minute

Page 29: Why Exploring Big Data is Hard - Danyel Fisher

Creating Walt(Min,Max)

Bucket all points

Total time: 2 minutes

Page 30: Why Exploring Big Data is Hard - Danyel Fisher

Interact With WaltCHANGE BUCKET COUNT

One pass.

Re-bucket every point

Or maybe we were clever…

CROSS-FILTER WALT WITH ANOTHER HISTOGRAM

One pass.

Check filter on every point

Or maybe we were clever…

Page 31: Why Exploring Big Data is Hard - Danyel Fisher

How clever do we have to be?Which operations are worth pre-caching?

◦ Change number of buckets, or their size

◦ Zoom in on a single bar

◦ Filter out some data

◦ Cross-filter into other visualizations

◦ Cross-filter from other visualizations

◦ Show sample rows from the histogram

OLAP!

Page 32: Why Exploring Big Data is Hard - Danyel Fisher

The Moral of Walt’s StoryDecide what operations will support rapidly … and which we’ll tolerate being slow

Page 33: Why Exploring Big Data is Hard - Danyel Fisher

Solution Space◦ Work Offline◦ Index

OLAP: PentahoInMems, Nanocubes

◦ Restrict Data◦ Sample (or Stream)◦ Divide & Conquer

◦ Multiple passes across the data in parallel

Limited exploration!

Page 34: Why Exploring Big Data is Hard - Danyel Fisher

Trade accuracy for latency

Time

100%

Online

Traditional

Image adapted from Hellerstein

Page 35: Why Exploring Big Data is Hard - Danyel Fisher

Computing Confidence Bounds

𝑏𝑜𝑢𝑛𝑑𝑠 ~𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

𝑠𝑎𝑚𝑝𝑙𝑒𝑠

Page 36: Why Exploring Big Data is Hard - Danyel Fisher

The Progressive Pitch

“Trust Me, I'm Partially Right: Incremental Visualization

Lets Analysts Explore Large Datasets Faster"

Page 37: Why Exploring Big Data is Hard - Danyel Fisher

CHI 2012

Page 38: Why Exploring Big Data is Hard - Danyel Fisher

What We LearnedUsers made lots of mistakes

…carried out lots of queries

…and cut them off early

Users were fearless about exploration

Most numbers are rough

Randomness in databases is a pain

Page 39: Why Exploring Big Data is Hard - Danyel Fisher
Page 40: Why Exploring Big Data is Hard - Danyel Fisher

Supporting StreamsRESERVOIR SAMPLE

Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir

EQUI-DEPTH HISTOGRAM

Good one-pass algorithms exist

… but we have no idea how to visualize them

Page 41: Why Exploring Big Data is Hard - Danyel Fisher

Incremental changes the rulesCategorical: add categories on the fly

Numerical: changing bounds

Any color map or scale can change

Page 42: Why Exploring Big Data is Hard - Danyel Fisher

SAMPLING: You’ll never know it all

TASKS

Find extreme

Compare bars

Bar to constant

Bar to range

Order (top-K)

Page 43: Why Exploring Big Data is Hard - Danyel Fisher

SAMPLING: Probabilistic Views

“Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”

Design Goals

Easy to interpret

Consistency across tasks

Spatial Stability

Minimize Visual Noise (overhead)

Page 44: Why Exploring Big Data is Hard - Danyel Fisher

“Is Bar A > Bar B”

Page 45: Why Exploring Big Data is Hard - Danyel Fisher

Compare to constant

Page 46: Why Exploring Big Data is Hard - Danyel Fisher

Other Tasks

Find extremeCompare to Range

Page 47: Why Exploring Big Data is Hard - Danyel Fisher

A Tentative Framework

Page 48: Why Exploring Big Data is Hard - Danyel Fisher

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

CACHE or INDEX

NETWORK!

SAMPLE

Place These!

Page 49: Why Exploring Big Data is Hard - Danyel Fisher

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkHOTMAP

CACHE

NETWORK!

CACHE

D3

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

NETWORK!CACHE

Page 50: Why Exploring Big Data is Hard - Danyel Fisher

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkSERVER-SIDE RENDER

NETWORK!

D3

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

NETWORK!

Page 51: Why Exploring Big Data is Hard - Danyel Fisher

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkOLAP & PRE-INDEX

SAMPLE ACTION

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

CACHE

NETWORK!NETWORK!

SAMPLE

OLAP: Pentaho, MondrianInMems, NanoCubes

Page 52: Why Exploring Big Data is Hard - Danyel Fisher

Cross-DisciplinarityThis isn’t the way SQL—or Hadoop—works today

Infovis needs to be very integrated with the back-end

New skills, new training

Close collaboration across fields

Page 53: Why Exploring Big Data is Hard - Danyel Fisher

Let’s Build Cool Stuff!

@fisherdanyel

http://research.microsoft.com/bigdataux