why exploring big data is hard - danyel fisher
TRANSCRIPT
WHY EXPLORING BIG DATA IS HARD(& WHAT WE CAN DO ABOUT IT)DANYEL FISHER, MICROSOFT RESEARCH
/tiles/r02123002133111.png
One of the most popular spots in the world.
Based on a table with a few billion rows
Can you distinguish American users from international?
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data
Create shapes
Assign scales to shapes
Render to screen
By hand!
SLOW!
NETWORK!
Defining “Big” Volume
“…200,000 magnetic tape reels
which represent over 900 billion
characters of data”
1975
“the size of the dataset is part of
the problem”
Why is Big Data different?
REPRESENTATION
What visualizations are suitable for big data?
INTERACTION
What do we need to do to make that visualization useful for interaction?
And it’s costly!
Big data has the potential to cost unlimited amounts of money
A query on 100 cores for an hour costs 100 core-hours … and an analyst-hour.
Massive savings for doing less, or early termination
A Note on Infrastructure
You Won’t Plot Every Point…Screen space to draw each data point [106 points]
Every data point in memory [109 bytes]
Store all the data points [1012 bytes]
… Even If You Tried
x
y
Scatterplot(at least one pixel per point)
Network DiagramParallel Coordinates
(individual lines)
Aggregation
What is the aggregation equivalent of a bar graph?
What is an aggregated line chart, or a scatterplot?
N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Trans-actions on Visualization and Computer Graphics, 16(3):439–454, May 2010.
Some things aggregate well
020406080
100120140160180200
3/1
3/1
98
6
3/1
3/1
98
7
3/1
3/1
98
8
3/1
3/1
98
9
3/1
3/1
99
0
3/1
3/1
99
1
3/1
3/1
99
2
3/1
3/1
99
3
3/1
3/1
99
4
3/1
3/1
99
5
3/1
3/1
99
6
3/1
3/1
99
7
3/1
3/1
99
8
3/1
3/1
99
9
3/1
3/2
00
0
3/1
3/2
00
1
3/1
3/2
00
2
3/1
3/2
00
3
3/1
3/2
00
4
3/1
3/2
00
5
3/1
3/2
00
6
3/1
3/2
00
7
3/1
3/2
00
8
3/1
3/2
00
9
3/1
3/2
01
0
3/1
3/2
01
1
3/1
3/2
01
2
3/1
3/2
01
3
3/1
3/2
01
4
Daily values
0
20
40
60
80
100
120
140
160
180
200
Monthly aggregate:min and max
Multiple dimensions
Liu, Jiang, Heer: imMens (2013)
Wattenberg: PivotGraph (2005)
Treemaps (mostly)
“Generalized Histograms”Select buckets on data
then
Examine points, placing them into buckets
then
Create shapes based on buckets
Hadley Wickham: "Bin, Summarize, Smooth: A Framework for Visualizing Large Data"
Big Data Exploration
EXPLORATION
Learn about the dataset
Explore multiple hypotheses
Manipulate data freely
May be discarded after completion
Rapid iteration
Examples: Some of Tableau, PowerView, GGPLOT, etc
PRESENTATION
Communicate a specific view
Constrain interaction
Visual style important
Examples: visual dashboards, data storytelling
The Story of Walt
the hypothetical histogram
The Story of Walt
ASSUMPTION
The dataset is too big to fit into memory
ASSUMPTION
Every query takes a full minute
Creating Walt(Min,Max)
Bucket all points
Total time: 2 minutes
Interact With WaltCHANGE BUCKET COUNT
One pass.
Re-bucket every point
Or maybe we were clever…
CROSS-FILTER WALT WITH ANOTHER HISTOGRAM
One pass.
Check filter on every point
Or maybe we were clever…
How clever do we have to be?Which operations are worth pre-caching?
◦ Change number of buckets, or their size
◦ Zoom in on a single bar
◦ Filter out some data
◦ Cross-filter into other visualizations
◦ Cross-filter from other visualizations
◦ Show sample rows from the histogram
OLAP!
The Moral of Walt’s StoryDecide what operations will support rapidly … and which we’ll tolerate being slow
Solution Space◦ Work Offline◦ Index
OLAP: PentahoInMems, Nanocubes
◦ Restrict Data◦ Sample (or Stream)◦ Divide & Conquer
◦ Multiple passes across the data in parallel
Limited exploration!
Trade accuracy for latency
Time
100%
Online
Traditional
Image adapted from Hellerstein
Computing Confidence Bounds
𝑏𝑜𝑢𝑛𝑑𝑠 ~𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
The Progressive Pitch
“Trust Me, I'm Partially Right: Incremental Visualization
Lets Analysts Explore Large Datasets Faster"
CHI 2012
What We LearnedUsers made lots of mistakes
…carried out lots of queries
…and cut them off early
Users were fearless about exploration
Most numbers are rough
Randomness in databases is a pain
Supporting StreamsRESERVOIR SAMPLE
Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir
EQUI-DEPTH HISTOGRAM
Good one-pass algorithms exist
… but we have no idea how to visualize them
Incremental changes the rulesCategorical: add categories on the fly
Numerical: changing bounds
Any color map or scale can change
SAMPLING: You’ll never know it all
TASKS
Find extreme
Compare bars
Bar to constant
Bar to range
Order (top-K)
SAMPLING: Probabilistic Views
“Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”
Design Goals
Easy to interpret
Consistency across tasks
Spatial Stability
Minimize Visual Noise (overhead)
“Is Bar A > Bar B”
Compare to constant
Other Tasks
Find extremeCompare to Range
A Tentative Framework
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
CACHE or INDEX
NETWORK!
SAMPLE
Place These!
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkHOTMAP
CACHE
NETWORK!
CACHE
D3
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
NETWORK!CACHE
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkSERVER-SIDE RENDER
NETWORK!
D3
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
NETWORK!
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkOLAP & PRE-INDEX
SAMPLE ACTION
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
CACHE
NETWORK!NETWORK!
SAMPLE
OLAP: Pentaho, MondrianInMems, NanoCubes
Cross-DisciplinarityThis isn’t the way SQL—or Hadoop—works today
Infovis needs to be very integrated with the back-end
New skills, new training
Close collaboration across fields
Let’s Build Cool Stuff!
@fisherdanyel
http://research.microsoft.com/bigdataux