how much data (bytes) did we produce in 2010?how much data (bytes) did we produce in 2010? 2010:...
TRANSCRIPT
1
Interactive Tools for Data Transformation & Visualization
Jeffrey Heer Stanford University
How much data (bytes) did we produce in 2010?
2010: 1,200 exabytes
Gantz et al, 2008, 2010
10x increase over 5 years
cabspotting.org
2
Records of Human Activity Wikipedia History Flow (IBM)
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, … because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.
Hal Varian, Google’s Chief EconomistThe McKinsey Quarterly, Jan 2009
3
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
The Elephant in the Room
4
Data Wrangling (n):
A process of iterative data exploration and transformation that enables analysis.
The goal of wrangling is to make data useful:Map data to a form readable by downstream tools (database, stats, visualization, …)Identify, document, and (where possible) address data quality issues.
DataWrangler
with Sean Kandel, Andreas Paepcke & Joe Hellerstein
Interactive Data Table
Suggested Transforms
TransformHistory
Data Quality Meter Data Wrangler
Declarative data transformation languageTuple mapping – split, merge, extract, deleteLookups and joins – e.g., FIPS code to US stateReshaping – e.g., cross-tabulationSorting, aggregation, etc.
Informed by prior work in databases, namely Potter’s Wheel & SchemaSQL
5
Data Wrangler
Declarative data transformation language
+Mixed-initiative interface for data transforms
Select data elements of interestSuggest applicable transformsEnable rapid preview and refinement
Comparative Evaluation
Compared Wrangler performance to Excel with 3 data cleaning tasks on small data sets.
Median completion time for Wrangler at least twice as fact in all tasks.
Skilled Excel users benefit proportionately!
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
6
How do people create visualizations?
Chart TypologyPick from a stock of templatesEasy-to-use but limited expressivenessProhibits novel designs, new data types
Component ArchitecturePermits more combinatorial possibilitiesNovel views require new operators, which requires software engineering.
Chart TypologiesExcel, Many Eyes, Google Charts
Visual Analysis LanguagesTableau VizQL, ggplot2, HiVE
?Component Model Architectures
Improvise, Prefuse, Flare
Graphics APIsOpenGL, Java2D, GDI+, ProcessingEf
fici
ency
Expressiveness
Today's first task is not to invent wholly new [graphical] techniques, though these are needed. Rather we need most vitally to recognize and reorganize the essential of old techniques, to make easy their assembly in new ways, and to modify their external appearances to fit the new opportunities.
J. W. Tukey, The Future of Data Analysis, 1962.
Protovis: A Declarative Language for Visualization
A graphic is a composition of data-representative marks.
with Mike Bostock & Vadim Ogievetsky
7
Area Bar Dot Image
Line Label Rule Wedge
ProtovisCreate customized visualizations using a declarative specification language.
Protovis (http://protovis.org) – Declarative Visualization Specification
var vis = new pv.Panel();vis.add(pv.Bar).data([1, 1.2, 1.7, 1.5, .7]).bottom(10).width(20).height(function(d) d * 70).left(function() this.index * 25 + 20);
vis.render();
vis.add(pv.Rule).data([0,-10,-20,-30]).top(function(d) 300 - 2*d - 0.5).left(200).right(150).lineWidth(1).strokeStyle("#ccc").anchor("right").add(pv.Label).font("italic 10px Georgia").text(function(d) d+"°").textBaseline("center");
vis.add(pv.Line).data(napoleon.temp).left(lon).top(tmp) .strokeStyle("#0").add(pv.Label).top(function(d) 5 + tmp(d)).text(function(d) d.temp+"° "+d.date.substr(0,6)).textBaseline("top").font("italic 10px Georgia");
var army = pv.nest(napoleon.army, "dir", "group“);var vis = new pv.Panel();
var lines = vis.add(pv.Panel).data(army);lines.add(pv.Line).data(function() army[this.idx]).left(lon).top(lat).size(function(d) d.size/8000).strokeStyle(function() color[army[paneIndex][0].dir]);
vis.add(pv.Label).data(napoleon.cities).left(lon).top(lat).text(function(d) d.city).font("italic 10px Georgia").textAlign("center").textBaseline("middle"); Bach’s Prelude #1 in C Major | Jieun Oh
8
Obesity Map | Vadim Ogievetsky Obesity Map | Vadim Ogievetsky
Dymaxion Maps | Vadim Ogievetsky FlickrSeason | Ken-Ichi Ueda
9
Exploiting Declarative Specification
Protovis has led to faster designs, less codeJob Voyager: 5x less code, 10x less dev timeOver 40,000 downloads and widely in use
Multiple implementations: JavaScript & Java
Behind-the-scenes optimization & parallelization20x scalability over prior systems (in Java)
d3.js Data-Driven Documents
by Mike Bostock
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
10
sense.usA Web Application for Collaborative Visualization of Demographic Data
with Fernanda Viégas and Martin Wattenberg[CHI 07]
11
The great postmaster scourge of 1910?
Or just a bugin the data?
Voyagers and VoyeursComplementary faces of analysis
Voyager – focus on visualized dataActive engagement with the dataSerendipitous comment discovery
Voyeur – focus on comment listingsInvestigate others’ explorationsFind people and topics of interestCatalyze new explorations
Many-Eyes
12
Content Analysis of Comments
Feature prevalence from content analysis (min Cohen’s κ = .74)High co-occurrence of Observation, Question, and Hypothesis
ServiceSense.us Many-Eyes
0 20 40 60 80Percentage
0 20 40 60 80Percentage
ObservationQuestion
HypothesisData Integrity
LinkingSocializing
System DesignTesting
TipsTo-Do
Affirmation
Content Analysis of Comments
16% of sense.us comments and 10% of Many-Eyes comments reference data integrity issues.
ServiceSense.us Many-Eyes
0 20 40 60 80Percentage
0 20 40 60 80Percentage
ObservationQuestion
HypothesisData Integrity
LinkingSocializing
System DesignTesting
TipsTo-Do
Affirmation
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Acquisition
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
13
Students & Collaborators
Mike BostockJason ChuangSean KandelDiana MacLeanVadim Ogievetsky
Joe Hellerstein, Andreas PaepckeFernanda Viégas, Martin Wattenberg
Interactive Tools for Data Transformation & Visualization
Jeffrey Heer http://vis.stanford.edu