escience data management bill howe, phd escience institute it’s not just size that matters, it’s...
TRANSCRIPT
eScience Data Management
Bill Howe, PhdeScience Institute
It’s not just size that matters, it’s what you can do with it
3/12/09 Bill Howe, eScience Institute 2
from eScience Rollout, 11/5/08
me
3/12/09 Bill Howe, eScience Institute 3
My Background
BS Industrial and Systems Engineering, GA Tech 1999
Big 3 Consulting with Deloitte 99-00 Residual guilt from call centers of consultants burning $50k/day
Independent Consulting 00-01 Microsoft, Siebel, Schlumberger, Verizon
Phd, Computer Science, Portland State University, 2006 (via OGI) Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical
Sciences”, Advisor: David Maier
Postdoc and Data Architect 06-08 NSF Science and Technology Center for
Coastal Margin Observation and Prediction (CMOP)
3/12/09 Bill Howe, eScience Institute 4
All Science is becoming eScience
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: automated PCR, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
3/12/09 Bill Howe, eScience Institute 5
The long tail is getting fatter:
notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)
The Long Tailda
ta in
vent
ory
ordinal position
Researchers with growing data management challenges but limited resources for cyberinfrastructure
• No dedicated IT staff
• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)
LSST (~100PB)
PanSTARRS (~40PB)
Ocean Modelers <Spreadsheet
users>
SDSS (~100TB)
Seis-mologists
MicrobiologistsCARMEN (~50TB)
“The future is already here. It’s just not very evenly distributed.”-- William Gibson
3/12/09 Bill Howe, eScience Institute 6
Heterogeneity also drives costs#
of
by
tes
# of data types
CERN (~15PB/year, particle interactions)
LSST(~100PB; images, objects)
PanSTARRS (~40PB; images, objects, trajectories)
OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)
SDSS (~100TB; images, objects)
Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)
3/12/09 Bill Howe, eScience Institute 7
Web Services
Facets of Data Management
Query Languages
Storage Management
Visualization; Workflow
Data IntegrationKnowledge Extraction,Crawlers
Access Methods
Data Mining, Distributed Programming Models, Provenance
complexity-hiding interfaces
The DB maxim: push computation to the data
3/12/09 Bill Howe, eScience Institute 8
Example: Relational Databases
At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].
The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!
Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”
physical data independence
logical data independence
3/12/09 Bill Howe, eScience Institute 9
Medium-Scale Data Management Toolbox
Relational Databases
Scientific Workflow Systems
Science “Mashups”
“Dataspace” systems
The “hammer” of data management
[Howe, Freire, Silva, et al. 2008]
[Howe, Green-Fishback, Maier, 2009]
[Howe, Maier, Rayner, Rucker 2008]
3/12/09 Bill Howe, eScience Institute 10
Large-Scale Data Management Toolbox
Amazon S3
Dryad
MapReduce
Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research)
Parallel programming using functional programming abstractions(Google)Howe, Freire, Silva: 2009 NSF CluE AwardConnolly, Gardner: 2009 NSF CluE Award
RDBMS-like features in the cloudNote: cost effectiveness unclear for large datasets
3/12/09 Bill Howe, eScience Institute 11
Current Activities
Consulting: Armbrust Lab(next slide)
Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)
3/12/09 Bill Howe, eScience Institute 12
Consulting: Armbrust Lab
Initial Goal: Corral and inventory all relevant data SOLiD sequencer: potentially 0.5 TB / day, flat files Metadata: small relational DB + Rails/Django web app Data Products: visualizations, intermediate results Ad hoc scripts and programs
Initial Goal: Amplify programmer effort Change is constant: No “one size fits all” solution; ad hoc
development is the norm Strategy: Teach biologists to “fish” (David Schruth’s R course) Strategy: Develop an infrastructure that enables and encourages
reuse -- scientific workflow systems
key idea: these are data too
3/12/09 Bill Howe, eScience Institute 13
Scientific Workflow Systems
Value proposition: More time on science, less time on code
How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency
Provenance Automatic task-parallelism Visual programming Caching Domain-specific toolkits
Many examples from eScience and DB communities: Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more
3/12/09 Bill Howe, eScience Institute 14
Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.
http://www.microsoft.com/mscorp/tc/trident.mspx
3/12/09 Bill Howe, eScience Institute 15screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
3/12/09 Bill Howe, eScience Institute 16screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
3/12/09 Bill Howe, eScience Institute 17
Bill Howe @ CMOP computes salt flux using GridFields
Erik Anderson @ Utah adds vector
streamlines and adjusts opacity
Bill Howe @ CMOP adds an isosurface of
salinity
Peter Lawson adds discussion of the
scientific interpretation
source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)
3/12/09 Bill Howe, eScience Institute 18
Strategy at Armbrust Lab
1. Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings
2. “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake
3. “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.
Informed by two of Jim Gray’s Laws of Data Engineering: Start with “20 queries” Go from “working to working”
3/12/09 Bill Howe, eScience Institute 19
NSF Award: Cluster Exploratory (CluE)
Partnership between NSF, IBM, Google Data-intensive computing: “I/O farm”
massive queries, not massive simulations “in ferro” experiments
To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Requires turning over up to 25TB < 5s Provenance, reproducibility, visualization: VisTrails
Connect rich desktop experience to cloud query engine
Co-PIs from University of Utah Claudio Silva and Juliana Freire
3/12/09 Bill Howe, eScience Institute 20
Ahmdahl’s Laws
Gene Amdahl (1965): Laws for a balanced system
i. Parallelism: max speedup is S/(S+P)
ii. One bit of IO/sec per instruction/sec (BW)
iii. One byte of memory per one instruction/sec (MEM)
iv. One IO per 50,000 instructions (IO)
Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)
For a Blue Gene the BW=0.001, MEM=0.12.
For the JHU cluster BW=0.5, MEM=1.04
source: Alex Szalay, keynote, eScience 2008
3/12/09 Bill Howe, eScience Institute 21
Climatology
Feb May
Average Surface Salinity by Month Columbia River Plume 1999-2006
Columbia River
psu
Washington
Oregon
animation
3/12/09 Bill Howe, eScience Institute 22
1 2 3 4 5 6 7
31
23
psu
8 9 10 11 12 13 14 15
16 17 18(b)
19 20 21 22
24 25 26 27 28 29 30
3/12/09 Bill Howe, eScience Institute 23
Epilogue
We’re here to help!
SIG Wiki:https://sig.washington.edu/itsigs/SIG_eScience
eScience Blog:http://escience.washington.edu/blog/
eScience wesbite:http://www.washington.edu/uwtech/escience.html
3/12/09 Bill Howe, eScience Institute 24
3/12/09 Bill Howe, eScience Institute 25
eScience requirements are Fractal
William Gibson -- “The future is already here. It’s just not very evenly distributed.”
3/12/09 Bill Howe, eScience Institute 26
High-Performance Computing
Data Management
Con
sult
ing
Online Collaboration Tools
CS
Res
earc
h
eScience
3/12/09 Bill Howe, eScience Institute 27
It’s what you can do with it
Relational database SQL, plus UDTs and UDFs as needed
FASTA databases Alignments, rarefaction curves, phylogenetic trees, filtering
MapReduce: Roll your own
Dryad Relational algebra available; you can still roll our own if needed
3/12/09 Bill Howe, eScience Institute 28
A data deluge in all fields
Acquisition eventually outpaces analysis Astronomy: SDSS, now LSST; PanSTARRS Biology: PCR, SOLiD sequencing Oceanography: high-resolution models, cheap sensors Marine Microbiology: FlowCytometer
Empirical X Analytical X Computational X X-informatics
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
High-Performance Computing
Data ManagementC
onsu
ltin
g
Online CollaborationCom
mu
nit
y B
uild
ing
Tec
hn
olog
y T
ran
sfer
eScience Research
3/12/09 Bill Howe, eScience Institute 30
Query Languages
Organize and encapsulate access methods Raise the level of abstraction beyond GPLs Identify and exploit opportunities for algebraic
optimization What is algebraic optimization? Consider the expression x/z +
y/zx/z + y/z = (x + y)/z, but the latter is less expensive since it involves only
one division operation
Tables -- SQL XML -- XQuery, XPath RDF -- SPARQL Streams -- StreamSQL, CQL Meshes (e.g., Finite Element Sims) -- GridFields
3/12/09 Bill Howe, eScience Institute 31
Example: Relational Databases (In Codd we Trust…)
At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.
The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!
1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970
The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.
3/12/09 Bill Howe, eScience Institute 32
Gray’s Laws of Data Engineering
Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working”
DISSC: Data Intensive Scalable Scientific Computing
slide source: Alex Szalay, keynote, eScience 2008
3/12/09 Bill Howe, eScience Institute 33
Data Management