escience data management bill howe, phd escience institute it’s not just size that matters, it’s...

eScience Data Management

Bill Howe, PhdeScience Institute

It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 2

from eScience Rollout, 11/5/08

me


My Background

BS Industrial and Systems Engineering, GA Tech 1999

Big 3 Consulting with Deloitte 99-00 Residual guilt from call centers of consultants burning $50k/day

Independent Consulting 00-01 Microsoft, Siebel, Schlumberger, Verizon

Phd, Computer Science, Portland State University, 2006 (via OGI) Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical

Sciences”, Advisor: David Maier

Postdoc and Data Architect 06-08 NSF Science and Technology Center for

Coastal Margin Observation and Prediction (CMOP)


All Science is becoming eScience

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: automated PCR, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Empirical X Analytical X Computational X X-informatics


The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta in

vent

ory

ordinal position

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.”-- William Gibson


Heterogeneity also drives costs#

of

by

tes

# of data types

CERN (~15PB/year, particle interactions)

LSST(~100PB; images, objects)

PanSTARRS (~40PB; images, objects, trajectories)

OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

SDSS (~100TB; images, objects)

Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)


Web Services

Facets of Data Management

Query Languages

Storage Management

Visualization; Workflow

Data IntegrationKnowledge Extraction,Crawlers

Access Methods

Data Mining, Distributed Programming Models, Provenance

complexity-hiding interfaces

The DB maxim: push computation to the data


Example: Relational Databases

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”

physical data independence

logical data independence


Medium-Scale Data Management Toolbox

Relational Databases

Scientific Workflow Systems

Science “Mashups”

“Dataspace” systems

The “hammer” of data management

[Howe, Freire, Silva, et al. 2008]

[Howe, Green-Fishback, Maier, 2009]

[Howe, Maier, Rayner, Rucker 2008]


Large-Scale Data Management Toolbox

Amazon S3

Dryad

MapReduce

Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research)

Parallel programming using functional programming abstractions(Google)Howe, Freire, Silva: 2009 NSF CluE AwardConnolly, Gardner: 2009 NSF CluE Award

RDBMS-like features in the cloudNote: cost effectiveness unclear for large datasets


Current Activities

Consulting: Armbrust Lab(next slide)

Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)


Consulting: Armbrust Lab

Initial Goal: Corral and inventory all relevant data SOLiD sequencer: potentially 0.5 TB / day, flat files Metadata: small relational DB + Rails/Django web app Data Products: visualizations, intermediate results Ad hoc scripts and programs

Initial Goal: Amplify programmer effort Change is constant: No “one size fits all” solution; ad hoc

development is the norm Strategy: Teach biologists to “fish” (David Schruth’s R course) Strategy: Develop an infrastructure that enables and encourages

reuse -- scientific workflow systems

key idea: these are data too


Scientific Workflow Systems

Value proposition: More time on science, less time on code

How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency

Provenance Automatic task-parallelism Visual programming Caching Domain-specific toolkits

Many examples from eScience and DB communities: Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more


Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.

http://www.microsoft.com/mscorp/tc/trident.mspx

3/12/09 Bill Howe, eScience Institute 15screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

3/12/09 Bill Howe, eScience Institute 16screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah


Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector

streamlines and adjusts opacity

Bill Howe @ CMOP adds an isosurface of

salinity

Peter Lawson adds discussion of the

scientific interpretation

source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)


Strategy at Armbrust Lab

1. Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings

2. “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake

3. “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.

Informed by two of Jim Gray’s Laws of Data Engineering: Start with “20 queries” Go from “working to working”


NSF Award: Cluster Exploratory (CluE)

Partnership between NSF, IBM, Google Data-intensive computing: “I/O farm”

massive queries, not massive simulations “in ferro” experiments

To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Requires turning over up to 25TB < 5s Provenance, reproducibility, visualization: VisTrails

Connect rich desktop experience to cloud query engine

Co-PIs from University of Utah Claudio Silva and Juliana Freire


Ahmdahl’s Laws

Gene Amdahl (1965): Laws for a balanced system

i. Parallelism: max speedup is S/(S+P)

ii. One bit of IO/sec per instruction/sec (BW)

iii. One byte of memory per one instruction/sec (MEM)

iv. One IO per 50,000 instructions (IO)

Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)

For a Blue Gene the BW=0.001, MEM=0.12.

For the JHU cluster BW=0.5, MEM=1.04

source: Alex Szalay, keynote, eScience 2008


Climatology

Feb May

Average Surface Salinity by Month Columbia River Plume 1999-2006

Columbia River

psu

Washington

Oregon

animation


1 2 3 4 5 6 7

31

23

psu

8 9 10 11 12 13 14 15

16 17 18(b)

19 20 21 22

24 25 26 27 28 29 30


Epilogue

We’re here to help!

SIG Wiki:https://sig.washington.edu/itsigs/SIG_eScience

eScience Blog:http://escience.washington.edu/blog/

eScience wesbite:http://www.washington.edu/uwtech/escience.html


eScience requirements are Fractal

William Gibson -- “The future is already here. It’s just not very evenly distributed.”


High-Performance Computing

Data Management

Con

sult

ing

Online Collaboration Tools

CS

Res

earc

h

eScience


It’s what you can do with it

Relational database SQL, plus UDTs and UDFs as needed

FASTA databases Alignments, rarefaction curves, phylogenetic trees, filtering

MapReduce: Roll your own

Dryad Relational algebra available; you can still roll our own if needed


A data deluge in all fields

Acquisition eventually outpaces analysis Astronomy: SDSS, now LSST; PanSTARRS Biology: PCR, SOLiD sequencing Oceanography: high-resolution models, cheap sensors Marine Microbiology: FlowCytometer

Empirical X Analytical X Computational X X-informatics

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

High-Performance Computing

Data ManagementC

onsu

ltin

g

Online CollaborationCom

mu

nit

y B

uild

ing

Tec

hn

olog

y T

ran

sfer

eScience Research


Query Languages

Organize and encapsulate access methods Raise the level of abstraction beyond GPLs Identify and exploit opportunities for algebraic

optimization What is algebraic optimization? Consider the expression x/z +

y/zx/z + y/z = (x + y)/z, but the latter is less expensive since it involves only

one division operation

Tables -- SQL XML -- XQuery, XPath RDF -- SPARQL Streams -- StreamSQL, CQL Meshes (e.g., Finite Element Sims) -- GridFields


Example: Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.

The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.


Gray’s Laws of Data Engineering

Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working”

DISSC: Data Intensive Scalable Scientific Computing

slide source: Alex Szalay, keynote, eScience 2008


Data Management

escience data management bill howe, phd escience institute it’s not just size that matters, it’s...

Documents

data architect

data typescern

world data acquisition

mebill howe

databill howe

itbill howe

escience instituteexample

tabular data representation