the virtual data grid: a new model and architecture for data-intensive collaboration

Post on 01-Feb-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division. GriPhyN: Grid Physics Network Mission. - PowerPoint PPT Presentation

TRANSCRIPT

The Virtual Data Grid:A New Model and Architecture for

Data-Intensive Collaboration

Summer Grid 2004UT Brownsville South Padre Island Center

24 June 2004

Mike WildeArgonne National Laboratory

Mathematics and Computer Science Division

2Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

GriPhyN:Grid Physics Network Mission

Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation

Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.

GriPhyN works to “cross the chasm” -

application and computer scientists create and field-test paradigms and toolkits together

3Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Acknowledgements:Virtual Data is a Large Team Effort

The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao

The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi

Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams

4Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual Data Scenario

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

On-demand data

generation

Update workflow following changes

Manage workflow;

psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2

Explain provenance, e.g. for file8:

5Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual DataDescribes analysis workflow

The recorded virtual data “recipe” here is:

– Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2

– Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

Requesteddataset

6Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual DataDescribes analysis workflow

To recreate file 8: Step 1

– simulate > file1, file2

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

Requestedfile

7Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual DataDescribes analysis workflow

To re-create file8: Step 2

– files 3, 4, 5, 6 derived from file 2

– reformat > file3, file4, file5

– conv > file 6

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

Requestedfile

8Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual DataDescribes analysis workflow

To re-create file 8: step 3

– File 7 depends on file 6

– Summarize > file 7

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

Requestedfile

9Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual DataDescribes analysis workflow

To re-create file 8: final step

– File 8 depends on files 1, 3, 4, 5, 7

– psearch < file1, file3, file4, file5, file 7 > file 8

simulate –t 10 …

file1

file2

psearch –t 10 …

reformat –f fz …

conv –I esd –o aod

file1file1File3,4,5

file6 summarize –t 10 …

file7

file8

Requestedfile

10Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Grid3 – The Laboratory

Supported by the National Science Foundation and the Department of Energy.

11Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

VDL: Virtual Data LanguageDescribes Data Transformations

Transformation– Abstract template of program invocation– Similar to "function definition"

Derivation– “Function call” to a Transformation– Store past and future:

> A record of how data products were generated> A recipe of how data products can be generated

Invocation– Record of a Derivation execution

These XML documents reside in a “virtual data catalog” – VDC - a relational database

12Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

VDL Describes Workflowvia Data Dependencies

TR tr1(in a1, out a2) {

argument stdin = ${a1}; 

argument stdout = ${a2}; }

TR tr2(in a1, out a2) {

argument stdin = ${a1};

argument stdout = ${a2}; }

DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});

DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});

file1

file2

file3

x1

x2

13Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Workflow example

Graph structure– Fan-in

– Fan-out

– "left" and "right" can run in parallel

Needs external input file– Located via replica catalog

Data file dependencies– Form graph structure

findrangefindrange

analyze

preprocess

14Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Complete VDL workflow

Generate appropriate derivationsDV top->preprocess( b=[ @{out:"f.b1"},

@{ out:"f.b2"} ], a=@{in:"f.a"} );DV left->findrange( b=@{out:"f.c1"},

a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" );

DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" );

DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

15Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Compound TransformationsEnable Functional Abstractions

Compound TR encapsulates an entire sub-graph:TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ){ call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2},

name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2},

name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

16Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Derivation scripts Representation of virtual data provenance:

DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" );

DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );

...DV d70->diamond( fd=@{out:"f.001A3"},

fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

17Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Invocation Provenance

Completion status and resource usage

Attributes of executable transformation

Attributes of input and output files

18Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Executing VDL Workflows

Abstractworkflow

local planner

ConcreteDAG

Global planner“Pegasus”

DAGman /Condor-G

GridInfo

“jit” planner(research)

19Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

GriPhyN-iVDGLApplications to date

ATLAS, BTeV, CMS – HEP event simulation Argonne Computational Biology – sequence

comparison and result capture LIGO – Pulsar search Sloan Digital Sky Survey – cluster finding;

near-earth object search planned Quarknet – science education – cosmic

rays, HEP analysis

20Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Genome Analysis Database Update

Jazz/ANL

Grid3

UofWiscJazz/ANL

Grid3

UofWisc

Grid

A

B

D

C A

B

C

D A

D

B

C

C

D

A

B

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

A

B

D

C A

B

C

D A

D

B

C

C

D

A

B

A

B

D

C

A

B

D

C A

B

C

D

A

B

C

D A

D

B

C

A

D

B

C

C

D

A

B

C

D

A

B

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

Hit and Run Registered Groups Collaborators

Interface to theServer

Jets

pee

d

Hit and Run Registered Groups CollaboratorsPublic Registered Groups Collaborators

End Users

Interface to theServer

Jets

pee

d

Dat

a F

low

an

d S

tora

ge

at v

ario

us

leve

ls

Ch

imer

a, C

on

do

r, G

lob

us

Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS

Described in GGF10workshop paper.

21Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

DAG

Virtual Data Example:Galaxy Cluster Search

Sloan Data

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,

University of Chicago. Described in SC2002 paper

22Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Cluster SearchWorkflow Graph

and Execution Trace

Workflow jobs vs time

23Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000

mass = 200decay = WWstability = 1event = 8

mass = 200decay = WWstability = 1plot = 1

mass = 200decay = WWplot = 1

mass = 200decay = WWevent = 8

mass = 200decay = WWstability = 1

mass = 200decay = WWstability = 3

mass = 200

mass = 200decay = WW

mass = 200decay = ZZ

mass = 200decay = bb

mass = 200plot = 1

mass = 200event = 8

Virtual Data Application: High Energy Physics

Data Analysis

Work and slide byRick Cavanaugh andDimitri Bourilkov,University of FloridaRef: CHEP 2002 paper

24Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Using Virtual Data forScience Education

The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education

Its an experiment to give students the means to:– discover and apply datasets, algorithms, and data

analysis methods

– collaborate by developing new ones and sharing results and observations

– learn data analysis methods that will ready and excite them for a scientific career

And in later steps, we may actually use the Grid!

25Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Quarknet Virtual Data Project

Standard Web access

Central High SchoolReston, Virginia

LocallyCollected Data

CosmicRay

DetectorS

tud

ent/

Teach

erT

eams

Yale / Middletown High CollaborationHartford, Connecticut

LocallyCollected Data

CosmicRay

Detector

Stu

den

t/T

eacher

Team

s

Foothills High SchoolGreat Falls, Montana

LocallyCollected Data

CosmicRay

Detector

Stu

den

t/T

eacher

Team

s

Quarknet Virtual Data Portal

Student Data,Algorithms,

Results, Notes,and communications

VirtualData

Toolkit

VirtualData

Catalog

Student teacher teams sharing data, methods, programs, and knowledge

Enabling collaboration-intensive science discovery with virtual data tools and methods

26Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Detector Performance Study

27Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Example: BTeV Event Simulation

28Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Support for Search and Discovery

Goal: make it as easy to use as Google More advanced capabilities lie below the

surface (as with Google) Understand the structure and meaning of

the datasets and their fields. Advanced search, using SQL-like queries Find both DATA and TRANSFORMATIONS Create datasets from queries Perform calculations on datasets, filtering

results to look for patterns

29Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Search byMetadata

30Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Derving a new

dataset

…to find mass of

“z” particle:

31Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Workflow formissing energy calculations

32Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual Provenance:list of derivations and files

<job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job><job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job><job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job>

<!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>….(excerpted for display)

33Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual Provenance in XML:control flow graph

<child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>…

(excerpted for display…)

And writing the results up in a “poster”

35Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Poster describing analysis

36Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Using active data from Web Services

37Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

38Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

39Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

40Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Levels of Interaction “Skins” – use it like a calculator,

experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values.

“Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre-developed transforms as building blocks

“Code” – write new transforms in a variety of languages and data models

41Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Observations

A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity

Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation

The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

42Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Vision for Provenance in the Large

Universal knowledge management and production systems

Vendors integrate the provenance tracking protocol into data processing products

Ability to run anywhere “in the Grid”

43Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Virtual Data Grid Vision

GridOperations

simulation data

discovery

ScienceReview

Data Grid

storageelement

replica locationservice

storageelement

storageelement

Dat

aT

ran

spo

rt Sto

rage

Reso

urce

Mg

mt

virtualdata

catalogvirtual data

index

virtualdata

catalog

virtualdata

catalog

Computing Grid

workflowplanner

request plannerworkflowexecutor

(DAGman)

request executor(Condor-G,

GRAM)

requestpredictor

(Prophesy)

Grid Monitor

ProductionManager

Researcher

planning

discovery

com

po

sition

sim

ula

tio

n

anal

ysis

sharing

raw d

ata

detector

derivatio

n

44Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Planned Dataset Model

<FORM <Title…>/FORM>

File Set of files

Relational query or spreadsheet range

XML Element

Set of files with relational index

Object closure

New user-defined dataset type:

Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

45Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Planned Dataset Type ModelFileDataset

File FileSet

MultiFileSet TarFileSetEventCollection

RawEventSet SimulatedEventSet

MonteCarloSimulation

DiscreteEventSimulation

Representational

Logical

(Nonleaf Typesare Superclasses)

46Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Provenance Server Plans OGSA-based Grid services

– Discovery, security, resource management Supports code and data discovery

and workflow management Object names (TR, DS, TY, DV, IV) can be used as

global cross-server links Derivations can reference remote transformations

and datasets Structured object namespaces & object-level access

control enable large VO collaboration Generalize transforms to describe service calls,

database queries and language interpreters

47Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

CollaborationVDS

TR

TR

TR

DV

TR

DV

DV

DV

DV

DV

Group VDS

PersonalVDS

PersonalVDS

DS

DSDS

Provenance Hyperlinks

48Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

Indexing Serversto Support Discovery

Collaboration-wideindex

Collaboration-levelindex

Group Index

PersonalIndex

PersonalIndex

PersonalIndex

CollaborationVDS

TR

TR

TR

DV

TR

DV

DV

DV

DV

DV

Group VDS

PersonalVDS

PersonalVDS

DS

DSDS

49Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

For Information and Software Virtual Data System

– www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software

Grids and Grid Software– www.ivdgl.org/grid2003 - Using Grid3– www.griphyn.org/vdt - Virtual Data Toolkit– www.globus.org – The Globus Toolkit– www.cs.wisc.edu/condor - The Condor Project– www.ppdg.net – Particle Physics Data Grid

50Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI

AcknowledgementsGriPhyN, iVDGL, and QuarkNet

(in part) are supported by the National Science Foundation

The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of

Energy, Office of Science; by the NASA Information Power Grid program; and by IBM

top related