wilson et al. agu talk, dec. 17, 2008 1 “practical” provenance using sciflo workflows brian...

20
Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Upload: emma-welch

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 1

“Practical” ProvenanceUsing SciFlo Workflows

Brian WilsonGerald Manipon and Hook HuaJet Propulsion Laboratory

Page 2: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 2ESIP Federation Talk, July 8, 2009

Outline• Decade-Scale Climate Science Using Workflows

• Review of SciFlo: Scientific DataFlow Engine

• Uses of Provenance

• Auto-provenance from dataflow execution engine

• processor versions, annotated results, etc.

• “Practical Provenance”

• commit to workflow

• get basic provenance for free

• reproducible workflows

• use web graph to store resources

Page 3: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 3ESIP Federation Talk, July 8, 2009

Large-Scale, Distributed Data Fusion

• Find Level-2 datasets across multiple data centers– Space/time granule query for multiple EOS (“A-Train”) instruments – AIRS, AMSR-E,

AMSU, MODIS, Cloudsat, GPS

• Co-locate retrievals using space/time metadata– Instantaneous “matchups” in space & time

• Read the data– Temperature, water vapor, quality flags, cloud properties (HDF)

• Understand the data– Units, quality control (non-trivial !!), etc.

• Publish merged products– Water vapor climatology, stratified by Cloudsat cloud classes

• Publish multi-sensor “fused” products– Determine instrument biases, understand by stratifying– Fuse L2 data on a common grid

Page 4: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 4ESIP Federation Talk, July 8, 2009

AIRS/GPS Temperature & Water Vapor Comparison Plots

AIRS / GPS Matchups

Page 5: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 5ESIP Federation Talk, July 8, 2009

n Automate large-scale, multi-instrument science processing by authoring a dataflow document that specifies a tree of executable operators/services.

n VizFlow Visual Authoring Tool (AJAX GUI in browser)n Distributed Dataflow Execution Engine (in python)

n Data Grid: Move data “granules” to the operators using FTP, HTTP, or OpenDAP URLs.

n Compute Grid: Move operators (executables) to the data.n Built-in reusable operators provided for many tasks such as

subsetting, co-registration, regridding, data fusion, etc.n Custom operators easily plugged in by scientists.n Leverage convergence of Web Services (SOAP) with Grid

Services (Globus toolkit v4).

SciFlo Workflow Engine

Page 6: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 6ESIP Federation Talk, July 8, 2009

Service/Operator Orchestration

• Each SciFlo processing step is one of:

– Template for XML (or string) generation– REST (http GET) call: e.g. WMS/WCS, DAP URLs– SOAP service call: “have WSDL, will call”– XPath 2.0 transformation for XML mediation– XQuery 1.0 query/transformation– Command-line script or executable– Python method call

– Scientist’s custom IDL or MATLAB script– Other (What do you need?)

Page 7: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 7ESIP Federation Talk, July 8, 2009

GPS-AIRS Matchup & Temp. Profile Comparison

VizFlow Flowchart

• Connect a series of services and operators into a dataflow• Drag services/operators from menu, and drop onto the canvas• Lay out the flowchart by moving nodes• Connect the input/output ports by drawing lines• User guided by matching up port names and types

Page 8: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 9ESIP Federation Talk, July 8, 2009

Carbon Cycle

n AMAPS = Aerosol Modeling And Processing Systemn Amy Braverman, ACCESS PI; Joyce Penner, U. Michigann Compare Aerosol Optical Depth (AOD) from MODIS, MISR, &

AERONET to IMPACT model

n AQUA = Automated Query & Accessn One-year ACCESS ECHO grant (Brian Wilson, PI)n Automated, repeatable access to 5-year EOS datasets for

large-scale data mining

n MEASUREs Project – Eric Fetzer, PIn Publish a temperature & water vapor climatology stratified by

cloud scene (CloudSat classes) using A-Train data (AIRS, AMSR-E, AMSU, MODIS, MLS)

n Cimate Virtual Observatoryn Examine the biases of temperature retrievals from AIRS,

AMSU, MLS by comparisons to GPS occulationsn Stratify biases by geophysical conditions, cloud scene, etc.;

study decade-scale trends.

SciFlo Applications

Page 9: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 10ESIP Federation Talk, July 8, 2009

Carbon Cyclen Debugging production (instrument drift, algorithm bugs)

n What data granule caused a production failure?n What executable version yielded “dubious” products?

n Traceable sciencen What data observed a climate event/anomaly?n What data contributed to analysis of a climate trend?

n Reproducible sciencen Re-generate the science analysis years latern Allow peer reviewers to reproduce and vary the science

analysis by executing the workflows

Uses of Provenance

Page 10: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 11ESIP Federation Talk, July 8, 2009

Carbon Cycle

n Two Approaches:

n Instrument the Production Scriptsn Call out to ‘provenance capture’ API to record metadatan Could be web service callsn Intrusiven Only retain what you explicitly save

n Use Formal Workflow for Productionn Annotated workflow document contains provenancen Versions of operatorsn Web service endpointsn Intermediate & final results, or pointers to themn Use links to limit combinatorial explosionn Workflow points to entire provenance, if URI’s are permanent

Provenance in Production Systems

Page 11: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 12ESIP Federation Talk, July 8, 2009

Carbon Cycle

n Chain of provenance is a directed graph linking inputs, processors with configuration, & computed outputs

n Saving the graph (replicas of resources)n Bullet proofn But unnecessary duplicationn Combinatorial explosion

n Saving the graph (links to resources)n Provenance graph is on the webn But links rot, so more fragile

n Importance of permanent names (simplifies problem)n Permanent names under-used on the webn URL’s can be permanent, just policyn Provenance system can guarantee permanencen Also could migrate to another system, e.g. XRI

Provenance Graph

Page 12: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 13ESIP Federation Talk, July 8, 2009

Example: Subsetting WorkflowMISR Granule Subsetter

Page 13: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 14ESIP Federation Talk, July 8, 2009

Annotated SciFlo Document

Page 14: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 15ESIP Federation Talk, July 8, 2009

Auto-generated SciFlo Input Form

Input widgets

Page 15: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 16ESIP Federation Talk, July 8, 2009

Processing Step #1 Call to versioned web service

Page 16: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 17ESIP Federation Talk, July 8, 2009

Processing Step #2 Call to method in python module

Version provenance: - Execution engine adds version annotation - Or here code bundle is versioned

Page 17: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 18ESIP Federation Talk, July 8, 2009

Annotated Results Section Results document exact granules used

Page 18: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 19ESIP Federation Talk, July 8, 2009

Annotated Results (2) Intermediate results from each processing step

Page 19: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 20ESIP Federation Talk, July 8, 2009

Carbon Cycle

n Web Services & Operators versionedn Versioned URI’s or annotationsn Can snapshot code bundles, or better a virtual image of OS

with installed operators

n Save (small) intermediate resultsn Particularly, list of input data granules returned from queryn Can still reproduce workflow even if service/op unavailablen SciFlo user controls what is saved

n Provenance is contained in annotated SciFlo doc. and resources it links to (URI’s):

n Provenance is immediately returned to user with resultsn Distributed provenance implicit in links between workflowsn Resource replicas can be bundled according to user policyn Database not necessary, but can be used for archiving/queryn Provenance graph can be transformed to OPM/XML for

interoperability

Provenance Features

Page 20: Wilson et al. AGU Talk, Dec. 17, 2008 1 “Practical” Provenance Using SciFlo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory

Wilson et al. AGU Talk, Dec. 17, 2008 21ESIP Federation Talk, July 8, 2009

Carbon Cycle

n Commit to Workflow (many other benefits)n Declarative production streamsn Auto-parallel executionn Workflow can be distributed using web services (multi-sensor,

multi-data center science)

n Get Basic Provenance For Freen No need to instrument production systemsn Provenance graph implicit in SciFlo document & REST URLsn Web graph of permanent resourcesn By policy, can archive resource replicas (with URL redirection)

n Web Services Eran Distributed provenance is importantn SciFlo or OPM docs. link to each othern Trace full graph by following REST URL’sn Reproducible workflows

“Practical” Provenance