matthew b. jones - state of the salmonstateofthesalmon.org/pdfs/saldawg...

23
Kepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara November 5, 2009

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Kepler: A scientific workflow support tool

Matthew B. Jones

National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

November 5, 2009

Page 2: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Data analysis and modeling

• Access to data is just the beginning

• Analysis and Modeling are critical to synthesis

–  communication about analytical processes – analytical transparency

• Transform and integrate data from many sources

Page 3: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Analyses and models

•  Wide variety of analyses used in ecology and conservation science –  Statistical analyses and trends –  Rule-based models –  Dynamic models (e.g., continuous time) –  Individual-based models (agent-based) –  many others

•  Implemented in many frameworks –  implementations are black-boxes –  learning curves can be steep –  difficult to couple models

Page 4: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Analysis/Modeling Challenges

•  Manual process to work with multiple analytical systems

•  Data are discovered outside of tools and imported manually

•  Difficult to understand models at a glance

•  Difficult to revise analyses except in scripted systems

•  No accepted way to publish models to share with colleagues

•  Very little re-use of components – many re-inventions

•  Difficult to use multiple computers for one analysis/model –  Only a few experts use grid computing

Page 5: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

•  Current analytical practices are difficult to manage

•  Model the steps used by researchers during analysis –  Graphical model of flow of data among processing steps

•  Each step often occurs in different software –  Matlab, R, SAS, C/C++, Fortran, Swarm, ... –  Each component can ‘wrap’ external systems, presenting

a unified view

•  Refer to these graphs as ‘Scientific Workflows’

Models as ‘scientific workflows’

Data Graph Clean Analyze /Model

Page 6: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

A

Source (e.g., data)

C

Sink (e.g., display)

B

Scientific workflows

•  What are scientific workflows? –  Graphical model of data flow among processing steps

–  Inputs and Outputs of components are precisely defined –  Components are modular and reusable –  Flow of data controlled by a separate execution model –  Support for hierarchical models

A’

Processor (e.g., regression)

B

E D F

Page 7: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Kepler scientific workflow system

Data source from repository

res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res)

R processing script

Page 8: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

A Simple Kepler Workflow

Component Tab

Workflow Run Manager

Searchable Component

List

Page 9: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Component Documentation

Page 10: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Direct Data Access

398 hits for ‘ADCP’ located in search

Drag to workflow area to create datasource

Search for metadata term (“ADCP”)

Page 11: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Lotka-Volterra Model

Page 12: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Gene sequences via web services

Gene sequence returned in XML format

Web service executes remotely (e.g., in Japan)

This entire workflow can be wrapped as a re-usable component so that the details of extracting sequence data are hidden unless needed.

Extracted sequence can be returned for further processing

Page 13: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Slide: B. Ludaescher T. Fricke: ORB work in Kepler

ORB

Page 14: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Kilo Nalu Workflow

Streaming Data from observatory DataTurbine Server

Graphs and derived data can be archived and displayed

now <- Sys.time() Epoch <- now - as.numeric(now) timeval <-Epoch + timestamps posixtmedian = median(timeval) mediantime = as.numeric(posixtmedian) meantemp = mean(data)

Support application scripts in R, Matlab, etc.

Modular components, easily saved and shared

Page 15: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Composite actors aid comprehension

Page 16: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Composite actors aid comprehension Scientific workflows use hierarchy to hide complexity:

•  Top level workflows can be a conceptual representation of the science process that is easy to comprehend at a glance

•  Drilling down into sub-workflows reveals increasing levels of detail

•  Composing models using hierarchy promotes the development of re-usable components that can be shared with other scientists

Page 17: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Run Management & Sharing

Page 18: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Report Designer

Page 19: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Sharing Widely

• Web interface [built upon Metacat] – Publish archives

• workflows • executions

– Manage • Schedule workflow executions • View completed execution results • View rendered reports

Page 20: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

–  Mix analytical systems •  Matlab, R, C code, other executables, ...

–  Understand models •  visually depict how the analysis works

•  provide the ability to rapidly review and revise an analysis

–  Parameterize models •  provide direct data access to archives and streams

–  Share models •  allow sharing of analytical procedures

•  document precise versions of data and models used

–  Provide provenance information •  provenance is critical to science

•  workflows are metadata about scientific process

Advantages of Scientific Workflows

Page 21: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

• Components and their ports typically have: –  Explicit ‘structural type’

• e.g., int, float, string, {double}

–  Implicit semantic type • Not sure whether the stream of values from a port

represents ‘rainfall’ values or ‘body size’ values

string[]

Semantics in scientific workflows

A B int[] int[]

Rainfall BodySize ForkLength

Page 22: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Questions?

•  http://www.nceas.ucsb.edu/ecoinfo/ •  http://knb.ecoinformatics.org/ •  http://kepler-project.org/

Page 23: Matthew B. Jones - State of the Salmonstateofthesalmon.org/pdfs/SalDAWG PDFs/2009/Jones-kepler.pdfKepler: A scientific workflow support tool Matthew B. Jones National Center for Ecological

Acknowledgments

•  This material is based upon work supported by:

•  The Andrew W. Mellon Foundation

•  The National Science Foundation under Grant Numbers 9980154, 9904777, and 0225676.

•  The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

•  Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis, USGS

•  Kepler contributors: SEEK, REAP, Ptolemy II, SDM/SciDAC