matthew b. jones - state of the salmonstateofthesalmon.org/pdfs/saldawg...
TRANSCRIPT
Kepler: A scientific workflow support tool
Matthew B. Jones
National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara
November 5, 2009
Data analysis and modeling
• Access to data is just the beginning
• Analysis and Modeling are critical to synthesis
– communication about analytical processes – analytical transparency
• Transform and integrate data from many sources
Analyses and models
• Wide variety of analyses used in ecology and conservation science – Statistical analyses and trends – Rule-based models – Dynamic models (e.g., continuous time) – Individual-based models (agent-based) – many others
• Implemented in many frameworks – implementations are black-boxes – learning curves can be steep – difficult to couple models
Analysis/Modeling Challenges
• Manual process to work with multiple analytical systems
• Data are discovered outside of tools and imported manually
• Difficult to understand models at a glance
• Difficult to revise analyses except in scripted systems
• No accepted way to publish models to share with colleagues
• Very little re-use of components – many re-inventions
• Difficult to use multiple computers for one analysis/model – Only a few experts use grid computing
• Current analytical practices are difficult to manage
• Model the steps used by researchers during analysis – Graphical model of flow of data among processing steps
• Each step often occurs in different software – Matlab, R, SAS, C/C++, Fortran, Swarm, ... – Each component can ‘wrap’ external systems, presenting
a unified view
• Refer to these graphs as ‘Scientific Workflows’
Models as ‘scientific workflows’
Data Graph Clean Analyze /Model
A
Source (e.g., data)
C
Sink (e.g., display)
B
Scientific workflows
• What are scientific workflows? – Graphical model of data flow among processing steps
– Inputs and Outputs of components are precisely defined – Components are modular and reusable – Flow of data controlled by a separate execution model – Support for hierarchical models
A’
Processor (e.g., regression)
B
E D F
Kepler scientific workflow system
Data source from repository
res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res)
R processing script
A Simple Kepler Workflow
Component Tab
Workflow Run Manager
Searchable Component
List
Component Documentation
Direct Data Access
398 hits for ‘ADCP’ located in search
Drag to workflow area to create datasource
Search for metadata term (“ADCP”)
Lotka-Volterra Model
Gene sequences via web services
Gene sequence returned in XML format
Web service executes remotely (e.g., in Japan)
This entire workflow can be wrapped as a re-usable component so that the details of extracting sequence data are hidden unless needed.
Extracted sequence can be returned for further processing
Slide: B. Ludaescher T. Fricke: ORB work in Kepler
ORB
Kilo Nalu Workflow
Streaming Data from observatory DataTurbine Server
Graphs and derived data can be archived and displayed
now <- Sys.time() Epoch <- now - as.numeric(now) timeval <-Epoch + timestamps posixtmedian = median(timeval) mediantime = as.numeric(posixtmedian) meantemp = mean(data)
Support application scripts in R, Matlab, etc.
Modular components, easily saved and shared
Composite actors aid comprehension
Composite actors aid comprehension Scientific workflows use hierarchy to hide complexity:
• Top level workflows can be a conceptual representation of the science process that is easy to comprehend at a glance
• Drilling down into sub-workflows reveals increasing levels of detail
• Composing models using hierarchy promotes the development of re-usable components that can be shared with other scientists
Run Management & Sharing
Report Designer
Sharing Widely
• Web interface [built upon Metacat] – Publish archives
• workflows • executions
– Manage • Schedule workflow executions • View completed execution results • View rendered reports
– Mix analytical systems • Matlab, R, C code, other executables, ...
– Understand models • visually depict how the analysis works
• provide the ability to rapidly review and revise an analysis
– Parameterize models • provide direct data access to archives and streams
– Share models • allow sharing of analytical procedures
• document precise versions of data and models used
– Provide provenance information • provenance is critical to science
• workflows are metadata about scientific process
Advantages of Scientific Workflows
• Components and their ports typically have: – Explicit ‘structural type’
• e.g., int, float, string, {double}
– Implicit semantic type • Not sure whether the stream of values from a port
represents ‘rainfall’ values or ‘body size’ values
string[]
Semantics in scientific workflows
A B int[] int[]
Rainfall BodySize ForkLength
Questions?
• http://www.nceas.ucsb.edu/ecoinfo/ • http://knb.ecoinformatics.org/ • http://kepler-project.org/
Acknowledgments
• This material is based upon work supported by:
• The Andrew W. Mellon Foundation
• The National Science Foundation under Grant Numbers 9980154, 9904777, and 0225676.
• The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
• Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis, USGS
• Kepler contributors: SEEK, REAP, Ptolemy II, SDM/SciDAC