seek meeting, ucsb, 10/22-26/2003 towards scientific workflows based on dataflow process networks...
TRANSCRIPT
SEEK meeting, UCSB, 10/22-26/2003
Towards Scientific Workflows Based on Towards Scientific Workflows Based on Dataflow Process Networks Dataflow Process Networks
(or (or from Ptolemy to Kepler)from Ptolemy to Kepler)
Bertram LudBertram Ludääscherscher
San Diego Supercomputer San Diego Supercomputer CenterCenter
[email protected]@SDSC.edu
SEEK meeting, UCSB, 10/22-26/2003
A Note on the Style of the following A Note on the Style of the following SlidesSlides
Due to lack of time, most of the following slides will be “by reference” Due to lack of time, most of the following slides will be “by reference” only ;-) only ;-)
– …Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]
SEEK meeting, UCSB, 10/22-26/2003
AcknowledgementsAcknowledgements• NSF, NIH, DOENSF, NIH, DOE
• GEOsciences Network (NSF) GEOsciences Network (NSF) – www.geongrid.org
• Biomedical Informatics Research Network (NIH)Biomedical Informatics Research Network (NIH)– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/
SEEK meeting, UCSB, 10/22-26/2003
Example: Promoter Identification Workflow (PIW) (simplified)
From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)
SEEK meeting, UCSB, 10/22-26/2003
Compute clusters(min. distance)
Select gene-set(cluster-level)
For each geneRetrieve
Transcription factors
ArrangeTranscription factors
For each promoter
ComputeSubsequence labels
With all Promoter Models
Compute JointPromoter Model
Retrieve matching cDNA
Retrieve genomicSequence
Extract promoterRegion(begin, end)
Create consensussequence
Align promoters
Conceptual Workflow Conceptual Workflow ((Promoter Identification Workflow PIWPromoter Identification Workflow PIW))
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Details of the Functional MRI (Magnetic Resonance Details of the Functional MRI (Magnetic Resonance Imaging) Analysis Workflow (Jeffrey GretheImaging) Analysis Workflow (Jeffrey Grethe))
1.1. Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific taskCollect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task2.2. Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)3.3. Now have anatomical and functional dataNow have anatomical and functional data4.4. Pre-process the functional dataPre-process the functional data
1. Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to correct for these differences so that all slices seem to be acquired at same time
2. Not correct for subject motion (head movement in scanner) by realigning all functional images
5.5. Register the functional images with the anatomical image Register the functional images with the anatomical image all images are now in the same space (all all images are now in the same space (all aligned with one another)aligned with one another)
6.6. Move all subjects into template space through non-linear spatial normalization. There exist atlas Move all subjects into template space through non-linear spatial normalization. There exist atlas templates (made from many subjects) that one can normalize to so that all subjects are in the same space, templates (made from many subjects) that one can normalize to so that all subjects are in the same space, allowing for direct comparison across subjects.allowing for direct comparison across subjects.
7.7. DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly tweaking some parameters for the routines or by re-doing some of it by hand).tweaking some parameters for the routines or by re-doing some of it by hand).
8.8. Move onto statistics. First we do single subject statistics: in addition to the images, information about the Move onto statistics. First we do single subject statistics: in addition to the images, information about the experimental paradigm is required. These can be overlayed onto an anatomical to create visual displays of experimental paradigm is required. These can be overlayed onto an anatomical to create visual displays of brain activation during a particular task.brain activation during a particular task.
9.9. Can also combine statistical data from multiple subjects and do a group/population analysis and display Can also combine statistical data from multiple subjects and do a group/population analysis and display these results.these results.
Interactive nature of these workflows is critical (data verification) - Interactive nature of these workflows is critical (data verification) - can these steps be automated or semi-automated? can these steps be automated or semi-automated?
need metadata from collection equipment and experimental design !need metadata from collection equipment and experimental design !
SEEK meeting, UCSB, 10/22-26/2003
GARP Invasive Species Pipeline GARP Invasive Species Pipeline
Training sample
(d)
GARPrule set
(e)
Test sample (d)
Integrated layers
(native range) (c)
Speciespresence &
absence points(native range)
(a)EcoGridQuery
EcoGridQuery
LayerIntegration
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Validation
MapGeneration
Integrated layers (invasion area) (c)
Species presence &absence points
(invasion area) (a)
Native range
predictionmap (f)
Model qualityparameter (g)
Environmental layers (native
range) (b)
GenerateMetadata
ArchiveTo Ecogrid
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
Environmental layers (invasion
area) (b)
Invasionarea prediction
map (f)
Model qualityparameter (g)
Selectedpredictionmaps (h)
From: NSF SEEK (Deana Pennington et al)
SEEK meeting, UCSB, 10/22-26/2003
Scientific Workflows: Some FindingsScientific Workflows: Some Findings
• More More dataflowdataflow than workflow than workflow– but some branching looping, merging, …– not: documents/objects undergoing modifications – instead: dataset-out = analysis(dataset-in)
• Need for “Need for “collection/functional-style programmingcollection/functional-style programming” (FP)” (FP)– Iterations over lists (foreach); filtering; functional composition; generic &
higher-order operations (zip, map(f), …)
• Need for Need for abstractionabstraction and and nested workflowsnested workflows• Need for Need for data transformationsdata transformations (compute/transform alternations) (compute/transform alternations)• Need for rich Need for rich user interactionuser interaction / / steeringsteering::
– pause & resume– select & branch; e.g., web browser capability at specific steps as part of a
coordinated SWF
• Need for Need for high-throughputhigh-throughput transfers (“grid-enabling”, “streaming”) transfers (“grid-enabling”, “streaming”)• Need for Need for persistencepersistence of intermediate products of intermediate products
data provenance (“virtual data”; cf. several ITR and e-Science projects)
SEEK meeting, UCSB, 10/22-26/2003
(Analytical) Pipelines …. (Scientific) Workflows(Analytical) Pipelines …. (Scientific) Workflows
• Spectrum of languages & formalisms:Spectrum of languages & formalisms:– Pipelines (a la Unix)
– Dataflow languages:• Synchronous dataflow networks (SDF)
• Kahn’s process networks (PN)
– “Web page-flow”: • Active XML, WebML, …
• Hesitating-weak-alternating-tree-automata-ML
• …
– (Business) Workflows:• WfMC’s XPDL, WSFL, BPELWS, …
SEEK meeting, UCSB, 10/22-26/2003
Business WorkflowsBusiness Workflows
• Business Workflows Business Workflows – show their office automation ancestry
– documents and “work-tasks” are passed
– no data streaming, data-intensive pipelines
– lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,…
– but often no clear semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
SEEK meeting, UCSB, 10/22-26/2003
The ZOO of Workflow Standards and SystemsThe ZOO of Workflow Standards and Systems
Source: W.M.P. van der Aalst et al.http://tmitwww.tm.tue.nl/research/patterns/
SEEK meeting, UCSB, 10/22-26/2003
More on Scientific WF vs Business WFMore on Scientific WF vs Business WF
• Business WFBusiness WF– Tasks, documents, etc. undergo modifications (e.g., flight reservation from
reserved to ticketed), but modified WF objects still identifiable throughout
– Complex control flow, task-oriented
– Transactions w/o rollback (ticket: reserved purchased)
– …
• SWFSWF– data-in and data-out of an analysis step are not the same object!
– dataflow, data-oriented (cf. AVS/Express, Khoros, …)
– re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type)
– data integration & semantic mediation as part of SWF framework!
– …
SEEK meeting, UCSB, 10/22-26/2003
SWF vs Distributed ComputingSWF vs Distributed Computing
• Distributed Computing (e.g. a la Condor-(G) )Distributed Computing (e.g. a la Condor-(G) )– Batch oriented
– Transparent distributed computing (“remote Unix/Java”; standard/Java universes in Condor)
– HPC resource allocation & scheduling
• SWFSWF– Often highly interactive for decision making/steering of the WF
and visualization (data analysis)
– Transparent data access (Grid) and integration (database mediation & semantic extensions)
– Desktop metaphor (“microworkflow”!?); often (but not always!) light-weight web service invocation
SEEK meeting, UCSB, 10/22-26/2003
Ptolemy-IIPtolemy-II
• Recommendations following:Recommendations following:– must read
– must see (now: snippets following; watch for new ways to compress slides ;-)
– must try
• Bottom line:Bottom line:– a sophisticated system to do “simple” things (dataflows) as well as highly complex things (hybrid models) (compare to your favorite standard/approach/system)
SEEK meeting, UCSB, 10/22-26/2003
Dataflow Process Dataflow Process Networks and Ptolemy-Networks and Ptolemy-
IIII
see!see!see!see!
try!try!try!try!
read!read!read!read!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
In our (SEEK) terminology: Think of it as “Workflow
Execution Model++”
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Our SEEK/ SciDAC/Kepler extensions here!
SEEK meeting, UCSB, 10/22-26/2003
Some Glimpses on the PT-II Some Glimpses on the PT-II Execution Models (“Domains”)Execution Models (“Domains”)
SEEK meeting, UCSB, 10/22-26/2003
Kahn Process Networks (PN)Kahn Process Networks (PN)• Concurrent processes communication through Concurrent processes communication through one-wayone-way FIFO channels with FIFO channels with unbounded unbounded
capacitycapacity• A A functional processfunctional process F F maps a set of input sequences into a set of output sequences maps a set of input sequences into a set of output sequences
(sounds like XSM!)(sounds like XSM!)• increasing chain of sets of sequences increasing chain of sets of sequences outputs may outputs may notnot increase! increase! • Consider increasing chains (wrt. prefix ordering “<“) of streamsConsider increasing chains (wrt. prefix ordering “<“) of streams• PN is PN is continuouscontinuous if lub(Xs) exists for all increasing chains Xs and if lub(Xs) exists for all increasing chains Xs and
– F(lub(Xs)) < lub(F(Xs))F(lub(Xs)) < lub(F(Xs))• Continuous implies montonicContinuous implies montonic::
– if Xs < Ys then F(Xs)<F(Ys)if Xs < Ys then F(Xs)<F(Ys)
SEEK meeting, UCSB, 10/22-26/2003
Process Networks (cont’d)Process Networks (cont’d)
• PN in essence: PN in essence: simultaneous relations between sequencessimultaneous relations between sequences• Network of functional processes can be described by a Network of functional processes can be described by a
mapping mapping
X X = F(= F(XX,,II) ) – X denotes all the sequences in the network (inputs I+outputs)
• X X that forms a solution is a that forms a solution is a fixed pointfixed point• Continuity implies exactly one “minimal” fixed pointContinuity implies exactly one “minimal” fixed point
– minimal in the sense of pre-fix ordering for any inputs I
– execution of the network: given I = and find the minimal fixed point (works because of the monotonic property)
SEEK meeting, UCSB, 10/22-26/2003
Synchronous Synchronous Data Flow Data Flow Networks Networks
(SDF)(SDF)
• Special case of PNSpecial case of PN• Ptolemy-II SDF overview Ptolemy-II SDF overview
– SDF supports efficient execution of Dataflow graphs that lack control structures– with control structures Process Networks(PN) – requires that the rates on the ports of all actors be known before hand– do not change during execution– in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted
SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.
SEEK meeting, UCSB, 10/22-26/2003
Extended Kahn-MacQueen Process NetworksExtended Kahn-MacQueen Process Networks
• A process is considered A process is considered activeactive from its creation until its termination from its creation until its termination
• An active process can block when trying to read from a channel An active process can block when trying to read from a channel ((read-blockedread-blocked), when trying to write to a channel (), when trying to write to a channel (write-blockedwrite-blocked) or ) or when waiting for a queued topology change request to be processed when waiting for a queued topology change request to be processed ((mutation-blockedmutation-blocked))
• A A deadlockdeadlock is when all the active processes are blocked is when all the active processes are blocked– real deadlock: all the processes are blocked on a read
– artificial deadlock: all processes are blocked, at least one process is blocked on a write increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.
– If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.
SEEK meeting, UCSB, 10/22-26/2003
Towards Towards SciMod/SDMSWE/Kepler/SciMod/SDMSWE/Kepler/……
(my vote is for ‘Kepler’…)(my vote is for ‘Kepler’…)
SEEK meeting, UCSB, 10/22-26/2003
Scientific Workflows = Dataflow Process Networks Scientific Workflows = Dataflow Process Networks + X+ X
• Kepler = current Ptolemy-II Kepler = current Ptolemy-II plus Xplus X, where X = …, where X = …– Extended type system (structural & semantic extensions)
– Collection programming extensions (declarative/FP) and
– Rich user interactions/workflow steering
– Rich data transformations (compute/transform alternations)
– (Eco-)Grid extensions:• Actors as web/grid services
• 3rd party data transfer, high-throughput data streaming
• Data and service repositories, discovery
– Data provenance • (semi-)automatic meta-data creation
– What else???
• … … minusminus upcoming Ptolemy-II extensions! upcoming Ptolemy-II extensions!– The slower we are, the less we have to do ourselves ;-)
SEEK meeting, UCSB, 10/22-26/2003
Extended Type System Extended Type System (here: OWL Semantic (here: OWL Semantic Types)Types)
SemType m1 :: Observation & itemMeasured.AbundanceCount & hasContext.appliesTo.LifeStageProperty DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStagePropertySubstructure association:
XML raw-data =(X)Query=> object model =link => OWL ontology
SEEK meeting, UCSB, 10/22-26/2003
Actor Repositories (here: a commercial Actor Repositories (here: a commercial tool)tool)
See why we said user-
definable (or auto-generated) actor libraries?
SEEK meeting, UCSB, 10/22-26/2003
Collection ProgrammingCollection Programming(some lessons from SciDAC/SSDBM (some lessons from SciDAC/SSDBM
demo)demo)
SEEK meeting, UCSB, 10/22-26/2003
Promoter Identification
Workflowin Ptolemy-II(SSDBM’03)
hand-crafted control solution; also: forces sequential execution!
designed to fit
designed to fit
hand-craftedWeb-service
actor
Complex backward control-flow
No data transformations
available
SEEK meeting, UCSB, 10/22-26/2003
Promoter Identification Workflow in FPPromoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> String
d0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file
SEEK meeting, UCSB, 10/22-26/2003
Simplified Process Network PIWSimplified Process Network PIW
• Back to purely functional Back to purely functional dataflow process networkdataflow process network(= a data streaming model!)
• Re-introducing Re-introducing mapmap((ff) to ) to Ptolemy-II Ptolemy-II (was there in PT Classic) (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from
piw(GeneId) to PIW :=map(piw) over [GeneId]
map(f)-style
iterators Powerful type
checking Generic,
declarative “programming”
constructs
Generic data transformation
actors
Forward-only, abstractable sub-workflow piw(GeneId)
SEEK meeting, UCSB, 10/22-26/2003
Optimization by Declarative Rewriting IOptimization by Declarative Rewriting I• PIW as a declarative, PIW as a declarative,
referentially transparent referentially transparent functional processfunctional process optimization via functional
rewriting possiblee.g. map(f o g) = map(f) o map(g)
• Details: Details: – Technical report &PIW specification
in Haskell
map(f o g) instead of map(f) o
map(g)
Combination of map and zip
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
SEEK meeting, UCSB, 10/22-26/2003
Optimization by Declarative Rewriting IIOptimization by Declarative Rewriting II
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney
• Rewritings require that data transformation semantics is known Rewritings require that data transformation semantics is known • e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database queryinge.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying
SEEK meeting, UCSB, 10/22-26/2003
Data Transformation Actors:Data Transformation Actors:Chaining together web services is Chaining together web services is
easy …easy …
… … (NOT)(NOT)
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
MAP: Data Massaging a la Blue-Titan/Perl MAP: Data Massaging a la Blue-Titan/Perl
SEEK meeting, UCSB, 10/22-26/2003
Data Transformation Actors: Data Transformation Actors: Our Approach (proposal)Our Approach (proposal)
• ManualManual– XQuery, XSLT, Perl, Python, … transformation actor
(development)
• (Semi-)automatic(Semi-)automatic– Semantic-type guided transformation generation (research)
• Also: Also: Web Service CompositionWeb Service Composition is … is …– … a hot topic
– … a reincarnation of many “old” ideas – (e.g., AI-style planning born-again; functional composition; query composition;
… )
– … a separate topic
SEEK meeting, UCSB, 10/22-26/2003
User InteractionUser Interaction
• Brower Actor demo … (Ilkay)Brower Actor demo … (Ilkay)
SEEK meeting, UCSB, 10/22-26/2003
F I NF I N(addtl. material follows) (addtl. material follows)
FYI: Flow-based programming has been re-discovered/re-invented several times:FYI: Flow-based programming has been re-discovered/re-invented several times:– Flow-based Programming, http://www.jpaulmorrison.com/fbp/index.shtm