Thomas Steinke
Zuse Institute Berlin (ZIB) <www.zib.de>[email protected]
Activities of the COST D37 GridChemActivities of the COST D37 GridChemComputational Chemistry Workflow Computational Chemistry Workflow
GroupGroup
EGEE'07 ConferenceEGEE'07 Conference
BudapestBudapest
01.10.200701.10.2007
2
• Berlin
• Manno•
• Erlangen
• London•
• Sevilla
Zürich
Cambridge Thomas Steinke, Tim Clark (DE)
Hans-Peter Lüthi, Martin Brändle
(CH)
Peter Murray-Rust, Henry Rzepa
(UK)
Antonio Márquez (ES)
Kurt Mikkelsen (DK)
- CSCS (Manno, CH)
- ZIB (Berlin, DE)
Partners in the CCWF Working Group
København•
3
“Traditional” Workflow in Computational Chemistry
Workflows have a long tradition in the CC domain.
start knowledge base (DB search)automated/manually edited molecular structuresmolecular simulations
method / program Amethod / program B…
propertiesprimary visualization / quality controlanalysis / archival / DB storagenew insights?
in the 80’s – 90’s
4
Databases: Computational protocol (T. Clark, 1998)
Complete protocol runs automatically with less than 0.5% failure rate. Cleanup 2D 3D conversion VAMP optimization Calculate properties
~3,000 compounds per processor day (3 GHz Xeon)
Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized Structures B. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217.
source: Tim Clark, Uni Erlangen
5
Distributed Computing Environment in the 90’s
QMpackages
6
Distributed Computing Environment in the 90’s
Example: UniChemdistributed environment for quantum-chemical
simulationsCray Research Inc. 1991-(2004)
7
CCWF Chemical Illustrator Applications
Molecular design of functionalised enzynesHans-Peter Lüthi, Martin Brändle, ZürichPeter Murray-Rust, Cambridge; Henry Rzepa, London
Quantum chemical based QSAR/QSPRTim Clark, Erlangen; Jon Essex, Southampton
High-order dynamic and static electrostatic molecular properties
Kurt Mikkelsen, Copenhagen
Computational heterogeneous catalysisAntonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla
8
Molecular Design Workflow (Enzyne Design)
Steps: Generation and
Archiving of data
ExtractionXPath queries
Statistical Analysis
DB
QC Input
QC Output
Input
Output
Parser
StatisticalAnalysis
XMLXPathQuery
XSLT
QCApplication
source: Hans-Peter Lüthi, ETH Zürich
9
Quantum Chemical Based QSAR and QSPR
2D-Database
2D 3DConformations,
Tautomers
VAMP
ParaSurf
QSPR
Virtual Screening
ADME/Tox.
Pharmacokinetics
Molecular Info
Materials Design
Multiscale Modeling
Property Optimization
generate structures,conformations and protonation states
semiempirical MO geometry optimization and electron density
generate isodensity surfaces, spherical-harmonic fits and local properties
apply models
source: Tim Clark, Uni Erlangen
10
-14 -12 -10 -8 -6 -4 -2 0 2 4
Experimental Gsolv(H2O) (kcal mol-1)
-14
-12
-10
-8
-6
-4
-2
0
2
4
Cal
cula
ted
G
solv(H
2O)
(kca
l mol
-1)
Properties: Free Energies of Hydration
N = 362MUE = 0.85 kcal mol-1
RMSD = 1.09 kcal mol-1
r2 = 0.88q2 = 0.83
source: Tim Clark, Uni Erlangen
11
Computing the NCI database (P. Murray-Rust, ’05)
MOPACPM5
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Workflow built with Taverna
12
Times to run jobs
0
40,000
80,000
120,000
0.E+00 5.E+08 1.E+09
(n basis functions)4
time
/ s
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
13
Protocol
Log Files
Parse
SystemCrashes
ScienceErrors
Analysis
PathologicalBehaviour
Statistics
Other Science DisseminateResults
UnsuitableData
ProgramCrashes
InformDeveloper
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
14
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Conclusions from NCI “Experiment” (2005)
Protocols can be automated
Machines can highlight unusual behaviour, geometries and distribution of results for humans to consider
Computational programs can provide high quality “experimental” molecular properties
15
Motivation
The orchestration of complex workflow scenarios is on today’s agenda.
complex scientific solution paths linking in-house and (commercial) legacy codes
Transformation of scientific ventures into a scientifically validated protocol
allowing a highly (semi-) automated data generation (pre-processing) and data processing steps.
16
Goals of the CCWF Working Group
implementation of workflow environments for QC by adapting standard (Grid) technologies
fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology.
implementation of computational chemistry illustrator scenarios to demonstrate the applicability of our approach
17
Generic Workflow
1. Automatic generation + validation of input data
2. Submission, monitoring, and gathering of output data of
simulation jobs
3. Integration of results (primary data) into project database
4. Data mining and visualization techniques to reduce
complexity
5. Knowledge generation by applying methods of statistical
analysis and pattern recognition.
6. On-line publication and archiving of valuable scientific
data.
18
Challenges
Diversity:Molecular properties derived from state functions obtained with electronic-structure methods. ab-initio, semi-empirical, DFT, approximate potentials
Gaussian, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD…
Data formats:How to implement seamless data export/import? ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, …
OpenBABEL
19
Challenges (cont.)
Scaling, Robustness, Load Balancing:I can handle O(10) jobs by hand but…what about campaigns of O(1000) of jobs? workflow system computational resources distributed computing persistence, automated failure recovery, … long simulation times, sometimes unpredictable
Acceptance: easy of use, GUI + CLI
20
What I Want…
easy-of-use: workflow orchestration usage installation / maintenance
sharing of workflow descriptions with my colleagues standard languages
support in a heterogeneous environment laptop – server – cluster – supercomputer – grid
21
Which Workflow System?
… to be spoilt for choice?
22
Some Assessment Criteria
workflows in distributed systems supported batch systems: PBS (,
LSF) support for managing large files
recovery / backup
quality of the documentation customizability PKI / security
required installation effort Web interface WF language
robustness, stability Grid environment open source
restart/stop/debugging user/installation base
status & exception handling legacy codes and Web services project development activity
GUI
23
TRIANA Experiences (2005/06)
workflow orchestration integration of web
services semantic check of WSDL
files support for self-written
Triana modules negligible control logic
overhead pre-requisite for migration
to Grid environments
- proprietary workflow description language in TRIANA (BPEL is announced)
- GUI robustness for very complex workflow definitions
24
GWES Experiences (MediGRID, since 2006)
integration of web services and legacy codes
monitoring + debugging support
Grid environments under active development
(A. Hoheisel et al./FhG FIRST)
- workflow orchestration (WF GUI builder in preparation)
- proprietary workflow description language
25
26
OMII Server: Attracting Features
Workflows language: BPEL (Active BPEL) WF editor (Eclipse) Web Services customization
Jobs submission & monitoring via
WS job manager API
persistent (job recovery), in-memory (via Hibernate)
Distributed Resource Management (DRM)
Condor-G, Globus Gram SSH-exec your own plug-ins, e.g. PBS
Data GridSAM file staging support within job (JSDL): file stage in/out Apache Virtual File System library
(vfs) FTP, local files, http, http, ssftp zip, jar, tar, bzip2, gzip ram - data in memory
GridFTP
27
OMII/Active BPEL Experiences (3 months)
workflow orchestration (Eclipse plugin)
standardized WF language monitoring support Grid environments security features: https +
signed messages (X.509 cert.)
active development (UK eScience)
- deployment requires manual workarounds
- learning barrier (BPEL)- BPEL editor not fully
mature (validation of BPEL workflows)
28
Summary
there are a couple of workflow system available design/development of workflow system still an on-
going research not yet decided for our working group
barriers: easy to use vs. robustness middleware stack: more complicated Grid
environments vs. script-based approaches on clusters
standards vs. proprietary but powerful/sufficient WF languages BPEL has a high chance to survive
29
Acknowledgement
Core members of D37 CCWF working group Hans-Peter Lüthi, ETH Zurich Tim Clark, CCC Uni Erlangen J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni
Cambridge/Unilever Inst.
developer of workflow systems mentioned in this talk
30
QUESTIONS?