scientific workflows: what do we have, what do we miss?

Download Scientific Workflows: what do we have, what do we miss?

Post on 28-Jan-2015

103 views

Category:

Technology

0 download

Embed Size (px)

DESCRIPTION

Presentation given on June 22, 2013, in Nice, at the CIBB 2013 International Workshop. In collaboration with Paolo Missier, University of Newcastle upon Tyne, UK

TRANSCRIPT

  • 1. Scientific Workflows: what do we have, what do we miss? Paolo Romano IRCCS AOU San Martino IST, Genova, Italy (paolo.dm.romano@gmail.com, skype: p.romano)

2. Talk outline Aims of data integration in Life Sciences A methodology for the automation of data retrieval and analysis processes Workflow Management Systems Issues related to: automatic composition, execution performances, workflow reuse 22 June 2013 2Scientific Workflows: what do we miss? 3. Biomedical databases 22 June 2013 3Scientific Workflows: what do we miss? Accessible on-line by means of human-centered interfaces Dont share interface, data contents and structure, encoding Dont interoperate Oblige researchers to cut & paste data May have huge size 4. Some figures European Nucleotide Archive: 195,241,608 sequences, 292,078,866,691 bases UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs PRIDE: 111,219,191 spectra IntAct: 229,082 interactions ArrayExpress: ~16,000 experiments, ~450,000 hybridizations 22 June 2013 4 DB size Next-Generation Sequencing: 16Gb / experiment! Scientific Workflows: what do we miss? 5. Some figures European Nucleotide Archive: 195,241,608 sequences, 292,078,866,691 bases UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs PRIDE: 111,219,191 spectra IntAct: 229,082 interactions ArrayExpress: ~16,000 experiments, ~450,000 hybridizations 22 June 2013 5 DB size Next-Generation Sequencing: 16Gb / experiment! Scientific Workflows: what do we miss? 6. An international collaboration aimed at building a detailed map of human genome variability. Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010). Data: ~4.9 Tbases (~3 Gbases/individual) Found: 15M mutations, 1M deletions/insertions, 20K major variants The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010. DOI:10.1038/nature09534 http://www.1000genomes.org/ 22 June 2013 6 1000 Genomes Project Scientific Workflows: what do we miss? 7. An international collaboration aimed at building a detailed map of human genome variability. Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010). Data: ~4.9 Tbases (~3 Gbases/individual) Found: 15M mutations, 1M deletions/insertions, 20K major variants The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010. DOI:10.1038/nature09534 http://www.1000genomes.org/ 22 June 2013 7 1000 Genomes Project Impossible without bioinformatics Unmanageable without automation of processes Scientific Workflows: what do we miss? 8. 22 June 2013 8 Data integration: aims Data integration and automation of retrieval and analysis processes are needed for: o Achieving a precise and comprehensive vision of available information o Carrying out queries and analysis involving many databases and software tools automatically o Carrying out analysis of huge data quantities efficiently o Implementing an effective data mining Scientific Workflows: what do we miss? 9. A computerized facilitation or automation of a business process, in whole or part" (Workflow Management Coalition) Aim: Implementing data analysis processes in standardized enviroments Main advantages: efficiency: being automatic procedures, make researchers free from repetitive tasks and e support good practices, reproducibiliy: analysis may be replicated over time, easily and effectively, reuse: both intermediate results and workflows may be reused, traceability: the workflow is enacted in a environment that allows tracing back results. What is a Workflow 22 June 2013 9Scientific Workflows: what do we miss? 10. An experiment Prediction of the structure of a protein by homology 22 June 2013 10Scientific Workflows: what do we miss? 11. Researchers carrying out the analysis need to know: Which tools and dbs are needed, where they reside, and how to use them In which order they must be used How to transfer data between them How to reconcile semantics of data used by services Manual 22 June 2013 11Scientific Workflows: what do we miss? 12. In an automated procedure software must: Know which tool/db is able to carry out a given task (e.g. aligning sequence, retrieving protein structure data) Find real implementations (e.g. BLAST, provided by NCBI) Link services in a workflow enabling to achieve the desired task Transfer data appropriately between services Automatic 22 June 2013 12Scientific Workflows: what do we miss? 13. Workflow for CABRI Network Services 22 June 2013 13Scientific Workflows: what do we miss? 14. o Define XML languages with controlled vocabularies o Archive data in XML formats o Make use of Web Services for data exchange between services o Associate data and analysis to proper items of an ontology of bioinformatics data, data types, and tasks o Encode processes as workflows Methodology: components 22 June 2013 14Scientific Workflows: what do we miss? 15. Both industrial and academic WfMS are available and their use for Life Sciences is now widespread. Biopipe, an add-on for bioperl GPipe, an extension of Pise Taverna (EBI), a component of myGrid platform Pegasys (University of British Columbia) EGene (Universidade de So Paulo) Wildfire (Bioinformatics Institute, Singapore) Pipeline Pilot (SciTegic) BioWBI, Bioinformatic Workflow Builder Interface (IBM) Workflow Management Systems 22 June 2013 15Scientific Workflows: what do we miss? 16. Software Type Standard License URL Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/ Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/ ProGenGrid Stand-alone NA NA http://datadog.unile.it/progen DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/ Kepler Stand-alone MoML Open source http://kepler-project.org/ GPipe Interfaccia Web, servizi locali GPipe XML Open source http://if- web1.imb.uq.edu.au/Pise/5.a/gpipe.html EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/ BioWMS Interfaccia Web, servizi remoti XPDL Public use http://litbio.unicam.it:8080/biowms/ BioWEP Portale XScufl XPDL Open source http://bioinformatics.istge.it/biowep/ BioWBI Interfaccia Web, servizi locali Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/ Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/ Triana Stand-alone Triana Workflow Language Open source http://www.trianacode.org/ Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/ FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/ Biomake Libreria software NA Open source http://skam.sourceforge.net/ Workflow Management Systems Various software types and different standards 22 June 2013 16Scientific Workflows: what do we miss? 17. Taverna Workbench is the best known and most adopted in life sciences Developed in the context of the myGrid platform Univ. Manchester and EBI main developers Open source at SourceForge.net It allows to: Build and execute workflows for complex analysis by getting access to remote and local services displaying results in various formats describing data through an ad-hoc ontology Requirements: java plus Windows / Mac / Linux Open source: http://taverna.sourceforge.net/ Current version: 2.4 Taverna Workbench 22 June 2013 17Scientific Workflows: what do we miss? 18. WfMS are increasingly used for data integration and analysis in biomedical research. Here, we highlight some of current issues. Issues: Automatic composition of workflows Performances Reproducibility and reuse WfMS: some current issues 22 June 2013 18Scientific Workflows: what do we miss? 19. Researchers only care for scientific results! Building workflows may be a burden Various skills are requested, and GUI do not solve Workflow composition should be much simpler, and become semi-automatic Automatic composition 22 June 2013 19Scientific Workflows: what do we miss? 20. Automatic composition 22 June 2013 20 Automatic composition Automatic selection of best services Automated service identification and composition Adapters for different data formats Automatic conversion of formats Ontology of methods, tools and data types Integration with repositories Controlled Language Interface Scientific Workflows: what do we miss? 21. Automatic composition 22 June 2013 21 Automatic composition Automatic selection of best services Automated service identification and composition Adapters for different data formats Automatic conversion of formats Ontology of methods, tools and data types Integration with repositories Controlled Language Interface Scientific Workflows: what do we miss? A trade-off is required between rich semantic annotations and design complexity. Semantic-based solutions available for controlled set of services. 22. Beyond Taverna MyGrid team developed tools identification of services and supporting reuse of workflows BioCatalogue Annotated catalogue of Web Services for Life Science MyExperiment Repository of workflows for Life Science, enabled by social networking features 22 June 2013 22Scientific Workflows: what do we miss? 23. Allows to define all: Data analysis tasks for bioinformatics Data types Possible relations betweeb tasks and data types (I/O) Transformations between equivalent data (format) Transformations between related data (through elaboration, e.g.: triplet AA, gene symbol sequence) Fondamental in order to: Validate data flow and elaborations Support automatic workflow composition EDAM (EMBRACE Data and Methods) Ontology EDAM Ontology 22 June 2013 23Scientific Workflows: what do we miss? 24. EDAM (EMBRACE Data an

Recommended

View more >