doing it again: workflows and ontologies supporting science

41
Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University

Upload: kelsie-oneil

Post on 03-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Doing it again: Workflows and Ontologies Supporting Science. Phillip Lord Frank Gibson Newcastle University. Outline. Describe the background problem Introduce distributed services, workflows, eScience and (a bit of) ontologies. CARMEN Provenance Can we repeat an experiment?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Doing it again: Workflows and Ontologies Supporting Science

Doing it again: Workflows and Ontologies Supporting Science

Phillip Lord

Frank Gibson

Newcastle University

Page 2: Doing it again: Workflows and Ontologies Supporting Science

Outline

• Describe the background problem

• Introduce distributed services, workflows, eScience and (a bit of) ontologies.

• CARMEN

• Provenance

• Can we repeat an experiment?

Page 3: Doing it again: Workflows and Ontologies Supporting Science

Data-intensive bioinformatics

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Page 4: Doing it again: Workflows and Ontologies Supporting Science

Around the world in 80 days

• Biology is still largely a cottage industry

• On a global stage

Page 5: Doing it again: Workflows and Ontologies Supporting Science

Websites everywhere

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Page 6: Doing it again: Workflows and Ontologies Supporting Science

WBS Workflows:

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetative elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterProPFAMPrositeSmart

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

ncbiBlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence ncbiBlastWrapper

Sort for appropriate Sequences only

Pink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns

RepeatMasker

START

Page 7: Doing it again: Workflows and Ontologies Supporting Science

myGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net

Page 8: Doing it again: Workflows and Ontologies Supporting Science

Web Services

Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web

Web services are a:– technology and standard for exposing code / databases with an

API that can be consumed by a third party remotely.– describes how to interact with it.

They are:• Self-contained• Self-describing• Modular• Platform independent

Page 9: Doing it again: Workflows and Ontologies Supporting Science

Workflow language specifies how bioinformatics processes fit together.

High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows.

Workflow is a kind of script or protocol that you configure when you run it.

Easier to explain, share, relocate, reuse and repurpose.

The METHODS section of a scientific publication

Workflows

Page 10: Doing it again: Workflows and Ontologies Supporting Science

The Taverna Workbench

http://taverna.sourceforge.net

http://www.mygrid.org.uk

Page 11: Doing it again: Workflows and Ontologies Supporting Science

Workflows

• Automating away cutting and pasting.

• Helps to deal with distribution of data.

• myGrid and Taverna built on the open nature of bioinformatics.

• Can we adapt the same approach to another discipline?

Page 12: Doing it again: Workflows and Ontologies Supporting Science

CARMENCode, Analysis, Repository and Modelling for e-Neuroscience

www.carmen.org.uk

Engineering and Physical Sciences Research Council

Page 13: Doing it again: Workflows and Ontologies Supporting Science

Consortium & Profile

Stirling

St. Andrews

Newcastle

York

Sheffield

Cambridge

ImperialPlymouth

Warwick

Leicester

Manchester

• $10M over 4 years

• 20 Investigators

• Commenced 1st October 2006

Page 14: Doing it again: Workflows and Ontologies Supporting Science

Industry & Associates

Page 15: Doing it again: Workflows and Ontologies Supporting Science

Virtual Laboratory for Neurophysiology

• Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated

Page 16: Doing it again: Workflows and Ontologies Supporting Science

Potential Barriers

• Technical– Multiple propietary formats– No standardised metadata– Volume of data to be analysed

• Cultural– Multiple Communities acting independently– Concerns about implications of sharing

Page 17: Doing it again: Workflows and Ontologies Supporting Science

Comparing to bioinformatics

• Cottage industry

• Global distribution

• Need to share

• But….

Page 18: Doing it again: Workflows and Ontologies Supporting Science

Age and Impact.

Page 19: Doing it again: Workflows and Ontologies Supporting Science

No sequences!

• DNA and Protein sequence form a core datatype for bioinformatics

• It’s simple to structure and to store, and it is of high-value

• Initially, there wasn’t much of it, and textual metadata was fine.

• Many people built tools over it, for transforming and manipulating.

Page 20: Doing it again: Workflows and Ontologies Supporting Science

The need for clear metadata

• Most neurosciences data is relative simple in structure

• But often contextually complex

• Sometimes associated with behavioural features

Page 21: Doing it again: Workflows and Ontologies Supporting Science

Neuroscience spike data

• The raw data is just a waveform

• But what is the experiment for?

• What stimulus is the organism/tissue receiving?

• Even, which channel is which?

• The data sets being produced are (reasonably) large (10’s of Gb, or 1Tb in three months)

Page 22: Doing it again: Workflows and Ontologies Supporting Science

Data Sharing in bioinformatics

• Data Sharing was an early tradition in biology.

• Gene patenting, NDAs and the like came as quite a surprise

• Many political battles were fought, culminating with Clinton/Blair statement

Page 23: Doing it again: Workflows and Ontologies Supporting Science

Data Sharing in Neurosciences

• The data is easy to structure, but the metadata is not• There is, therefore, less point to sharing data

• Many neuroscientists come from a medical background• tends to be more of a hierarchical, secretive

profession – all worried about getting sued.

• A lot of neuroscientists use invasive, live animal experiments• security is more than a passing concern.

Page 24: Doing it again: Workflows and Ontologies Supporting Science

The difference in neuroscience

• Less data sharing tradition

• No rich ecosystem of tools

• Higher barrier to entry for metadata

• Larger datasets

Page 25: Doing it again: Workflows and Ontologies Supporting Science

Virtual Laboratory Node

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Search for Data & Analysis Code

Raw Signal Data Search & Visualisation

Deployment of Data & Analysis Code in Processes

Raw & Derived Data File Store

Security Policies Controlling Access to Data & Code

Structured Metadata Store Enabling Search & Annotation

Analysis & Model Code Store

Page 26: Doing it again: Workflows and Ontologies Supporting Science

CARMEN

Metadata(April 2008)

Data and Scripting Support (April 2008)

Security(April 2008)

Provenance (July 2008)

CARMEN v1.0(October 2008)

CARMEN v2.0 (October 2009)

Structured Metadata allowing data and analysis code to be described and searched

Support for extended range of data formats and scripting languagesSecurity allowing access

to data and analysis code to be controlledProvenance of analysis and modelling processes

leading to scientific resultsRelease of CARMEN v 1.0

Virtual laboratory nodes open to the CARMEN consortiumRelease of CARMEN v 2.0

Virtual laboratory nodes “networked”

Development Timeline

Page 27: Doing it again: Workflows and Ontologies Supporting Science

Virtual Laboratory Infrastructure

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Networked Nodes at Newcastle and York.

More planned …

Page 28: Doing it again: Workflows and Ontologies Supporting Science

Vision – Global Laboratory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Page 29: Doing it again: Workflows and Ontologies Supporting Science

Some Unexpected Advantages

• Big problem with bioinformatics services

• Over time they tend to disappear

• CARMEN keeps services and data together

• This means we should be able to rerun analyses later.

• We should be able to store provenance

Page 30: Doing it again: Workflows and Ontologies Supporting Science

What is Provenance

Page 31: Doing it again: Workflows and Ontologies Supporting Science

What does it mean to rerun an experiment?

• Replicability: one scientist should be able to repeat another’s experiment, under equivalent conditions, at a different time.

• Rerunability: a scientist should be able to apply an

equivalent technique under new circumstances.

• The addition of services into this mix complicate the issue.

New DataOld Data

Replicability Rerunability

Page 32: Doing it again: Workflows and Ontologies Supporting Science

New Data

Old Data Old Services

New ServicesReplicability

Rerunability

Is the specification of what

happened actually right?

Has the state of the world advanced since previously?

Has the world changed, in a comparable way?

Has the service changed in a comparable way?

Error-Prone

Neuroscientist

Eager Neuroscientist

Neuroscientist comparing to existing work

Tool Builder

Page 33: Doing it again: Workflows and Ontologies Supporting Science

There is a difficulty

• There is less tradition of data sharing

• The tendancy to want to control data is much larger

• If we want to data mine, we have to cope with data is mine

• If we have many different repositories, this needs to be supported computationally

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Page 34: Doing it again: Workflows and Ontologies Supporting Science

An Example: Licensing

• Computationally amenable licenses are available

• Take, for example, Creative Commons

Page 35: Doing it again: Workflows and Ontologies Supporting Science
Page 36: Doing it again: Workflows and Ontologies Supporting Science

Conclusions

• Automated workflows have been applied very successfully in bioinformatics.

• But applying these directly to neuroinformatics is a different issue.

• Technology has to fit the domain.

• We are investigating metadata for describing neuroinformatics

Page 37: Doing it again: Workflows and Ontologies Supporting Science

myGrid acknowledgementsCarole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble.

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell.

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

Page 38: Doing it again: Workflows and Ontologies Supporting Science

AcknowledgementsProfessor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan

University ofSt Andrews

TheUniversity OfSheffield

Page 39: Doing it again: Workflows and Ontologies Supporting Science
Page 40: Doing it again: Workflows and Ontologies Supporting Science
Page 41: Doing it again: Workflows and Ontologies Supporting Science