workflows, semantics & future escience

28
Workflows, Semantics & future eScience Integrative Bioinformatics Workshop, Tom Oinn – [email protected] , 6 th September 2006

Upload: edward

Post on 12-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Workflows, Semantics & future eScience. Integrative Bioinformatics Workshop, Tom Oinn – [email protected] , 6 th September 2006. Workflows. Data driven workflow system Graph of operations (nodes) and data transfer (edges) Operations are services, databases, command line tools, scripts… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Workflows, Semantics & future eScience

Workflows, Semantics & future eScience

Integrative Bioinformatics Workshop,Tom Oinn – [email protected],

6th September 2006

Page 2: Workflows, Semantics & future eScience

WorkflowsWorkflows

Data driven workflow system– Graph of operations (nodes) and data transfer

(edges)– Operations are services, databases, command

line tools, scripts…– Workflow engine software (enactor) responsible

for coordination of operations– Enactor is data agnostic, apart from collection

structures (for Taverna)

Page 3: Workflows, Semantics & future eScience

Taverna 1.4Taverna 1.4

Rewritten generic web service client Rewritten BioMart client Provenance capture system Performance and usability enhancements Groundwork for new architecture and build system

Page 4: Workflows, Semantics & future eScience

Web Service SupportWeb Service Support

Enhanced support for document / literal style services– i.e. NCBI eUtils services

More robust invocation– Copes better with various broken service types

Support for wsdl:documentation tags– Now shows free text service docs– Not ideal but it’s all web services give us

Page 5: Workflows, Semantics & future eScience
Page 6: Workflows, Semantics & future eScience

BioMart SupportBioMart Support

UI changed to reflect current website– Ease ‘techno-shock’ to users

Supports all Mart features– Data set linking and federation

Uses Mart service– Connects over HTTP reducing firewall issues– Mart providers no longer need to open JDBC

access ports, fewer ports open so better security for service providers.

Page 7: Workflows, Semantics & future eScience
Page 8: Workflows, Semantics & future eScience

Provenance Capture & BrowseProvenance Capture & Browse

Observes events from the workflow engine Populates a triple store with information

from these events Presents a simple browse interface over this

metadata– Replicates Taverna’s existing result and status

browser Allows for more complex query interfaces in

the future

Page 9: Workflows, Semantics & future eScience
Page 10: Workflows, Semantics & future eScience

Taverna is now part of OMII-UKTaverna is now part of OMII-UK Taverna 1.4 production target : Sept 2006

– Packaging, Installation, Deployment, Maintenance, Testing– GridSAM, GRIMOIRES, BioMOBY registry integration– Semantic content for registry– Integration of discovery and metadata management– Security AA for KAVE data and metadata management

Taverna 2.0 : Spring 2007– Redevelopment of the plug in and enactor framework,

improved iteration events, data management Close collaboration with pioneers Incremental rollouts to early adopters

Page 11: Workflows, Semantics & future eScience

Ingest Ingest

Early adoptersPioneers

Pioneers ConservativesEarly adoptersPioneers

myGridPre-release

myGrid Release

OMII-UKRelease

Software Engineering

XP

Software Engineering

Quality & Test

Evaluation Evaluation OMII Software Engineering

Quality & TestPrioritise & Plan

Prioritise & Plan

Production Applications & Professional ServicesApplications & Professional Services

myGridAlliance

myGridAlliance

Source-forgecommunity

Source-forgecommunity

Page 12: Workflows, Semantics & future eScience

Evolving challengesEvolving challenges

Long running data intensive workflows Manipulation of confidential or otherwise protected

information Use with classical grid systems Interaction with users during workflows Workflow authoring, service discovery and

composition Fine grained runtime updates Data comprehension, provenance and

visualization – the rest of this talk!

Page 13: Workflows, Semantics & future eScience
Page 14: Workflows, Semantics & future eScience

Increasing Automation

Be

tter

Se

man

tics

an

d U

nd

erst

and

ing

Manual use of tools, web pages

Scripted toolinvocation

Naïve workflowsystems

Basic ‘discovery’ styleservice annotations

Guided workflowconstruction

Workflow design withannotation overlays

Automated hypothesisgeneration (really!)

Knowledge drivenvisualization

Hypothesis validation

‘Data playground’exploratory tool

And now, the future…And now, the future…

Page 15: Workflows, Semantics & future eScience

Service AnnotationsService Annotations

Immediate problem – too many services!– At workflow construction time users cannot

isolate the services they need

Multiple levels of annotation– Interface and syntactic definitions i.e. WSDL– Free text descriptions– Semantic annotation of operations

Page 16: Workflows, Semantics & future eScience

Increasing Automation

Be

tter

Se

man

tics

an

d U

nd

erst

and

ing

Service annotationsService annotations

Manual use of tools, web pages

Scripted toolinvocation

Naïve workflowsystems

Basic ‘discovery’ styleservice annotations

Guided workflowconstruction

Automated hypothesisgeneration (really!)

Knowledge drivenvisualization

Hypothesis validation

‘myalignscript.pl’

‘A tool to comparemultiple protein structures’

performs_task : alignment

input_type{seq_a} : sequence…output_type{score} : d_value

output{score} is_distance_between pair {input{sequence a}, input{sequence b}}Also needs workflow level annotation!

Re

qui

res

typ

e o

nto

logy

or

ont

olo

gie

s!N

atu

ral

lan

gua

ge

Page 17: Workflows, Semantics & future eScience

Building the semantic networkBuilding the semantic network

Workflow engine uses service annotations to annotate the results of invocations of those services.

For example :

Fetch Structure Fetch Sequence

ID

InterproScan

GetGO(cellular location)

ExtractMotifRanges

Page 18: Workflows, Semantics & future eScience

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

No service annotationsNo service annotations

Fetch Structure

Fetch Sequence

InterproScanInterproScan

GetMotifRanges

GetMotifRanges GetMotifRanges

GetMotifRangesGetGO

GetGO

GetGO

Page 19: Workflows, Semantics & future eScience

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

Input / Output type annotationInput / Output type annotation

Fetch Structure

Fetch Sequence

InterproScanInterproScan

GetMotifRanges

GetMotifRanges GetMotifRanges

GetMotifRangesGetGO

GetGO

GetGO

protein_identifier

3d_structure

protein_sequence

ipro_identifier

ipro_identifier

go_term

go_term

go_term

range_setrange_set

range_set

range_set

Page 20: Workflows, Semantics & future eScience

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

Full static semanticsFull static semantics

has_structure

has_sequence

has_ipro_hithas_ipro_hit

contains_domain

contains_domain contains_domain

contains_domainhas_go

has_go

has_go

protein_identifier

3d_structure

protein_sequence

ipro_identifier

ipro_identifier

go_term

go_term

go_term

range_setrange_set

range_set

range_set

Page 21: Workflows, Semantics & future eScience

ID

ID

ID

ID

ID

ID

ID

Dynamic semanticsDynamic semantics

has_structure

has_sequence

has_ipro_hit

contains_domain

contains_domain

has_go

protein_identifier

3d_structure

protein_sequence

ipro_identifier

go_term

range_set

range_set

has_evidence

has_evidence

predicts_location

location_prediction

(nodes omitted to prevent further insanity)

Driven by workflow level annotation

Page 22: Workflows, Semantics & future eScience

VisualizationVisualization

Naïve rendering of the graph isn’t good enough

Any scientific domain already has vizualization mechanisms

Create an ecosystem of visualization agents– Iteratively consume the semantic network– Replace node(s) with markers into the

visualizer’s space– Render any remaining edges using graph layout

Page 23: Workflows, Semantics & future eScience

ID

ID

ID

ID

ID

ID

ID

Rendering AgentsRendering Agents

has_structure

has_sequence

has_ipro_hit

contains_domain

contains_domain

has_go

protein_identifier

3d_structure

protein_sequence

ipro_identifier

go_term

range_set

range_set

has_evidence

has_evidence

predicts_location

location_prediction

3D Structure Renderer

Sequence + Feature Renderer

Gene Ontology Subgraph Renderer

Page 24: Workflows, Semantics & future eScience

Hypothesis ValidationHypothesis Validation

Express hypothesis as a pattern that can match the semantic network topology

Combination of structure and node values– Need to use a rich graph aware query

language, various options For each object of a certain class test

whether the structure around it matches Link back to the visualization to show

exceptions

Page 25: Workflows, Semantics & future eScience

Hypothesis Generation (!)Hypothesis Generation (!)

Use genetic algorithms to ‘evolve’ a suitable match for the previous stage

Relatively easy to create a fitness function (precision, specificity, match percentage)

Easy to ‘mutate’ patterns ‘Tell me anything interesting you’ve noticed

about protein structures in this workflow’ capability

Page 26: Workflows, Semantics & future eScience
Page 27: Workflows, Semantics & future eScience

Obtaining TavernaObtaining Taverna

Taverna is available under the LGPL from our project site on Sourceforge.net– http://taverna.sourceforge.net

Release 1.4 as of May 2006 Win32, Solaris / Linux & OS-X Includes online and downloadable user manual,

examples etc. Support via project mailing lists

Page 28: Workflows, Semantics & future eScience

mymyGrid acknowledgementsGrid acknowledgementsCarole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble.

Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell.

Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust.