the past, present and future of knowledge in biology

36
The Past, Present and Future of Knowledge in Biology Robert Stevens BioHealth Informatics Group The University of Manchester Manchester United Kingdom [email protected]

Upload: robertstevens65

Post on 21-May-2015

138 views

Category:

Science


3 download

DESCRIPTION

Keynote talk at SMBM 2010

TRANSCRIPT

Page 1: The Past, Present and Future of Knowledge in Biology

The Past, Present and Future of Knowledge in Biology

Robert StevensBioHealth Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

[email protected]

Page 2: The Past, Present and Future of Knowledge in Biology

Overview

• A look at the state of play• For what are we using ontologies?• What do we count as knowledge?• Doing so much more with knowledge• Stopping text being a dead end

Page 3: The Past, Present and Future of Knowledge in Biology

Text and Ontologies: The Terrible Twins of Knowledge in Biology

Robert StevensBioHealth Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

[email protected]

Page 4: The Past, Present and Future of Knowledge in Biology

Biology now has lots of facts

Page 5: The Past, Present and Future of Knowledge in Biology

Genome

Proteome

Transcriptome

Interactome

Metabolome

PHENOME

Lots of catalogues

Page 6: The Past, Present and Future of Knowledge in Biology

Data are only as Good as their Metadata

• There is a lot of biology out there…• How these entities are described in our data varies• We don’t even agree on what entities there are to

describe in our data• This makes analysing data hard: You have to know

what your data represent• …, but also how the entities described in your data

relate to each other• We need to describe our data – their metadata

Page 7: The Past, Present and Future of Knowledge in Biology

Creating Woods, not Trees

Genes

Proteins

Pathways

Interactions

LiteratureComplex Machines

Virtual Organism

…. from biological facts, we make a system that is some model of a real organism

Page 8: The Past, Present and Future of Knowledge in Biology

Timeline

Page 9: The Past, Present and Future of Knowledge in Biology

There’s a Lot of it About

Searching for “ontology” in five year chunks on the ACM digital

portal

Searching for “ontology” in five year chunks on the ACM digital

portal

Searching for “ontology” in five year chunks on PubMed

Searching for “ontology” in five year chunks on PubMed

Page 10: The Past, Present and Future of Knowledge in Biology

It’s all Gruber’s Fault

• “In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081

Page 11: The Past, Present and Future of Knowledge in Biology

Angels on the head of a pin

Page 12: The Past, Present and Future of Knowledge in Biology

Everything with a Blob and Line is called an Ontology

• Wide acceptance criteria• Narrow evaluation criteria• Different sort of knowledge for different

situations• Different styles of representation; some

scruffy and some formal• Representing knowledge in biology is more

than ontologies• We could stop calling them ontologies

RDF graphRDF

graph

Database schema

Database schema

ThesaurusThesaurus

OWL Ontology

OWL Ontology

Formal ontologyFormal

ontology

SKOS vocabulary

SKOS vocabulary

Page 13: The Past, Present and Future of Knowledge in Biology

Uses of Ontologies

Page 14: The Past, Present and Future of Knowledge in Biology

Knowing What We’ve got is so Useful

• We could computationally handle lots of data, but we couldn’t do so with what we know about those data

• Ontologies so far mainly used for a common tongue so that we can compare

• … and it works!• Still getting lots of mileage from ontology

annotation• …, But there is so much more

Page 15: The Past, Present and Future of Knowledge in Biology

GENERIC GENE ONTOLOGY (GO) TERM FINDER S000003093

MXR1YPL250CS000004294SAM3YIR017CS000003152MMP1MET1

Expressed Genes

P-value score

http://go.princeton.edu/cgi-bin/GOTermFinder

Page 16: The Past, Present and Future of Knowledge in Biology

Classifying a Mouse

Individual Description:

Stops wriggling after 3 sec

Has 3 cm tail

Mass 10g

10 days old (since birth)

Strain C57Bl/6

Class Description:

Class:DepressedMouse

EquivalentTo:Mouse that

(wriggles For <=30 OR swims for <=45)

Data Transform

ation

Page 17: The Past, Present and Future of Knowledge in Biology

Short tailed mouse

Class:ShortTailedMouse EquivalentTo:Mouse that hasPart EXACTLY 1 (Tail that hasAssay SOME

(LengthAssay that hasValue SOME int[<= 20) and hasUnit SOME Millimetre))

SubClassOf: Mouse thathasPart some (Tail that hasQuality SOME Short)

• We can recognise an instance of short-tailed mouse, but we also know that it has the quality “short”

• Even when the fact isn’t asserted

•First bullet

Page 18: The Past, Present and Future of Knowledge in Biology

Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine

phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV……

…..

InterPro

Instance Store

Reasoner

Translate

Codify

Page 19: The Past, Present and Future of Knowledge in Biology

OWL’s Automated Reasoners

• Demonstrably useful in:– Building ontologies– Querying ontologies– Can automatically annotate– Have made “discoveries”But there is more than OWL’s reasoning

Page 20: The Past, Present and Future of Knowledge in Biology

Separation of Knowledge and Software

• We realised a long time ago that we needed to separate

• We only recently called this knowledge component ontology

• We don’t really need to see the ontology• We certainly shouldn’t show people OWL; it

“scares the horses”• Ontology for software not humans (L. Hunter)

Page 21: The Past, Present and Future of Knowledge in Biology

The Ontology cottage Industry

• We’ve industrialised data production• We’ve (to some extent) industrialised data

analysis• We’ve not really moved away from hand-

crafted, “whittled” ontologies

Page 22: The Past, Present and Future of Knowledge in Biology

Can we have Mass Editing of Ontologies?

• Probably not;• Computer scientists in love with synchronous

editing• …, but not really necessary (see CSCW)• Mass gathering of Knowledge

Page 23: The Past, Present and Future of Knowledge in Biology

Mass Gathering of Knowledge and the Application of Patterns or a

metamodel

http://rightfield.org.uk http://www.e-lico.eu/populous

Page 24: The Past, Present and Future of Knowledge in Biology

There’s so much more to Ontology Building than editing Axioms

• Gathering knowledge• Adding labels• Adding other human orientated content• Reviewing, checking suggesting• Deploying, using, creating “views”• Ontology comprehension

Page 25: The Past, Present and Future of Knowledge in Biology

There’s More to KR than OWL

• OWL and its automated reasoners are useful• But there is so much more to KR than

ontologies and OWL• Higher order reasoning• Rules• Other sorts of reasoning

Page 26: The Past, Present and Future of Knowledge in Biology

Generating natural language

Class: HeLa

SubClassOf: Cell,bearer_of some 'cervical carcinoma’,derives_from some 'Homo sapiens’,derives_from some cervix,derives_from some 'epithelial cell'

OWL

HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.

Generated natural language

Experimental Factor Ontology (EFO)http://www.ebi.ac.uk/efo

Page 27: The Past, Present and Future of Knowledge in Biology

Ontology as bookTitle: Experimental Factor Ontology

Table of Contents

Chapter 1. Cell lineChapter 2. Cell typeChapter 3. Chemical CompoundChapter 4. Organism

HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.

entry

Page 28: The Past, Present and Future of Knowledge in Biology

DataData

Types of Knowledge

Biologist’s headBiologist’s head

PapersPapers

DatabasesDatabases

OntologiesOntologies

??????

Page 29: The Past, Present and Future of Knowledge in Biology

It’s not Just “Things”

• Experiments produce data about things• Proteins, genes, chemicals, reactions,

diseases, size, shape, speed, ….• As well as this knowledge we have knowledge

of how it was done• OBI is still the “things” to do with production• We still need the methods of by which these

“things” were deployed• The protocol

Page 30: The Past, Present and Future of Knowledge in Biology

Knowledge about anexperiment

Workflow Run

Workflow Run

Workflow

ProvenanceProvenance

OrganisationalOrganisational

Results and Interpretation

Results and Interpretation

Page 31: The Past, Present and Future of Knowledge in Biology

Workflows are knowledge about methods

Get genes in region

Get pathways that contain genes

Merge data into single files

Get gene descriptions

Get pathway descriptions

Cross-reference ids

Methods:

1. A QTL (region of chromosome) is entered into the workflow, specified as base pairs. These base pairs are subsequently used to identify, in the Ensembl database, any genes that lie within this region.

2. Any genes found within this region are subsequently annotated with Entrez and UniProt identifiers.

3. The Entrez and UniProt identifiers are then passed to a KEGG id conversion Web Service, to cross-reference the input ids to KEGG gene identifiers. This enables gene descriptions and biological pathway data to be returned from KEGG.

4. Each KEGG gene id is then used in a search for KEGG pathways. Any pathways found to contain the gene are returned as KEGG pathway ids.

5. Both KEGG gene and pathway ids are then sent to individual services, provided by KEGG, which provide a description of the gene and pathway.

6. The outputs of the workflow are then combined into single flat files, which can be saved locally and used to identify novel pathways and genes within the QTL region.

Page 32: The Past, Present and Future of Knowledge in Biology

myExperiment

http://www.myexperiment.org

Page 33: The Past, Present and Future of Knowledge in Biology

Research Objects

MethodMethod

DataData

IntroductionIntroduction

ConclusionsConclusions

ResultsResults

Human WrittenWorkflowWorkflow

Generated Text

Semanticallyannotated

Page 34: The Past, Present and Future of Knowledge in Biology

Model, View, Controller

Annotated Data

Annotated Data

ControllerController

ProjectionProjection

TextTables Graphs

Steve Pettiferhttp://utopia.cs.man.ac.uk/

Page 35: The Past, Present and Future of Knowledge in Biology

What Next?

• Ontologies are not the only fruit• We could stop calling them ontologies• We need to produce “ontologies” faster• We need to do more interesting things with our knowledge• We need to make them pervade our tools• We need then to be “agile”• Open to other forms of KR and other forms of reasoning• Adding to data automatically• Generating our descriptions of data

Page 36: The Past, Present and Future of Knowledge in Biology

Acknowledgements• Simon Jupp for the slides• Alan rector and Carole goble• sysMoDB for rightField (Katy Wolstencroft, Stuart Owen, Matt

Horridge)• Populous – Simon Jupp• SWAT – richard Power, Sandra Williams and Allan third at the

OU• EFO – James Malone and Helen Parkinson• Steve Pettifer for the Utopia and MVC• Paul Fisher and the Taverna team• The myExperiment team at Southampton and Manchester