the past, present and future of knowledge in biology
DESCRIPTION
Keynote talk at SMBM 2010TRANSCRIPT
The Past, Present and Future of Knowledge in Biology
Robert StevensBioHealth Informatics GroupThe University of Manchester
ManchesterUnited Kingdom
Overview
• A look at the state of play• For what are we using ontologies?• What do we count as knowledge?• Doing so much more with knowledge• Stopping text being a dead end
Text and Ontologies: The Terrible Twins of Knowledge in Biology
Robert StevensBioHealth Informatics GroupThe University of Manchester
ManchesterUnited Kingdom
Biology now has lots of facts
Genome
Proteome
Transcriptome
Interactome
Metabolome
PHENOME
Lots of catalogues
Data are only as Good as their Metadata
• There is a lot of biology out there…• How these entities are described in our data varies• We don’t even agree on what entities there are to
describe in our data• This makes analysing data hard: You have to know
what your data represent• …, but also how the entities described in your data
relate to each other• We need to describe our data – their metadata
Creating Woods, not Trees
Genes
Proteins
Pathways
Interactions
LiteratureComplex Machines
Virtual Organism
…. from biological facts, we make a system that is some model of a real organism
Timeline
There’s a Lot of it About
Searching for “ontology” in five year chunks on the ACM digital
portal
Searching for “ontology” in five year chunks on the ACM digital
portal
Searching for “ontology” in five year chunks on PubMed
Searching for “ontology” in five year chunks on PubMed
It’s all Gruber’s Fault
• “In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081
Angels on the head of a pin
Everything with a Blob and Line is called an Ontology
• Wide acceptance criteria• Narrow evaluation criteria• Different sort of knowledge for different
situations• Different styles of representation; some
scruffy and some formal• Representing knowledge in biology is more
than ontologies• We could stop calling them ontologies
RDF graphRDF
graph
Database schema
Database schema
ThesaurusThesaurus
OWL Ontology
OWL Ontology
Formal ontologyFormal
ontology
SKOS vocabulary
SKOS vocabulary
Uses of Ontologies
Knowing What We’ve got is so Useful
• We could computationally handle lots of data, but we couldn’t do so with what we know about those data
• Ontologies so far mainly used for a common tongue so that we can compare
• … and it works!• Still getting lots of mileage from ontology
annotation• …, But there is so much more
GENERIC GENE ONTOLOGY (GO) TERM FINDER S000003093
MXR1YPL250CS000004294SAM3YIR017CS000003152MMP1MET1
Expressed Genes
P-value score
http://go.princeton.edu/cgi-bin/GOTermFinder
Classifying a Mouse
Individual Description:
Stops wriggling after 3 sec
Has 3 cm tail
Mass 10g
10 days old (since birth)
Strain C57Bl/6
Class Description:
Class:DepressedMouse
EquivalentTo:Mouse that
(wriggles For <=30 OR swims for <=45)
Data Transform
ation
Short tailed mouse
Class:ShortTailedMouse EquivalentTo:Mouse that hasPart EXACTLY 1 (Tail that hasAssay SOME
(LengthAssay that hasValue SOME int[<= 20) and hasUnit SOME Millimetre))
SubClassOf: Mouse thathasPart some (Tail that hasQuality SOME Short)
• We can recognise an instance of short-tailed mouse, but we also know that it has the quality “short”
• Even when the fact isn’t asserted
•First bullet
Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine
phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV……
…..
InterPro
Instance Store
Reasoner
Translate
Codify
OWL’s Automated Reasoners
• Demonstrably useful in:– Building ontologies– Querying ontologies– Can automatically annotate– Have made “discoveries”But there is more than OWL’s reasoning
Separation of Knowledge and Software
• We realised a long time ago that we needed to separate
• We only recently called this knowledge component ontology
• We don’t really need to see the ontology• We certainly shouldn’t show people OWL; it
“scares the horses”• Ontology for software not humans (L. Hunter)
The Ontology cottage Industry
• We’ve industrialised data production• We’ve (to some extent) industrialised data
analysis• We’ve not really moved away from hand-
crafted, “whittled” ontologies
Can we have Mass Editing of Ontologies?
• Probably not;• Computer scientists in love with synchronous
editing• …, but not really necessary (see CSCW)• Mass gathering of Knowledge
Mass Gathering of Knowledge and the Application of Patterns or a
metamodel
http://rightfield.org.uk http://www.e-lico.eu/populous
There’s so much more to Ontology Building than editing Axioms
• Gathering knowledge• Adding labels• Adding other human orientated content• Reviewing, checking suggesting• Deploying, using, creating “views”• Ontology comprehension
There’s More to KR than OWL
• OWL and its automated reasoners are useful• But there is so much more to KR than
ontologies and OWL• Higher order reasoning• Rules• Other sorts of reasoning
Generating natural language
Class: HeLa
SubClassOf: Cell,bearer_of some 'cervical carcinoma’,derives_from some 'Homo sapiens’,derives_from some cervix,derives_from some 'epithelial cell'
OWL
HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.
Generated natural language
Experimental Factor Ontology (EFO)http://www.ebi.ac.uk/efo
Ontology as bookTitle: Experimental Factor Ontology
Table of Contents
Chapter 1. Cell lineChapter 2. Cell typeChapter 3. Chemical CompoundChapter 4. Organism
HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.
entry
DataData
Types of Knowledge
Biologist’s headBiologist’s head
PapersPapers
DatabasesDatabases
OntologiesOntologies
??????
It’s not Just “Things”
• Experiments produce data about things• Proteins, genes, chemicals, reactions,
diseases, size, shape, speed, ….• As well as this knowledge we have knowledge
of how it was done• OBI is still the “things” to do with production• We still need the methods of by which these
“things” were deployed• The protocol
Knowledge about anexperiment
Workflow Run
Workflow Run
Workflow
ProvenanceProvenance
OrganisationalOrganisational
Results and Interpretation
Results and Interpretation
Workflows are knowledge about methods
Get genes in region
Get pathways that contain genes
Merge data into single files
Get gene descriptions
Get pathway descriptions
Cross-reference ids
Methods:
1. A QTL (region of chromosome) is entered into the workflow, specified as base pairs. These base pairs are subsequently used to identify, in the Ensembl database, any genes that lie within this region.
2. Any genes found within this region are subsequently annotated with Entrez and UniProt identifiers.
3. The Entrez and UniProt identifiers are then passed to a KEGG id conversion Web Service, to cross-reference the input ids to KEGG gene identifiers. This enables gene descriptions and biological pathway data to be returned from KEGG.
4. Each KEGG gene id is then used in a search for KEGG pathways. Any pathways found to contain the gene are returned as KEGG pathway ids.
5. Both KEGG gene and pathway ids are then sent to individual services, provided by KEGG, which provide a description of the gene and pathway.
6. The outputs of the workflow are then combined into single flat files, which can be saved locally and used to identify novel pathways and genes within the QTL region.
Research Objects
MethodMethod
DataData
IntroductionIntroduction
ConclusionsConclusions
ResultsResults
Human WrittenWorkflowWorkflow
Generated Text
Semanticallyannotated
Model, View, Controller
Annotated Data
Annotated Data
ControllerController
ProjectionProjection
TextTables Graphs
Steve Pettiferhttp://utopia.cs.man.ac.uk/
What Next?
• Ontologies are not the only fruit• We could stop calling them ontologies• We need to produce “ontologies” faster• We need to do more interesting things with our knowledge• We need to make them pervade our tools• We need then to be “agile”• Open to other forms of KR and other forms of reasoning• Adding to data automatically• Generating our descriptions of data
Acknowledgements• Simon Jupp for the slides• Alan rector and Carole goble• sysMoDB for rightField (Katy Wolstencroft, Stuart Owen, Matt
Horridge)• Populous – Simon Jupp• SWAT – richard Power, Sandra Williams and Allan third at the
OU• EFO – James Malone and Helen Parkinson• Steve Pettifer for the Utopia and MVC• Paul Fisher and the Taverna team• The myExperiment team at Southampton and Manchester