rapid development of an ontology of coriell cell lines chao pang, tomasz adamusiak, helen parkinson...

19
Rapid Development of an Ontology of Coriell Cell Lines Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone [email protected]

Upload: aileen-lawson

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Rapid Development of an Ontology of Coriell Cell Lines Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone

[email protected]

Master headline

Overview

• Motivation – our EBI use cases, big stuff in little time

• Methods – normalization, axiomatisation, automation

• Results – large flat ontology, adding structure

• Desiderata

• Future work

Master headline

Motivation

• Lot of large (often online) catalogues

• Infeasible to manually ontologise all of them (and cost is not easy to justify)

Master headline

High quality artefacts take time

33k classes 75k classes

FMA Gene Ontology

3k classes

OBI

30 person years

Since 1999

Since 2006

$money$

Funders

Master headline

Motivation

• Would like to include these in BioSample Database for curating data, querying, browsing

• Ontology use has proven to add value (e.g. GXA)

• RDF representations of data for integration

Master headline

Methods: A Model for Cell Lines

• Agreed with cell line ontology folk

• With addition of is_model_for, to reflect use of cell lines as models for particular diseases (they can be used as models for diseases other than the original disease borne by the subject)

Master headline

Methods: Programmatic Engineering

• Lexical concept recognition (2) - Map the terms to existing ontologies by either class label or synonyms in order to get identifiers (e.g. label – Homo sapiens and synonym – human)

• Produce a list of classes from reference ontologies (3) e.g.• Map organism terms: NCBI taxonomy ontology• Map disease terms: Disease ontology

• Map these to an upper ontology and axiomatic structure (4) cell line model on previous slide

Methods: Our “raw” data• Coriell database dump • Extract information for cell line, cell type, anatomy,

organism, sex and disease. (27002 cell lines)

Lexical Concept Recognition

• Our mapper tool was used, adapted the Levenshtein distance algorithm <http://www.ebi.ac.uk/efo/tools>

• Mapping result:• 93 organism - NCBI taxonomy ontology• 61 anatomy – MAT (species neutral), FMA and other anatomy• 23 cell type – cell ontology• 337 disease - Disease ontology (DO) (via OMIM)

• Check the mapping file manually• Inconsistency e.g. fibroma as anatomy• Unmapped terms• Unclear information e.g. buttock-thigh

Master headline

Using OWL-API to build ontology automatically

buildingOntology.java

Mapping files

Coriell cell line protocol

Reference ontologies Step one:Adding classes

Step two:Adding restriction

Coriell cell line Ontology

Spreadsheet.txt

Example: import class cervix from EFOIt has parent class organism part

It has class relations part_of some female reproductive system

Hela cell line

Master headline

Adding Structure to Normalized Hierarchy

Coriell cell line ontology

Query: human cancer cell line

Master headline

Explanation of query: human cancer cell line

Cancer

humanany cell type any organism part

Cell line

Master headline

Results In Brief

• Large (~28,000) cell line ontology, lots of axiomatisation for the axes mentioned (disease, organism, cell type, anatomy some extra bits like gender)

• Running code took ~5 minutes

• Flat asserted hierarchy under cell line; structure is all inferred using axioms as per Rector Normalization

• Makes more flexible to change and also querying more powerful

Master headline

BioPortal Analysis by Matt Horridge et al• http://bio-ontologies.knowledgeblog.org/135

• “the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment“

• Coriell is quite rich axiomatically – this is what we would expect given the normalization method

• Suggestion rich ontologies can be created rapdily

Master headline

Discussion

• Good bits:• Adding new classes was simple once code was done• Re-running the code also simple if a part of pattern was changed• Can be reused for different ontology with similar sort of input• Axiomatisation gives richer querying and renders implicit knowledge

explicit

• Harder bits:• Some data clean up necessary before using the code (SQL dumps can

be messy)• Eliminating errors from concept matching (though expansion on

synonyms helped)• Textual definitions for all classes • Size of the ontology created a few issues, e.g. BioPortal not able to

browse, most reasoners can not complete on the ontology

Conclusions

• To build very big ontologies we need either• Lots of resource (i.e. money)• Lots of people (crowd-sourcing)• Good tools and methods

• Each having pros and cons• Lots of money requires, well, lots of money• Lots of people requires community buy in and ‘trust’ (so

audit trail crucial) – we’re not always good at that • Automation can introduce errors and require some curation

Master headline

Master headline

Desiderata: Querying and External URIs • Easier to use interfaces, users like it simple

• One of top user feedbacks is “we’d like it to work like Google does”

• Users also like putting URIs into a webpage and getting something sensible back i.e.:

Master headline

Future Work• Creating text definitions

• Will follow up on work by Stevens R, Malone, J et al (2011)* for creating natural language definitions from OWL

• Text mining to extract missing disease information from textual descriptions from original catalogue

• Begin curating data for BioSample Database at EBI using ontology

• Integrate Coriell ontology with rest of process from sample to analysis, e.g. software ontology

http://theswo.sourceforge.net

*Robert Stevens, James Malone, Sandra Williams, Richard Power, Alan Third (2011) Automating generation of textual class definitions from OWL to English.

Journal of Biomedical Semantics, 2011 May 17, Vol. 2 Suppl 2:S5.

Master headline

Acknowledgements

• Thank to Chao Pang, Helen Parkinson,Tomasz Adamusiak, Paulo Silva and all people from functional genomics production team.

• Gen2phen• The Coriell Institute for Medical Research• Alan Ruttenberg and Science Commons for providing the

Coriell SQL dump• Lynn Schriml and colleagues from the Disease Ontology

for OMIM mappings • Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry

Meehan for discussions on the cell line model.