ontology for the financial services industry
TRANSCRIPT
Reference Data Integration: A Strategy For The Future
Barry SmithNational Center for Ontological Research
University at Buffalo
presented at FIMA, March 21, 2012
1
Who am I?National Center for Biomedical Ontology
based in Stanford Medical School, the Mayo Clinic and Buffalo Department of Philosophy
2
• Cleveland Clinic Semantic Database• Duke University Health System• University of Pittsburgh Medical Center• German Federal Ministry of Health• European Union eHealth Directorate• Plant Genome Research Resource• Protein Information Resource
Who am I?National Center for Ontological Research (http://ncor.us)
• Joint Warfighting Center, US Joint Forces Command • Intelligence and Information Warfare Directorate
(I2WD)• US Department of the Army Net-Centric Data
Strategy Center of Excellence• NextGen (Next Generation Air Transportation
System) Ontology Team• National Nuclear Security Administration (NNSA),
Department of Energy
3
Some questions
• How to find data?• How to understand data when you find it?• How to use data when you find it?• How to compare and integrate with other data?• How to avoid data silos?
4
The Web (net-centricity) as part of the solution
• You build a site• Others discover the site and they link to it• The more they link, the more well known the
page becomes (Google …)• Your data becomes discoverable
5
1. Make your data available in a standard way on the Web
2. Use controlled vocabularies (‘ontologies’) to capture common meanings, in ways understandable to both humans and computers – Web Ontology Language (OWL)
3. Build links among the datasets to create a ‘web of data’
The roots of Semantic Technology
Controlled vocabularies for tagging (‘annotating’) data
• Hardware changes rapidly• Organizations rapidly forming and
disbanding • Data is exploding• Meanings of common words change slowly • Use web architecture to annotate exploding
data stores using ontologies to capture these common meanings in a stable way
7
Where we stand today• increasing availability of semantically enhanced
data and semantic software• increasing use of XML, RDF, OWL in attempts to
create useful integration of on-line data and information
• “Linked Open Data” the New Big Thing
8
Ontology success stories, and some reasons for failure
9
as of September 2010
The problem: the more Semantic Technology is successful, they more it fails
The original idea was to break down silos via common controlled vocabularies for the tagging of data
The very success of the approach leads to the creation of ever new controlled vocabularies – semantic silos – as ever more ontologies are created in ad hoc ways
The Semantic Web framework as currently conceived and governed by the W3C yields minimal standardization
Multiplying (Meta)data registries are creating data cemeteries
11
NCBO Bioportal (Ontology Registry)
12
13/24
14/24
Reasons for this effect
• Low incentives for reuse of existing ontologies• Each organization wants its own ontology • Poor licensing regime, poor standards, poor
training• People think: Information technology
(hardware) is changing constantly, so it’s not worth the effort of getting things right
• People have egos: “We have done it this way for 30 years, we are not going to change now”
15
Why should you care?
• when they are many ad hoc systems, average quality will be low
• constant need for ad hoc repair through manual effort
• DoD alone spends $6 billion per annum on this problem
• regulatory agencies are recognizing the need for common controlled vocabularies
16/24
So now people are scrambling
• to learn how to create ontologies• serious lag in creating trained expertise• poor quality coding leads to poor quality
ontologies• poor quality ontology management
17
How to do it right?
• how create an incremental, evolutionary process, where what is good survives ?
• how to bring about ontology death ?
A success story from biology
18
Old biology data
19/
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDV
New biology data
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100
200
400
600
800
1000
1200
Series 1
Axis Title
Ontology in PubMed
By far the most successful: GO (Gene Ontology)
22
23
what cellular component?
what molecular function?
what biological process?
the Gene Ontology is not an ontology of genes
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Microarray datashows changed expression ofthousands of genes.
How will you spot the patterns?
24
Why is GO successful
• built by bench biologists• multi-species, multi-disciplinary, open
source • compare use of kilograms, meters, seconds
in formulating experimental results• natural language and logical definitions for
all terms• initially low-tech to ensure aggressive use
and testing 25
now used not just in biology but also in hospital research
26
Lab / pathology dataEHR dataClinical trial dataFamily history data Medical imagingMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP data
How will you spot the patterns?How will you find the data you need?
27
over 11 million annotations relating UniProt, Ensembl and other databases to terms in the GO
28
29
Hierarchical view representing relations between represented types
~ $200 mill. invested in the GO so far
A new kind of biomedical researchOver 11 million GO annotations to biomedical research literature freely available on the web
Powerful software tool support for navigating this data means that what used to take researchers months of data comparison effort, can now be performed in milliseconds
30
If controlled vocabularies are to serve to remove silos
they have to be respected by many owners of data as resources that ensure accurate description of their data
– GO maintained not by computer scientists but by biologists
they have to be willingly used in annotations by many owners of data
they have to be maintained by persons who are trained in common principles of ontology maintenance
31
32
The new profession of biocurator
GO has been amazingly successful
Has created a community consensusHas created a web of feedback loops where
users of the GO can easily report errors and gaps
Has identified principles for successful ontology management
Indispensable to every drug company and every biology lab
33
But GO is limited in its scope
it covers only generic biological entities of three sorts:
– cellular components– molecular functions– biological processes
no diseases, symptoms, disease biomarkers, protein interactions, experimental processes …
34
Extending the GO methodology to other domains of biology and
medicine
35
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)OBO (Open Biomedical Ontology) Foundry proposal
(Gene Ontology in yellow) 36
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)The strategy of orthogonal modules
37
Ontology Scope URL Custodians
Cell Ontology (CL)
cell types from prokaryotes to mammals
obo.sourceforge.net/cgi-
bin/detail.cgi?cell
Jonathan Bard, Michael Ashburner, Oliver Hofman
Chemical Entities of Bio-
logical Interest (ChEBI)
molecular entities ebi.ac.uk/chebi Paula Dematos,Rafael Alcantara
Common Anatomy Refer-
ence Ontology (CARO)
anatomical structures in human and model
organisms(under development)
Melissa Haendel, Terry Hayamizu, Cornelius
Rosse, David Sutherland,
Foundational Model of Anatomy (FMA)
structure of the human body
fma.biostr.washington.
edu
JLV Mejino Jr.,Cornelius Rosse
Functional Genomics Investigation
Ontology (FuGO)
design, protocol, data instrumentation, and
analysisfugo.sf.net FuGO Working Group
Gene Ontology (GO)
cellular components, molecular functions, biological processes
www.geneontology.org
Gene Ontology Consortium
Phenotypic Quality Ontology
(PaTO)
qualities of anatomical structures
obo.sourceforge.net/cgi
-bin/ detail.cgi?attribute_and_value
Michael Ashburner, Suzanna
Lewis, Georgios Gkoutos
Protein Ontology (PrO)
protein types and modifications (under development) Protein Ontology
Consortium
Relation Ontology (RO)
relations obo.sf.net/relationship
Barry Smith, Chris Mungall
RNA Ontology(RnaO)
three-dimensional RNA structures (under development) RNA Ontology Consortium
Sequence Ontology(SO)
properties and features of nucleic sequences song.sf.net Karen Eilbeck
How to recreate the success of the GO in other areas
1. create a portal for sharing of information about existing controlled vocabularies, needs and institutions operating in a given area
2. create a library of ontologies in this area3. create a consortium of developers of these
ontologies who agree to pool their efforts to create a single set of non-overlapping ontology modules
– one ontology for each sub-area39
40
NextGen Ontology Portal
Portal
Comm
unitiesSearch
Ontology Library
NextGen Enterprise Ontology
Ontology Portal• Two-Tiered Registry
– NextGen Ontology – consist of vetted ontologies
– Ontology Library – open to the wider community
• Ontology Metadata– Ontology owner, domain, and
location • Ontology Search*
– Support ontology discovery
Developers commit in advance to collaborating with developers of ontologies in adjacent domains and
to working to ensure that, for each domain, there is community convergence on a single ontology
http://obofoundry.org
The OBO Foundry: a step-by-step, principles-based approach
41
OBO Foundry Principles
Common governance
Common training
Robust versioning
Common architecture
42
Anatomy Ontology(FMA*, CARO)
Environment
Ontology(EnvO)
Infectious Disease
Ontology(IDO*)
Biological Process
Ontology (GO*)
Cell Ontology
(CL)
CellularComponentOntology
(FMA*, GO*) Phenotypic Quality
Ontology(PaTO)
Subcellular Anatomy Ontology (SAO)
Sequence Ontology (SO*) Molecular
Function(GO*)Protein Ontology
(PRO*) OBO Foundry Modular Organization
top level
mid-level
domain level
Information Artifact Ontology
(IAO)
Ontology for Biomedical Investigations
(OBI)
Ontology of General Medical Science
(OGMS)
Basic Formal Ontology (BFO)
43
UCore 2.0 / UCore SL
Extension Strategy
44
top level
mid-level
domain level
Military domain ontologies as extensions of the Universal Core Semantic Layer
Existing efforts to create modular ontology suites
NASA Sweet OntologiesMilitary Intelligence Ontology FoundryPlanned OMG efforts:• OMG (CIA) Financial Event Ontology• Semantic Layer for ISO 20022 (Financial Industry Message Scheme)
46
Example: Financial Securities OntologyMike Bennett (2007)
Basic principles of ontology development
– for formulating definitions– of modularity– of user feedback for error correction and gap
identification– for ensuring compatibility between modules– for using ontologies to annotate legacy data– for using ontologies to create new data– for developing user-specific views
Modularity designed to ensure
• non-redundancy• annotations can be additive• division of labor among SMEs• lessons learned in one module can benefit work on
other modules• transferrable training • motivation of SME users
49
How the FIMA Reference Data community should solve this problem?
Major financial institutions Major software vendorsMajor data management companiesEDMC and government principals
– should pool information about the controlled vocabularies which already exist
– create a common library of these controlled vocabularies– create a subset of thought leaders who agree to pool their efforts
in the creation of a suite of ontology modules for common use– create a strategy to disseminate and evolve the selected modules– create a governance strategy to manage the modules over time– allow bad ontologies to die
Urgent need for trained ontologists
Severe shortage of persons with the needed expertiseUniversity at Buffalo Online Training and Certification Program for Ontologists
for details: [email protected]