ontologies for data integration: a semantic web perspective training course olivier bodenreider...

107
Ontologies for Data Integration: A Semantic Web Perspective raining Course raining Course Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA May 24, 2006 Schloß Dagstuh iomedical Ontology iomedical Ontology in in

Post on 19-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Ontologies for Data Integration:A Semantic Web Perspective

Training CourseTraining Course

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

May 24, 2006

Schloß Dagstuhl

Biomedical OntologyBiomedical Ontology

inin

Page 2: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

OutlineOutline

From terminology integrationto information integrationUnified Medical Language System (UMLS)

UMLS in use:Mapping across terminologies

Beyond UMLS:Towards a Biomedical Knowledge Repository

Page 3: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

From terminology integrationto information integration

Unified Medical Language System (UMLS)

Page 4: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

What does UMLS stand for?What does UMLS stand for?

UUnifiednified MMedicaledical LLanguageanguage SSystemystem

UMLS®

Unified Medical Language System®

UMLS Metathesaurus®

Page 5: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MotivationMotivation

Started in 1986Started in 1986 National Library of MedicineNational Library of Medicine ““Long-term R&D project”Long-term R&D project”

«[…] the UMLS project is an effort to overcome two significant barriers to effective retrieval of machine-readable information.

• The first is the variety of ways the same concepts are expressed in different machine-readable sources and by different people.

• The second is the distribution of useful information among many disparate databases and systems.»

Page 6: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Overview through an example

Page 7: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

7 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Addison’s diseaseAddison’s disease

Addison's disease is a rare Addison's disease is a rare endocrine disorderendocrine disorder

Addison's disease occurs Addison's disease occurs when the when the adrenal glandsadrenal glands do not produce enough of do not produce enough of the hormone the hormone cortisolcortisol

For this reason, the For this reason, the disease is sometimes disease is sometimes called called chronic adrenal chronic adrenal insufficiencyinsufficiency, or , or hypocortisolismhypocortisolism

Page 8: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Adrenal insufficiency Adrenal insufficiency Clinical Clinical variantsvariants

Primary / SecondaryPrimary / Secondary Primary: lesion of the Primary: lesion of the

adrenal glands themselvesadrenal glands themselves Secondary: inadequate Secondary: inadequate

secretion of ACTH by the secretion of ACTH by the pituitary glandpituitary gland

Acute / ChronicAcute / Chronic Isolated / Polyendocrine Isolated / Polyendocrine

deficiency syndromedeficiency syndrome

ACTH

Page 9: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Addison’s disease: Addison’s disease: SymptomsSymptoms

FatigueFatigue WeaknessWeakness Low blood pressureLow blood pressure Pigmentation of the skin (exposed and non-Pigmentation of the skin (exposed and non-

exposed parts of the body)exposed parts of the body) ……

Page 10: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AD in medical vocabulariesAD in medical vocabularies

Synonyms: Synonyms: different termsdifferent terms Addisonian syndromeAddisonian syndrome Bronzed diseaseBronzed disease Addison melanodermaAddison melanoderma Asthenia pigmentosaAsthenia pigmentosa Primary adrenal deficiencyPrimary adrenal deficiency Primary adrenal insufficiencyPrimary adrenal insufficiency Primary adrenocortical insufficiencyPrimary adrenocortical insufficiency Chronic adrenocortical insufficiencyChronic adrenocortical insufficiency

Contexts: Contexts: different hierarchiesdifferent hierarchies

symptoms

clinicalvariants

eponym

Page 11: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Organize termsOrganize terms

Synonymous terms clustered into a conceptSynonymous terms clustered into a concept Preferred termPreferred term Unique identifier (CUI)Unique identifier (CUI)

Addison's disease

Addison Disease MeSH D000224Primary hypoadrenalism MedDRA 10036696Primary adrenocortical insufficiency ICD-10 E27.1Addison's disease (disorder) SNOMED CT 363732003

C0001403

Page 12: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Diseases of the endocrine system

Diseases of the Adrenal Glands

Addison’s Disease

Diseases/DiagnosesSNOMED International

Page 13: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Endocrine Diseases

Adrenal Gland Diseases

Addison’s Disease

DiseasesMeSH

Adrenal Gland Hypofunction

Page 14: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Endocrine disorder

Adrenal disorder

Adrenal cortical disorder

Adrenal cortical hypofunction

Addison’s Disease

AOD

Page 15: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Endocrine disorder

Disorder of adrenal gland

Hypoadrenalism

Adrenal Hypofunction

Corticoadrenal insufficiency

Addison’s Disease

Read Codes

Page 16: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Primary adrenocortical insufficiency

Other disorders ofadrenal gland

Disorders of otherendocrine gland

ICD-10

Page 17: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Organize conceptsOrganize concepts

Inter-concept Inter-concept relationships: hierarchies relationships: hierarchies from the source from the source vocabulariesvocabularies

Redundancy: multiple Redundancy: multiple pathspaths

One One graphgraph instead of instead of multiple multiple treestrees(multiple inheritance)(multiple inheritance)

A

B D E H D E

B

G H

E F H

C

B C

A

E FD

G H

Page 18: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Adrenal Cortex Diseases

Hypoadrenalism

Adrenal Gland Hypofunction

Adrenal cortical hypofunction

Endocrine Diseases

Adrenal Gland Diseases

organize concepts

Addison’s Disease

UMLS

SNOMEDMeSHAODRead Codes

Page 19: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

19 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Relate to other conceptsRelate to other concepts

Additional hierarchical relationshipsAdditional hierarchical relationships link to other treeslink to other trees make relationships explicitmake relationships explicit

Non-hierarchical relationshipsNon-hierarchical relationships Co-occurring conceptsCo-occurring concepts Mapping relationshipsMapping relationships

Page 20: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Endocrine Diseases

Adrenal Gland Diseases

Adrenal Cortex Diseases

Hypoadrenalism

Adrenal Gland Hypofunction

Adrenal cortical hypofunction

Addison’s Disease

Adrenal Cortex Dysfunction

Adrenal Dysfunction

Addison’s disease due to autoimmunity

Secondary hypocortisolism

Other disorders ofadrenal gland

Disorders of otherendocrine gland

Adrenal Glands

Adrenal Cortex

Endocrine System

Endocrine Glands

Abdominal organ Diseases

relate to other concepts

Page 21: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

21 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Categorize conceptsCategorize concepts

High-level categories High-level categories (semantic types)(semantic types)

Assigned by the Assigned by the Metathesaurus editorsMetathesaurus editors

Independently of the Independently of the hierarchies in which these hierarchies in which these concepts are locatedconcepts are located

Disease or Syndrome

Endocrine Diseases

Adrenal Gland Diseases

Addison’s Disease

Diseases

Adrenal Gland Hypofunction

Page 22: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

How do they do that?How do they do that?

Lexical knowledgeLexical knowledge

Semantic pre-processingSemantic pre-processing

UMLS editorsUMLS editors

Page 23: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Lexical knowledgeLexical knowledge

Adrenal gland diseasesAdrenal disorderDisorder of adrenal glandDiseases of the adrenal glandsC0001621

Page 24: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic pre-processingSemantic pre-processing

Metadata in the source vocabulariesMetadata in the source vocabularies

Tentative categorizationTentative categorization Positive (or negative) evidence for tentative Positive (or negative) evidence for tentative

synonymy relations based on lexical featuressynonymy relations based on lexical features

Page 25: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Additional knowledge: UMLS editorsAdditional knowledge: UMLS editors

Adrenal Gland Diseases

Adrenal Cortex Diseases

Adrenal Cortex Dysfunction

Hypoadrenalism

Adrenal Gland Hypofunction

Adrenal cortical hypofunction

Addison’s Disease

Other disorders ofadrenal gland

Page 26: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

UMLS: 3 componentsUMLS: 3 components

SPECIALIST LexiconSPECIALIST Lexicon 200,000 lexical items200,000 lexical items Part of speech and variant informationPart of speech and variant information

MetathesaurusMetathesaurus 5M names from over 100 terminologies5M names from over 100 terminologies 1M concepts1M concepts 16M relations16M relations

Semantic NetworkSemantic Network 135 high-level categories135 high-level categories 7000 relations among them7000 relations among them

Lexicalresources

Ontologicalresources

Terminologicalresources

Page 27: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

UMLS Metathesaurus

Page 28: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

28 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Source VocabulariesSource Vocabularies

139 source vocabularies139 source vocabularies 17 languages17 languages

Broad coverage of biomedicineBroad coverage of biomedicine 5.1M names5.1M names 1.3M concepts1.3M concepts 16M relations16M relations

Common presentationCommon presentation

(2006AB)

Page 29: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Addison’s Disease: Addison’s Disease: ConceptConcept

Addison’s Disease

C0001403

ADRENAL INSUFFICIENCY (ADDISON'S DISEASE) ADRENOCORTICAL INSUFFICIENCY, PRIMARY FAILURE Addison melanoderma Melasma addisonii Primary adrenal deficiency Asthenia pigmentosa Bronzed disease Insufficiency, adrenal primary Primary adrenocortical insufficiency Addison's, disease

MALADIE D'ADDISON - FrenchAddison-Krankheit - GermanMorbo di Addison - ItalianDOENCA DE ADDISON - PortugueseADDISONOVA BOLEZN' - RussianENFERMEDAD DE ADDISON - Spanish

A disease characterized by hypotension, weight loss, anorexia, weakness, and sometimes a bronze-like melanotic hyperpigmentation of the skin. It is due to tuberculosis- or autoimmune-induced disease (hypofunction) of the adrenal glands that results in deficiency of aldosterone and cortisol. In the absence of replacement therapy, it is usually fatal.

SNOMEDMeSHAODRead Codes…

Disease or Syndrome

Page 30: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Metathesaurus Metathesaurus ConceptsConcepts

ConceptConcept (> 1.3M)(> 1.3M) CUICUI Set of synonymousSet of synonymous

concept namesconcept names

TermTerm (> 4.6M)(> 4.6M) LUILUI Set of normalized namesSet of normalized names

StringString (> 5.1M)(> 5.1M) SUISUI Distinct concept nameDistinct concept name

AtomAtom (> 6.2M)(> 6.2M) AUIAUI Concept nameConcept name

in a given sourcein a given source

(2006AB)

A0000001 headache (source 1)A0000002 headache (source 2)

S0000001

A0000003 Headache (source 1)A0000004 Headache (source 2)

S0000002

L0000001

A0000005 Cephalgia (source 1)S0000003

L0000002

C0000001

Page 31: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Cluster of synonymous termsCluster of synonymous terms

ConceptC0001621

[…]TermL0001621

S0011232 Adrenal Gland DiseasesS0011231 Adrenal Gland DiseaseS0000441 Disease of adrenal glandS0481705 Disease of adrenal gland, NOSS0220090 Disease, adrenal glandS0044801 Gland Disease, Adrenal

TermL0041793

S0860744 Disorder of adrenal gland, unspecifiedS0217833 Unspecified disorder of adrenal glands

[…]

TermL0368399

S0586222 Adrenal diseaseS0466921 ADRENAL DISEASE, NOS

[…]

TermL0181041

S0632950 Disorder of adrenal glandS0354509 Adrenal Gland Disorders

[…]

TermL0161347

S0225481 ADRENAL DISORDERS0627685 DISORDER ADRENAL (NOS)

[…]

TermL1279026

S1520972 Nebennierenkrankheiten GER

S0226798 SURRENALE, MALADIESTermL0162317

FRE

Page 32: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Metathesaurus Metathesaurus Evolution over timeEvolution over time

Concepts never die (in principle)Concepts never die (in principle) CUIs are permanent identifiersCUIs are permanent identifiers

What happens when they do die (in reality)?What happens when they do die (in reality)? Concepts can merge or splitConcepts can merge or split Resulting in new concepts and deletionsResulting in new concepts and deletions

Addison's diseaseC0001403

Addison's disease, NOS C0271735

1992 1993 1994 1995 1996 1997 1998 1999 2004…

Page 33: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Metathesaurus Metathesaurus RelationshipsRelationships

Symbolic relations:Symbolic relations: ~9 M pairs of concepts~9 M pairs of concepts Statistical relations :Statistical relations : ~7 M pairs of concepts ~7 M pairs of concepts

(co-occurring concepts)(co-occurring concepts) Mapping relations:Mapping relations: 100,000 pairs of 100,000 pairs of

conceptsconcepts

Categorization: Relationships between concepts Categorization: Relationships between concepts and semantic types from the Semantic Networkand semantic types from the Semantic Network

Page 34: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

34 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Symbolic relationsSymbolic relations

RelationRelation Pair of “atom” identifiersPair of “atom” identifiers TypeType Attribute (if any)Attribute (if any) List of sources (for type and attribute)List of sources (for type and attribute)

Semantics of the relationship:Semantics of the relationship:defined by its defined by its typetype [and [and attributeattribute]]

Source transparency: the informationis recorded at the “atom” level

Page 35: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Symbolic relationships Symbolic relationships TypeType

HierarchicalHierarchical Parent / ChildParent / Child Broader / Narrower thanBroader / Narrower than

Derived from hierarchiesDerived from hierarchies Siblings (children of parents)Siblings (children of parents)

AssociativeAssociative OtherOther

Various flavors of near-synonymyVarious flavors of near-synonymy SimilarSimilar Source asserted synonymySource asserted synonymy Possible synonymyPossible synonymy

PAR/CHD

RB/RN

SIB

RO

RL

SY

RQ

Page 36: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Symbolic relationships Symbolic relationships AttributeAttribute

HierarchicalHierarchical isa (is-a-kind-of)isa (is-a-kind-of) part-ofpart-of

AssociativeAssociative location-oflocation-of caused-bycaused-by treatstreats … …

Cross-references (mapping)Cross-references (mapping)

Page 37: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Heart

Concepts

Metathesaurus

22

225

97

4

12

9 31

Esophagus

Left PhrenicNerve

HeartValves

FetalHeart

Medias-tinum

SaccularViscus

AnginaPectoris

CardiotonicAgents

TissueDonors

AnatomicalStructure

Fully FormedAnatomical

Structure

EmbryonicStructure

Body Part, Organ orOrgan Component Pharmacologic

Substance

Disease orSyndrome

PopulationGroup

Semantic Types

SemanticNetwork

Page 38: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

UMLS Semantic Network

Page 39: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

39 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic NetworkSemantic Network

Semantic types (135)Semantic types (135) tree structuretree structure 2 major hierarchies2 major hierarchies

EntityEntity

– Physical ObjectPhysical Object

– Conceptual EntityConceptual Entity EventEvent

– ActivityActivity

– Phenomenon or ProcessPhenomenon or Process

Page 40: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

40 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic NetworkSemantic Network

Semantic network relationships (54)Semantic network relationships (54) hierarchical (isa = is a kind of)hierarchical (isa = is a kind of)

among typesamong types

– AnimalAnimal isaisa OrganismOrganism

– EnzymeEnzyme isaisa Biologically Active SubstanceBiologically Active Substance among relationsamong relations

– treats treats isaisa affects affects

non-hierarchicalnon-hierarchical Sign or SymptomSign or Symptom diagnosesdiagnoses Pathologic FunctionPathologic Function Pharmacologic SubstancePharmacologic Substance treatstreats Pathologic FunctionPathologic Function

Page 41: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

41 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

““Biologic Function” hierarchy (isa)Biologic Function” hierarchy (isa)

Biologic Function

Pathologic FunctionPhysiologic Function

Disease orSyndrome

Cell orMolecular

Dysfunction

ExperimentalModel ofDisease

OrganismFunction

Organor TissueFunction

CellFunction

MolecularFunction

Mental orBehavioral

Dysfunction

NeoplasticProcess

MentalProcess

GeneticFunction

Page 42: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

42 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associative (non-isa) relationshipsAssociative (non-isa) relationshipsOrganism

process of

EmbryonicStructure

AnatomicalAbnormality

CongenitalAbnormality

AcquiredAbnormality

Fully FormedAnatomical

Structure

AnatomicalStructure

part of

OrganismAttribute

property of

BodySubstance

contains,produces

conceptualpart of

evaluation of

Body Systemconceptual

part of

part of

Body Part, Organ orOrgan Component

part of

Tissue

part of

Cell

part of

CellComponent

Gene orGenome

Body Spaceor Junction

adjacent to

location of

location of

evaluation ofFinding

Laboratory orTest Result

Sign orSymptom

BiologicFunction

PhysiologicFunction

PathologicFunction

Body Locationor Region

conceptualpart of

conceptualpart of

Injury orPoisoning

disrupts

disrupts

co-occurs with

Page 43: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

43 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Why a semantic network?Why a semantic network?

Semantic Types serve as high level categories Semantic Types serve as high level categories assigned to Metathesaurus concepts, assigned to Metathesaurus concepts, independently of their position in a hierarchyindependently of their position in a hierarchy

A relationship between 2 Semantic Types (ST) is a A relationship between 2 Semantic Types (ST) is a possible link between 2 concepts that have been possible link between 2 concepts that have been assigned to those STsassigned to those STs The relationship may or may not hold at the concept The relationship may or may not hold at the concept

levellevel Other relationships may apply at the concept levelOther relationships may apply at the concept level

Page 44: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

44 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Relationships can inherit semanticsRelationships can inherit semantics

Semantic Network

Metathesaurus

AdrenalCortex

AdrenalCortical

hypofunction

Disease or SyndromeBody Part, Organ,

or Organ Component

Pathologic Functionisa

Biologic Function

isa

Fully FormedAnatomical

Structure

isa

location of

location of

Page 45: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

45 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

UMLS UMLS SummarySummary

Synonymous terms clustered into conceptsSynonymous terms clustered into concepts Unique identifierUnique identifier

Finer granularityFiner granularity Broader scopeBroader scope Additional hierarchical relationshipsAdditional hierarchical relationships Semantic categorizationSemantic categorization

Page 46: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

46 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

MeSH

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIM

Clinicalrepositories

Clinicalrepositories

SNOMEDOthersubdomains

Othersubdomains

AnatomyAnatomy

UWDA

UMLS

Page 47: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

47 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

Genomeannotations

Genomeannotations

Modelorganisms

Modelorganisms

Geneticknowledge bases

Geneticknowledge bases

Clinicalrepositories

Clinicalrepositories

Othersubdomains

Othersubdomains

AnatomyAnatomy

Page 48: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Information integrationGenomics as an example

Page 49: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

49 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

NF2 NF2 GeneGene, , proteinprotein, and , and diseasedisease

Neurofibromatosis 2 is an autosomal dominant disease characterized by tumors called schwannomas involving the acoustic nerve, as well as other features. The disorder is caused by mutations of the NF2 gene resulting in absence or inactivation of the protein product. The protein product of NF2 is commonly called merlin (but also neurofibromin 2 and schwannomin) and functions as a tumor suppressor.

Page 50: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

50 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Schwannoma (acoustic neuroma)Schwannoma (acoustic neuroma)

http://www.mayoclinic.com

Page 51: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

51 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Page 52: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

52 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

NF2 geneNF2 gene

http://staff.washington.edu/timk/cyto/human/ http://www.ncbi.nlm.nih.gov/mapview/

Page 53: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

53 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Page 54: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

54 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MerlinMerlin

SynonymsSynonyms Neurofibromin 2Neurofibromin 2 SchwannominSchwannomin SchwannomerlinSchwannomerlin Neurofibromatosis-2Neurofibromatosis-2

10 isoforms10 isoforms AnnotationsAnnotations

Negative regulation of cell proliferationNegative regulation of cell proliferation CytoskeletonCytoskeleton Plasma membrane Plasma membrane

Page 55: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

55 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Page 56: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

56 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Neurofibromatosis 2(Type II neurofibromatosis,

Bilateral acoustic neurofibromatosis)C0027832

NF2(Neurofibromin 2 gene)

C0085114 Merlin(Schwannomin,

Neurofibromin 2)C0254123

NEUROFIBROMATOSIS,TYPE II; NF2

#101000

Drosophila melanogaster merlin(Dmerlin) mRNA, complete cds.

U49724OMIM GenbankExternal resources

UMLS Metathesaurus(Concepts and relations)

Amino Acid,Peptide, or Protein

Biologically ActiveSubstance

Neoplastic Process Gene or Genome

UMLS Semantic Network (Semantic Types)

Merlin, Drosophila

Tumor suppressorgenes

Benign neoplasmsof cranial nerves

Neuro-fibromatoses

Tumor suppressorproteins

Page 57: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

57 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

LimitationsLimitations

Genes not systematically representedGenes not systematically represented Most gene products and diseases areMost gene products and diseases are

Gene/Gene product-Disease relationsGene/Gene product-Disease relations Not systematically representedNot systematically represented Not explicitly represented (e.g., co-occurrence)Not explicitly represented (e.g., co-occurrence)

Cross-references not systematically representedCross-references not systematically represented

Naming conventions (genes)Naming conventions (genes)

Page 58: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

58 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

UMLSUMLSumlsinfo.nlm.nih.govumlsinfo.nlm.nih.gov

UMLS browsersUMLS browsers (free, but UMLS license required) (free, but UMLS license required) Knowledge Source Server: Knowledge Source Server: umlsks.nlm.nih.govumlsks.nlm.nih.gov Semantic Navigator: Semantic Navigator: http://mor.nlm.nih.gov/perl/semnav.plhttp://mor.nlm.nih.gov/perl/semnav.pl

RRF browserRRF browser(standalone application distributed with the UMLS)(standalone application distributed with the UMLS)

Page 59: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

59 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Recent overviewsRecent overviews Bodenreider O. (2004). Bodenreider O. (2004). The Unified Medical Language The Unified Medical Language

System (UMLS): Integrating biomedical terminologySystem (UMLS): Integrating biomedical terminology. . Nucleic Acids ResearchNucleic Acids Research; D267-D270.; D267-D270.

Nelson, S. J., Powell, T. & Humphreys, B. L. (2002 ). Nelson, S. J., Powell, T. & Humphreys, B. L. (2002 ). The Unified Medical Language System (UMLS) The Unified Medical Language System (UMLS) ProjectProject. In: Kent, Allen; Hall, Carolyn M., editors. . In: Kent, Allen; Hall, Carolyn M., editors. Encyclopedia of Library and Information ScienceEncyclopedia of Library and Information Science. New . New York: Marcel Dekker. p.369-378. York: Marcel Dekker. p.369-378.

Page 60: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

60 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

UMLS as a research projectUMLS as a research project Lindberg, D. A., Humphreys, B. L., & McCray, A. T. Lindberg, D. A., Humphreys, B. L., & McCray, A. T.

(1993). (1993). The Unified Medical Language SystemThe Unified Medical Language System. . Methods Inf Med, 32Methods Inf Med, 32(4), 281-91.(4), 281-91.

Humphreys, B. L., Lindberg, D. A., Schoolman, H. M., Humphreys, B. L., Lindberg, D. A., Schoolman, H. M., & Barnett, G. O. (1998). & Barnett, G. O. (1998). The Unified Medical The Unified Medical Language System: an informatics research Language System: an informatics research collaborationcollaboration. . J Am Med Inform Assoc, 5J Am Med Inform Assoc, 5(1), 1-11.(1), 1-11.

Page 61: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

61 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Technical papersTechnical papers McCray, A. T., & Nelson, S. J. (1995). McCray, A. T., & Nelson, S. J. (1995). The The

representation of meaning in the UMLSrepresentation of meaning in the UMLS. . Methods Inf Methods Inf Med, 34Med, 34(1-2), 193-201.(1-2), 193-201.

Bodenreider O. & McCray A. T. (2003). Bodenreider O. & McCray A. T. (2003). Exploring Exploring semantic groups through visual approachessemantic groups through visual approaches. . Journal of Journal of Biomedical InformaticsBiomedical Informatics, 36(6), 414-432. , 36(6), 414-432.

Page 62: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

UMLS in UseMapping across Vocabularies

Page 63: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

63 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

The problemThe problem

For noun phrases extracted from medical texts, For noun phrases extracted from medical texts, map to UMLS conceptsmap to UMLS concepts

Then, select from the MeSH vocabulary the Then, select from the MeSH vocabulary the concepts that are the most closely related to the concepts that are the most closely related to the original conceptsoriginal concepts

Medical text

Noun phrase

UMLS

MeSH descriptor

Page 64: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

64 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Map noun phrases to UMLSMap noun phrases to UMLS

NormalizationNormalization normalizenormalize noun phrases noun phrases use the use the normalized string indexnormalized string index

MetaMapMetaMap approximate matchingapproximate matching more aggressive approachmore aggressive approach

use derivational variantsuse derivational variants allow partial matchesallow partial matches

Page 65: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

65 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSHRestrict to MeSH

Based on the principle of semantic localityBased on the principle of semantic locality Use different components of the UMLSUse different components of the UMLS 4 techniques of increasing aggressiveness4 techniques of increasing aggressiveness

Use SynonymyUse Synonymy MRCON + MRCON + MRSOMRSO

Use Associated expressions (ATXs)Use Associated expressions (ATXs) MRATXMRATX

Explore the AncestorsExplore the Ancestors MRREL + MRREL + SNSN

Explore the Other related conceptsExplore the Other related concepts MRREL + MRREL + SNSN

Page 66: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

66 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: SynonymySynonymy

Term mapped to Source conceptTerm mapped to Source concept For this concept, is there a synonym term For this concept, is there a synonym term

that comes from MeSH? that comes from MeSH? (MRSO)(MRSO)

Page 67: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

67 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: Assoc. expressionsAssoc. expressions

If not,If not, Is there an associated expression (ATX) that Is there an associated expression (ATX) that

describes this concept using a combination of describes this concept using a combination of MeSH descriptors? MeSH descriptors? (MRATX)(MRATX)

Endoscopic removal of intraluminal foreign body from oesophagus without incision

AND

Foreign Bodies

MH/SH

Esophagus surgery

Page 68: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

68 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: AncestorsAncestors

If not, let us build the graph of the ancestors of If not, let us build the graph of the ancestors of this conceptthis concept using parents and broader concepts using parents and broader concepts (MRREL)(MRREL)

all the way to the topall the way to the top excluding ancestors whose semantic types are not excluding ancestors whose semantic types are not

compatible with those of the source concept compatible with those of the source concept (MRSTY)(MRSTY)

From the graph, select the concepts that come From the graph, select the concepts that come from MeSH from MeSH (MRCONSO)(MRCONSO)

Remove those that are ancestors of another Remove those that are ancestors of another concept coming from MeSHconcept coming from MeSH

Page 69: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

69 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: Other related conceptsOther related concepts

If not, explore the other related concepts If not, explore the other related concepts (MRREL) (MRREL)

whose semantic types are compatible with those of whose semantic types are compatible with those of the source concept the source concept (MRSTY)(MRSTY)

From those, select the concepts that come from From those, select the concepts that come from MeSH MeSH (MRCONSO)(MRCONSO)

Page 70: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

70 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: ExampleExample

Vein of neck, NOS

There is a MeSH term in the synonyms of SC

SC is described by a combination of MeSH terms (ATX)

The ancestors of SC contain MeSH terms

MeSH terms from non-hierarchically related concepts

Neck+Vein

Page 71: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

71 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Restrict to MeSH: Restrict to MeSH: ExampleExample

Vein of neck, NOS

Vein of head and neck, NOS

Neck

Blood Vessels Vascular structure

Veins

Systemic veins

Head

Head and neck, NOS Body part, NOS

Page 72: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

72 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Overall resultsOverall results

Synonymy:Synonymy: 24%24% Built-in mapping:Built-in mapping: 1% 1% AncestorsAncestors

From concept:From concept: 49%49% From children:From children: 2% 2% From siblings:From siblings: 1% 1%

Other:Other: 11%11% No mappingNo mapping 12%12%

Page 73: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

73 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Bodenreider O, Nelson SJ, Hole WT, Chang HF. Bodenreider O, Nelson SJ, Hole WT, Chang HF. Beyond synonymy: Beyond synonymy: exploiting the UMLS semantics in mapping vocabulariesexploiting the UMLS semantics in mapping vocabularies. Proceedings . Proceedings of AMIA Annual Symposium 1998:815-9.of AMIA Annual Symposium 1998:815-9.http://mor.nlm.nih.gov/pubs/pdf/1998-amia-ob.pdfhttp://mor.nlm.nih.gov/pubs/pdf/1998-amia-ob.pdf

Fung KW, Bodenreider O. Fung KW, Bodenreider O. Utilizing the UMLS for semantic mapping Utilizing the UMLS for semantic mapping between terminologiesbetween terminologies. Proceedings of AMIA Annual Symposium . Proceedings of AMIA Annual Symposium 2005:266-270.2005:266-270.http://mor.nlm.nih.gov/pubs/pdf/2005-amia-kwf.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-amia-kwf.pdf

Page 74: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Advanced Library ServicesTowards a Biomedical Knowledge Repository

Page 75: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

75 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

DisclaimerDisclaimer

The project presented in this talk is being The project presented in this talk is being proposed as a new research initiative at the Lister proposed as a new research initiative at the Lister Hill National Center for Biomedical Hill National Center for Biomedical CommunicationsCommunications

It has not been approved or reviewed by NLM yetIt has not been approved or reviewed by NLM yet The ideas presented here may not reflect NLM’s The ideas presented here may not reflect NLM’s

viewsviews

In collaboration with Tom Rindflesch, NLMIn collaboration with Tom Rindflesch, NLM

Page 76: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

76 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Delivering Health InformationDelivering Health Information

Provide biomedical text to health care Provide biomedical text to health care professionals and consumersprofessionals and consumers

Maintain NLM’s cutting edgeMaintain NLM’s cutting edge Support public health and healthy behaviorSupport public health and healthy behavior Assist clinical practiceAssist clinical practice Enable biomedical research and discoveryEnable biomedical research and discovery

Exploit current Library resources and advanced Exploit current Library resources and advanced technologytechnology

Page 77: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

77 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Why additional services?Why additional services?

Biomedical literature is growing at an increasingly Biomedical literature is growing at an increasingly faster pacefaster pace High-throughput approach to literature processingHigh-throughput approach to literature processing

Integration between literature and other resources Integration between literature and other resources is insufficientis insufficient Adequate for navigating purposesAdequate for navigating purposes Insufficient for knowledge processingInsufficient for knowledge processing

Information retrieval is the starting point, not the Information retrieval is the starting point, not the end of the journey for the researcherend of the journey for the researcher

Page 78: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

78 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Integration for navigation purposesIntegration for navigation purposes

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

Page 79: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

79 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

What additional services?What additional services?

Multi-document summarizationMulti-document summarization Extract and visualize the facts extracted from 250 Extract and visualize the facts extracted from 250

recent abstracts on the treatment of Parkinson’s diseaserecent abstracts on the treatment of Parkinson’s disease

Question answeringQuestion answering Clinical and biological questionsClinical and biological questions

Knowledge discoveryKnowledge discovery Connect facts from heterogeneous resourcesConnect facts from heterogeneous resources

Refined information retrievalRefined information retrieval Indexing on relations in addition to concepts or Indexing on relations in addition to concepts or

association main heading/subheadingassociation main heading/subheading

Page 80: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

80 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Fact-based vs. concept-basedFact-based vs. concept-based

(concept, relationship, concept) triples are the (concept, relationship, concept) triples are the common denominator to the various advanced common denominator to the various advanced servicesservices FactsFacts RelationsRelations Semantic predicationsSemantic predications RDF triplesRDF triples

Page 81: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

81 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Biomedical knowledge repositoryBiomedical knowledge repository

Knowledge integrationKnowledge integration Unique repositoryUnique repository Common formatCommon format Seamless environmentSeamless environment Phenotype and genotype information togetherPhenotype and genotype information together

Enabling resource for the various servicesEnabling resource for the various services SummarizationSummarization Question answeringQuestion answering Knowledge discoveryKnowledge discovery Refined information retrievalRefined information retrieval

Page 82: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

82 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Sources of knowledgeSources of knowledge

Biomedical literatureBiomedical literature Facts extracted from MEDLINE abstracts and full-text Facts extracted from MEDLINE abstracts and full-text

publicly available articles using text mining techniquespublicly available articles using text mining techniques Other corporaOther corpora

Structured databases / knowledge basesStructured databases / knowledge bases NCBI resourcesNCBI resources Model organism databasesModel organism databases Terminological knowledgeTerminological knowledge ……

Contributed knowledgeContributed knowledge The repository is open to collaborators outside NLMThe repository is open to collaborators outside NLM

Page 83: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

83 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Annotated knowledgeAnnotated knowledge

Provenance informationProvenance information Source (e.g., PMID)Source (e.g., PMID) Extraction mechanismExtraction mechanism TimestampTimestamp

Frequency informationFrequency information RedundancyRedundancy

Collaborative annotationCollaborative annotation ““Was this information useful?”Was this information useful?” Context of use/usefulnessContext of use/usefulness

Page 84: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

84 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic Web perspectiveSemantic Web perspective

Common format for knowledgeCommon format for knowledge Resource Description Format (RDF)Resource Description Format (RDF)

Common identification schemeCommon identification scheme Unified Resource Identifier (URI)Unified Resource Identifier (URI)

Standard toolsStandard tools RDF browsersRDF browsers RDF “reasoners”RDF “reasoners”

High level of interest for biomedicine in the SW High level of interest for biomedicine in the SW communitycommunity Health Care and Life Sciences Interest GroupHealth Care and Life Sciences Interest Group

Page 85: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

BiomedicalKnowledgeRepository

InformationRetrieval

QuestionAnswering

KnowledgeDiscovery

DocumentSummarization

Sourceselection

TextMining

TerminologicalKnowledge

OtherK. Sources

ContributedKnowledge

UMLS

MetaMap

MEDLINE CT.gov

Model organismannotation databases

OMIM

SemRep

01DEC05

Page 86: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Towards aBiomedical Knowledge Repository

Page 87: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

87 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Creating the repositoryCreating the repository

BiomedicalKnowledgeRepository

MedlineStructuredBiomedical

Data

Enhanced Information Management for Medicine

TextMining

PubMed

ClinicalTrials.gov UMLS Contributed

Knowledge

Page 88: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

88 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Creating the repositoryCreating the repository

StructuredBiomedical

Data

PubMed

Enhanced Information Management for Medicine

UMLS ContributedKnowledge

BiomedicalKnowledgeRepository

Medline

TextMining

Semantic relations

e.g.,

Rasagiline TREATS Parkinson Disease

ClinicalTrials.gov

Page 89: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

89 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

StructuredBiomedical

DataUMLS Contributed

Knowledge

Page 90: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

90 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledgeUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

StructuredBiomedical

Data

Page 91: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

91 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledgeUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

EntrezGene

Page 92: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

92 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledgeUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

GeneticsHome

Reference

Page 93: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

93 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledgeUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

OMIM

Page 94: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

94 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledgeUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

GeneOntology

Page 95: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

95 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

Creating the repositoryCreating the repository

Medline

PubMed

Enhanced Information Management for Medicine

TextMining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

ContributedKnowledge

Page 96: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

96 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledge

StructuredBiomedical

DataUMLS

Advanced library servicesAdvanced library services

Medline

PubMedText

Mining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

Enhanced Information Management for Medicine

Page 97: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

97 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledge

StructuredBiomedical

DataUMLS

Advanced library servicesAdvanced library services

Medline

PubMedText

Mining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

Knowledge Discovery

Page 98: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

98 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledge

StructuredBiomedical

DataUMLS

Advanced library servicesAdvanced library services

Medline

PubMedText

Mining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

Question Answering

Page 99: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

99 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContributedKnowledge

StructuredBiomedical

DataUMLS

Advanced library servicesAdvanced library services

Medline

PubMedText

Mining

ClinicalTrials.gov

BiomedicalKnowledgeRepository

Summarization andFocused Retrieval

Page 100: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

Summarizing Biomedical Text

Page 101: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

101 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Summarizing Biomedical TextSummarizing Biomedical Text

SearchSearch MedlineMedline ClinicalTrials.govClinicalTrials.gov

Summarize documentsSummarize documents Most salient semantic relationsMost salient semantic relations

Visualize the summaryVisualize the summary Link the semantic relations toLink the semantic relations to

Original textOriginal text Related structured knowledgeRelated structured knowledge

Page 102: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

102 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Text Mining WorkflowText Mining Workflow

Information retrieval

summarization

Text mining

Parkinson disease/therapy

Network of relations

retrieval

294 articles

Page 103: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

103 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Text Mining WorkflowText Mining Workflow

Parkinson disease/therapy

Page 104: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

104 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MovementDisorders

ParkinsonDisease

pramipexol

DopaminAgonists

Dopamine

Brain

rasagilineLevodopa

Entire subthalamicnucleus

Neuro-degenerative

Diseases

entacapone

Anhedonia

treats

location of

GeneTherapy

Deep brainStimulation

Procedure

Depressivedisorder

Bilateralbreast cancer

Dementia

occurs in

Dyskineticsyndrome

isa treats

Treatment of Parkinson’s diseaseTreatment of Parkinson’s disease SemGen outputSemGen output

Page 105: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

105 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MovementDisorders

ParkinsonDisease

pramipexol

DopaminAgonists

Dopamine

Brain

rasagilineLevodopa

Entire subthalamicnucleus

Neuro-degenerative

Diseases

entacapone

Catechol-O-methyl-transferase inhibitor

Anhedonia

Monoamine OxidaseInhibitors

AntiparkinsonAgents

AntidepressiveAgents

treats

isa

location ofpart of

GeneTherapy

Deep brainStimulation

Procedure

Depressivedisorder

Bilateralbreast cancer

Dementia

occurs in

Dyskineticsyndrome

isa treats

associatedwith

SemGen output+ UMLS relations

+ additional UMLS concepts

SemGen output+ UMLS relations

+ additional UMLS concepts

Treatment of Parkinson’s diseaseTreatment of Parkinson’s disease

Page 106: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

106 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ConclusionsConclusions

Need to go beyond information retrievalNeed to go beyond information retrieval Need to integrate multiple, heterogeneous Need to integrate multiple, heterogeneous

knowledge sources to support knowledge knowledge sources to support knowledge processing, not only navigationprocessing, not only navigation

Synergistic with the Semantic WebSynergistic with the Semantic Web Emerging standard frameworkEmerging standard framework W3C Health Care and Life Sciences Interest GroupW3C Health Care and Life Sciences Interest Group

http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/

Page 107: Ontologies for Data Integration: A Semantic Web Perspective Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications

MedicalOntologyResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov