5th international symposium on semantic mining in ... · 04/09/2012  · lister hill national...

48
From biomedical information integration to knowledge discovery Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA 5th International Symposium on Semantic Mining in Biomedicine (SMBM) Institute of Computational Linguistics University of Zurich, Switzerland September 4, 2012

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

From biomedical information integration to knowledge discovery

Olivier Bodenreider

Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

5th International Symposium on Semantic Mining in Biomedicine (SMBM)

Institute of Computational Linguistics University of Zurich, Switzerland

September 4, 2012

Page 2: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 2

Semantic mining

Extract information from structured and unstructured sources From text: text mining From ontologies and knowledge bases

Integrate information From structured and unstructured sources

Aggregate information Subsumption reasoning

Use the extracted information for a meaningful purpose Hypothesis generation / knowledge discovery Better information retrieval Question answering

Page 3: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 3

Outline

Knowledge, integration and aggregation Knowledge sources

Structured sources Relations extracted from text

Integrating relations from text mining and ontologies

Biomedical Knowledge Repository

Page 4: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

KNOWLEDGE, INTEGRATION AND AGGREGATION

Page 5: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 5

Definitional knowledge

Definitional knowledge Universally true Examples

Lung cancer has_location Lung Myocardial infarction isa Cardiovascular disease Liver part_of Abdomen (canonical anatomy, in a given

species)

Typically found in ontologies Useful as background knowledge

Page 6: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 6

Assertional knowledge

Assertional knowledge True in a given context Examples

Aspirin treats headache IL-13 inhibits COX2 Chest pain manifestation_of Myocardial infarction Ciprofloxacin causes Tendon rupture

Typically found in knowledge bases (and in text) Useful for knowledge discovery, question answering,

biocuration support, etc.

Page 7: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 7

Definitional vs. assertional knowledge

Definitional knowledge Universally true Typically found in

ontologies

Useful as background knowledge

Assertional knowledge True in a given context Typically found in

knowledge bases (and in text)

Useful for knowledge discovery, question answering, biocuration support, etc.

Page 8: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 8

Why integrate assertional and definitional knowledge?

To bridge the granularity mismatch Differences in granularity between

What is expressed in in text (or structured sources) What is needed in “semantic mining” applications

To increase statistical power Low frequency for individual, fine-grained assertions Higher frequency when frequencies are aggregated at a

coarser level

Page 9: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 9

Aggregating frequencies

fluoroquinolone

isa

Moflifloxacin causes Tendon rupture [7]

Levofloxacin causes Tendon rupture [2]

Ciprofloxacin causes Tendon rupture [3]

causes Tendon rupture [12]

Page 10: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 10

Bridging the granularity mismatch

A researcher is interested in glycosylation and its implications for one disorder: congenital muscular dystrophy.

Link between glycosyltransferase activity and congenital muscular dystrophy?

Page 11: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 11

Congenital muscular dystrophy, type 1D

LARGE (GeneID: 9215)

has_associated_disease

Page 12: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 12

has_molecular_function

acetylglucosaminyltransferase activity

LARGE (GeneID: 9215)

Page 13: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 13

Using SPARQL to test a hypothesis

GO ID GO ID

Gene ID

is_a

OMIM ID OMIM name has textual description

Find all the genes annotated with the GO molecular function glycosyltransferase or any of its descendants and associated with any form of congenital muscular dystrophy

Page 14: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 14

Results Instantiated graph

GO:0008375 GO:0016757

EG:9215

is_a

MIM:608840 Muscular dystrophy, congenital, type 1D

has textual description

glycosyltransferase

LARGE

acetylglucosaminyl- transferase

Page 15: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 15

From glycosyltransferase to congenital muscular dystrophy

MIM:608840 Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215 LARGE

acetylglucosaminyl- transferase

GO:0016757 glycosyltransferase

GO:0008194 isa

GO:0008375 acetylglucosaminyl- transferase

GO:0016758

Page 16: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

KNOWLEDGE SOURCES

Page 17: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 17

Knowledge sources

Ontologies – definitional knowledge (mostly) Terminology integration systems

Unified Medical Language System (NLM) BioPortal (NCBO)

Relations extracted from text – assertional knowledge (mostly) Text corpus

MEDLINE

Relation extraction system SemRep (NLM), MedLEE (Columbia) Commercial systems, specialized systems

Page 18: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 18

Unified Medical Language System

SPECIALIST Lexicon 460,000 lexical items Part of speech and variant information

Metathesaurus 8M names from over 160 terminologies 2.7M concepts 16M relations

Semantic Network 133 high-level categories 7000 relations among them

Lexical resources

Ontological resources

Terminological resources

Page 19: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 19

Metathesaurus Basic organization

Concepts Synonymous terms are clustered into a concept Properties are attached to concepts, e.g.,

Unique identifier Definition

Relations Concepts are related to other concepts Properties are attached to relations, e.g.,

Type of relationship Source

Page 20: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 20

Organize terms

Synonymous terms clustered into a concept Preferred term Unique identifier (CUI)

Addison's disease

Addison Disease MeSH D000224 Primary hypoadrenalism MedDRA 10036696 Primary adrenocortical insufficiency ICD-10 E27.1 Addison's disease (disorder) SNOMED CT 363732003

C0001403

Page 21: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 21

Integrating subdomains

Biomedical literature

MeSH

Genome annotations

GO Model organisms

NCBI Taxonomy

Genetic knowledge bases

OMIM

Clinical repositories

SNOMED CT Other subdomains

Anatomy

FMA

UMLS

Page 22: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 22

Integrating subdomains

Biomedical literature

Genome annotations

Model organisms

Genetic knowledge bases

Clinical repositories

Other subdomains

Anatomy

Page 23: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 23

Trans-namespace integration

Genome annotations

GO Model organisms

NCBI Taxonomy

Genetic knowledge bases

OMIM Other subdomains

Anatomy

FMA

UMLS Addison Disease (D000224)

Addison's disease (363732003)

Biomedical literature

MeSH

Clinical repositories

SNOMED CT

UMLS C0001403

Page 24: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 24

Organize concepts

Inter-concept relationships: hierarchies from the source vocabularies

Redundancy: multiple paths

One graph instead of multiple trees (multiple inheritance)

A

B D E H D E

B

G H

E F H

C

B C

A

E F D

G H

Page 25: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 25

SemRep

Part of the Semantic Knowledge Representation project at NLM Tom Rindflesch & Marcelo Fiszman

Knowledge extraction system for the automatic summarization system SemanticMEDLINE http://skr3.nlm.nih.gov/SemMedDemo/

Page 26: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 26

SemRep

Extract semantic predications from biomedical research literature (MEDLINE citations)

Based on Generalizations about the structure of English Structured domain knowledge: UMLS

Balances linguistic insight with practical implementation Underspecified syntax Core predications only Limited by domain

Page 27: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 27

SemRep: Extract Predication

… Exemestane after non-steroidal aromatase inhibitor for post-menopausal women with advanced breast cancer

Aromatase Inhibitor Breast Carcinoma TREATS

Semantic Network Relation

Metathesaurus Concept

Metathesaurus Concept

Unified Medical Language System

Page 28: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 28

Several Evaluations

Focused on biomedical subdomains, e.g. Clinical treatment, genetic etiology of disease,

pharmacogenomics

Focused on structure, e.g. Hypernymic predications, comparatives, nominalizations

Overall Precision is around 75% (lower for molecular biology) Recall is around 60%

Page 29: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 29

Predication Database: SemMedDB

Processed all of MEDLINE More than 21 million citations Titles and abstracts

SemRep predications extracted 57 million predications (through 06/30/2012)

Made available to the research community MySQL database RDF triples

Page 30: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

INTEGRATING RELATIONS FROM TEXT MINING AND ONTOLOGIES

Page 31: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Movement Disorders

Parkinson Disease

pramipexol

Dopamin Agonists

Dopamine

Brain

rasagiline Levodopa

Entire subthalamic nucleus

Neuro- degenerative

Diseases

entacapone

Anhedonia

treats

location of

Gene Therapy

Deep brain Stimulation

Procedure

Depressive disorder

Bilateral breast cancer

Dementia

occurs in

Dyskinetic syndrome

isa treats

Treatment of Parkinson’s disease SemRep output

Page 32: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Movement Disorders

Parkinson Disease

pramipexol

Dopamin Agonists

Dopamine

Brain

rasagiline Levodopa

Entire subthalamic nucleus

Neuro- degenerative

Diseases

entacapone

Catechol-O-methyl- transferase inhibitor

Anhedonia

Monoamine Oxidase Inhibitors

Antiparkinson Agents

Antidepressive Agents

treats

isa

location of part of

Gene Therapy

Deep brain Stimulation

Procedure

Depressive disorder

Bilateral breast cancer

Dementia

occurs in

Dyskinetic syndrome

isa treats

associated with

SemRep output + UMLS relations

+ additional UMLS concepts

Treatment of Parkinson’s disease

Page 33: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 33

Original graph

Page 34: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 34

Adding hierarchy between any two concepts in the graph

Page 35: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 35

Add UMLS concepts for improving connectivity

Page 36: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 36

Pruning

Page 37: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 37

Page 38: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 38

Break up large cluster

Page 39: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 39

Aggregation

Page 40: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 40

After aggregation

Liqin Wang U. Utah

Page 41: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

BIOMEDICAL KNOWLEDGE REPOSITORY

Page 42: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 42

Biomedical Knowledge Repository

Integrated set of relations From the UMLS Metathesaurus Extracted from MEDLINE by SemRep

Together with metadata Source of the relations (provenance)

Semantic Web technologies RDF store (Virtuoso)

Page 43: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 43

Representation

Captopril treats Congestive heart failure Contextualized relation (instance level) PMID:12345

Metadata PMID:12345 publication_date 9/4/2012

Captopril treats Congestive heart failure Non-contextualized relation (class level)

ACE Inhibitors Cardiovascular disease

Pharm. substance Disease or Syndrome Non-contextualized relation (semantic type level)

treats

Page 44: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 44

Status

Experimental Fully populated

UMLS 2012AA 50M relations extracted from MEDLINE

SemMedDB available for download UMLS in RDF not yet available for download Not available as a SPARQL endpoint

Licensing issues Lack of access control in RDF stores

Page 45: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 45

Potential applications

Multi-document summarization Semantic MEDLINE “plus”

Information retrieval of relations Beyond keywords or concepts

Simple question answering Which drugs treat congestive heart failure?

Knowledge discovery Swanson’s paradigm (e.g., finding “B”s) Patterns of relations

Page 46: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Lister Hill National Center for Biomedical Communications 46

A knowledge discovery platform

Collaboration with domain experts Effect of cortisol on sleep quality in aging men

Page 47: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured

Medical Ontology Research

Olivier Bodenreider

Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

Contact: Web:

[email protected] mor.nlm.nih.gov

Page 48: 5th International Symposium on Semantic Mining in ... · 04/09/2012  · Lister Hill National Center for Biomedical Communications 2 Semantic mining Extract information from structured