a umls- based system for literature-based discovery in medicine

Post on 23-Feb-2016

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A UMLS- Based System for Literature-Based Discovery in Medicine . Matteo Gabetta. MEDINFO Copenhagen, August 21 st 2013. Literature Based Discovery (LBD). Discover unknown relationships among scientific knowledge. - PowerPoint PPT Presentation

TRANSCRIPT

UNIVERSITÀ DI PAVIA

A UMLS-Based Systemfor Literature-Based Discovery

in Medicine

Matteo Gabetta

MEDINFOCopenhagen, August 21st 2013

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery (LBD)

Discover unknown relationships among scientific knowledge

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Literature Based Discovery

Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.

• Methods of discoveryOPEN vs. CLOSED

• Sources of knowledgeAbstract, Full Text, MeSH, …

• Knowledge representationConcepts, (groups of) words

• Knowledge extractionText mining techniques

• Relationship measurementCitation frequency, association

rules…• Process automation

User interaction level

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics• Methods of discovery

OPEN discovery• Sources of knowledge

Abstract• Knowledge representation

UMLS concepts• Knowledge extraction

Text mining techniques• Relationship measurement

Support/Confidence from association rule theory• Process automation

Highly interactive discovery process

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics

Moreover:• Co-cited UMLS concepts = related

concepts• Semantic Types used for filtering• Literature-Mining Database as a

persistence layer

Technologies:• Java• Entrez Programming Utilities – eUtils• GWT – Google Web Toolkit

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System characteristics

Moreover:• Co-cited UMLS concepts = related

concepts• Semantic Types used for filtering• Literature-Mining Database as a

persistence layer

Technologies:• Java• Entrez Programming Utilities – eUtils• GWT – Google Web Toolkit

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System Workflow

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System Workflow (AB)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System Workflow (BC)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

System Workflow (final)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Support & Confidence

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Support & Confidence

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

The INHERITANCE projectIntegrated Heart Research In Translational Genetics of Cardiomyopathies in

Europe

• Dilated cardiomyopathies• 3 year health research project• European commission funding program 7• 11 European centers

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene association

TNNT2 TPM1 DES LMNATTN MYH7 DMD MVCL

MYBPC3 ABCC9 DSP PLNACTC CLP LDB3 SGCD

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene association

TNNT2 TPM1 DES LMNATTN MYH7 DMD MVCL

MYBPC3 ABCC9 DSP PLNACTC CLP LDB3 SGCD

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: idea

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene association

Angiology. 1975 Nov;26(10):723-33.The differential diagnosis of congestive cardiomyopathyand ischemic cardiomyopathy by echocardiography.Shors CM, et al.

DCM

Nov 1975 time

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: idea

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene association

J Biol Chem. 1982 Apr 25;257(8):4328-32.Oligomeric structure of the major nuclear envelope protein lamin B.Shelton KR, et al.

DCM

Nov 1975

LMNA

Apr 1982 time

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: idea

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene association

N Engl J Med. 1999 Dec 2;341(23):1715-24.Missense mutations in the rod domain of the lamin A/C gene as causes of dilated cardiomyopathy and conduction-system disease.Fatkin D, et al.

DCM

Nov 1975

LMNA

Apr 1982 Dec 1999

LMNA+DCM

time

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: idea

“Re-discover” DCM/gene association

• Only literature prior to 1st explicit DCM/gene associationDCM

Nov 1975

LMNA

Apr 1982 Dec 1999

LMNA+DCM

time

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example• A string : “Dilated cardiomyopathy”

• A concept : “Cardiomyopathy, Dilated –

(C0007193)”

• Query dates : (Apr 1982 – Nov 1999)

• Literature A obtained

• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example• A string : “Dilated cardiomyopathy”

• A concept : “Cardiomyopathy, Dilated –

(C0007193)”

• Query dates : (Apr 1982 – Nov 1999)

• Literature A obtained

• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example• A string : “Dilated cardiomyopathy”

• A concept : “Cardiomyopathy, Dilated –

(C0007193)”

• Query dates : (Apr 1982 – Nov 1999)

• Literature A obtained

• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example• A string : “Dilated cardiomyopathy”

• A concept : “Cardiomyopathy, Dilated –

(C0007193)”

• Query dates : (Apr 1982 – Nov 1999)

• Literature A obtained

• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example

• Query dates : (Apr 1982 – Nov 1999)

• Literature B obtained

• C concepts:o One Semantic Type: “Gene or Genome –

T028”

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example

• Query dates : (Apr 1982 – Nov 1999)

• Literature B obtained

• C concepts:o One Semantic Type: “Gene or Genome –

T028”

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: an example

• Query dates : (Apr 1982 – Nov 1999)

• Literature B obtained

• C concepts:o One Semantic Type: “Gene or Genome –

T028”

Is LMNA between C concepts?Evaluation of Support and Score

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: resultsGene First date First date

w/ DMCB

concepts#

Papers

TNNT2 1994 May 2000 Jan Not Found 5

TTN 1975 Jan 1994 Oct 64 546

MYBPC3 1993 Feb 1997 Mar Not Found 17

ACTC 1977 Feb 1998 May 98 1313

TPM1 1974 Jan 2000 Jan Not Found 51

MYH7 1989 Feb 2000 Jan Not Found 35

ABCC9 2001 Apr 2004 Apr Not Found 9

CLP 1991 Sep 1997 Feb Not Found 11

DES 1976 Dec 1990 Jan 82 943

DMD 1978 May 1990 Feb 35 290

DSP 1982 Jan 2000 Oct 189 313

LDB3 1993 Jan 2003 Dec Not Found 14

LMNA 1983 Jan 1999 Dec 166 214

MVCL 1985 Jan 1997 Jan Not Found 30

PLN 1975 Jan 1990 May 45 203

SGCD 1999 Aug 1999 Aug Not Available 2

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: resultsGene First date First date

w/ DMCB

concepts#

Papers

TNNT2 1994 May 2000 Jan Not Found 5

TTN 1975 Jan 1994 Oct 64 546

MYBPC3 1993 Feb 1997 Mar Not Found 17

ACTC 1977 Feb 1998 May 98 1313

TPM1 1974 Jan 2000 Jan Not Found 51

MYH7 1989 Feb 2000 Jan Not Found 35

ABCC9 2001 Apr 2004 Apr Not Found 9

CLP 1991 Sep 1997 Feb Not Found 11

DES 1976 Dec 1990 Jan 82 943

DMD 1978 May 1990 Feb 35 290

DSP 1982 Jan 2000 Oct 189 313

LDB3 1993 Jan 2003 Dec Not Found 14

LMNA 1983 Jan 1999 Dec 166 214

MVCL 1985 Jan 1997 Jan Not Found 30

PLN 1975 Jan 1990 May 45 203

SGCD 1999 Aug 1999 Aug Not Available 2

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: resultsGene First date First date

w/ DMCB

concepts#

Papers

TNNT2 1994 May 2000 Jan Not Found 5

TTN 1975 Jan 1994 Oct 64 546

MYBPC3 1993 Feb 1997 Mar Not Found 17

ACTC 1977 Feb 1998 May 98 1313

TPM1 1974 Jan 2000 Jan Not Found 51

MYH7 1989 Feb 2000 Jan Not Found 35

ABCC9 2001 Apr 2004 Apr Not Found 9

CLP 1991 Sep 1997 Feb Not Found 11

DES 1976 Dec 1990 Jan 82 943

DMD 1978 May 1990 Feb 35 290

DSP 1982 Jan 2000 Oct 189 313

LDB3 1993 Jan 2003 Dec Not Found 14

LMNA 1983 Jan 1999 Dec 166 214

MVCL 1985 Jan 1997 Jan Not Found 30

PLN 1975 Jan 1990 May 45 203

SGCD 1999 Aug 1999 Aug Not Available 2

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: resultsGene First date First date

w/ DMCB

concepts#

Papers

TNNT2 1994 May 2000 Jan Not Found 5

TTN 1975 Jan 1994 Oct 64 546

MYBPC3 1993 Feb 1997 Mar Not Found 17

ACTC 1977 Feb 1998 May 98 1313

TPM1 1974 Jan 2000 Jan Not Found 51

MYH7 1989 Feb 2000 Jan Not Found 35

ABCC9 2001 Apr 2004 Apr Not Found 9

CLP 1991 Sep 1997 Feb Not Found 11

DES 1976 Dec 1990 Jan 82 943

DMD 1978 May 1990 Feb 35 290

DSP 1982 Jan 2000 Oct 189 313

LDB3 1993 Jan 2003 Dec Not Found 14

LMNA 1983 Jan 1999 Dec 166 214

MVCL 1985 Jan 1997 Jan Not Found 30

PLN 1975 Jan 1990 May 45 203

SGCD 1999 Aug 1999 Aug Not Available 2

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: results

Gene Score Support Rank Sup Rank Score

TTN 26832 92 68/542 41/542

ACTC 203577 1025 7/662 6/662

DES 21598 150 11/349 8/349

DMD 15268 300 2/349 21/349

DSP 256598 1115 5/887 8/887

LMNA 252739 752 9/822 5/822

PLN 7906 47 69/380 75/380

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: results

Gene Score Support Rank Sup Rank Score

TTN 26832 92 68/542 41/542

ACTC 203577 1025 7/662 6/662

DES 21598 150 11/349 8/349

DMD 15268 300 2/349 21/349

DSP 256598 1115 5/887 8/887

LMNA 252739 752 9/822 5/822

PLN 7906 47 69/380 75/380

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Validation: results

Gene Score Support Rank Sup Rank Score

TTN 26832 92 68/542 41/542

ACTC 203577 1025 7/662 6/662

DES 21598 150 11/349 8/349

DMD 15268 300 2/349 21/349

DSP 256598 1115 5/887 8/887

LMNA 252739 752 9/822 5/822

PLN 7906 47 69/380 75/380

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Discussion and Future Developments

• Effective in ranking DCM related genes• Heuristic score good alternative to Support• Limitation: fails for C concepts with small

literature• Analyze in depth the “threshold problem”• Practical comparison with other systems• Improve effectiveness of Text Mining system

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Discussion and Future Developments

• Effective in ranking DCM related genes• Heuristic score good alternative to Support• Limitation: fails for C concepts with small

literature• Overcome the empirical set-up of some

parameters• Practical comparison with other systems• Improve effectiveness of Text Mining system

Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta

Thank You.

In loving memory ofGilles Belley

top related