a system for finding biological entities that satisfy certain conditions from texts

27
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton 1

Upload: shirin

Post on 14-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts. Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton. Outline. Problem statement Techniques and methods Experimental results Discussion and conclusion. Problem statement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Wei Zhou, Clement Yu

University of Illinois at Chicago

Weiyi, Meng

SUNY at Binghamton

1

Page 2: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Outline

Problem statement Techniques and methods Experimental results Discussion and conclusion

2CIKM 2008 By Clement Yu from UIC

Page 3: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Problem statement

Given a complex biological question, output relevant passages (or excerpts) where the answer can be found.

3CIKM 2008 By Clement Yu from UIC

Page 4: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

What [GENES] are involved in insect segmentation?

A sample question:

A sample relevant passage:An Example

4CIKM 2008 By Clement Yu from UIC

In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects.

Target: GENESQualification concepts: 1) insect 2) segmentation

[hb, ftz, and eve are targets found in the passage]

Page 5: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Technique and methods

Identify concepts in queries and texts Use of domain knowledge Related concepts (query expansion) Gene symbol disambiguation Conceptual IR models

5CIKM 2008 By Clement Yu from UIC

Page 6: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

In texts Window size: all component words

appear within a certain window size.

An example:

...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon , but not rectal, cancer than do women who ...”,

[Query concept: colon cancer]

Identify concepts in queries and texts

In queries

PubMed automatic term mapping

6CIKM 2008 By Clement Yu from UIC

Page 7: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Use of domain knowledge

Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant.

Example:

Query: What [GENES] are involved axon guidance in C.elegans?

An irrelevant passage because of a different species:

“We describe DPTP52F, which is probably the last remaining RPTP encoded in the Drosophila genome. Ptp52F mutations cause specific CNS and motor axon guidance phenotypes, and exhibit genetic interactions with mutations in the other Rptp genes”.

[Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans]

7CIKM 2008 By Clement Yu from UIC

Page 8: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Use of domain knowledge

Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes

from Entrez gene and map them to the TREC entity types.

An example:[Target types]: TUMOR TYPES

[Dictionary]: UMLS Metathesaurus

[Instances]: Lung Cancer; T-cell lymphoma; Pheochromocytoma

8CIKM 2008 By Clement Yu from UIC

Page 9: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Related concepts

Synonyms Hyponyms (one-level only) Hypernyms (one-level only) Lexical variants Related abbreviations

9CIKM 2008 By Clement Yu from UIC

Page 10: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Related concepts: lexical variants

Type 1:

Automatically generate lexical variants using manually created heuristics:

e.g., PLA2

PLA 2, PLAII, and PLA II

Note: PLA2: Phospholipase A2

10CIKM 2008 By Clement Yu from UIC

Page 11: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Related concepts: lexical variants

Type 2:

Retrieve additional lexical variants from a term database of MEDLINE

e.g., PLA2 PL-A2

Note: PLA2: Phospholipase A2

11CIKM 2008 By Clement Yu from UIC

Page 12: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Related concepts – Lexical variants

12CIKM 2008 By Clement Yu from UIC

6 sub types of Type 3

Type 3.1:Identical after stemming Example: APC: "antigen presenting cell" ≈ "antigen presented cell"

Type 3.2: Different by a small edit distance Example: HPV: "Human papillomavirus" ≈ "Human papillomaviral"

Type 3.3: Identical after normalization Example: NFkb: "Nuclear factor kappa beta" ≈ "Nuclear factor kb"

Type 3.4: Different ordering Example: Abeta: "amyloid beta protein“ ≈ "beta amyloid protein"

Type 3.5: Extra words Example: ACD: "cerebral amyloid angiopathies" ≈ "cerebral beta amyloid angiopathies"

Type 3.6: Internal abbreviations Example: APC: "ag presenting cell" ≈ "antigen presenting cell"

Type 3:

Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr.

Page 13: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Related concepts: related abbreviations Abbreviations whose definitions (or long-

forms) consume the query concept.

For example some related abbreviations for concept “lung cancer” are):

SCLC (small cell lung cancer) LCSS (lung cancer symptom scale) NSCLC(non-small cell lung cancer)

CIKM 2008 By Clement Yu from UIC 13

Page 14: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Gene symbol disambiguation

CIKM 2008 By Clement Yu from UIC 14

3 simple rules are defined to disambiguate gene symbols from

Abbreviations of non-gene meanings (Rule 1 & 2)

Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154-KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”]

Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor,

was one of the genes identified in this study. ” [“Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”]

Page 15: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Conceptual IR Models

Model 1 Differentiate target instances

Model 2 Equally weight target instances

CIKM 2008 By Clement Yu from UIC 15

Page 16: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Conceptual IR Models – Model 1

CIKM 2008 By Clement Yu from UIC 16

Page 17: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Conceptual IR Models – Model 2

CIKM 2008 By Clement Yu from UIC 17

Page 18: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Experimental results

Data sets and evaluation metrics Impact of different techniques and methods Comparison with best reported results

CIKM 2008 By Clement Yu from UIC 18

Page 19: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Data sets and evaluation metrics Query collection: 36 questions collected from

biologists in 2007.

Document collection: 162,259 Highwire full-text

documents in HTML format. Performance Metrics

Passage MAP Aspect MAP Document MAP

CIKM 2008 By Clement Yu from UIC 19

Page 20: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Impact of different techniques and methods

CIKM 2008 By Clement Yu from UIC 20

Page 21: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Impact of different techniques and methods

CIKM 2008 By Clement Yu from UIC 21

Page 22: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Comparison with best reported results

CIKM 2008 By Clement Yu from UIC 22

The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval).

Page 23: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Summary

Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness.

Achieved significant improvement over the best reported results

Compared two conceptual IR models in retrieval effectiveness

Evaluated a simple method for gene symbol disambiguation

23CIKM 2008 By Clement Yu from UIC

Page 24: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Conclusions

1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness.

24CIKM 2008 By Clement Yu from UIC

Page 25: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Conclusions

2: The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses.

25CIKM 2008 By Clement Yu from UIC

Page 26: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Future work

Improve the quality of target instances retrieved from different resources

Improve gene symbol disambiguation method

Handle pronouns More evaluations on other gold standards

26CIKM 2008 By Clement Yu from UIC

Page 27: A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

Questiosn

Thanks

CIKM 2008 By Clement Yu from UIC 27