information extraction from literature

33
Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007

Upload: ahanu

Post on 18-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Information Extraction from Literature. Yue Lu BeeSpace Seminar Oct 24, 2007. Outline. Overview of BeeSpace v4 Entity Recognition Relation Extraction. Overview. BeeSpace V4 deeper semantic base than the current v3 system entities and relations VS mutual information Four levels - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Extraction from Literature

Information Extraction from Literature

Yue LuBeeSpace SeminarOct 24, 2007

Page 2: Information Extraction from Literature

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Page 3: Information Extraction from Literature

Overview

BeeSpace V4 deeper semantic base than the current v3 system entities and relations VS mutual information

Four levels Level1: Entity Recognition Level2: Entity Association Mining Level3: Relation Extraction Level4: Inference and Hypothesis Generation

Page 4: Information Extraction from Literature

Overview

Level1: Entity Recognition (detailed later) Level2 Entity Association Mining

Suppose entities are properly taggedUtilize the co-occurrence patterns of entities

to extract semanticse.g. a bee biologist may want to know which

genes are important for foraging behavior. Similar to TREC Genomics 2007 task

Page 5: Information Extraction from Literature

TREC Genomics 2007

e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?”

currently only retrieval techniquesGene synonym expansionConjunctive query interpretationUser relevance feedback

tagged Entities definitely would help

Page 6: Information Extraction from Literature

Overview

Level3: Relation Extraction Goal is to extract the relations between entities Generally requires entities to be properly tagged first Detailed later

Level4: Inference and Hypothesis Generation Inference on knowledge base Graph mining

Page 7: Information Extraction from Literature

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Page 8: Information Extraction from Literature

Entity Recognition

Gene Example:

Although <GENE>mxp</GENE> and <GENE>Pb</GENE> display very similar expression patterns, <GENE>pb</GENE> null embryos develop normally

Page 9: Information Extraction from Literature

Entity Recognition

Anatomy Example:

In normal embryos, mxp is expressed in the <ANATOMY>maxillary</ANATOMY> and <ANATOMY>labial</ANATOMY> segments, whereas ectopic expression is observed in some GOF variants.

Page 10: Information Extraction from Literature

Entity Recognition

Biological process Example:

Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for <BIOLOGICAL PROCESS>the formation of terminal structures in the embryo<BIOLOGICAL PROCESS>.

Page 11: Information Extraction from Literature

Entity Recognition

Pathways Example:

Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - <PATHWAY>the terminal class signal transduction pathway</PATHWAY> - to better understand the cellular mechanisms of proto-oncogenes that mediate cellular responses in vertebrates including humans

Page 12: Information Extraction from Literature

Entity Recognition

Protein family Example:

While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a <PROTEIN FAMILY>zinc finger transcription factor</PROTEIN FAMILY> required for photoreceptor cell specification, differentiation, and survival.

Page 13: Information Extraction from Literature

Entity Recognition

CRE (cis-regulatory elements) Example:

A synthetic, 23-bp <CRE>ecdysterone regulatory element (EcRE) </CRE>, derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT).

Page 14: Information Extraction from Literature

Entity Recognition

Phenotype Definition:

a set of observable physical characteristics of an individual organism

Example: Fog, dumpy

Page 15: Information Extraction from Literature

Entity Recognition

Class1: Small Variation (Dictionary/Ontology)Organism, Anatomy , Biological Process,

Pathway, Protein Family Class2: Medium Variation

Gene, cis Regulatory Element Class3: Large Variation

Phenotype, Behavior

Page 16: Information Extraction from Literature

Entity Recognition

Generally can be defined as a classification problem

Boils down to feature definitionClass1: matching a word in the

Dictionary/Ontology Class2: prefix/suffix of the word, POS tags, …Class3:?

Page 17: Information Extraction from Literature

Entity Recognition

Firstly focus on Class1Relatively simple

Class2 and Class3 need training examples

Useful in entity association miningUseful in facilitating extraction of many

interesting relations Related work: Textpresso

Page 18: Information Extraction from Literature

Textpresso

Input: full text C. elegans literature Output: tagged XML format Defined a Textpresso ontology

First category is biological entities

manually curated a lexicon of names Implemented by PERL regular expressions We could reuse some of the regular expressions

Page 19: Information Extraction from Literature
Page 20: Information Extraction from Literature

Entity Recognition

Organism Entrez gene table, Textpresso, BeeSpace DB

Anatomy FlyBase

Biological Process,

Cellular Component, Molecular Function

Textpresso

Pathway KEGG

Protein Family PDB, NCBI

Resources:

Page 21: Information Extraction from Literature

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Page 22: Information Extraction from Literature

Relation Extraction

Expression Location the expression of a gene in some location

(tissues, body parts) Homology/Orthology

one gene is homologous to another gene

Page 23: Information Extraction from Literature

Relation Extraction

Biological process one gene has some role in a biological

process Genetic/Physical/Regulatory Interaction

one gene interacts with another gene in a certain fashion (3 types of relations)

a simple case: Protein-Protein Interaction (PPI)

Page 24: Information Extraction from Literature

Relation Extraction

Generally can be defined as a classification problem, which requires training data

Domain adaptation?an example of PPI

Page 25: Information Extraction from Literature

PPI

Problem Definition:Gene/protein names are already taggedA known list of interaction words

133 words

classify each tuple (p1, p2, interWord) in one single sentence

Page 26: Information Extraction from Literature

PPI

MethodsLearning algorithm: Maximum EntropyContext features

“Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06

e.g. lexical forms, POS tags … Less dependent on domain

Page 27: Information Extraction from Literature

PPI

Training/Testing data:BioCreative1000 hand labeled sentences, 3964 tuples5-fold cross validation

Performanceavgpr = 47.14624avgre = 43.97337avgf1 = 45.35523

Page 28: Information Extraction from Literature

PPI

Training data: BioCreative 1000 hand labeled sentences, 3964 tuples

Testing Data (different domain) Bee collection

Performance (Judged by Moushumi) Total number of tuples extracted as PPI instances: 92 Precision: 63%

Page 29: Information Extraction from Literature

PPI Misclassification examples

Type1: No interaction Sentence: Pretreatment of platelet suspension

with phospholipase A2 from N. naja atra or A. mellifera venom (50 .mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A-23187.

False: (collagen, thrombin, induced) True: relation between protein and platelet

aggregation; no PPI

Page 30: Information Extraction from Literature

PPI Misclassification examples

Type2: Incorrect interaction word Sentence: IgG antibody was able to inhibit

binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold.

False: (IgG antibody, IgE antibody, binding) True: (IgG antibody, IgE antibody, inhibit)

Page 31: Information Extraction from Literature

PPI Misclassification examples

Type3: Incorrect protein involved Sentence: AChE exhibits a

butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity.

False: (AChE, AChE, exhibits) True: (AChE, BuChE, exhibits )

Page 32: Information Extraction from Literature

PPI

Possible Improvementsyntactic patterns: “Optimizing syntax-

patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track,

parse treedependency parsing…

Page 33: Information Extraction from Literature

The End