effective reranking for extracting protein-protein interactions from biomedical literature deyu...

24
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering Nanyang Technological University, Singapore 30 August 2007

Upload: adela-moody

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Effective Reranking for Extracting Protein-protein Interactions from Biomedical

Literature

Deyu Zhou, Yulan He and Chee Keong Kwoh

School of Computer Engineering

Nanyang Technological University, Singapore

30 August 2007

Page 2: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

OutlineOutline

• Protein-protein interactions (PPIs) extraction

• Hidden Vector State (HVS) model for PPIs extraction

• Reranking approaches

• Experimental results

• Conclusions

Page 3: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

ProteinProtein

Interact

Protein

Protein-Protein Interactions ExtractionProtein-Protein Interactions Extraction

Spc97p interacts with Spc98 and Tub4 in the

two-hybrid system

Spc97p interact Spc98Spc97p interact Tub4

Spc97p interact Spc98Spc97p interact Tub4

Page 4: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Existing ApproachesExisting Approaches

Statistics Methods

Pattern Matching

Parsing-Based

Simple to ComplicatedSimple to Complicated

Page 5: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

An exampleAn example

However, unlike another tumor suppressor protein, p53, Rb did not have any significant effecton basal levels of transcription, suggesting that Rb specifically interacts with IE2 rather ...

Part-of-speech tagging

However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, p53/NN ,/, Rb/NN did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/IN basal/JJ levels/NNS of/INtranscription/NN ,/, suggesting/VBG that/IN Rb/NN specifically/RB interacts/VBZ with/IN IE2/NN rather/RB ...

However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, PROTEIN(p53/NN) ,/, PROTEIN(Rb/NN) did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/INbasal/JJ levels/NNS of/IN transcription/NN ,/, suggesting/VBG that/IN PROTEIN(Rb/NN)specifically/RB interacts/VBZ with/IN PROTEIN(IE2/NN) rather/RB ...

Protein name identification

Page 6: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Statistics-Based ApproachesStatistics-Based Approaches

Corpus level statisticSentence level statistic

(p53, IE2)

(Rb, IE2)

+1

+1

Relation Occurrence

(p53, Rb) +1(p53, IE2)

...

8

1

Relation Occurrence

... 6

Relation Confidence

(p53, IE2)

...

75%

...

... ...

Predefined threshold a = 7

Page 7: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Pattern Matching ApproachesPattern Matching Approaches

Rb interact IE2p53 interact IE2

Protein [*] interact[s] with protein protein RB VBZ WITH protein

Rb interact IE2

Pattern matching

Pattern 1 Pattern 2

Page 8: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Parsing-Based ApproachesParsing-Based Approaches

Syntactic processing

Semantic processing...Rb specifically interacts with IE2...

N ADV V P N

NP PP

VP

VP

(<INTERACT><THE Rb PROTEIN><THE IE2 PROTEIN>)

Rb interact IE2

…...

Page 9: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Semantic ParserSemantic Parser

Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c

Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c

For each candidate word string Wn, need to compute most likely set of embedded concepts

semanticmodel

lexicalmodel

Page 10: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

We could use a simple finite state tagger …

P(Wn|C)

P(C)

… can be robustly trained using EM, but model is too weak to represent embeddings in natural language

<s> Spc97p interacts with Spc98 and Tub4 in the </s>

SS PROTEIN INTERACT DUMMY SEPROTEIN PROTEINDUMMY DUMMY

two-hybrid system

Page 11: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM …

… but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models

P(Wn|C)

P(C)

Spc97p interacts with Spc98 and Tub4 in the two-hybrid system

S

PROTEIN

INTERACT

PREP PROTEIN PROTEINAND DUMMY

INTERACTION

SUBJECT OBJECT OBJECT

Page 12: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Hidden Vector State ModelHidden Vector State Model

<s> Spc97p interacts with Spc98 and Tub4 in the two-hybrid system </s>

SS

PROTEIN

INTERACT

DUMMY SEPROTEIN PROTEINDUMMY DUMMY

PROTEININTERACTPROTEIN

SS

SS PROTEINSS

INTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

PROTEININTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

DUMMYSS

SESS

Page 13: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size …

P(Wn|C)

… this is a very convenient framework for applying constraints

P(C)PROTEIN

INTERACTPROTEIN

SS

SS PROTEINSS

INTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

PROTEININTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

DUMMYSS

SESS

<s> Spc97p interacts with Spc98 and Tub4 in the two </s> -hybrid system

Page 14: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

HVS model transition constraints:

• finite stack depth – D• push only one non-terminal semantic onto the stack at each step

… model defined by three simple probability tables

Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,N t

Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,N t

Page 15: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Parsing with the HVS model

P(nt|Ct-1)P(nt|Ct-1)

1) POP 1 elements from the previous stack state, n =1

1) POP 1 elements from the previous stack state, n =1

P(Ct[1]|Ct [2..Dt])P(Ct[1]|Ct [2..Dt])

2) Push 1 pre-terminal semantic concept into stack

2) Push 1 pre-terminal semantic concept into stack

P(Wt|Ct)P(Wt|Ct)3) Generate the next word3) Generate the next word

PROTEININTERACTPROTEIN

SS

PROTEININTERACTPROTEIN

SS

… with Spc98 and Tub4 …… with Spc98 and Tub4 …

INTERACTPROTEIN

SS

INTERACTPROTEIN

SS

DUMMYINERACTPROTEIN

SS

DUMMYINERACTPROTEIN

SS

Page 16: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Train using EM and apply constraints

Abstract semantic annotationPROTEIN (

INTERACT (PROTEIN) )

CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system

Training text

Data Constraints

EM Parameter Estimation

HVS Model Parameters

Parse Statistics

Limit forward-backward search to only include states which are consistent with the constraints

Page 17: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Reranking MethodologyReranking Methodology

• Reranking approaches attempts to improve upon an existing probabilistic parser by reranking the output of the parser.

• It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling.

• To rerank parses generated by the HVS model for protein-protein interactions extraction

Page 18: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Architecture Architecture

Annotated Corpus E

Test DataTraining

Training

SemanticParsing

RerankingReranking

Model

Parse results

Ranked 1st parse

Extracted protein-protein

Interactions

HVS model

Parsing Information IPStructure Information ISComplexity Information IC...

Features:

Page 19: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Reranking approaches Reranking approaches

• Features for Reranking

Suppose sentence Si has its corresponding parse set Ci = {Cij, j = 1,.. N}

– Parsing Information

– Structure Information

– Complexity Information

Page 20: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Reranking approaches Reranking approaches

Score is defined as• log-linear regression model

• Neural Network

• Support Vector Machines

Page 21: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Experiments Experiments

• Setup– Corpus I

• comprises of 300 abstracts randomly retrieved from the GENIA corpus

• GENIA is a collection of research abstracts selected from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors”

• split into two parts:– Part I contains 1500 sentences (training data)

– Part II consists of 1000 sentences (test data)

Page 22: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Experimental ResultsExperimental Results

Figure 1: F-measure vs number of candidate parses.Figure 1: F-measure vs number of candidate parses.

Page 23: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

Experimental Results Experimental Results (cont’d)(cont’d)

Experiments

Recall

(%)

Precision

(%)

F-Score

(%)

Baseline 55.8 55.6 55.7

SVM

NN

LLR

59.1

57.9

58.5

60.2

61.8

61.2

59.7

59.8

59.8

Table 3: Results based on the interaction category.Table 3: Results based on the interaction category.

Page 24: Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering

ConclusionsConclusions

• Three reranking methods for the HVS model in the application of extracting protein-protein interactions from biomedical literature.

• Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results

• Incorporating other semantic or syntactic information might be able to give further gains.