a distributed architecture system for recognizing textual entailment

24
SYNASC 2007, Timişoara, Romania A Distributed Architecture System for Recognizing Textual Entailment Adrian Iftene, Alexandra Balahur-Dobrescu, Daniel Matei {adiftene, abalahur, dmatei}@info.uaic.ro Al. I. Cuza“ University, Iasi, Al. I. Cuza“ University, Iasi, Romania Romania Faculty of Computer Science Faculty of Computer Science

Upload: faculty-of-computer-science

Post on 25-Jun-2015

247 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: A Distributed Architecture System for Recognizing Textual Entailment

SYNASC 2007, Timişoara, Romania

A Distributed Architecture System for Recognizing Textual Entailment

Adrian Iftene, Alexandra Balahur-Dobrescu, Daniel Matei{adiftene, abalahur, dmatei}@info.uaic.ro

„„Al. I. Cuza“ University, Iasi, Al. I. Cuza“ University, Iasi, RomaniaRomania

Faculty of Computer ScienceFaculty of Computer Science

Page 2: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Overview Textual Entailment

Definition System presentation Results

Peer-to-Peer Architecture Presentation Transfer protocol Synchronization problem Results

Conclusions

Page 3: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Textual Entailment• TE is defined (Dagan et al., 2006) as a directional relation

between two text fragments, termed T (text) - the entailing text, and H (hypothesis) - the entailed text.

• It is then said that T entails H if, typically, a human reading T would infer that H is most likely true.

• Example:– T: The carmine cat devours the mouse in the garden.– H: The red cat killed the mouse.

Page 4: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

RTE Competition Organized by PASCAL (Pattern Analysis,

Statistical Modeling and Computational Learning) - the European Commission's IST-funded Network of Excellence for Multimodal Interfaces.

This year, a limited number of longer texts were added.

2005: 16 groups, 55% average, 70% the best 2006: 23 groups, 58% average, 75% the best 2007: 26 groups, 80 % the best, our result 69.13%

(third place)

Page 5: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

System presentation

Resources

Initial

data

DIRT

Minipar module

Dependency trees for (T, H) pairs

LingPipe module

Named entities for (T, H) pairs

Final result

Core Module3

Core Module2

Core Module1

Acronyms

Background knowledge

Wordnet

P2P Computers

Wikipedia

Page 6: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Tools - LingPipe LingPipe is a suite of Java libraries for the linguistic

analysis of human language. The major tools are for: Sentece. Parts of Speech. Named Entities. Coreference

Example: Hypothesis from pair 111:Leloir was born in Argentina.

<ENAMEX TYPE="PERSON">Leloir</ENAMEX> was born in <ENAMEX TYPE="LOCATION">Argentina</ENAMEX>.

Page 7: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Tools - MINIPAR

• MINIPAR transform the text and the hypothesis into dependency trees

Example: Le Beau Serge was directed by Chabrol.

(E0(() fin C * )1 (Le ~ U 3 lex-mod (gov Le Beau Serge))2 (Beau ~ U 3 lex-mod (gov Le Beau Serge))3 (Serge Le Beau Serge N 5 s (gov direct))4 (was be be 5 be (gov direct))5 (directed direct V E0 i (gov fin))E2 (() Le Beau Serge N 5 obj (gov direct)

(antecedent 3))6 (by ~ Prep 5 by-subj (gov direct))7 (Chabrol ~ N 6 pcomp-n (gov by))8 (. ~ U * punc))

direct (V)

Le_Beau_Serge (N) be (be) Chabrol (N)

Le_Beau_Serge (N)

Le (U) Beau (U)

sbe by

obj

lex-modlex-mod

Page 8: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Resources – DIRT1

• DIRT is both an algorithm and a resulting knowledge collection created by Lin and Pantel

"X solves Y"

Y is solved by X

X resolves Y

X finds a solution to Y

X tries to solve Y

X deals with Y

Y is resolved by X…

N:s:V<direct>V:by:N

N:obj:V<direct>V:by:N

N:s:V<direct>V:

:V<direct>V:by:N

:V<direct>V:by:N

N:obj:V<direct>V:

Example: Le Beau Serge was directed by Chabrol

Page 9: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Resources – eXtended WordNet

For every synonym, we check to see which word appears in the text tree, and select the mapping with the best value according to the values from eXtended WordNet.

For example, the relation between “relative” and “niece” is made with a score of 0.078652.

Page 10: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Resources - Acronyms

The acronyms’ database helps our program in finding relations between the acronym and its meaning: “US - United States”

Page 11: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Background Knowledge - Example

<pair id="748" entailment="YES“>

<T>Argentina President Carlos Menem has ordered an 'immediate' investigation into war crimes allegedly committed by British troops during the 1982 Falklands War.</T>

<H>Argentine demanded an investigation of alleged war crimes during the Falklands War.</H>

</pair>

Page 12: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Resources – Background Knowledge

Argentine [is] Argentina

Chinese [in] China

Los Angeles [in] California

2 [is] two

Netherlands [is] Holland

ar |calling_code = 54 |footnotes = Argentina also has a territorial disputeArgentina', , Nación Argentina (Argentine Nation) for many

legal purposes), isin the world. Argentina occupies a continental surface area ofArgentina national football team

Netherlands [is] Dutch Netherlands [is] NederlandseNetherlands [is] AntillenNetherlands [in] EuropeNetherlands [is] HollandAntilles [in] Netherlands

“Argentine”: Extracted Snippets from Wikipedia:

Usually are “definition” patterns:- verbs like “is”, “define”, “represent”, etc.- punctuation context , “ ‘ () [] :- anaphora resolution

Page 13: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Semantic Variability Rules Negation rule – given by terms like “no”, “not”,

“never” Modal verbs: “may”, “might”, “cannot”, “should”,

“could” Certain cases for particle “to” when it precedes:

a verb: “allow”, “impose”, “galvanize” adjective like “necessary”, “compulsory”, “free” noun like “attempt”, “trial”

Influence of context: Positive words: “certainly”, “absolutely” Negative words: “probably”, “likely”

Page 14: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Fitness calculation 1 Local Fitness:

1 at direct mapping, Acronyms, BK

DIRT score eXtended WordNet score

Extended Local Fitness: Local Fitness Parent Fitness Mapping of edge label Node Position (left or

right)

Text tree

node mapping

father mapping

edge label mapping

Hypothesis tree

Page 15: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Fitness calculation 2

Total Fitness

The Negation Value

Threshold value = 2.06

rNodesNumbeHypothesis

calFitnessExtendedLoTF Hnode

node

rOfVerbsTotalNumberVerbsNumbePositive

NV_

)4(*)1(* TFNVTFNVessGlobalFitn

Page 16: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Fitness calculation 3• T: The French railway company SNCF is cooperating in the project.

• H: The French railway company is called SNCF.Initial entity Node Fitness Extended local fitness

(the, company, det) 1 3.125

(French, company, nn) 1 3.125

(railway, company, nn) 1 3.125

(company, call, s) 1 2.5

(be, call, be) 1 4

(call, -, -) 0.096 3.048

(company, call, obj) 1 1.125

(SNCF, call, desc) 1 2.625

•Total_Fitness = (3.125 + 3.125 + 3.125 + 2.5 + 4 + 3.048 + 1.125 + 2.625)/8 Total_Fitness = (3.125 + 3.125 + 3.125 + 2.5 + 4 + 3.048 + 1.125 + 2.625)/8 = 22.673/8 = 2.834= 22.673/8 = 2.834•Positive_Verbs_Number = 1/1 = 1Positive_Verbs_Number = 1/1 = 1•GlobalFitness = 1*2.834+(1–1)*(4-2.834) = 2.834GlobalFitness = 1*2.834+(1–1)*(4-2.834) = 2.834

Page 17: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Results

0.69130.6450.8650.6850.57Run02

0.69130.6350.870.690.57Run01

GlobalSUMQAIRIE

Language Computer Corporation, USA

0.8000

LCC Richardson, USA 0.7225

”Al. I. Cuza” University, Romania 0.6913

University of Texas, USA 0.6700

LT-lab, Germany 0.6687

University of Rome ”Tor Vergata”, Italy

0.6675

Page 18: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Peer-to-Peer Architecture

Initiator

DIRT db

CM

CM

CM

CM

Acronyms

SMB upload

SMB download

CM

CM

•Speed optimization•P2P architecture, cache mechanism

•Transfer protocol•Fail-over mechanism

•Ending synchronization•Quota mechanism

Page 19: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Transfer protocol

SMB header

CIFS protocol

Page 20: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Synchronization problem

• Dynamic quota (~ 0.26 s)

Page 21: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Results

0:00:06.75 computers with 7 processes4

0:00:41One computer with full cache at start3

2:03:13One computer with caching mechanism, but with empty cache at start

2

5:28:45One computer without caching mechanism

1

DurationRun detailsNo

Page 22: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Conclusions

Core of our approach is based on a tree edit distance algorithm (Kouylekov, Magnini, 2005)

Main idea is to transform the hypothesis using source like DIRT, WordNet, Wikipedia, Acronyms database

In order to improve the speed we use a P2P architecture and a caching mechanism

For ending synchronization we use a dynamic quota

Page 23: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

Acknowledgments

NLP group of Iasi: Supervisor: Prof. Dan Cristea Diana Trandabat, Corina Forascu, Ionut

Pistol, Marius Raschip Anaphora resolution group:

Iustin Dornescu, Alex Moruz, Gabriela Pavel

Page 24: A Distributed Architecture System for Recognizing Textual Entailment

Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007

THANK YOU!