a distributed architecture system for recognizing textual entailment
Post on 25-Jun-2015
247 Views
Preview:
TRANSCRIPT
SYNASC 2007, Timişoara, Romania
A Distributed Architecture System for Recognizing Textual Entailment
Adrian Iftene, Alexandra Balahur-Dobrescu, Daniel Matei{adiftene, abalahur, dmatei}@info.uaic.ro
„„Al. I. Cuza“ University, Iasi, Al. I. Cuza“ University, Iasi, RomaniaRomania
Faculty of Computer ScienceFaculty of Computer Science
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Overview Textual Entailment
Definition System presentation Results
Peer-to-Peer Architecture Presentation Transfer protocol Synchronization problem Results
Conclusions
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Textual Entailment• TE is defined (Dagan et al., 2006) as a directional relation
between two text fragments, termed T (text) - the entailing text, and H (hypothesis) - the entailed text.
• It is then said that T entails H if, typically, a human reading T would infer that H is most likely true.
• Example:– T: The carmine cat devours the mouse in the garden.– H: The red cat killed the mouse.
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
RTE Competition Organized by PASCAL (Pattern Analysis,
Statistical Modeling and Computational Learning) - the European Commission's IST-funded Network of Excellence for Multimodal Interfaces.
This year, a limited number of longer texts were added.
2005: 16 groups, 55% average, 70% the best 2006: 23 groups, 58% average, 75% the best 2007: 26 groups, 80 % the best, our result 69.13%
(third place)
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
System presentation
Resources
Initial
data
DIRT
Minipar module
Dependency trees for (T, H) pairs
LingPipe module
Named entities for (T, H) pairs
Final result
Core Module3
Core Module2
Core Module1
Acronyms
Background knowledge
Wordnet
P2P Computers
Wikipedia
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Tools - LingPipe LingPipe is a suite of Java libraries for the linguistic
analysis of human language. The major tools are for: Sentece. Parts of Speech. Named Entities. Coreference
Example: Hypothesis from pair 111:Leloir was born in Argentina.
<ENAMEX TYPE="PERSON">Leloir</ENAMEX> was born in <ENAMEX TYPE="LOCATION">Argentina</ENAMEX>.
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Tools - MINIPAR
• MINIPAR transform the text and the hypothesis into dependency trees
Example: Le Beau Serge was directed by Chabrol.
(E0(() fin C * )1 (Le ~ U 3 lex-mod (gov Le Beau Serge))2 (Beau ~ U 3 lex-mod (gov Le Beau Serge))3 (Serge Le Beau Serge N 5 s (gov direct))4 (was be be 5 be (gov direct))5 (directed direct V E0 i (gov fin))E2 (() Le Beau Serge N 5 obj (gov direct)
(antecedent 3))6 (by ~ Prep 5 by-subj (gov direct))7 (Chabrol ~ N 6 pcomp-n (gov by))8 (. ~ U * punc))
direct (V)
Le_Beau_Serge (N) be (be) Chabrol (N)
Le_Beau_Serge (N)
Le (U) Beau (U)
sbe by
obj
lex-modlex-mod
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Resources – DIRT1
• DIRT is both an algorithm and a resulting knowledge collection created by Lin and Pantel
"X solves Y"
Y is solved by X
X resolves Y
X finds a solution to Y
X tries to solve Y
X deals with Y
Y is resolved by X…
N:s:V<direct>V:by:N
N:obj:V<direct>V:by:N
N:s:V<direct>V:
:V<direct>V:by:N
:V<direct>V:by:N
N:obj:V<direct>V:
Example: Le Beau Serge was directed by Chabrol
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Resources – eXtended WordNet
For every synonym, we check to see which word appears in the text tree, and select the mapping with the best value according to the values from eXtended WordNet.
For example, the relation between “relative” and “niece” is made with a score of 0.078652.
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Resources - Acronyms
The acronyms’ database helps our program in finding relations between the acronym and its meaning: “US - United States”
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Background Knowledge - Example
<pair id="748" entailment="YES“>
<T>Argentina President Carlos Menem has ordered an 'immediate' investigation into war crimes allegedly committed by British troops during the 1982 Falklands War.</T>
<H>Argentine demanded an investigation of alleged war crimes during the Falklands War.</H>
</pair>
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Resources – Background Knowledge
Argentine [is] Argentina
Chinese [in] China
Los Angeles [in] California
2 [is] two
Netherlands [is] Holland
ar |calling_code = 54 |footnotes = Argentina also has a territorial disputeArgentina', , Nación Argentina (Argentine Nation) for many
legal purposes), isin the world. Argentina occupies a continental surface area ofArgentina national football team
Netherlands [is] Dutch Netherlands [is] NederlandseNetherlands [is] AntillenNetherlands [in] EuropeNetherlands [is] HollandAntilles [in] Netherlands
“Argentine”: Extracted Snippets from Wikipedia:
Usually are “definition” patterns:- verbs like “is”, “define”, “represent”, etc.- punctuation context , “ ‘ () [] :- anaphora resolution
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Semantic Variability Rules Negation rule – given by terms like “no”, “not”,
“never” Modal verbs: “may”, “might”, “cannot”, “should”,
“could” Certain cases for particle “to” when it precedes:
a verb: “allow”, “impose”, “galvanize” adjective like “necessary”, “compulsory”, “free” noun like “attempt”, “trial”
Influence of context: Positive words: “certainly”, “absolutely” Negative words: “probably”, “likely”
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Fitness calculation 1 Local Fitness:
1 at direct mapping, Acronyms, BK
DIRT score eXtended WordNet score
Extended Local Fitness: Local Fitness Parent Fitness Mapping of edge label Node Position (left or
right)
Text tree
node mapping
father mapping
edge label mapping
Hypothesis tree
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Fitness calculation 2
Total Fitness
The Negation Value
Threshold value = 2.06
rNodesNumbeHypothesis
calFitnessExtendedLoTF Hnode
node
rOfVerbsTotalNumberVerbsNumbePositive
NV_
)4(*)1(* TFNVTFNVessGlobalFitn
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Fitness calculation 3• T: The French railway company SNCF is cooperating in the project.
• H: The French railway company is called SNCF.Initial entity Node Fitness Extended local fitness
(the, company, det) 1 3.125
(French, company, nn) 1 3.125
(railway, company, nn) 1 3.125
(company, call, s) 1 2.5
(be, call, be) 1 4
(call, -, -) 0.096 3.048
(company, call, obj) 1 1.125
(SNCF, call, desc) 1 2.625
•Total_Fitness = (3.125 + 3.125 + 3.125 + 2.5 + 4 + 3.048 + 1.125 + 2.625)/8 Total_Fitness = (3.125 + 3.125 + 3.125 + 2.5 + 4 + 3.048 + 1.125 + 2.625)/8 = 22.673/8 = 2.834= 22.673/8 = 2.834•Positive_Verbs_Number = 1/1 = 1Positive_Verbs_Number = 1/1 = 1•GlobalFitness = 1*2.834+(1–1)*(4-2.834) = 2.834GlobalFitness = 1*2.834+(1–1)*(4-2.834) = 2.834
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Results
0.69130.6450.8650.6850.57Run02
0.69130.6350.870.690.57Run01
GlobalSUMQAIRIE
Language Computer Corporation, USA
0.8000
LCC Richardson, USA 0.7225
”Al. I. Cuza” University, Romania 0.6913
University of Texas, USA 0.6700
LT-lab, Germany 0.6687
University of Rome ”Tor Vergata”, Italy
0.6675
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Peer-to-Peer Architecture
Initiator
DIRT db
CM
CM
CM
CM
Acronyms
SMB upload
SMB download
CM
CM
•Speed optimization•P2P architecture, cache mechanism
•Transfer protocol•Fail-over mechanism
•Ending synchronization•Quota mechanism
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Transfer protocol
SMB header
CIFS protocol
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Synchronization problem
• Dynamic quota (~ 0.26 s)
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Results
0:00:06.75 computers with 7 processes4
0:00:41One computer with full cache at start3
2:03:13One computer with caching mechanism, but with empty cache at start
2
5:28:45One computer without caching mechanism
1
DurationRun detailsNo
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Conclusions
Core of our approach is based on a tree edit distance algorithm (Kouylekov, Magnini, 2005)
Main idea is to transform the hypothesis using source like DIRT, WordNet, Wikipedia, Acronyms database
In order to improve the speed we use a P2P architecture and a caching mechanism
For ending synchronization we use a dynamic quota
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
Acknowledgments
NLP group of Iasi: Supervisor: Prof. Dan Cristea Diana Trandabat, Corina Forascu, Ionut
Pistol, Marius Raschip Anaphora resolution group:
Iustin Dornescu, Alex Moruz, Gabriela Pavel
Iftene, Balahur-Dobrescu, Matei – SYNASC, 2007
THANK YOU!
top related