recognizing textual entailment: rational, evaluation and approaches source:natural language...

Recognizing textual entailment: Rational, evaluation and

approaches

Source:Natural Language Engineering 15 (4)Author:Ido Dagan, Bill Dolan,

Bernardo Magnini and Dan RothReporter:Yong-Xiang Chen

Textual Entailment

• Textual Entailment :One piece of text can be plausibly inferred from another – Entailment: a text t entails another text (hypothesis) h

• if h is true in every circumstance (possible world) in which t is true

• the hypothesis is necessarily true in any circumstance for which the text is true

– Example:• T: iTunes software has seen strong sales in Europe• H: Strong sales for iTunes in Europe• Text T entails hypothesis H

Applied definition

• Allows the truth of the hypothesis is highly plausible, rather than certain

• Example:– T: The Republic of Yemen is an Arab, Islamic

and independent sovereign state whose integrity is inviolable, and no part of which may be ceded

– H: The national language of Yemen is Arabic

Multiple applications

• Text-understanding applications which need semantic inference

• QA：– System has to identify (candidate) texts that entail the

expected answer• Question: ‘Who is John Lennon’s widow?’• Answer: ‘Yoko Ono is John Lennon’s widow’• Candidate text: ‘Yoko Ono unveiled a bronze statue of her lat

e husband, John Lennon, to complete the official renaming of England’s Liverpool Airport as Liverpool John Lennon Airport’

• Candidate text entails the expected answer

Current lines of research

• Automatic acquisition of paraphrases and lexical semantic relationships

• Unsupervised inference in applications – Question answering,– Information extraction

• Need– Inference methods– Knowledge representations– The use of learning methods

• Community： PASCAL Recognizing Textual Entailment (RTE) challenges

RTE task

• Given two text fragments (T and H), whether the meaning of one text is entailed (can be inferred) from another text– Applied notion: a directional relationship between pairs of text

expressions– T entails H if a human reading T would infer that H is most

probably true• By language• By common background knowledge

– Contests provided datasets for evaluation and a forum for presenting and comparing

– RTE1 in 2005, RTE4 in 2008, now TAC2010• Other workshop

– Answer Validation Exercise (AVE) at Cross-Language Evaluation Forum (QA@CLEF 2008)

– The second evaluation campaign of NLP tools for Italian (EVALITA 2009)

Rationale of RTE

• Variability of semantic expression– the same meaning can be expressed by, or inferred

from, different texts– considered as the dual problem

1. language ambiguity2. many-to-many mapping between language expressions and

meanings

• Different applications need similar models for semantic variability

• need a model for recognize that different text variants inferred a particular target meaning

• to evaluated the performance of application-oriented methods• difficult to compare under a generic evaluation framework

Entailment for applications• IR

– Query denotes the combination of semantic concepts and relations

– Relevant retrieved documents should entail the query • IE

– Different text variants entail the same target relation• Multi-document summarization

– sentences in the summary should entail a redundant sentence (be omitted from the summary)

• MT– A correct automatic translation should be semantically

equivalent to the gold-standard translation

Evaluation of RTE

• Operational evaluation tasks– the annotators who decide whether this relatio

nship holds for a given pair of texts or not

RTE1 dataset & gold-standard annotation

The RTE systems’ results

• Overall accuracy levels ranging – RTE1: from 50% to 60% (17 submissions)– RTE2: from 53% to 75% (23 submissions)– RTE3: from 49% to 80% (26 submissions)– RTE4: from 45% to 74% (26 submissions, three-way

task)

• Common approaches– machine learning (typically SVM)– Logical inference– cross-pair similarity measures between T and H– word alignment

Collecting RTE datasets

• The text–hypothesis pairs were collected from several application scenarios

1. information extraction pairs

2. information retrieval pairs

3. question answering pairs

4. summarization pairs

Collecting information extraction pairs

• Simulate the need of IE systems to recognize that the given text indeed entails the semantic relation – Relation: X works for Y– T: ‘An Afghan interpreter, employed by the United States, was also

wounded’– H: ‘An interpreter worked for Afghanistan’

• Adapting the setting to pairs of texts rather than a text and a structured template

• The pairs were generated using four different approaches– 1

• ACE-2004 relations (relations tested in the ACE-2004 RDR task) were taken as templates for hypotheses

• Relevant news articles were collected as texts• Texts then given to actual IE systems for extraction of ACE relation

instances• The system outputs were used as hypotheses

– Generating both positive examples (from correct outputs) and negative examples (from incorrect outputs)

– 2• the output of IE systems on the dataset of the MUC-

4 TST3 task (the events are acts of terrorism)

– 3• additional entailment pairs were manually generated

from both the annotated MUC-4 dataset and news articles collected for the ACE relations

– 4• hypotheses which correspond to new types of

semantic relations (not found in the ACE and MUC datasets)

• manually generated for sentences in the collected news articles

• relations were taken from various semantic domains, such as sports, entertainment and science

Collecting information retrieval pairs

• Assume: relevant documents should entail the given propositional query (hypothesis)

• Hypotheses: propositional IR queries– which specify some statement– e.g. ‘Alzheimer’s disease is treated using drugs’– adapted and simplified from standard IR evaluation da

tasets (TREC and CLEF).

• Texts: selected from documents retrieved by different search engines for each hypothesis– e.g. Google, Yahoo and MSN

Collecting question answering pairs

• In QA system, the retrieved passage text as answer should entails the correct answer

• Annotators were given questions, taken from TREC-QA and QA@CLEF datasets

• Text: corresponding answers were extracted from the web by QA systems

• Hypothesis: Transforming a question–answer pair to a text–hypothesis pair1. Annotators picked from the answer passage an answer term of the exp

ected answer type, either a correct or an incorrect one2. Then, the annotators turned the question into an affirmative sentence w

ith the answer term ‘plugged in’• given the question: ‘How many inhabitants does Slovenia have?’• Answer text: ‘In other words, with its 2 million inhabitants, Slovenia has only

5.5 thousand professional soldiers’ (T)• picked ‘2 million’ as the (correct) answer term, turn the question into the stat

ement ‘Slovenia has 2 million inhabitants’ (H), producing a positive entailment pair

• ‘5.5 thousand’ inhabitants as an (incorrect) answer term

Collecting summarization pairs

• T and H are sentences taken from a news document cluster– news articles that describe the same news item

• Annotators were given output of multi-document summarization systems – document clusters– the summary generated for each cluster

• picked sentence pairs with high lexical overlap• T: at least one of the sentences was taken from the

summary• H:

– positive examples: simplified by removing sentence H parts which can’t entailed by sentence T, until H was fully entailed by T

– Negative examples: without reaching the entailment of H by T

Creating the final dataset

• Cross-annotation by at least two annotators– in RTE1,pairs on which the annotators disagreed

were filtered out– average agreement on the test set (between each

pair of annotators who shared at least 100 examples) was 89.2%, with an average kappa level of 0.78,

– Additional filtering: discarded pairs that seemed controversial, too difficult or redundant

• 25.5% of the (original) pairs were removed from the test set• Fixing spelling and punctuation but not style

Evaluation measures

• main task in the RTE Challenges was “classification” – entailment judgment for each pair in the test set– The evaluation criterion for this task was accuracy

• secondary optional task was “ranking”– the T-H pairs, according to their entailment confidenc

e• the first pair is the one for which entailment is most certain• the last pair is the one for which entailment is least likely

– A perfect ranking would place all the positive pairs before all the negative pairs

– This task was evaluated using the average precision measure

RTE1

• Manually collected text fragment pairs– text (T): one or two sentences– hypothesis (H): one sentence

• participating systems were required to judge for each pair whether T entails H– The pairs represented success and failure settings of

inferences in various application types, QA, IE, IR and MT

– development set: 567 pairs– test set: 800 pairs– 17 submissions– low accuracy: best results below 60%

Result

• Allow submission of partial coverage results which do not cover all test data

• Balanced dataset in terms of true and false examples– uniformly predicts True (or False) would achieve an a

ccuracy of 50% as baseline

• The most basic inference type: measure word overlap between T and H– using a simple decision tree trained on the developme

nt set, obtained accuracy 0.568• Might reflect a knowledge-poor baseline

RTE2

• Four sites participated in the data collection and annotation, from Israel, Italy, USA

• RTE2 dataset was to provide more ‘realistic’ text–hypothesis examples, based mostly on outputs of actual systems

• Annotation processes including cross-annotation• Including sentence splitting and dependency par

sing to the challenge data

RTE3

• part of the pairs contained longer texts (up to one paragraph)– encouraging participants to move towards dis

course-level inference

• 26 teams participated

• the results were presented at the ACL 2007 Workshop

• scores ranged from 0.49 to 0.80

RTE4

• Included as a track in the Text Analysis Conference (TAC), organized by the NIST

• Three-way classification, non-entailment cases are split • CONTRADICTION: the negation of the hypothesis is entaile

d from the text

• UNKNOWN: the truth of the hypothesis cannot be determined based on the text

– Runs submitted to the three-way task were automatically converted to two-way runs (conflated non-entailment cases to NO ENTAILMENT)

– in the three-way task, the best accuracy was 0.685• the average three-way accuracy was 0.51

– 2-way judgment, the best accuracy was 0.72• lower than those achieved in RTE3’s competition (the datasets are diff

erent)

Some issues• Adopts approaches as applied to semantics• Availability of training set made it possible to formulate the problem in

terms of a classification task , features including – Lexical syntactic– semantic features– document co-occurrence counts– first-order syntactic rewrite rules– extract the information gain provided by lexical measures

• To design Transformations: to derive the hypothesis H from the text T– transformation rules designed to preserve the entailment relation– probabilistic setting

• Precision-oriented RTE modules– specialized textual entailment engines are designed to address a specific

aspect of language variability• e.g. contradiction, lexical similarity• combined, applying a voting mechanism, with a high-coverage backup module

Resources• lexical-semantic resources

– WordNet and its extensions• statistically learned inference rules

– DIRT(Discovery of Inference Rules from Text)• X is author of Y ≈ X wrote Y• X solved Y ≈ X found a solution to Y• X caused Y ≈ Y is triggered by X• verb-oriented resources

– VerbNet and VerbOcean• web as a resource

– to extract entailment rules, named entities and background knowledge– Wikipedia

• various text collections– Reuters corpus– English Gigaword

• To extract features based on documents’ co-occurence counts and InfoMap– Dekang Lins thesaurus and gazetteers

• to draw lexical similarity judgements

Predicts entailment using syntactic features and a general purpose thesaurus

• To understand what proportion of the entailments in the RTE1 test set could be solved using a robust parser

• Human annotators evaluated each T–H pair– true by syntax, false by syntax, not syntax and can’t decide

• Annotators also indicate whether the information in a general purpose thesaurus entry would allow a pair to be judged true or false– 37% of the test items can be handled by syntax– 49% of the test items can be handled by syntax plus a general p

urpose thesaurus

• It is easier to decide when syntax can be expected to return ‘true’, and it is uncertain when to assign ‘false’

Two intermediate modelsof textual entailment

• Lexical level– lexical-semantic – morphological relations– lexical world knowledge

• Lexical-syntactic level– syntactic relationships and transformations– lexical-syntactic inference patterns (rules)– co-reference

• Compared the outcomes for the two models as well as for their individual components

• The lexical-syntactic model outperforms• Both models fail to achieve high recall• the majority rely on significant amount of the so-called common

human understanding of lexical and world knowledge

a harder task: contradiction

• Contradiction occur when two sentences are extremely unlikely to be true simultaneously(involve the same event)

• with a collection of contradiction corpora

• harder task since requires deeper inferences, assessing event co-reference and model building

Conclusions

• RTE task has reached a noticeable level of maturity

• the long-term goal of textual entailment research is the development of robust entailment ‘engines’– will be used as a generic component in many t

ext-understanding applications

recognizing textual entailment: rational, evaluation and approaches source:natural language...

Documents

text fragments t

piece of text

text hypothesis hif

pairs of text expressionst

europetext t

semantic expressionthe

semantic inferenceqasystem

human reading t