decontextualization: making sentences stand-alone

15
Decontextualization: Making Sentences Stand-Alone Eunsol Choi 2* , Jennimaria Palomaki 1 , Matthew Lamm 1 , Tom Kwiatkowski 1 , Dipanjan Das 1 , Michael Collins 1 1 Google Research 2 Department of Computer Science, The University of Texas at Austin [email protected], {jpalomaki, mrlamm, tomkwiat, dipanjand, mjcollins}@google.com, Abstract Models for question answering, dialogue agents, and summarization often interpret the meaning of a sentence in a rich context and use that meaning in a new context. Tak- ing excerpts of text can be problematic, as key pieces may not be explicit in a local window. We isolate and define the prob- lem of sentence decontextualization: tak- ing a sentence together with its context and rewriting it to be interpretable out of con- text, while preserving its meaning. We de- scribe an annotation procedure, collect data on the Wikipedia corpus, and use the data to train models to automatically decontex- tualize sentences. We present preliminary studies that show the value of sentence de- contextualization in a user facing task, and as preprocessing for systems that perform document understanding. We argue that de- contextualization is an important subtask in many downstream applications, and that the definitions and resources provided can ben- efit tasks that operate on sentences that oc- cur in a richer context. 1 Introduction Many applications of natural language processing need to be able to interpret, or present, text in- dependently from the rich context in which it oc- curs. For example, summarization systems extract salient information from documents and present it in a reduced context. Many systems also segment documents prior to interpretation of retrieval for computational efficiency. In all of these cases, we would like the context-reduction step to be mean- ing preserving but, to date, there has been no inde- pendent method of ensuring this. In this paper we isolate and define the problem of sentence decontextualization: taking a sentence * Work done at Google. Page Title: Croatia at the FIFA World Cup Paragraph: Croatia national football team have appeared in the FIFA World Cup on five occasions (in 1998, 2002, 2006, 2014 and 2018) since gaining independence in 1991. Before that, from 1930 to 1990 Croatia was part of Yu- goslavia. Their best re sult thus far was reach ing the 2018 nal, where they lost 4 - 2 to France. Decontextualized Sentence: The Croatia national foot- ball team’s best result thus far in the FIFA World Cup was reaching the 2018 final, where they lost 4-2 to France. Figure 1: An example decontextualization. The sen- tence to decontextualize is highlighted in grey. together with its context and rewriting it to be in- terpretable out of context if feasible, while pre- serving its meaning. 1 Having defined the problem, we operationalize this definition into a high qual- ity annotation procedure; use the resulting data to train models to automatically decontextualize sen- tences; and present preliminary results that show the value of automatic decontextualization in a user facing task, and as preprocessing for systems that perform document understanding. We argue that decontextualization is an important sub-task in many downstream applications, and we believe this work can benefit tasks that operate on sen- tences that occur in a wider context. One contribution of this work is to release a dataset of decontextualized sentences that can be used as training and evaluation data, together with the evaluation script: on publication of this paper the data will be available at https://github. com/google-research/language/ tree/master/language/decontext. Figure 1 shows an example decontextualization. In this example we have a coreference resolution step (their The Croatia national football team’s) and a bridging step (insertion of the prepositional phrase “in the FIFA World Cup" to modify “Croa- 1 More precisely the truth-conditional meaning or explica- ture (Sperber and Wilson, 1986); see section 2 for discussion. arXiv:2102.05169v1 [cs.CL] 9 Feb 2021

Upload: others

Post on 30-Nov-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Decontextualization: Making Sentences Stand-Alone

Eunsol Choi2∗, Jennimaria Palomaki1, Matthew Lamm1,Tom Kwiatkowski1, Dipanjan Das1, Michael Collins1

1Google Research2Department of Computer Science, The University of Texas at Austin

[email protected], {jpalomaki, mrlamm, tomkwiat, dipanjand, mjcollins}@google.com,

Abstract

Models for question answering, dialogueagents, and summarization often interpretthe meaning of a sentence in a rich contextand use that meaning in a new context. Tak-ing excerpts of text can be problematic, askey pieces may not be explicit in a localwindow. We isolate and define the prob-lem of sentence decontextualization: tak-ing a sentence together with its context andrewriting it to be interpretable out of con-text, while preserving its meaning. We de-scribe an annotation procedure, collect dataon the Wikipedia corpus, and use the datato train models to automatically decontex-tualize sentences. We present preliminarystudies that show the value of sentence de-contextualization in a user facing task, andas preprocessing for systems that performdocument understanding. We argue that de-contextualization is an important subtask inmany downstream applications, and that thedefinitions and resources provided can ben-efit tasks that operate on sentences that oc-cur in a richer context.

1 Introduction

Many applications of natural language processingneed to be able to interpret, or present, text in-dependently from the rich context in which it oc-curs. For example, summarization systems extractsalient information from documents and present itin a reduced context. Many systems also segmentdocuments prior to interpretation of retrieval forcomputational efficiency. In all of these cases, wewould like the context-reduction step to be mean-ing preserving but, to date, there has been no inde-pendent method of ensuring this.

In this paper we isolate and define the problemof sentence decontextualization: taking a sentence

∗Work done at Google.

Page Title: Croatia at the FIFA World CupParagraph: Croatia national football team have appearedin the FIFA World Cup on five occasions (in 1998, 2002,2006, 2014 and 2018) since gaining independence in 1991.Before that, from 1930 to 1990 Croatia was part of Yu-goslavia. Their best result thus far was reaching the 2018final, where they lost 4-2 to France.Decontextualized Sentence: The Croatia national foot-ball team’s best result thus far in the FIFA World Cup wasreaching the 2018 final, where they lost 4-2 to France.

Figure 1: An example decontextualization. The sen-tence to decontextualize is highlighted in grey.

together with its context and rewriting it to be in-terpretable out of context if feasible, while pre-serving its meaning.1 Having defined the problem,we operationalize this definition into a high qual-ity annotation procedure; use the resulting data totrain models to automatically decontextualize sen-tences; and present preliminary results that showthe value of automatic decontextualization in auser facing task, and as preprocessing for systemsthat perform document understanding. We arguethat decontextualization is an important sub-taskin many downstream applications, and we believethis work can benefit tasks that operate on sen-tences that occur in a wider context.

One contribution of this work is to release adataset of decontextualized sentences that can beused as training and evaluation data, together withthe evaluation script: on publication of this paperthe data will be available at https://github.com/google-research/language/tree/master/language/decontext.

Figure 1 shows an example decontextualization.In this example we have a coreference resolutionstep (their→ The Croatia national football team’s)and a bridging step (insertion of the prepositionalphrase “in the FIFA World Cup" to modify “Croa-

1More precisely the truth-conditional meaning or explica-ture (Sperber and Wilson, 1986); see section 2 for discussion.

arX

iv:2

102.

0516

9v1

[cs

.CL

] 9

Feb

202

1

tia’s best result thus far"). Decontextualizationinvolves various linguistic phenomena, includingcoreference resolution, global scoping, and bridg-ing anaphora (Clark, 1975). We present a linguis-tically motivated definition of decontextualiationin Section 2 and show that this definition can bereliably applied by crowdworkers in Section 3.

We generate a corpus of decontextualized sen-tences corresponding to original sentences drawnfrom the English Wikipedia. We show that a highproportion of these original sentences can be de-contextualized using a relatively simple set of re-write operations, and we use the data to define anew automatic decontextualization task in whicha computer system needs to create a decontextual-ized sentence from an original sentence presentedin paragraph context. We discuss the implicationsof choosing Wikipedia as a domain in Section 3.4.

We present two methods for automatic decon-textualization based on state-of-the-art corefer-ence (Joshi et al., 2020) and generation (Raffelet al., 2019a) models. We evaluate the output ofthese models with automatic measures (derivedfrom Xu et al. (2016)), as well as through hu-man evaluation. Both automatic and human evalu-ations show that the largest sequence-to-sequencemodel produces high quality decontextualizationsin the majority of cases, although it still lags hu-man performance in the thoroughness and accu-racy of these decontextualization edits.

Finally, we present two demonstrations of theutility of decontextualization. The first is a userstudy giving evidence that decontextualized sen-tences can be valuable when presented to usersas answers in a question-answering task—ratersjudge that they balance conciseness with informa-tiveness. In the second one, we use decontextu-alization as a preprocessing component for gener-ating a retrieval corpus for open domain questionanswering. Decontextualizing the sentences tobe indexed by retrieval system enables more effi-cient answer string retrieval for information seek-ing queries. These demonstrations are presentedas preliminary results, and we argue that decon-textualization is an important sub-task for a widerange of NLP applications.

2 Linguistic Background

We start with the following definition:

Definition 1 (Decontextualization) Given asentence-context pair (s, c), a sentence s′ is a

valid decontextualization of s if: (1) the sentences′ is interpretable in the empty context; and(2) the truth-conditional meaning of s′ in theempty context is the same as the truth-conditionalmeaning of s in context c.

A context c is a sequence of sentences precedings, and the empty context is the empty sequence.

We have been careful here to use the morespecific term "truth conditional meaning" ratherthan "meaning". Here we follow the distinctionin semantics/pragmatics between truth conditionalmeaning and implicature, and deliberately excludeimplicatures (which can also be considered part ofthe meaning of an utterance) from our definition.There is a rich history of work in semantics andpragmatics on truth-conditional meaning and im-plicatures, going back to Grice (1975). Our con-cept of "truth conditional meaning" is very close to"explicature" as used in Relevance Theory (Sper-ber and Wilson, 1986). Consider this descriptionof explicature from Birner (2012) (pages 96–97,our own emphasis added):

“The explicature in an utterance is the resultof enriching the semantic content with the sortsof pragmatic information necessary to provide uswith a truth-evaluable proposition. This includescalculating the referents for pronouns, workingout the intended interpretation for deictic phraseslike here and later ..., disambiguating lexically andstructurally ambiguous words and phrases, mak-ing any “bridging" inferences necessary for refer-ence resolution ... and so on." We will see in thenext section that our annotation task follows thisdefinition quite closely.

As an example consider the following ex-change:

Susan: Has the Croatia national footballteam ever won the FIFA World Cup?Jon: Their best result thus far was reach-ing the 2018 final, where they lost 4-2 toFrance.

Here the truth conditional meaning of Jon’s re-ply is equivalent to "Croatia’s best result thus far inthe FIFA World Cup was reaching the 2018 final,where they lost 4-2 to France.", whereas the im-plicature would be "the Croatia national footballteam has never won the FIFA World Cup" (whichanswers Susan’s question). In our definition thedecontextualized sentence s′ should preserve the

truth-conditional meaning, but is not required topreserve the implicature(s) of the sentence.2

Remark (extra-linguistic context): In addition toits document context, a given sentence s and itscounterpart s′ also come with a temporal, cultural,and geographic context — i.e. where and whenthey are being written or read and by whom.3

We assume that these aspects of context are pre-served during decontextualization. The effect ofthis is that elements of s which derive their mean-ing from outside of the document context will re-ceive equivalent interpretation in s′, and hence donot require decontextualization. For example, theexpression "thus far" in Figure 1 is interpreted rel-ative to the time of utterance, not relative to whathas been previously said in the Wikipedia article,and hence it appears in both the original and de-contextualized sentences.

3 Task Definition

An annotator is provided with an entire docu-ment d with a target sentence within the document,represented as a start and end index sst, send.First, the annotator decides whether the target sen-tence can be decontextualized or not, labeling itas FEASIBLE or INFEASIBLE. If the example ismarked as FEASIBLE, the annotator decontextual-izes the sentence, producing y, a new sentence thatsatisfies the conditions in definition 1.

3.1 Feasibility

Sentences in FEASIBLE include sentences that donot require any modification to be decontextual-ized (e.g., “Émilie du Châtelet proposed the hy-pothesis of the conservation of total energy, asdistinct from momentum."), and sentences that re-quire edits to stand alone.

In the decontextualization step, we instructedannotators to make only minor modifications,which includes copying and pasting a few phrasesfrom the document to the target sentence anddeleting phrases from the target sentence. Whenit is too challenging to decontextualize, it is clas-sified into the INFEASIBLE category. Often, sen-tences in this category are a part of a narrative

2We have not necessarily given up on recovering implica-tures: the decontextualized sentence will likely be a valuableintermediate step in deriving the implicatures of an utterance.

3Research on text simplification (Xu et al., 2015; Bingelet al., 2018) also shows how target output depends on theexpected audience.

Page Title: Postage stamps and postal history of IndiaParagraph: ... In the opinion of Geoffrey Clarke , the re-formed system was to be maintained “ for the benefit ofthe people of India and not for the purpose of swelling therevenue.” The Commissioners voted to abolish the ear-lier practice of conveying official letters free of postage(“franking”). The new system was recommended by theGovernor - General , Lord Dalhousie , and adopted by theEast India Company ’s Court of Directors.

Page Title: Thermodynamic temperature

Paragraph: ... To completely melt ice at 0 C into water at0 C, one must add roughly 80 times the thermal energy asis required to increase the temperature of the same mass ofliquid water by one degree Celsius. The metals’ ratios areeven greater, typically in the range of 400 to 1200 times.

Figure 2: Decontextualization examples falling into theINFEASIBLE category. The sentence to be decontextu-alized is highlighted in grey.

story, or rely heavily on the preceding few sen-tences. See Figure 2 for examples.

3.2 Edit Types and Linguistic PhenomenaWhen an example is classified as FEASIBLE, theannotator makes edits to decontextualize the sen-tence. Table 1 shows the different edit types. Theyfall into four broad categories:

NAME COMPLETION, PRONOUN / NP SWAPcorrespond to replacement of a referring expres-sion that is unclear out of context with a referringexpression that is unambiguous out of context. Forexample replacing the pronoun “She” with “Cyn-thia Nixon”, the definite NP “the copper statue”with “The Statue of Liberty”, or the abbreviatedname “Meg” with “Megan "Meg" Griffin”.

DM REMOVAL involves removal of discoursemarkers (DMs) such as “therefore”.

BRIDGING, GLOBAL SCOPING involve addi-tion of a phrase (typically a prepositional phrase)that modifies either a particular noun phrase(“bridging”) or the entire sentence (“global scop-ing”). For example adding “in the Ultimate Fight-ing Championship” as a modifier to “all fights”, oradding “at the 2018 Cannes Film Festival" at theend of the sentence. The additional phrase essen-tially spells out a modifier that is implied by thecontext.

ADDITION inserts background information thatsignificantly improves readability: in many cases,this involves adding an appositive or premodifier

Edit Type Description Example %

PRONOUN/NP SWAP Replacement of a definitepronoun / noun phrasewith another referring ex-pression

* -The copper statue , +The Statue of Liberty +, a gift fromthe people of France to the people of the United States, wasdesigned by French sculptor Frédéric Auguste Bartholdi andbuilt by Gustave Eiffel.

40.5

NAME COMPLETION Expansion of acronyms orpartial names

* -Meg , +Megan "Meg" Griffin + made her first appearanceon television when Family Guy debuted on Fox on January31, 1999, with the episode “Death Has a Shadow”.

11.5

DM REMOVAL Removal of discoursemarkers that can be onlyunderstood in context

* - For instance, + Alaska could be regarded as the higheststate because Denali, at 20,310 feet, is the highest point inthe US.

3.5

BRIDGING Addition of a modifier(typically a PP) to a nounphrase

In all fights * +in the Ultimate Fighting Championship +,each round can be no longer than five minutes.

13

GLOBAL SCOPING Addition of a phrase (typ-ically a PP) that modifiesthe entire sentence

The Japanese film Shoplifters, directed byHirokazu Kore-eda, won the Palme d’Or* +at the 2018 Cannes Film Festival. +

7

ADDITION Addition of backgroundinformation that is notnecessary but helps read-ability significantly

Charles Darwin* +, an English naturalist and biologist, +was among the first to suggest that physiological changescaused by an emotion had a direct impact on , rather thanbeing just the consequence of that emotion.

10

Table 1: The list of possible edits in decontextualization. The last column represents how frequently the phenom-ena occurs in the data, from manual analysis on 200 examples, including examples that belongs to INFEASIBLEcategories and examples that does not require any edits. The bag notation removes x and adds y (* -x , +y +) at itsposition.

to a named entity to add useful background infor-mation about that entity. Unlike other edits de-scribed above, edits in this category are optional.For example, replacing “The Eagles” with “TheAmerican rock band The Eagles.”

3.3 VariabilityWe note that for a given sentence frequently therewill be more than one possible decontextualiza-tion. While this inherent subjectivity makes taskchallenging to crowdsource and evaluate, we ar-gue this is important feature, as shown in recentliterature (Aroyo and Welty, 2015; Pavlick andKwiatkowski, 2019; Kwiatkowski et al., 2019),and propose to collect multiple references per ex-ample. Table 2 shows examples where there canbe multiple different correct decontextualizations.In the first example, while the semantics of theedits are roughly equivalent, i.e., the annotatorsagreed on what noun phrases have to be disam-biguated and information has to be added, theydiffer in how to rewrite the sentence. In the sec-ond example, we see disagreement on what in-formation should be added to the sentence. Wedo not make any explicit assumptions about whatis known and salient to the reader, and instructed

them to use their best judgement to rewrite suchthat the new sentence is fluent, unambiguous andclear when posed alone. In the last example, an-notators disagree on the feasibility. While the sen-tence is a part of a bigger narrative, two annota-tors judged it could be edited to alone, by addinga global scoping modifier, “In Greek mythology."

3.4 Scope of Current Task FormulationOur data comes from the English portion of theWikipedia corpus. We sampled sentences asfollows. We first pick a (question, Wikipedia,short answer) triple from the Natural Ques-tions (Kwiatkowski et al., 2019) uniformly at ran-dom from the questions that have a short answer.We include the sentence containing the short an-swer as one example; as a second example wechoose a sentence at random from the Wikipediapage. After sampling we exclude (1) sentences un-der a “Plot” category as they are often infeasible todecontextualize; (2) any sentence that is the firstsentence of the page; and (3) any sentence from aparagraph containing only a single sentence.

We designed this data selection process to en-sure that a large proportion of examples (90%)could be decontextualized using simple edits de-

Page title / Section title: We Don’t Talk Anymore (Charlie Puth song) / Music videoParagraph: The music video premiered on August 2 , 2016 , on BuzzFeed and was directed by Phil Pinto . It shows Puth andMirella Cardoso as his love interest . . . .Decontextualization 1: * -It , +We Don’t Talk Anymore music video + shows * -Puth , +Charlie Puth + and Mirella Car-doso as his love interest.Decontextualization 2: * -It , +The “We Don’t Talk Anymore”(Charlie Puth song) music video + shows Puth and MirellaCardoso as his love interest.

Page title: The American Baking CompetitionParagraph: CBS placed casting calls for participants on November 14, 2012 . Auditions were held between December 1 andDecember 15, 2012. The competition took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013.Decontextualization 1: The * -competition , +American Baking Competition + took place at the Gibbs Gardens in BallGround , Georgia in March 2013.Decontextualization 2: The * -competition , +American Baking Competition, a reality competition television series, + tookplace at the Gibbs Gardens in Ball Ground , Georgia in March 2013 .

Page title: Gemini (Constellation)Paragraph: In Greek mythology, Gemini was associated with the myth of Castor and Pollux, the children of Leda and Arg-onauts both. Pollux was the son of Zeus, who seduced Leda, while Castor was the son of Tyndareus, king of Sparta and Leda’shusband. Castor and Pollux were also mythologically associated with St. Elmo’s fire in their role as the protectors of sailors.When Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by unitingthem together in the heavens.Decontextualization 1: INFEASIBLE

Decontextualization 2: * +In Greek mythology, + when Castor died, because he was mortal, Pollux begged his father Zeus togive Castor immortality, and he did, by uniting them together in the heavens.

Table 2: Examples showing the diversity of valid decontextualization edits.

scribed in Section 3.2.Before settling on Wikipedia, we conducted an

initial pilot study which revealed that encyclope-dic text is substantially easier to decontextualizecompared to newswire or literary text. In the lat-ter genres, the context required for the compre-hension of any given sentence appears to be muchmore complex in structure. Similarly, it is difficultto posit decontextualization for sentences that ap-pear on social media platforms, because they aresituated within complex and highly specific socialcontexts. In contrast, being written for a generalaudience, Wikipedia makes limited assumptionsabout its reader.

Within Wikipedia, we similarly found that arti-cles on popular historical or cultural entities andevents were easier to decontextualize by crowd-workers compared to articles from technical do-mains, such as ones on medical or mathematicalconcepts. Comprehension of such articles requiresa considerable body of background knowledge orinformation from preceding paragraphs. Articlesin our dataset cover topics that require little back-ground knowledge to comprehend.

We focus on decontextualization of sentences,where the space of edits is restricted, to make thetask easier to quality control and annotate. How-ever alternate formulations, such as decontextual-

ization on paragraphs could also be studied. Onecould even also consider allowing wider range ofedits, such as multi-sentence outputs and edits be-yond copy-and-pasting, such as paraphrasing andre-ordering. We anticipate exploring such alterna-tive formulations would help to extend the scopeof decontextualization to the more challenging do-mains previously mentioned.

We stress however that in spite of our restrictionto single sentences in Wikipedia, the decontextu-alization task is nevertheless valuable: Wikipedia(and other encyclopedic sources) contain a wealthof factual information, and a high proportion (over60%; see table 3) of sentences both require decon-textualization and can be decontextualized underour definitions (only 30% of sentences are inter-pretable out of context without any edits).

4 Data Collection

Annotation Interface The annotator is pre-sented a sentence in the context of an entireWikipedia page. In the first step the anno-tator judges whether the example is FEASIBLE

or INFEASIBLE. If the example is marked asFEASIBLE, the annotator can use delete, add, orswap operations within a user interface to producea decontextualized string.

# par. sent. FEASIBLE(%) INFEAlen len w/ edit as is SIBLE (%)

Train 11290 695 156 60 31 9Dev 1945 695 162 67 21 12Test 1945 711 160 68 20 12

Expert 100 658 163 63 26 12

Table 3: Data statistics. par. len refers to paragraphlength in bytes, and sent. len refers to sentence lengthin bytes. All lengths are in bytes. The developmentand test set is five-way annotated, and the expert data isfour-way annotated.

Data Statistics We collected one reference foreach example in the training data, and five ref-erences for each example in the evaluation data.Annotators are native speakers of English locatedin the United States, and on average, they took 4minutes to annotate a single example.

In total, 28 annotators annotated the examples,with 11 annotators annotating more than 1K ex-amples each.

Table 3 represents some overall data statistics.Decontextualization is possible for the majority ofexamples, with the INFEASIBLE category cover-ing roughly 10% of the data. We note a slightdiscrepancy between train and evaluation datasetdistribution, potentially due to a change in the an-notation interface. A small subset of data is an-notated by the authors to be compared with thecrowd-sourced data (last row in the table).

Annotation Quality We quantify the annotationagreement on the category classification. TheFleiss’ kappa on category classification is 0.51among expert annotators, and is 0.30 among thecrowd annotators (binary agreement is at 85%).We observed more variability in crowdworkers asannotators’ background is more diverse, and someannotators have a loose concept of “stand alone"and consistently attempted decontextualization.

We also measured agreement among the indi-vidual edits. For each of the edit operations (as de-fined in Section 3.2), we compare the output sen-tence after the single edit and to a set of outputsentences, each after a single edit by other annota-tors. About 32.5% of edits were covered.

Because of the inherent annotation variabil-ity, four of the authors manually evaluated 100crowd-sourced annotations from the evaluationdata based on two measures: (1) whether the sen-tence is sufficiently and correctly decontextual-ized, and (2) whether the sentence is grammati-

cally correct and fluent. Overall, 88% of annota-tions were valid in both, 89% on the content and88% on form.

5 Automatic Decontextualization

5.1 Models

We present two models for decontextualization:a coreference resolution model and a sequence-to-sequence generation model. For both models,the input is a concatenation of the title of theWikipedia document, the section titles, and theparagraph containing the target sentence. Duringthe annotation pilots, we found the document ti-tle is crucial for decontextualization, while sectionheaders were frequently necessary or missing. Wedenote the title of the Wikipedia page as the se-quence of tokens t, section titles of the paragraphas the sequence of tokens ts and the n sentences inthe paragraph where the target sentence is comingfrom as x1 . . . xn, where each xi is a sequence oftokens, and xt is the target sentence (1 ≤ t ≤ n).The model considers the concatenation of a subsetof the document,

[CLS]t[S]ts[S]x1 · · ·xt−1[S]xt[S]xt+1 · · ·xn[S]

where [S] is a separator token. This representa-tion differs from the setting of annotators, wherethey were given the full document context. Asan approximation, we include article and sectiontitles in the inputs, as these often contain salientcontextual elements. We did experiment with giv-ing more context, i.e., adding the first paragraphof the article as an additional input, but did not ob-serve a performance improvement. On the initialpilot, annotators marked that 10-20% of examplesrequired access to the full document.

The Coreference Model As many decontextu-alization edits can be recovered by a corefer-ence resolution module, we adapt the output fromthe state-of-the-art coreference resolution systemof (Joshi et al., 2020), trained on the CoNLLdataset (Pradhan et al., 2012), as a decontextual-ization system. We used the publicly availablepre-trained checkpoint of SpanBERT-Large withthe original hyper parameters.4

We run this model on the input sequence, andmap the coreference cluster predictions to modifythe sentence as follows. We only consider clusterswith a mention in the target sentence. For each

4https://github.com/facebookresearch/SpanBERT/

such cluster, we find its first mention inside the tar-get sentence, and find another mention in the samecluster that was presented earlier in the input andis longer than the current mention. If such a men-tion is found, we replace the current entity mentionstring with the earliest such mention string (e.g.,"She" is replaced with "Taylor Swift"). On av-erage, 36.5% of examples were modified throughthis process.

The Seq2Seq Generation Model is based onthe recent T5 model (Raffel et al., 2019b). Weshow two variations of the model, BASE and11B, which mainly differ in the model capac-ity. We fine-tune the model on our crowd-sourced training set, by setting the target se-quence to be [CAT][SEP]y, where [CAT] ∈{UNNECESSARY, FEASIBLE, INFEASIBLE} and yis a decontextualized sentence when [CAT] =FEASIBLE and the original sentence when[CAT] ∈ {UNNECESSARY, INFEASIBLE}. UN-NECESSARY are examples where the original sen-tence without any edit can stand alone.

We limit the input/output to 512/128 tokensfor both variants, and fine-tuned from pre-trainedcheckpoints5 with a batch size of 100 examplesuntil the validation loss stopped decreasing, afterabout 32K for the larger and 500K steps for thesmaller model.

5.2 Evaluation

5.2.1 Feasibility DetectionWe first evaluate the accuracy of models in mak-ing the feasible vs. infeasible decision. To do thiswe compute the binary agreements with all humanreferences and average them to get an accuracy.

Results For the feasible vs. infeasible classi-fication task, baseline that always predicts FEA-SIBLE will have 88% accuracy. The larger vari-ant of T5, T5-11B, achieves 89% accuracy, out-performing human agreement (85% accuracy), af-firming the strong performance of pre-trained lan-guage models on classification tasks (Devlin et al.,2018). This model predicts the INFEASIBLE cate-gory infrequently for the larger variant (5% of ex-amples), while humans classify an example as IN-FEASIBLE for 12% of examples. We observe thesmaller variant, T5-Base, is less accurate, over-predicting the INFEASIBLE category (for 20% of

5https://github.com/google-research/text-to-text-transfer-transformer

examples), getting 77% accuracy. The coreferencemodel cannot decide the decontextualization feasi-bility, as an untrained baseline.

5.2.2 Decontextualized Sentence GenerationSetup For development / test examples, we havefive human annotations per example. We only con-sider examples marked by three or more annota-tors (out of five) as FEASIBLE for decontextualizedsentence generation. For each of these examples,we discard annotations which mark the example asINFEASIBLE. For automatic evaluation and com-parison, we need a human output, which will becompared to model outputs, and a set of referenceannotations which will be considered as correct,gold annotations. The single human output pro-vides a reference point for evaluation measures towhich the automatic output can be compared.

We observed comparing a longer decontextu-alized sentence to shorter decontextualized sen-tences often erroneously results in low scores au-tomatic metrics (e.g., In the last example of Ta-ble 2, adding extra information will be erroneouslypunished). Thus, instead of randomly selectingone annotation to be used as the representative hu-man output, we sort the annotations by the lengthof the output sentence (raw bytes), and take theannotation with median length6 as a human out-put and take the remaining annotations as a set ofreference annotations. From manual inspection ofthe data the median-length output appeared oftento be optimal in terms of balancing length versusaccuracy of the decontextualization.

Metric For each model prediction and humanoutput, we report:

• Length Increase, the average value of(len(decontext)-len(original)) / len(original).

• % edited, the proportion of examples that weremodified for decontextualization (as opposed tobeing left unchanged).

• Sentence match, a binary score computed be-tween the output and a set of references, indicat-ing whether the output matches any of the ref-erences after normalization (stripping away ar-ticles and punctuation and lowercasing). We re-port two numbers, a score on all examples, anda score on examples where all references editedthe sentence.6When there are four of references, we take the second

shortest sentence.

len % ed match SARI add SARI delinc. ited all / edited F1 (P/R) F1 (P/R)

Repeat 0 0 38 / 0 0 (0/0) 0 (0/0)Coref 7 42 39 / 13 22 (51/14) 31 (34/28)T5-Base 8 40 48 / 21 29 (67/19) 40 (54/32)T5-11B 12 59 53 / 32 42 (72/30) 46 (49/43)Human 24 76 45 / 29 56 (64/49) 58 (61/55)

Table 4: Development set performance. Len inc. is theaverage percentage increase in length from decontextu-alization. % edited is the proportion of examples thathave at least one edit. match-all shows percentage ofoutputs that have at least one match in the human refer-ences; match-edited shows the match value calculatedon cases where all references include at least one edit.

• SARI (system output against references andagainst the input sentence) metric (Xu et al.,2016). To compute this, for each reference, wecalculate a set of add edits, corresponding towhich unigrams are seen in the reference butnot in the original sentence. Conversely, wecan calculate the set of delete edits, correspond-ing to unigrams that are in the original sen-tence but not in the reference. We calculateprecision/recall/F1-measure on add and deleteedits. We look at unigrams only, and use frac-tional counts for the words in the references(i.e., a word appearing in one of r referenceswill be counted as 1/r). We compute micro av-erage across examples, i.e., globally by count-ing the total true positives, false negatives andfalse positives, as many examples do not requireany edits.7

While the sentence match score is the easiest tointerpret, it punishes longer outputs, making com-parisons across systems producing outputs of dif-ferent lengths challenging, and it overly rewardsconservative strategies that simply copy across theoriginal sentence. Thus, we use the SARI met-ric as our main evaluation metric. SARI can bethought of as a precision/recall measure on topics(unigrams) that should be added or deleted.

Automatic Evaluation Tables 4 and 5 show de-velopment and test performance. A successful

7Similar to BLEU in machine translation, SARI is a usefulmeasure for comparing different systems; however, due tothe relatively large space of possible decontextualizations itwill not be possible to achieve anything close to 100% F1on SARI measures, and thus the absolute score is harder tointerpret. A SARI score of for example 50% should not beinterpreted as indicating a system with 50% accuracy.

len % ed match SARI add SARI delinc. ited all / edited F1 (P/R) F1 (P/R)

Repeat 0 0 36 / 0 0 (0/0) 0 (0/0)Coref 8 42 38 / 13 23 (50/15) 36 (40/32)T5-11B 13 61 52 / 32 43 (69/31) 47 (49/46)Human 23 77 44 / 28 56 (64/49) 58 (61/56)

Table 5: Test set results. See table 4 caption for a key.

T5 either Annotator Sum

T5 13 12 2 27either 7 22 4 33Annotator 1 15 24 40

Sum 21 49 30 100

Table 6: Preference between T5 output and human an-notation. Columns represents the judgement of the ex-pert A, rows that of the expert B. We see high agree-ment between two expert annotators, despite one ex-pert annotator (column annotator) is ambivalent morefrequently.

decontextualization system would result in highsentence match, adequate changed ratio (expertsedited about 79% of examples), and length changeratio (the experts’ ratio is 1.19), as well as highSARI addition and deletion scores. As a sanitycheck, we report REPEAT, which outputs the orig-inal sentence. This alone results in high sentencematch score, around 40%, meaning that on thisnumber of examples, at least one of the annota-tors deemed the sentence can stand alone withoutany edits.

The coreference system has an exact match ofabout 13% of examples that require edits, withoutany task-specific fine-tuning. Its SARI add scoresshows high precision and low recall, and its dele-tion scores are low as it cannot delete discoursemarkers. The Seq2seq generation model achieveshigh scores across all measures. The bigger vari-ant is substantially better, editing more than itssmaller variant without losing precision. We ob-serve the larger variants outperform the averagehuman on sentence match measure, but not inSARI measures. The T5 model modifies fewer ex-amples than the annotator, and edits involve fewertokens, benefiting it on the sentence match mea-sure. However, the model is more likely to missrequired edits, as shown in low recall for the SARIadd and deletion measures. We discuss this furtherin the following human evaluation section.

Human Evaluation We sampled 100 examplesin the evaluation set, where at least two annotatorsand our best model made decontextualization ed-its. We randomized the order or presentation ofthe T5 and human outputs so as to not bias theannotation. On this set, we (two of the authors)conducted a manual evaluation. Given two de-contextualized sentences, one from the best modeland another randomly selected from a set of an-notations with decontextualization edits, we eval-uated each on two dimensions: (a) is it fluent andgrammatically correct? (b) is it sufficiently andcorrectly decontextualized? Lastly, we chose thepreference between two outputs (A, B, or either).

Expert annotators marked as ‘sufficient’ thoseitems for which all possible referential ambigui-ties had been resolved. Given the subjective natureof the task, some ‘insufficient’ decontextualiza-tions by the expert annotator could be valid for theanother annotator with a different world knowl-edge. We report averaged binary scores from twoexperts. The model output scored 88.0% on flu-ency, and 67.5% on correct decontextualization,while the human reference output scored 84.5% onfluency and 78.5% on correct decontextualization.Both annotators found T5 to be slightly more flu-ent, while humans are more thorough and accuratein decontextualizating. Table 6 shows the prefer-ences of two annotators. Both preferred humanoutput, and their preferences exhibit high agree-ment (matching on 37 out of forty examples whenboth had preferences).

We briefly characterize common error patternsfor annotators and the T5 model. Similar error pat-terns emerge between the annotations and modeloutputs. Both occasionally fail to identify gener-ics that need to be replaced with referring NPs,phrases that require bridging, and temporal con-texts that should be provided. Additionally, wenoticed that the T5 model heavily relies on the titlecues, and sometimes fail to clarify ambiguous en-tities that are not the main entity of the page. Wenoticed very few examples where T5 hallucinatesfactually incorrect contents.

6 Two Applications

We present two demonstrations of the utility ofdecontextualization. First, we argue that the de-contextualized sentences can be valuable in them-selves in question answering, and show that theycan be useful as a preprocessing step.

Prefer log oddsOpt.A vs. Opt.B A B either intercept [CI]

Dec. vs. Ori. 730 426 364 0.85 [0.4,1.3]Dec. vs. Par. 850 456 234 0.55 [0.1,1.0]Ori. vs. Par. 741 505 274 0.31 [-0.2,0.8]

Table 7: User study results. Dec. refers to decontex-tualized sentence answer, Ori. means original sentenceanswer, Par. means paragraph answer. We present rawcounts of preferences and the log odds of preferring op-tion A and its 95% confidence interval.

6.1 Decontextualized Answer As IsWe showcase a use case of decontextualizedsentences as providing a succinct yet infor-mative answer to open domain factoid ques-tions (Kwiatkowski et al., 2019). We design a userstudy where people compare a decontextualized-sentence answer with an original-sentence answerand a paragraph answer to the same query.8

Setup Given a question and two presentations ofthe same answer, raters were tasked with markingtheir preference between the two answer presenta-tions (option A, option B, or either). The actualshort span answer in the sentence is always high-lighted (similar to seen in Table 8) (See Figure 3for a screenshot).

We conduct three comparison studies on thesame set of 150 questions: (a) decontextualizedsentence vs. original sentence, (b) original sen-tence vs. original paragraph, (c) decontextual-ized sentence vs. original paragraph. For eachexample in each study, we collected 10 user rat-ings. The questions are randomly chosen from aset of questions which have a short answer, andsuch that the sentence containing the short an-swer is categorized as FEASIBLE by the annota-tors and edits were necessary to decontextualize.We use crowd-sourced annotations of decontextu-alized sentences. Figure 3 shows the screenshot ofthe user study interface.

Result Table 7 shows the results of the userstudy. We observe that decontextualized sentenceanswers are preferred to both the original sentenceanswers and the original paragraph answers. Wealso note that the users preferred sentence answercompared to paragraph answer in general.

8Understanding how to present answers to users is a com-plex problem with many desiderata, e.g., preserving the orig-inal content, crediting the source, interaction with the userinterface, which we are not covering comprehensively.

Figure 3: A screenshot of the instruction and an example instance comparing the original sentence answer and theparagraph answer shown to annotators for the user study.

Figure 4: Each dot represents how frequently the de-contextualized answer is preferred for a single ques-tion / rater to the original sentence for a single question(top plot) and single rater (bottom plot). Questions (topplot) and raters (bottom plot) are sorted by its prefer-ence towards the decontextualized answer. The red lineis where both are equally preferred, and above the linerepresents question / rater where decontextualized an-swers were preferred. While the decontextualized an-swer is preferred overall, we see a large variability.

We further investigated the statistical signifi-cance of the preferences reported in Table 7. Wenoticed a quite large amount of question and ratervariability—some raters consistently preferred asentence answer, valuing conciseness, while someraters behaved in the other direction. Similarly, forsome questions, all raters preferred a sentence an-swer. Figure 4 visualizes such variability based onthe questions and raters.

To control for the correlations induced by therater and question groups, we fit a generalized lin-ear mixed model (GLMM) using the brm R pack-age (Bürkner, 2017). For this analysis, we ex-cluded data points where users did not show apreference (selected either). We used the formula:p∼1 + (1|r) + (1|q) where p is whethera rater chose one option over the other; r is therater id; and q is the question id. This formulaspecifies a regression of the log-odds of the raterpreference while allowing for random effects inthe raters (r) and questions (q). The last col-umn of Table 7 shows the fixed effect coefficientsand their confidence intervals. The intercept rep-resents the strength of preference towards optionA. We found a statistically significant preferencefor decontextualized sentences over both originalsentences and the paragraphs (p-value was smallerthan 0.05 for both studies).

Examples We qualitatively investigated whichexamples benefit from decontextualization, and inwhich examples raters prefer paragraph answers.Table 8 shows questions together with two answerpresentations, along with the predicted fixed effectquestion coefficient towards decontextualized an-swer in study (b) and towards the sentence answerin study (c). In the first row, the added informationfrom the decontextualization is not relevant to thequestion, thus we observe preference against de-contextualization. In the second and third row, thedecontextualized sentence answer is preferred asit provides enough evidence to answer the query,

Query Decontextualized answer Paragraph answer (sentence answer highlighted) Decont. Ori.

when was therising of themoon written

The Rising of the Moon, Irishballad recounting a battle be-tween the United Irishmen and theBritish Army has been in circula-tion since circa 1865 .

The ballad has been in circulation since circa1865 . The earliest verifiable date found in pub-lication is 1867 .

-2.09 -1.53

what is the mostviewed videoon youtube in24 hours

The most viewed music videowithin 24 hours of its release isTaylor Swift ’s Look What YouMade Me Do .

This list of most viewed online videos in the first24 hours contains the top 30 online videos thatreceived the most views within 24 hours of re-lease across the world. This list excludes movietrailers , which are featured on the list of mostviewed online trailers in the first 24 hours. Themost viewed music video in this time period isTaylor Swift ’s Look What You Made Me Do .

1.06 -0.48

when was lasttime englandgot to quarterfinals in worldcup

The England national footballteam have reached the quarter - fi-nals on nine occasions , the latestof which were at the 2002 and the2006 .

England did not enter the competition until1950. . . Their best ever performance is winningthe Cup in the 1966 , whilst they also finishedin fourth place in 1990, and in 2018. Otherthan that, the team have reached the quarter - fi-nals on nine occasions, the latest of which wereat the 2002 and the 2006 .

1.40 0.70

Table 8: Examples from the user study. The last column represent coefficients for preferring original sentenceover the original paragraph, and the fourth column presents coefficients for decontextualized sentence over theparagraph. Positive values means preference towards the sentence-length answer over the paragraph-length answer.

while the original sentence answer does not.

6.2 Decontextualizing System Inputs

Having shown the benefits of decontextualizationin a user facing task, we now investigate theuse of decontextualizaton as a preprocessing step.Specifically, we construct a passage retrieval cor-pus for open domain question answering (Chenet al., 2017) with decontextualized sentences. Ex-periment shows that decontextualized sentencesensure completeness of the passages while mini-mizing their length (thus computational cost).

Background Open domain question answeringtypically consists of pair a passage retrieval (Liuand Croft, 2002) and transformer-based answerextractor (reading comprehension model) basedon the retrieved passages (Guu et al., 2020;Karpukhin et al., 2020; Izacard and Grave, 2020).And the computational cost is dominated by thecost of co-encoding the query with the retrievedpassages (typically paragraphs or overlapping 100word windows).

Setup We create a corpus using the 7k docu-ments (233k paragraphs, 868k sentences) from thedocuments associated with the questions in theNQ-open development set (Lee et al., 2019). Weconsider a retrieved passage to be correct if it con-

tains one of the answer strings9 and investigatethe number of questions for which we can re-trieve a correct passage for a fixed computationalcost. Under this measure, we compare paragraphs,windows of 100 words, sentences, and decontex-tualized sentences as a set of retrieval passages.These segmentation approaches generate differentnumber of passages for the same article (para-graph and a window of 100 words segmentationmake fewer passages compared to sentences-levelsegmentation). To generate decontextualized sen-tences process all paragraphs with T5-11B model,which are trained on all annotated data (includ-ing development and test set). For about 40%of sentences, the model classified the sentenceas infeasible to decontextualize or unnecessary tomake any edits, we use the original sentence. Onthe other 60% model tended to add more infor-mation. For example, for a sentence “Bush waswidely seen as a ‘pragmatic caretaker’ presidentwho lacked a unified and compelling long-termtheme in his efforts.", the decontextualized sen-tence will be “George H.W. Bush was widely seenas a ‘pragmatic caretaker’ president of the UnitedStates who lacked a unified and compelling long-term theme in his efforts." and a paragraph would

9We adopt the answer match heuristics from Lee et al.(2019).

be entire paragraph containing this sentence, and100-word window will be a chunk without using asentence boundary as a segmentation. For all, weprepend the document title to the passage, follow-ing the literature and use the TFIDF as a retrievermodel.

Metric Let qi be a question; let Ai = [a0i . . . ani ]

be the set of valid answers; let Ci = [c1i . . . cki ]

be a ranked list of evidence passages; and letH(Ai,Ci) be the index of the top ranked contextthat contains one of the valid answers (or k + 1if there is no such context). We first define thecost of encoding a single question and passage,c(qi, c

mi ) = (|qi| + 1 + |cmi |)2. This captures

the fact that the Transformer’s computation costscales quadratically with the length of the encodedtext (question + separator + evidence passage).

O(qi,Ai,Ci) =

H(Ai,Ci)∑m=1

c(qi, cmi ).

Given the per example cost defined above, wedefine the recall of the retrieval system at compu-tational cost budget t to be:

1

N

N∑i=0

1 [O(qj,aj,Cj) < t] (1)

where N is the total number of examples and 1 isan indicator function. We use this as an evaluationmeasure instead of mean reciprocal rank or recallat N, to compare across different retrieval passagelength.

Results Figure 5 plots the recall of each re-trieval corpus at different computational cost bud-get t on the whole NQ-open evaluation set. Thegraph shows that sentence level segmentation ismore cost-effective than paragraph or 100 wordlevel segmentation, and using decontextualizedsentences is more cost effective than using theoriginal sentences. Decontextualized sentencesnear the performance of commonly used 100 wordwindows with 1/10th the cost.

This result exemplifies the way in which decon-textualization can be used to ensure that the inputto natural language understanding system is con-cise yet complete. We think this way of using de-contextualization as a preprocessing could also aidtasks such as summarization.

Figure 5: Retrieval recall plotted against computationalcost budget (Eqn 1) for different methods of documentsegmentation.

7 Related Work

Prior literature in summarization studied how arti-cle context affects the understanding of sentenceswithin an article. It has been observed that dis-ambiguating entity mentions and correctly resolv-ing anaphora is crucial for automatic summariza-tion (Otterbacher et al., 2002; Steinberger et al.,2007) and for evaluation of summarization sys-tems (Pitler et al., 2010). Li et al. (2016) identi-fied information missing from a sentence could beidentified in the article context in newswire text60% of the time. This is considerably less fre-quent than for the encyclopedic text studied here,but nevertheless hints that decontextualization fornewswire text could be feasible. It remains un-clear whether information accessible in newswirecontexts can be readily incorprated into sentencesusing controlled edits of the type we employ.

Successful decontextualization models must re-solve entity and event coreferences (Humphreyset al., 1997) as well as other forms ofanaphora (Rösiger et al., 2018). These are neces-sary but insufficient for decontextualization how-ever, which also involves discourse marker re-moval, acronym expansion, and fluent and gram-matical sentence generation.

The term decontextualization was introduced ina recent table-to-text generation dataset (Parikhet al., 2020) where a sentence from a Wikipediadocument was decontextualized such that it can beinterpretable when presented with a table alone.They cover only the sentences that are relevant

to the table, and adapt it to the table context. Ina recent image captioning dataset (Sharma et al.,2018), sentences are re-written such that informa-tion that cannot be inferred from the image is re-moved. For example, entity names are replacedwith generics (e.g. * -Tom Cruz , +A man + iswaiting.").

8 Conclusion

We define decontextualization, the task of rewrit-ing a sentence from a document to be interpretablein an empty context, while preserving its meaning.We build a crowdsourced dataset and a model fordecontextualization, and demonstrate how decon-textualization can be used in a user-facing task andas a sub-component of an application system.

We believe that decontextualization will alsobe helpful in a wide range of other applica-tions. For example, in multi-document summa-rization (Fabbri et al., 2019), co-referring enti-ties and events must be resolved across differ-ent documents and removing ambiguous refer-ences may help; extractive summarization (Chengand Lapata, 2016) could benefit from the typeof pre-processing that we presented for open-domain QA; anaphora resolution is crucial forboth summarization and machine translation (Su-sanne et al., 1992); and decontextualizing sen-tences may help in recovering explicit mentions ofentities and relations which can help informationextraction (Narasimhan et al., 2016). The currentformulation focuses on the English encyclopediccorpus and rewriting for an empty context, and fu-ture work can explore different domains of text aswell as mapping to a different context.

Acknowledgments

We would like to thank members of Google AI,especially Jacob Eisenstein, Kenton Lee, SantiagoOntanon, Ankur Parikh, Daniel Andor, Chris Al-berti, Slav Petrov for helpful discussions and com-ments. Lastly, we would like to thank our annota-tors.

References

Lora Aroyo and Chris Welty. 2015. Truth is a lie:Crowd truth and the seven myths of human an-notation. AI Magazine, 36:15–24.

Joachim Bingel, Gustavo Paetzold, and Anders

Søgaard. 2018. Lexi: A tool for adaptive, per-sonalized text simplification. In Proceedings ofthe 27th International Conference on Compu-tational Linguistics, pages 245–258, Santa Fe,New Mexico, USA. Association for Computa-tional Linguistics.

Betty J. Birner. 2012. Introduction to Pragmatics,1st edition. Wiley Publishing.

Paul-Christian Bürkner. 2017. brms: An R pack-age for Bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28.

Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017. Reading wikipedia to an-swer open-domain questions. Proceedings ofthe Annual Meeting of the Association for Com-putation Linguistics (ACL).

Jianpeng Cheng and Mirella Lapata. 2016. Neu-ral summarization by extracting sentences andwords. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Lin-guistics, pages 484–494, Berlin, Germany. As-sociation for Computational Linguistics.

Herbert H Clark. 1975. Bridging. In Theoreticalissues in natural language processing.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-trainingof deep bidirectional transformers for languageunderstanding.

Alexander R. Fabbri, Irene Li, Tianwei She, SuyiLi, and Dragomir R. Radev. 2019. Multi-news:a large-scale multi-document summarizationdataset and abstractive hierarchical model. InProceedings of the Annual Meeting of the Asso-ciation for Computation Linguistics (ACL).

H. Paul Grice. 1975. Logic and conversation. InPeter Cole and Jerry L. Morgan, editors, SpeechActs, volume 3 of Syntax and Semantics, pages41–58. Academic Press, New York.

Kelvin Guu, Kenton Lee, Zora Tung, PanupongPasupat, and Ming-Wei Chang. 2020. Realm:Retrieval-augmented language model pre-training.

K. Humphreys, R. Gaizauskas, and Saliha Azzam.1997. Event coreference for information ex-traction.

Gautier Izacard and Edouard Grave. 2020. Lever-aging passage retrieval with generative modelsfor open domain question answering.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.Spanbert: Improving pre-training by represent-ing and predicting spans. Transactions ofthe Association for Computational Linguistics,8:64–77.

V Karpukhin, B Oguz, S Min, L Wu, S Edunov,and others. 2020. Dense passage retrieval forOpen-Domain question answering. Proceed-ings of the Conference on Empirical Methodsin Natural Language Processing (EMNLP).

Tom Kwiatkowski, Jennimaria Palomaki, OliviaRedfield, Michael Collins, Ankur Parikh, ChrisAlberti, Danielle Epstein, Illia Polosukhin,Matthew Kelcey, Jacob Devlin, Kenton Lee,Kristina N. Toutanova, Llion Jones, Ming-WeiChang, Andrew Dai, Jakob Uszkoreit, QuocLe, and Slav Petrov. 2019. Natural questions:a benchmark for question answering research.Transactions of the Association of Computa-tional Linguistics.

Kenton Lee, Ming-Wei Chang, and KristinaToutanova. 2019. Latent retrieval for weaklysupervised open domain question answering.arXiv preprint 1906.00300.

Junyi Jessy Li, Bridget O’Daniel, Y. Wu, W. Zhao,and A. Nenkova. 2016. Improving the annota-tion of sentence specificity. In LREC.

X. Liu and W. Croft. 2002. Passage retrieval basedon language models. In CIKM ’02.

Karthik Narasimhan, Adam Yala, and ReginaBarzilay. 2016. Improving information extrac-tion by acquiring external evidence with rein-forcement learning. In Proceedings of the Con-ference on Empirical Methods in Natural Lan-guage Processing (EMNLP).

Jahna Otterbacher, Dragomir R. Radev, andAirong Luo. 2002. Revisions that improve co-hesion in multi-document summaries: a prelim-inary study. In Proceedings of the Annual Meet-ing of the Association for Computation Linguis-tics (ACL).

Ankur P. Parikh, Xuezhi Wang, SebastianGehrmann, Manaal Faruqui, Bhuwan Dhingra,Diyi Yang, and Dipanjan Das. 2020. Totto: Acontrolled table-to-text generation dataset. Pro-ceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),abs/2004.14373.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inher-ent disagreements in human textual inferences.Transactions of the Association for Computa-tional Linguistics, 7:677–694.

Emily Pitler, Annie Louis, and Ani Nenkova.2010. Automatic evaluation of linguistic qual-ity in multi-document summarization. In Pro-ceedings of the Annual Meeting of the Associa-tion for Computation Linguistics (ACL).

Sameer Pradhan, Alessandro Moschitti, NianwenXue, Olga Uryupina, and Yuchen Zhang. 2012.Conll-2012 shared task: Modeling multilin-gual unrestricted coreference in ontonotes. InEMNLP-CoNLL Shared Task.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and Peter J Liu.2019a. Exploring the limits of transfer learningwith a unified text-to-text transformer. arXivpreprint 1910.10683.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and Peter J. Liu.2019b. Exploring the limits of transfer learningwith a unified text-to-text transformer. ArXiv,abs/1910.10683.

Ina Rösiger, Arndt Riester, and Jonas Kuhn. 2018.Bridging resolution: Task definition, corpus re-sources and rule-based experiments. In Pro-ceedings of the International Conference onComputational Linguistics (COLING).

Piyush Sharma, Nan Ding, Sebastian Goodman,and Radu Soricut. 2018. Conceptual captions:A cleaned, hypernymed, image alt-text datasetfor automatic image captioning. In Proceed-ings of the Annual Meeting of the Associationfor Computation Linguistics (ACL).

Dan Sperber and Deirdre Wilson. 1986. Rele-vance: Communication and Cognition. Har-vard University Press, USA.

Josef Steinberger, Massimo Poesio, Mijail A.Kabadjov, and Karel Jeek. 2007. Two uses ofanaphora resolution in summarization. Inf. Pro-cess. Manage., 43(6):1663–1680.

Preusz Susanne, Birte Schmitz, Christa Hauen-schild, and Carla Umbach. 1992. Anaphora res-olution in machine translation.

Wei Xu, Chris Callison-Burch, and CourtneyNapoles. 2015. Problems in current text simpli-fication research: New data can help. Transac-tions of the Association for Computational Lin-guistics, 3:283–297.

Wei Xu, Courtney Napoles, Ellie Pavlick, QuanzeChen, and Chris Callison-Burch. 2016. Opti-mizing statistical machine translation for textsimplification. Transactions of the Associationfor Computational Linguistics, 4:401–415.