arxiv:1902.00489v1 [cs.cl] 1 feb 2019 · to agreement in prior, comprehensive studies of...

Human acceptability judgements for extractive sentence compression

Abram Handler†, Brian Dillon∗ and Brendan O’Connor†College of Information and Computer Sciences†

Department of Linguistics∗

University of Massachusetts [email protected]

Abstract

Recent approaches to English-language sen-tence compression rely on parallel corporaconsisting of sentence–compression pairs.However, a sentence may be shortened inmany different ways, which each might besuited to the needs of a particular applica-tion. Therefore, in this work, we collect andmodel crowdsourced judgements of the ac-ceptability of many possible sentence short-enings. We then show how a model of suchjudgements can be used to support a flexibleapproach to the compression task. We re-lease our model and dataset for future work.

1 Introduction

In natural language processing, sentence compres-sion refers to the task of automatically shorten-ing a longer sentence (Knight and Marcu, 2000;Clarke and Lapata, 2008; Filippova et al., 2015).Traditional approaches attempt to create compres-sions which (i) maintain readability, (ii) retain themost important information from the source sen-tence and (iii) achieve some fixed or flexible rateof compression (Napoles et al., 2011).

However, in practice, applications which makeuse of sentence compression techniques will re-quire dramatically different strategies on how tobest achieve these three competing goals: a jour-nalist might want abbreviated quotes from politi-cians, while a bus commuter might want moviereview snippets on their mobile phone. Eachsuch application presents (i) different readabilityrequirements, (ii) different definitions of impor-tance and (iii) different brevity constraints.

Thus, while recent research into extractive sen-tence compression often assumes a single, bestshortening of a sentence (Napoles et al., 2011),in this work we argue that any compressionwhich matches the readability, informativeness

and brevity requirements of a given application is aplausible shortening. Practical compression meth-ods will need to identify the best possible short-ening for a given application, not just recover the“gold standard” compression.

Traditional supervision for the compressiontask does not offer an obvious method for achiev-ing this application-specific objective. Standardsupervision consists of individual sentences pairedwith “gold standard” compressions (Filippova andAltun, 2013): offering just one reasonable short-ening for each sentence in the corpus, from amongthe many plausible compressions. Table 1 demon-strates this point in detail. Therefore, in this work,we:

• Collect and release1 a large, crowdsourceddataset of human acceptability judgements(Sprouse, 2011), specifically tailored for thesentence compression task (§4). Accept-ability judgments are native speakers’ self-reported perceptions of the well-formednessof a sentence (Sprouse and Schütze, 2014).

• Present and evaluate a model of these accept-ability judgements (§5).

• Use this model to define an ACCEPTABILITY

function which predicts the well-formednessof a shortening (§6), for use in application-specific compression systems.

2 Related work

Researchers have been studying extractive sen-tence compression for nearly two decades (Knightand Marcu, 2000; Clarke and Lapata, 2008; Fil-ippova et al., 2015). Recent approaches are often

1http://slanglab.cs.umass.edu/compression

arX

iv:1

902.

0048

9v1

[cs

.CL

] 1

Feb

201

9

http://slanglab.cs.umass.edu/compression

http://slanglab.cs.umass.edu/compression

Sentence. Pakistan launched a search for its missing ambassador to Afghanistan on Tues-day, a day after he disappeared in a Taliban area.Headline. Pakistan searches for missing ambassador.“Gold” compression. Pakistan launched a search for its missing ambassador.Alternate 1. Pakistan launched a search for its missing ambassador to Afghanistan onTuesday. (A(c) = -1.367, BREVITY = 84 characters max., IMPORTANCE = 1)Alternate 2. Pakistan launched search Tuesday. (A(c) = -6.144, BREVITY = 59 charactersmax., IMPORTANCE = 0)

Table 1: A sentence, headline and “gold compression” from a standard sentence compression dataset (Filippovaand Altun, 2013), along with two alternate compressions constructed with a system supervised with human ac-ceptability judgements (§6). The alternate compressions reflect different BREVITY requirements and different ad-herence to an application-specific IMPORTANCE criterion. In this case, the brevity requirement is expressed witha hard maximum character constraint and the importance criterion is expressed with a binary score, indicating if asentence includes a query term “Afghanistan”. We use our A(c) ∈ (−∞, 0] metric (§6) to measure the ACCEPT-ABILITY of each compression; a higher score indicates a compression is more likely to be well-formed. Alternate2 is neither entirely garbled nor perfectly well-formed, reflecting the gradient-based nature of acceptability (§4).

based on a large compression corpus,2 which wasautomatically constructed by using news headlinesto identify “gold standard” shortenings (Filippovaand Altun, 2013). State-of-the-art models trainedon this dataset (Filippova et al., 2015; Andor et al.,2016; Wang et al., 2017) can reproduce gold com-pressions (i.e. perfect token-for-token match) withaccuracy higher than 30%.

However, because a sentence may be com-pressed in many ways (Table 1), this work in-troduces human acceptability judgements as anew and more flexible form of supervision forthe sentence compression task. Our approach isthus closely connected to research which seeks tomodel human judgements of the well-formednessof a sentence (Heilman et al., 2014; Sprouse andSchütze, 2014; Lau et al., 2017; Warstadt et al.,2018). Unlike such studies, our work is strictlyconcerned with human perceptions of shortenedsentences.3 Our work also solicits human judge-ments of shortenings from naturally-occurringnews text, instead of sentences drawn from syntaxtextbooks (Sprouse and Almeida, 2017; Warstadtet al., 2018) or created via automatic translation(Lau et al., 2017).

We note that our effort focuses strictly on an-ticipating the well-formedness of extractive com-pressions, rather than identifying compressionswhich contradict or distort the meaning of theoriginal sentence. Identifying which compres-

2https://github.com/google-research-datasets/sentence-compression

3We compare our model to Warstadt et al. (2018) in §5.

sions do not modify the meaning of source sen-tences is closely connected to the unsolved tex-tual entailment problem, a recent area of focus incomputational semantics (Bowman et al., 2015;Pavlick and Callison-Burch, 2016; McCoy andLinzen, 2018). In the future, we hope to apply thisevolving research to the compression task. Somecurrent compression methods use simple hand-written rules to guard against changes in meaning(Clarke and Lapata, 2008), or syntactic mistakes(Jing and McKeown, 2000).

Finally, following much prior work, this studyapproaches sentence compression as a purely ex-tractive task. Closely related work on abstractivecompression (Cohn and Lapata, 2008; Rush et al.,2015; Mallinson et al., 2018) and sentence simpli-fication (Zhu et al., 2010; Xu et al., 2015) seeksto shorten sentences via paraphrases or reorderingof words. Despite superficial similarity, extrac-tive methods typically use different datasets, dif-ferent evaluation metrics and different modelingtechniques.

3 Compression via subtree deletion

Any sentence compression technique requires aframework for generating possible shortenings ofa sentence, s. We generate compressions with asubtree deletion approach, based on prior work(Knight and Marcu, 2000; Filippova and Strube,2008; Filippova and Altun, 2013). To generate asingle compression (from among all possible com-pressions) we begin with a dependency parse of

https://github.com/google-research-datasets/sentence-compression



s.4 Then, at each timestep, we prune a single sub-tree from the parse. AfterM subtrees are removedfrom the parse (one at a time, over M timesteps)the remaining vertexes are linearized in their orig-inal order. Formally pruning a subtree refers to re-moving a vertex and all of its descendants from adependency tree; pruning singleton subtrees (onevertex) is permitted.

We find that is possible to construct 88.2% ofthe gold compressions in the training set of a stan-dard corpus (Filippova and Altun, 2013) by onlypruning subtrees. Therefore, we only examineprune-based compression in this work.5

4 Methodology and Dataset

Our data collection methodology follows exten-sive research into human judgements of linguis-tic well-formedness (Sprouse and Schütze, 2014).Such work has shown that non-expert participantsoffer consistent judgements of natural and unnat-ural sounding sentences with high test-retest relia-bility (Langsford et al., 2018), across different datacollection techniques (Bader and Häussler, 2010;Sprouse and Almeida, 2017). We apply this re-search to sentence compression, with confidencethat our results reflect genuine human perceptionsof shortened sentences because:

1. We carefully screen out many workers whooffer judgements that violate well-knownproperties of English syntax, such as workerswho approve deletion of objects from obliga-tory transitive verbs.

2. We observe that workers approve and dis-approve of classes of sentences that Englishspeakers would categorize as “grammatical”and “ungrammatical,” respectively. For in-stance, workers rarely endorse deletion ofnominal subjects, and often endorse deletionof temporal modifiers.

3. Annotator agreement in our dataset is similarto agreement in prior, comprehensive studiesof acceptability judgments (§4.4).

The appendix details our screening procedures,and discusses classes of accepted and rejectedcompressions in our dataset.

4We use Universal Dependency (Nivre et al., 2016) trees(v1), parsed using CoreNLP (Manning et al., 2014).

5Extracting nested subclauses from source sentences isnot possible with prune-only methods, because the root nodeof the compression must be the same as the root node of theoriginal sentence. We plan to address this in future work.

0255075

no yes

count

Binary

0102030

1 2 3 4 5 6 7

count

Likert

Figure 1: Binary judgements and graded (Likert)judgements from Sprouse and Almeida (2017) for theslightly-awkward sentence, “They suspected and webelieved Peter would visit the hospital”. Bader andHäussler (2010) describe correlations between suchmeasurement techniques.

4.1 Measuring well-formedness

Our study adopts a standard distinction be-tween acceptability and grammaticality (Chom-sky, 1965; Schütze, 1996). Grammaticality is abinary and theoretical notion, used to character-ize whether a sentence is or is not generable un-der a grammatical model. Acceptability is a mea-surement an individual’s perception of the well-formedness of a sentence. Empirical studies haveshown acceptability to be a gradient-based phe-nomenon, affected by a range of factors includingplausibility, syntactic well-formedness, and fre-quency (Sprouse and Schütze, 2014). Based onthis work, we expect that workers will have graded(not binary) perceptions of the well-formedness ofcompressions shown in our task.

Although acceptability is gradient-based, wenevertheless measure worker perceptions by col-lecting binary judgements of well-formedness.Earlier studies (Bader and Häussler, 2010;Sprouse and Almeida, 2017; Langsford et al.,2018) have shown that such binary measurementsof acceptability correlate strongly with explic-itly gradient collection techniques, such as Lik-ert scales (Figure 1). We chose to collect bi-nary judgements instead of graded judgments be-cause (1) binary judgments avoid ambiguity inhow participants interpret a gradient scale (2) bi-nary judgements allowed us to write clear screenerquestions to block unreliable crowdworkers fromthe task and (3) binary judgements allowed us toapply binary logistic modeling to directly predictobservable worker behavior.

4.2 Data collection prompt

We show crowdworkers on Figure Eight6 anaturally-occurring sentence, along with a pro-posed compression of that same sentence, gener-

6https://www.figure-eight.com/.

https://www.figure-eight.com/

ated by executing a single prune operation on thesentence’s dependency tree. We then ask a binaryquestion: can the longer sentence be shortened tothe shorter sentence? Our prompt is shown in Fig-ure 2. We instruct workers to say yes if the shortersentence “sounds good” or “sounds like somethinga person would say,” following verbiage for the ac-ceptability task (Sprouse and Schütze, 2014).

Because we designed our task to follow typi-cal acceptability prompts, we expect that workerscompleting the task evaluated the wellformednessof each compressed sentence, and then answeredyes and if they deemed it acceptable.

We instruct workers to say yes if a compressionsounds good, even if it changes the meaning of asentence. While practitioners will need to iden-tify shortenings which are both syntactically well-formed and which do not modify the meaning ofa sentence, this work focuses strictly on identify-ing well-formed compressions. In the future, weplan to apply active research in semantics (§2) toidentify disqualifying changes in meaning.

Figure 2: Prompt to collect human judgements of ac-ceptability for sentence compression. Workers are in-structed to answer yes if the shorter sentence “soundsgood” or “sounds like something a person would say.”

4.3 Dataset details

We generate 10,128 sentence–compression pairsfrom a freely-distributable corpus of web news(Zheleva, 2017). Each source sentence s is chosenat random, and each compression c is producedby a single, randomly-chosen prune operation ons. Our data thus reflects the natural distribution ofdependency types in the corpus.

N judgements (train) 6010 (4522 sents.)N judgements (test) 640 (486 sents.)class balance 64.2%/35.8% (no/yes)overall compression rate µ=0.867 σ=0.174

Table 2: Corpus statistics.

We present each (s, c) pair to 3 or more work-

ers,7 then conservatively exclude many judge-ments from workers who are revealed to be inat-tentive or careless (see appendix), in order to becertain that worker disagreements in our datasetreflect genuine perceptions of well-formedness.We then divide filtered data into a training and testset by (s, c) pair, so that our model does not usea train-time judgement about (s, c) from worker kto predict a test-time judgement about (s, c) fromworker k′. Table 2 presents dataset statistics.8

Our organization does not require institutional ap-proval for crowdsourcing.

4.4 Inter-annotator agreement

There are at least two sources of inter-annotatordisagreement which could arise in our data. First,in cases when a compression is neither entirelygarbled nor perfectly well formed, previous empir-ical studies (Sprouse and Almeida, 2017; Langs-ford et al., 2018) suggest that annotators will likelydisagree. (See Figure 1 and §4.1). Second,we suspect that different individuals set differentthresholds on how acceptable a sentence must bebefore they give a “yes” response: a compres-sion that one person rates as acceptable might berated by the next as unacceptable, even if theyhave the same impression of the compression’sacceptability. Such between-annotator variabilitymight represent a form of response bias, which iscommon in psychological experiments (Macmil-lan and Creelman, 1990, 2004). We attempt tocontrol for such bias by including a worker ID fea-ture in our model (§5.1).

To evaluate the extent of such disagreement, andto compare with other work, we measure inter-annotator agreement using Fleiss’ kappa (Fleiss,1971), computing a κ = .294 on the entirefiltered dataset.9 We observe a similar rate ofagreement (κ = .323) in a comprehensive study ofacceptability judgements (Sprouse and Almeida,2017).10 Our κ is lower than typical in standardannotation paradigms in NLP, which often attemptto assign instances to hard classes, rather thanmeasure graded phenomena.

7FigureEight will sometimes solicit additional judge-ments automatically.

8Note that we use a character-based Filippova et al. (2015)rather than token-based (Napoles et al., 2011) definition ofcompression rate.

9See appendix: details of Fleiss κ for crowdsourcing.10We compute this number using publicly released data

from the YN study.

5 Intrinsic Task: Modeling singleoperations

We create a model to predict if a given worker willjudge that a given single-operation compression isacceptable. We say that Y = 1 if worker k an-swers that s can be shortened to c. We then modelp(Y = 1|W ,x) using binary logistic regression,where x = φ(s, c, k) is a feature vector reflectingthe nature of the edit which produces c from s, asobserved by worker k.

5.1 Model Features

The major features in our model are: (i) lan-guage model features, (ii) dependency type fea-tures, (iii) worker ID features, (iv) features re-flecting properties of the edit from s to c and (v)interaction features. We discuss each below.

Language model features. Our model buildsupon earlier work examining the relationship be-tween language modeling and acceptability judge-ments. In a prior study, Lau et al. (2017) de-fine several functions which normalize predictionsfrom a language model by token length and byword choice; then test which functions best alignwith human acceptability judgements. We usetheir Norm LP function in our model, defined as:

Norm LP(ξ) , − log pm(ξ)

log pu(ξ)(1)

where ξ is a sentence, pm(ξ) is the probabilityof ξ given by a language model and pu(ξ) is theunigram probability of the words in ξ.

We use Norm LP as a part of two features inour approach. One real-valued feature records theprobability of a compression computed by NormLP(c). Another binary feature computes NormLP(s) - Norm LP(c) > 0. The test set performanceof these language model (LM) features is shownin Table 3. The appendix further describes our im-plementation of Norm LP.

Dependency type features. We use the depen-dency type governing the subtree pruned from sto predict the acceptability of c. This is becauseworkers are more likely to endorse deletion of cer-tain dependency types. For instance, workers willoften endorse deletion of temporal modifiers, andoften reject deletion of nominal subjects.

Worker ID features. We also include a fea-ture indicating which of the workers in our studysubmitted a particular judgement for a given (s, c)pair. We include this feature because we observe

that different workers have greater or lesser toler-ance for more and less awkward compressions.11

Including the worker ID feature allows ourmodel to partially account for an individualworker’s judgement based on their overall en-dorsement threshold, and partially account for aworker’s judgement based on the linguistic proper-ties of the edit. The feature thus controls for vari-ability in each worker’s baseline propensity to an-swer yes. Because real applications will not haveaccess to worker-specific information, we do notuse the worker ID feature in evaluating our modeland dataset for use in practical compression sys-tems (§6). All workers in the test set submit judge-ments in the training set.

Edit property features. We also include sev-eral features which register properties of an edit,such as features which indicate if an operation re-moves tokens from the start of the sentence, re-moves tokens from the end of a sentence, or re-moves tokens which follow a punctuation mark.We include a feature that indicates if a givenoperation breaks a collocation (e.g. “He hit ahome run”). The appendix details our collocation-detection technique.

Interaction features. Finally, we include sev-enteen interaction features formed by crossingexit property features with particular dependencytypes. For instance, we include a feature whichrecords if a prune of a conj−→ removes a token fol-lowing a punctuation mark.

5.2 Evaluation

We compare our model of individual workerjudgements to simpler approaches which usefewer features (Table 3), including an approachwhich uses only language model information (Lauet al., 2017) to predict acceptability. We com-pute the test set accuracy of each approach in pre-dicting binary judgements from individual work-ers, which allows for comparison with agreementrates between workers. However, because accept-ability is a gradient-based phenomenon (§4), wealso evaluate without an explicit decision thresh-old via the area under the receiver operating char-acteristic curve (ROC AUC), which measures theextent to which an approach ranks good dele-

11More formally, we define a given worker’s deletion en-dorsement rate as the number of times a worker answers yes,divided by their total judgements. We observe a roughlynormal distribution (µ = .402 , σ = .216) of worker deletionendorsement rates across the dataset.

Hard classification (t=0.5) Ranking

Model Accuracy Fleiss κ ROC AUC (p)CoLA 0.622 -0.210 0.590 (< .001)

language model (LM) 0.623 -0.232 0.583 (< .001)

+ dependencies 0.664 0.124 0.646 (< .001)

+ worker ID 0.695 0.232 0.746 (< .001)

full , p(Y = 1|W ,x) 0.742 0.400 0.807- dependencies 0.731 0.368 0.797 (0.073)

- worker ID 0.667 0.170 0.691 (< .001)

worker – worker agreement 0.636∗ 0.270 —Sprouse and Almeida (2017) — 0.323 —

Table 3: Test set accuracy, Fleiss’ κ and ROC AUC scores for six models, trained on the single-prune dataset(§4), as well as scores for a model trained on the CoLA dataset (Warstadt et al., 2018). The simplest model usesonly language modeling (LM) features. We add dependency type (+ dependencies) and worker ID (+ workerIDs) information to this simple model. We also remove dependency information (- dependencies) and workerinformation (- worker ID) from the full model. The full model achieves the highest test set AUC; p values besideeach smaller AUC score show the probability that the full model’s gains over the smaller AUC score occurs bychance. We also compute κ for each model by calculating the observed and pairwise agreement rates (Fleiss,1971) for judgements submitted by the crowdworker and “judgements” submitted by the model. Models whichcan account for worker effects achieve higher accuracies than the observed agreement rate among workers (0.636∗),leading to higher κ than for worker–worker pairs.

tions over bad deletions. ROC AUC thus mea-sures how well predicted probabilities of binarypositive judgements correlate with human per-ceptions of well-formedness. Other work whichsolicits gradient-based judgements instead of bi-nary judgements evaluates with Pearson correla-tion (Lau et al., 2017); ROC AUC is a close vari-ant of the Kendall ranking correlation (Newson,2002). Our full model also achieves a higherAUC than approaches that remove features fromthe model. We use bootstrap sampling (Berg-Kirkpatrick et al., 2012) to test the significance ofAUC gains (Table 3); p values reflect the proba-bility that the difference in AUC between the fullmodel and the simpler model occurs by chance.

The probability that two workers, chosen at ran-dom, will agree that a given s in the test set maybe shortened to a given c in the test set is 63.6%.We hypothesize that the full model’s accuracy of74.2% is higher than the observed agreement ratebetween workers because the full model is betterable to predict if worker k will endorse an individ-ual deletion.

We also compare our full model to a baselineneural acceptability predictor trained on CoLA(Warstadt et al., 2018): a corpus of grammaticaland ungrammatical sentences drawn from syntaxtextbooks. Using a pretrained model, we predictthe probability that each source sentence and each

compression is well formed, denoted CoLA(s)and CoLA(c). We use these predictions to definefour features: CoLA(c), log CoLA(c), CoLA(s) -CoLA(c), and log CoLA(s) - log CoLA(c). Weshow the performance of this model in Table 3.CoLA’s performance for extractive compressionresults warrants future examination: large cor-pora designed for neural methods sometimes con-tain limitations which are not initially understood(Chen et al., 2016; Jia and Liang, 2017; Gururan-gan et al., 2018).

6 Extrinsic Task: Modelingmulti-operation compression

In this work, we argue that a single sentence maybe compressed in many ways; a “gold standard”compression is just one of the exponential possi-ble shortenings of a sentence. Instead of mim-icking gold standard training data, we argue thatcompression systems should support application-specific BREVITY constraints and IMPORTANCE

criteria: stock traders and stylists will have differ-ent definitions of “important” content in fashionblogs, compressions for a desktop will be longerthan compressions for a phone.

Nevertheless, in many application settings,compression systems will try to show users well-formed shortenings. Thus, in the remainder ofthis study, we examine how our transition-based

sentence compressor (§3), supervised with humanacceptability judgements (§5), may be be used toprovide ACCEPTABILITY scores which align withhuman perceptions of well-formedness.12 Suchscores could be used as a component of many dif-ferent practical sentence compression systems, in-cluding a method described in §6.3.

6.1 ACCEPTABILITY scoresWe consider any function which maps a com-pression to some real number reflecting its well-formedness to be an ACCEPTABILITY score. In§5, we present a model, p(Y = 1|W ,x), whichattempts to predict which operations on well-formed sentences return reasonable compressions.If we execute a chain of M such operations, andassume that each operation’s effect on acceptabil-ity is independent, we can model the probabil-ity that M prune operations will result in an ac-ceptable compression with

∏Mi p(Y = 1|W ,xi),

which is equal to the chance that a person will en-dorse each of the M deletions. We test this modelwith a function that expresses the probability thatall operations are acceptable:

A(c) ,M∑i=1

log p(Y = 1|W ,xi) (2)

where each xi are features reflecting the na-ture of the prune operations which shortens cito ci+1 in the chain of operations, and wherep(Y = 1|W ,xi) is the predicted probability ofdeletion endorsement under our full model. (Be-cause no worker observes the deletion, we do notuse the worker ID feature in predicting deletionendorsement.)

The sum of log probabilities in A(c) reflects thefact that any operation on a well-formed sentencecarries inherent risk: modifying a sentence’s de-pendency tree may result in a compression whichis not acceptable. The more operations executedthe greater the chance of generating a garbledcompression. We use this intuition to define asimpler alternative, AM (c) , −M , where Mis the number of prune operations used to cre-ate the compression. We also examine a func-tion Amin(c) , min{log p(Y = 1|W ,xi) | i ∈

12Early approaches to sentence compression used languagemodels (Clarke and Lapata, 2008) or corpus statistics (Fil-ippova and Strube, 2008) to generate “readability” scores.Newer neural approaches (Filippova et al., 2015) rely on im-plicit definitions of well-formedness encoded in training data.

1..M}, which represents our observation that asingle operation with a low chance of endorsementwill often create a garbled compression. We com-pare these functions to ALM (c), which is equalto the probability of the compression c under alanguage model, normalized by sentence length.(We use the Norm LP formula defined in §5. Lan-guage model predictions have been shown to cor-relate with the well-formedness of a sentence (Lauet al., 2017; Kann et al., 2018).) Finally, we testa function ACoLA(c), which is equal to the pre-dicted probability of well-formedness of the com-pression from a pretrained acceptability predictor(§5).

6.2 Experiment

To evaluate each ACCEPTABILITY function, wecollect a small dataset of multi-prune compres-sions.13 We draw 1000 initial candidate sen-tences from a standard compression corpus (Filip-pova and Altun, 2013), and then remove sentenceswhich are shorter than 100 characters to create asample of 958 sentences. We then compress eachof the sentences in the sample by executing pruneoperations in succession until the character lengthof the remaining sentence is less than B, a ran-domly sampled integer between 50 and 100. Thiscreates an evaluation dataset with a (roughly) uni-form distribution, by character length.

To generate each compression, we use a sam-pling method which allows us to explore a widerange of well-formed and garbled shortenings,without generating too many obviously terriblecompressions.14 Concretely, we sample eachprune operation in each chain in proportion top(Y = 1), a model’s prediction (§5) that the editwill be judged acceptable. This means that wedelete vertex v and its descendants with probabil-ity 1

Z p(y = 1|W, s, cv), where cv is the compres-sion formed by pruning the subtree rooted at v andZ =

∑v∈V p(y = 1|W, s, cv) is the sum of the

endorsement probability of all possible compres-sions.15

We show each sentence in the evaluation dataset

13Rather than single-prune compressions (§4).14Very many exponential possible compressions of a sin-

gle sentence will be garbled or non-sensical. Pruning evena single subtree at random from an acceptable sentence (§5)destroys acceptability more than 60% of the time.

15In the behavioral sciences, this method of choosingactions is sometimes called probability matching (Vulkan,2000).

Function Description ROC AUC

ACoLA(c) CoLA pretrained .510ALM (c) Language model .557AM (c) Number of operations × -1 .580Amin(c) Least acceptable operation .581A(c) All operations acceptable .591

Table 4: ROC AUC for several ACCEPTABILITY func-tions for the multi-operation compression task. TheA(c) model achieves a gain of .034 in AUC over theALM (c) model (p = 0.005).

to 3 annotators, using our acceptability prompt(Figure 2). This creates final evaluation set con-sisting of 2,388 judgements of 940 multi-prunecompressions, after we implement the judgementfiltering process described in the appendix. Wecompute κ=0.099 for the evaluation dataset.

For all defined ACCEPTABILITY functions, wemeasure AUC against binary worker deletion en-dorsements (yes or no judgements) in the evalua-tion dataset to determine the quality of the rank-ing produced by each ACCEPTABILITY function(§5.2). The function A(c), which integrates infor-mation from a language model, as well as informa-tion about the grammatical details of the processwhich creates c from s, achieves the highest AUCon the evaluation set, best correlating with humanjudgements of well-formedness.

6.3 One sentence, many compressionsThis work argues that there is no single best way tocompress a sentence. We demonstrate this idea byexamining some of the exponential possible com-pressions of the sentence shown below. (This samesentence is also shown in Table 1.)

s = Pakistan launched a search for its missingambassador to Afghanistan on Tuesday, a dayafter he disappeared in a Taliban area.cg = Pakistan launched a search for its missingambassadorc1 = Pakistan launched a search for its missingambassador to Afghanistan on Tuesdayc2 = Pakistan launched search Tuesday

Table 5: A sentence s and a “gold” compression (cg)from a standard corpus (Filippova and Altun, 2013),along with two alternate compressions (c1 and c2).

We generate an initial list of 1000 possible com-pressions of this 126-character sentence via theprocedure defined in §6.2, and we score the AC-

CEPTABILITY of each compression by A(c). Inthis instance, we define BREVITY(c) to be themaximum character length of a compression andIMPORTANCE(c) to be a binary function return-ing 1 only if the compression includes the queryterm, “Afghanistan”. (Practical compression sys-tems would also need to check for changes inmeaning resulting from deletion (§2), but we leavethis step for future work.)

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●●

●●

●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

−15

−10

−5

0

50 60 70 80 90 100Brevity

Acc

epta

bilty ●

alt. 1alt. 2gold

Importance●

●FalseTrue

Figure 3: 554 possible compressions of a single sen-tence (Table 5), displayed by A(c) score, BREVITYconstraint and IMPORTANCE criterion. The “gold stan-dard” compression, cg , is shown with a large redsquare, along with alternate c1 (large triangle), and al-ternate c2 (large circle).

Following deduplication steps described in theappendix, we generate a final list of 554 differentpossible compressions of the sentence. We ploteach of the 554 compressions in Figure 3, whichshows many possible shortenings with high A(c)scores. The “gold standard” compression is justone arbitrary shortening of a sentence.

7 Conclusion and future work

Our effort suggets areas for future work. To be-gin, our study is strictly concerned with gram-matical and not semantic acceptability. In the fu-ture, we hope to apply active work from seman-tics (§2) to identify meaning-changing compres-sions. We also plan to add support for additionaloperations such as paraphrasing or forming com-pressions from nested subclauses. In addition, weplan to develop improved models of multi-prunecompression, and to apply our ACCEPTABILITY

scores in loss functions for neural compressiontechniques.

8 Appendix

8.1 Crowdsourcing details

We paid workers 5 cents per judgement and onlyopened our task to US-based workers with a level-2 designation on Figure Eight.

Following standard practice on the crowdsourc-ing platform, we used test questions to screen outcareless workers (Snow et al., 2008). All work-ers began our job with a quiz mode of 10 screenerquestions, and then saw one screener question inevery subsequent page of 10 judgements. Work-ers who failed more than 80% of test questionswere screened out from the task. Screener ques-tions were indistinguishable from our regular col-lection prompt.

We wrote test screener questions based onestablished understanding of English syntax, toavoid biasing results with our own subjectivejudgements of acceptability. For example, lin-guists have extensively examined which Englishverbs require objects and which verbs do not re-quire objects via corpus-based, elicitation-basedand eye-tracking methods (Gahl et al., 2004; Staubet al., 2006). We used this work to write screenerquestions which check that workers answer nofor operations that prune direct objects of obliga-tory transitive verbs. Similarly, we wrote screenerquestions which check that workers answer no todeletions which split a verb and a known oblig-atory particle in a multiword expression (Baldwinand Kim, 2010), or remove determiners before sin-gular count nouns (Huddleston and Pullum, 2002,p. 354). For the multi-prune dataset (§6.2), weadded test questions which confirmed that workersapproved of well-formed, gold standard compres-sions from a standard corpus (Filippova and Altun,2013).

We also include screener questions which checkif a worker is paying attention, along with sev-eral poll questions which ask workers if they grewup speaking English.16 We ignore judgementsfrom known non-native speakers and known inat-tentive workers in downstream analysis.17 We de-fined rules for filtering the dataset before examin-ing the test set to ensure that filtering decisions didnot influence test-set results. We release screener

16Workers are instructed there is no right answer to ques-tions about language background, so there is no incentive toanswer dishonestly.

17We also exclude 1663 suspected fraudulent judgementsfrom 17 IP address associated with multiple worker IDs.

questions and task instructions along with crowd-sourced data for this work.

8.2 Per-dependency deletion endorsements

Breaking out worker responses by dependencytype provides additional validation for our datacollection approach. We observe that workers areunlikely to endorse deletion of dependency typeswhich create compressions that English speak-ers would deem “ungrammatical,” and likely toendorse deletions which speakers would deem“grammatical.”

For example, in UD, the mwe−−→ relation ismost commonly used to link two or more func-tion words that obligatorily occur together (e.g.because of, due to, as well as). Since deletinga mwe−−→ amounts to suppressing a critical closed-class item, it is not surprising that, overall, workersonly assented to deleting mwe−−→ in 9.5% of cases.Similarly, low deletion endorsement for the cop−→relation (15.6%) is consistent with the grammati-cal rules of mainstream varieties of American En-glish, which generally require an overt copula incopular constructions.18

On the other hand, we found that optional (Lar-son, 1985) pre-conjunction operators like both oreither were almost always considered removable(80.0% deletion endorsement). Workers also en-dorsed the deletion of temporal adverbs such astomorrow or the day after next 78.9% of the time,which is sensible as temporal adverbs are typicallyconsidered adjuncts.

Since these response patterns generally alignwith well-established grammatical generalizations(Huddleston and Pullum, 2002), they serve to val-idate our data collection approach.

8.3 Experimental details

We report additional details regarding several ofthe experiments in the paper, presented in the orderin which experiments appear.

Fleiss κ. Fleiss’ original metric (Fleiss, 1971)assumes that each judged item will be judgedby exactly the same number of raters. How-ever, our data filtering procedures create a datasetwith a variable number of raters per sentence-compression pair. (This is common in crowd-sourcing.) We thus calculate the observed agree-ment rate for an individual item (Pi, in Fleiss’ no-tation) by computing the pairwise agreement rate

18Not all dialects require overt copulas (Green, 2002).

from among all raters for that item. We ignorecases where only one rater judged a given (s, c)pair, which occurs for 73.2% of pairs.

Tuning and implementation. We implementour model , p(Y = 1|W ,x) with scikit-learn(Pedregosa et al., 2011) using L2 regularization.We tune the inverse regularization constant to c =0.1 to optimize ROC AUC in 5-fold cross valida-tion over the training set, after testing c ∈ {10j |j ∈ {−3,−2,−1, 0, 1, 2}}. We do not include abias term. All other settings are set to default val-ues.

NormLP. Following Lau et al. (2017), in thiswork, we use the Norm LP function to normal-ize output from a language model to predict gram-maticality. Our Norm LP function uses predic-tions from a 3-gram language model trained onEnglish Gigaword (Parker et al., 2011) and imple-mented with KenLM (Heafield, 2011). Lau et al.(2017) report identical performance for the NormLP function using 3-gram and 4-gram models.

Lau et al. (2017) found that another function,SLOR(ξ), performed as well as Norm LP inpredicting human judgements. We found thatNorm LP achieved higher AUC than SLOR in 5-fold cross-validation experiments with the trainingset.19

Collocations. Our model includes a binary fea-ture, indicating if an edit breaks a collocation.We identify collocations by computing the offsets(signed token distances) between words (Manningand Hinrich Schütze, 1999, ch. 5.2) in English Gi-gaword (Parker et al., 2011). If the variance intoken distance between two words is less than 2and the mean token distance between the words isless than 1.5 we deem the words a collocation. Weidentify 647 total edits (across train and test sets)which break a collocation; only 11 of such editsare for mwe−−→ relations. Examples include: “forgetabout it”’, “kind of” and “as well”.

CoLA. All reported results for the CoLA modeluse the Real/Fake + LM (ELMo) baseline fromWarstadt et al. (2018).20 Across our entire dataset,the mean predicted acceptability of source sen-tences from the CoLA model is 0.867 (σ=0.264)and the predicted acceptability of compressions is0.740 (σ = 0.363). We hypothesize that compres-sion scores have a greater variance and a lower

19Kann et al. (2018) also examine SLOR for automatic flu-ency evaluation.

20https://github.com/nyu-mll/CoLA-baselines

mean because only some compressions are well-formed.

Deduplication of possible compressions. Inthis work, we describe a method for generatingmultiple compressions from a single sentence. Inour generation procedure, it is possible to ran-domly select the exact same sequence of opera-tions multiple times. During these experiments,we remove any such duplicates from the initial list.

Additionally, in our compression framework,the sequence of operations which produces a givenshortening is not unique.21 In cases where differ-ent sequences of operations return the same com-pression, we select the compression with the high-est A(c) score, which represents the best availablepath to the shortening.

References

Daniel Andor, Chris Alberti, David Weiss, Ali-aksei Severyn, Alessandro Presta, KuzmanGanchev, Slav Petrov, and Michael Collins.2016. Globally normalized transition-basedneural networks. In ACL.

Markus Bader and Jana Häussler. 2010. Toward amodel of grammaticality judgments. Journal ofLinguistics, 46(2):273–330.

Timothy Baldwin and Su Nam Kim. 2010. Mul-tiword expressions. In Handbook of NaturalLanguage Processing.

Taylor Berg-Kirkpatrick, David Burkett, and DanKlein. 2012. An empirical investigation of sta-tistical significance in NLP. In Proceedings ofthe 2012 Joint Conference on Empirical Meth-ods in Natural Language Processing and Com-putational Natural Language Learning.

Samuel R. Bowman, Gabor Angeli, ChristopherPotts, and Christopher D. Manning. 2015. Alarge annotated corpus for learning natural lan-guage inference. In EMNLP.

Danqi Chen, Jason Bolton, and Christopher D.Manning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. InACL.21For instance, it is possible to prune a leaf vertex with

one operation, and then prune its parent vertex with a secondoperation; or just remove both vertexes at once via a singleprune of the parent. (Each sequence returns the same com-pression.)

https://github.com/nyu-mll/CoLA-baselines

https://github.com/nyu-mll/CoLA-baselines

https://arxiv.org/abs/1603.06042

https://arxiv.org/abs/1603.06042

http://www.aclweb.org/anthology/D12-1091

http://www.aclweb.org/anthology/D12-1091

Noam Chomsky. 1965. Aspects of the theory ofsyntax. M.I.T. Press, Cambridge.

James Clarke and Mirella Lapata. 2008. Globalinference for sentence compression: An integerlinear programming approach. Journal of Arti-ficial Intelligence Research, 31:399–429.

Trevor Cohn and Mirella Lapata. 2008. Sentencecompression beyond word deletion. In Pro-ceedings of the 22nd International Conferenceon Computational Linguistics-Volume 1.

Katja Filippova, Enrique Alfonseca, Carlos A Col-menares, Lukasz Kaiser, and Oriol Vinyals.2015. Sentence compression by deletion withLSTMs. In EMNLP.

Katja Filippova and Yasemin Altun. 2013. Over-coming the lack of parallel data in sentencecompression. In EMNLP.

Katja Filippova and Michael Strube. 2008. De-pendency tree based sentence compression. InProceedings of the Fifth International NaturalLanguage Generation Conference.

Joseph L Fleiss. 1971. Measuring nominal scaleagreement among many raters. Psychologicalbulletin, 76(5):378.

Susanne Gahl, Dan Jurafsky, and Douglas Roland.2004. Verb subcategorization frequencies:American english corpus data, methodologicalstudies, and cross-corpus comparisons. Behav-ior Research Methods, Instruments, & Comput-ers, 36(3):432–443.

Lisa Green. 2002. African American English : Alinguistic introduction. Cambridge UniversityPress, Cambridge, U.K. New York.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, andNoah A. Smith. 2018. Annotation artifacts innatural language inference data. In NAACL.

Kenneth Heafield. 2011. KenLM: faster andsmaller language model queries. In EMNLPSixth Workshop on Statistical Machine Trans-lation, Edinburgh, Scotland, United Kingdom.

Michael Heilman, Aoife Cahill, Nitin Madnani,Melissa Lopez, Matthew Mulholland, and JoelTetreault. 2014. Predicting grammaticality onan ordinal scale. In ACL.

Rodney Huddleston and Geoffrey K. Pullum.2002. The Cambridge Grammar of the EnglishLanguage. Cambridge University Press.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehensionsystems. In EMNLP, Copenhagen, Denmark.

Hongyan Jing and Kathleen R McKeown. 2000.Cut and paste based text summarization. InNAACL.

Katharina Kann, Sascha Rothe, and Katja Filip-pova. 2018. Sentence-Level Fluency Evalua-tion: References Help, But Can Be Spared! InCoNNL 2018.

Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence com-pression. In AAAI/IAAI.

Steven Langsford, Amy Perfors, Andrew T Hen-drickson, Lauren A Kennedy, and Danielle JNavarro. 2018. Quantifying sentence accept-ability measures: Reliability, bias, and variabil-ity. Glossa: a journal of general linguistics,3(1).

Richard K Larson. 1985. On the syntax of dis-junction scope. Natural Language & LinguisticTheory, 3(2):217–264.

Jey Han Lau, Alexander Clark, and Shalom Lap-pin. 2017. Grammaticality, acceptability, andprobability: A probabilistic view of linguis-tic knowledge. Cognitive Science, 41(5):1202–1241.

Neil A Macmillan and C Douglas Creelman. 1990.Response bias: Characteristics of detection the-ory, threshold theory, and "nonparametric" in-dexes. Psychological Bulletin, 107(3):401.

Neil A. Macmillan and C. Douglas Creelman.2004. Detection theory: A user’s guide. Psy-chology Press.

Jonathan Mallinson, Rico Sennrich, and MirellaLapata. 2018. Sentence Compression for Arbi-trary Languages via Multilingual Pivoting. InEMNLP 2018.

Christopher Manning and Carson Hin-rich Schütze. 1999. Foundations of StatisticalNatural Language Processing. MIT Press,Cambridge, Mass.

http://aclweb.org/anthology/N18-2017

http://aclweb.org/anthology/N18-2017

https://kheafield.com/papers/avenue/kenlm.pdf

https://kheafield.com/papers/avenue/kenlm.pdf

https://doi.org/10.1017/9781316423530

https://doi.org/10.1017/9781316423530

https://www.aclweb.org/anthology/D17-1215



Christopher D. Manning, Mihai Surdeanu, JohnBauer, Jenny Finkel, Steven J. Bethard, andDavid McClosky. 2014. The Stanford CoreNLPnatural language processing toolkit. In Associ-ation for Computational Linguistics (ACL) Sys-tem Demonstrations.

Thomas R. McCoy and Tal Linzen. 2018. Non-entailed subsequences as a challenge for naturallanguage inference. CoRR. Version 1.

Courtney Napoles, Benjamin Van Durme, andChris Callison-Burch. 2011. Evaluating sen-tence compression: Pitfalls and suggestedremedies. In Proceedings of the Workshop onMonolingual Text-To-Text Generation.

Roger Newson. 2002. Parameters behind “non-parametric” statistics: Kendall’s tau, Somers’D and median differences. The Stata Journal,2(1):45–64.

Joakim Nivre, Marie-Catherine de Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christo-pher D. Manning, Ryan T. McDonald, SlavPetrov, Sampo Pyysalo, Natalia Silveira, ReutTsarfaty, and Daniel Zeman. 2016. Universaldependencies v1: A multilingual treebank col-lection. In LREC.

Robert Parker, David Graff, Junbo Kong,Ke Chen, and Kazuaki Maeda. 2011. EnglishGigaword fifth edition.

Ellie Pavlick and Chris Callison-Burch. 2016. So-called non-subsective adjectives. In *SEM.

Fabian Pedregosa, Gaël Varoquaux, AlexandreGramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Pret-tenhofer, Ron Weiss, Vincent Dubourg, JakeVanderplas, Alexandre Passos, David Courna-peau, Matthieu Brucher, Matthieu Perrot, andÉdouard Duchesnay. 2011. Scikit-learn: Ma-chine learning in python. J. Mach. Learn. Res.,12:2825–2830.

Alexander M. Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. In EMNLP.

Carson Schütze. 1996. The empirical base oflinguistics: Grammaticality judgments and lin-guistic methodology. University of ChicagoPress, Chicago, Il.

Rion Snow, Brendan O’Connor, Daniel Jurafsky,and Andrew Y Ng. 2008. Cheap and fast—butis it good?: Evaluating non-expert annotationsfor natural language tasks. In EMNLP.

John Sprouse and Carson Schütze. 2014. Re-search Methods in Linguistics, chapter Judg-ment Data. Cambridge University Press, Cam-bridge, UK.

Jon Sprouse. 2011. A validation of Amazon Me-chanical Turk for the collection of acceptabil-ity judgments in linguistic theory. Behavior re-search methods, 43(1):155–167.

Jon Sprouse and Diogo Almeida. 2017. Designsensitivity and statistical power in acceptabilityjudgment experiments. Glossa, 2(1):1.

Adrian Staub, Charles Clifton, and Lyn Frazier.2006. Heavy np shift is the parser’s last re-sort: Evidence from eye movements. Journalof memory and language, 54(3):389–406.

Nir Vulkan. 2000. An economist’s perspective onprobability matching. Journal of economic sur-veys, 14(1):101–118.

Liangguo Wang, Jing Jiang, Hai Leong Chieu,Chen Hui Ong, Dandan Song, and Lejian Liao.2017. Can syntax help? Improving an LSTM-based sentence compression model for new do-mains. In ACL.

Alex Warstadt, Amanpreet Singh, and Samuel R.Bowman. 2018. Neural network acceptabilityjudgments. CoRR, abs/1805.12471v1.

Wei Xu, Chris Callison-Burch, and CourtneyNapoles. 2015. Problems in current text sim-plification research: New data can help. TACL,3:283–297.

Elena Zheleva. 2017. Vox media dataset. In KDDDS + J Workshop, Halifax, Canda.

Zhemin Zhu, Delphine Bernhard, and IrynaGurevych. 2010. A monolingual tree-basedtranslation model for sentence simplification.In COLING.

http://www.aclweb.org/anthology/P/P14/P14-5010

http://www.aclweb.org/anthology/P/P14/P14-5010

https://arxiv.org/pdf/1811.12112.pdf



http://dl.acm.org/citation.cfm?id=1953048.2078195

http://dl.acm.org/citation.cfm?id=1953048.2078195

https://doi.org/10.18653/v1/D15-1044

https://doi.org/10.18653/v1/D15-1044

https://doi.org/10.1016/j.jml.2005.12.002

https://doi.org/10.1016/j.jml.2005.12.002

https://d

arxiv:1902.00489v1 [cs.cl] 1 feb 2019 · to agreement in prior, comprehensive studies of...

Documents