how to evaluate a summarizer: study design and statistical

15
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1861–1875 April 19 - 23, 2021. ©2021 Association for Computational Linguistics 1861 How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation Julius Steen Katja Markert Department of Computational Linguistics Heidelberg University 69120 Heidelberg, Germany (steen|markert)@cl.uni-heidelberg.de Abstract Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent sum- marization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experi- ments on two aspects of summaries’ linguistic quality (coherence and repetitiveness) to com- pare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our sur- vey, we also find that study parameters such as the overall number of annotators and distribu- tion of annotators to annotation items are of- ten not fully reported and that subsequent sta- tistical analysis ignores grouping factors aris- ing from one annotator judging multiple sum- maries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addi- tion, we highlight that for the purpose of sys- tem comparison the current practice of elicit- ing multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget. 1 Introduction Current automatic metrics for summary evaluation have low correlation with human judgements on summary quality, especially for linguistic quality evaluation (Fabbri et al., 2020). As a consequence, manual evaluation is still vital to properly compare the linguistic quality of summarization systems. While the document understanding conferences (DUC) established a standard manual evaluation procedure (Dang, 2005), we conduct a comprehen- sive survey of recent works in text summarization that reveals a wide array of different evaluation questions and methods in current use. Furthermore, DUC procedures were designed for a small set of expert judges, while current evaluation campaigns are often conducted by untrained crowd-workers. The design of the manual annotation, specifically the overall number of annotators as well as the distribution of annotators to annotation items, has substantial impact on power, reliability and type I errors of subsequent statistical analysis. How- ever, most current papers (see Section 2) do not consider the interaction of annotation design and statistical analysis. We investigate the optimal an- notation methods, design and statistical analysis of summary evaluation studies, making the following contributions: 1. We conduct a comprehensive survey on the current practices in manual summary evalu- ation in Section 2. Often, important study parameters, such as the total number of an- notators, are not reported. In addition, sta- tistical significance is either not assessed at all or with tests (t-test or one-way ANOVA) that lead to inflated type I error in the pres- ence of grouping factors (Barr et al., 2013). In summarization evaluation, grouping factors arise whenever one annotator rates multiple summaries. 2. We carry out annotation experiments for co- herence and repetition. We use both Likert- and ranking-style questions on the output of four recent summarizers and reference sum- maries. We show that ranking-style eval- uations are more reliable and cost-efficient for coherence, similar to prior findings by Novikova et al. (2018) and Sakaguchi and Van Durme (2018). However, on repetition, where many documents do not exhibit any problems, Likert outperforms ranking. 3. Based on our annotation data, we perform

Upload: others

Post on 11-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Evaluate a Summarizer: Study Design and Statistical

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1861–1875April 19 - 23, 2021. ©2021 Association for Computational Linguistics

1861

How to Evaluate a Summarizer: Study Design and Statistical Analysis forManual Linguistic Quality Evaluation

Julius Steen Katja MarkertDepartment of Computational Linguistics

Heidelberg University69120 Heidelberg, Germany

(steen|markert)@cl.uni-heidelberg.de

Abstract

Manual evaluation is essential to judgeprogress on automatic text summarization.However, we conduct a survey on recent sum-marization system papers that reveals littleagreement on how to perform such evaluationstudies. We conduct two evaluation experi-ments on two aspects of summaries’ linguisticquality (coherence and repetitiveness) to com-pare Likert-type and ranking annotations andshow that best choice of evaluation method canvary from one aspect to another. In our sur-vey, we also find that study parameters such asthe overall number of annotators and distribu-tion of annotators to annotation items are of-ten not fully reported and that subsequent sta-tistical analysis ignores grouping factors aris-ing from one annotator judging multiple sum-maries. Using our evaluation experiments, weshow that the total number of annotators canhave a strong impact on study power and thatcurrent statistical analysis methods can inflatetype I error rates up to eight-fold. In addi-tion, we highlight that for the purpose of sys-tem comparison the current practice of elicit-ing multiple judgements per summary leads toless powerful and reliable annotations given afixed study budget.

1 Introduction

Current automatic metrics for summary evaluationhave low correlation with human judgements onsummary quality, especially for linguistic qualityevaluation (Fabbri et al., 2020). As a consequence,manual evaluation is still vital to properly comparethe linguistic quality of summarization systems.

While the document understanding conferences(DUC) established a standard manual evaluationprocedure (Dang, 2005), we conduct a comprehen-sive survey of recent works in text summarizationthat reveals a wide array of different evaluationquestions and methods in current use. Furthermore,

DUC procedures were designed for a small set ofexpert judges, while current evaluation campaignsare often conducted by untrained crowd-workers.The design of the manual annotation, specificallythe overall number of annotators as well as thedistribution of annotators to annotation items, hassubstantial impact on power, reliability and typeI errors of subsequent statistical analysis. How-ever, most current papers (see Section 2) do notconsider the interaction of annotation design andstatistical analysis. We investigate the optimal an-notation methods, design and statistical analysis ofsummary evaluation studies, making the followingcontributions:

1. We conduct a comprehensive survey on thecurrent practices in manual summary evalu-ation in Section 2. Often, important studyparameters, such as the total number of an-notators, are not reported. In addition, sta-tistical significance is either not assessed atall or with tests (t-test or one-way ANOVA)that lead to inflated type I error in the pres-ence of grouping factors (Barr et al., 2013).In summarization evaluation, grouping factorsarise whenever one annotator rates multiplesummaries.

2. We carry out annotation experiments for co-herence and repetition. We use both Likert-and ranking-style questions on the output offour recent summarizers and reference sum-maries. We show that ranking-style eval-uations are more reliable and cost-efficientfor coherence, similar to prior findings byNovikova et al. (2018) and Sakaguchi andVan Durme (2018). However, on repetition,where many documents do not exhibit anyproblems, Likert outperforms ranking.

3. Based on our annotation data, we perform

Page 2: How to Evaluate a Summarizer: Study Design and Statistical

1862

Monte-Carlo simulations to show the riskposed by ignoring grouping factors in statisti-cal analysis and find up to eight-fold increasesin type I errors when using standard signif-icance tests. As an alternative, we proposeto either use mixed effect models (Barr et al.,2013) for analysis or to design studies in sucha manner that results can be aggregated into in-dependent samples, amenable to simpler anal-ysis tools.

4. Finally, we show that the common practice ofeliciting repeated judgements for the samesummary leads to less reliable and power-ful studies for system-level comparison whencompared to studies with the same budget butonly one judgement per summary.

Code and data for our experiments is avail-able at https://github.com/julmaxi/summary_lq_analysis.

2 Literature Survey

We survey all summarization papers in ACL,EACL, NAACL, ConLL, EMNLP, TACL and theComputational Linguistics journal in the years2017-2019. We choose this timeframe as we are in-terested in current practices in summarization eval-uation: 2017 marks the publication of the pointergenerator network (See et al., 2017), which hasbeen highly influential for neural summarization.We focus our analysis on papers that present a novelsystem for single- or multi-document summariza-tion and take a single or multiple full texts as inputand also output text (SDS/MDS). This allows us toconcentrate on recommendations for human evalu-ation of newly developed summarization systems.1

Out of the resulting 105 SDS/MDS system pa-pers, we identify all papers that conduct at leastone new comparative system evaluation with hu-man annotators for further analysis, leading to 58papers in the survey. The fact that this is only abouthalf of all papers is troubling given that it has beenrecently demonstrated that current automatic eval-uation measures such as ROUGE (Lin, 2004) are

1Excluded from the analysis are sentence summarizationor headline generation papers, although most of the points wemake hold for their evaluation campaigns as well. Summa-rization evaluation papers that do not present a new systembut concentrate on sometimes large-scale system comparisonsare discussed in the Related Work section instead. Lists ofall included and excluded papers are given in SupplementaryMaterial, which also contains exact evaluation parameters perpaper in a spreadsheet.

Category Pa. St.

EvaluationQuestions

Overall 17 23Content 45 65Fluency 29 34Coherence 10 11Repetition 14 17Faithfulness 6 8Referential Clarity 2 2Other 8 9

EvaluationMethod

Likert 32 43Pairwise 10 14Rank 9 9BWS 6 9QA 9 14Binary 4 4Other 2 2

Number ofDocumentsinEvaluation

< 20 6 1020-34 22 4135-49 3 450-99 14 21100 11 14> 100 4 4not given 1 1

Number ofSystemsconsidered

< 3 13 203 17 234 16 235 6 10> 5 12 19w/ Reference 16 25w/o Reference 45 70

Number ofAnnotationsperSummary

1 2 52-3 20 304-5 12 276-10 3 5not given 23 28

OverallNumber ofAnnotators

1-5 19 256-10 3 3> 10 5 9not given 32 58

AnnotatorRecruitment

Crowd 25 49Other 35 46

StatisticalEvaluation

t-test 9 16ANOVA 9 18CI 4 6Other/unspecified 7 8None 32 47

Table 1: Our survey for 58 system papers with 95 man-ual evaluation studies (2017-2019). We show numbersboth for individual studies and per paper. As a papermay contain several studies with different parameters,counts in the paper column do not always add up.

Page 3: How to Evaluate a Summarizer: Study Design and Statistical

1863

not good at predicting summary scores for modernsystems (Schluter, 2017; Kryscinski et al., 2019;Peyrard, 2019).

We assess both what studies ask annotators tojudge, as well as how they elicit and analyse judge-ments. The survey was conducted by one of theauthors: for most papers, the categories they fellinto were obvious. For difficult cases (unclear spec-ifications, papers that do not fit the normal mould)the two authors discussed the categorisations. Sur-vey results are given in Table 1. Further detailsabout the choices made in the survey, including cat-egory groupings/definitions and what is includedunder Other, can be found in Appendix B. As manypapers conduct more than one human evaluation(for example on different corpora), we also list in-dividual annotation studies (a total of 95).

Of the systems that do have human evaluation,many focus on content, including informativeness,coverage, focus, and relevance. Where linguisticquality is evaluated, most focus on general ques-tions about fluency/readability, with a smaller num-ber of papers evaluating coherence and repetition.

In the rest of this section we focus on the threeaspects of evaluation we cover in this paper: Howto elicit judgements, how these judgements are ana-lysed statistically and how studies are designed.

2.1 MethodsThe majority of evaluations is conducted usingLikert-type judgements, with the second most fre-quent method being rank-based annotations, in-cluding pairwise comparison. Best-worst scaling(BWS) is a specific type of ranking-oriented eval-uation that requires annotators to specify only thefirst and last rank (Kiritchenko and Mohammad,2017). QA (Narayan et al., 2018) is used for con-tent evaluation only. This motivates us to compareboth Likert and ranking annotations in Section 4.1.

2.2 Statistical AnalysisIf a significance test is conducted, most papers anal-yse their data either using ANOVA or a sequenceof paired t-tests. Both tests are based on the as-sumption that judgements (or pairs of judgements,in case of paired t-test) are sampled independentlyfrom each other. However, in almost all studies,annotators give judgements on more than one sum-mary from the same system. Thus the resultingjudgements are only independent if we assume thatall annotators behave identically. Given that priorwork (Gillick and Liu, 2010; Amidei et al., 2018),

as well as our own reliability analysis in Section4.1, show that especially crowd-workers tend todisagree about judgements, this assumption doesnot seem warranted. As a consequence, traditionalsignificance tests are at high risk of inflated type Ierror rates. This is well known in the broader fieldof linguistics (Barr et al., 2013), but is disregardedin summarization evaluation. We show in Section 5that this is a substantial problem for current summa-rization evaluations and suggest alternative analysismethods.

2.3 Design

Most papers only report the number of documentsin the evaluation and the number of judgementsper summary. This, however, is not sufficient to de-scribe the design of a study, lacking any indicationabout the overall number of annotators that madethese judgements. A study with 100 summaries and3 annotations per summary can mean 3 annotatorsdid all judgements in one extreme, or a study with300 distinct annotators in the other. Only 26 of the95 studies describe their annotation design in full,almost all of which use designs in which a smallnumber of annotators judge all summaries. Only 6of 49 crowdsourced studies report the full design.

We show in Section 5 that a low total numberof annotators aggravates type I error rates with im-proper statistical analysis. In Section 6 we furthershow that with proper analysis, a low total numberof annotators leads to less powerful experiments.Almost all analysed papers choose designs withmultiple judgements per summary. However, weshow in Section 6.2 that this — for the purposeof system ranking — leads to loss of reliability aswell as power when compared to a study with thesame budget and only one annotation per summary.

3 Coherence and Repetition Annotation

To elicit summary judgements for analysis, we con-duct studies on two linguistic quality tasks. Inthe first, we ask annotators to judge the coher-ence of the summaries, while in the second weask for the repetitiveness of the summary. We se-lect these two tasks over the more frequent Fluencytask as we found in preliminary investigations thatmany recent summarization systems already pro-duce highly fluent text, making them hard to differ-entiate. We do not evaluate Overall and Content asboth require access to the input document, whichdifferentiates these questions from the linguistic

Page 4: How to Evaluate a Summarizer: Study Design and Statistical

1864

quality evaluation of the summaries.For both tasks, we conduct one study using a

seven-point Likert-scale (Likert) and another us-ing a ranking-based annotation method (Rank),where annotators rank summaries for the same doc-ument from best to worst. Screenshots of the in-terfaces for both approaches and full annotator in-structions are given in Appendix A.

Corpus and Systems. Mirroring a commonsetup (see Section 2), we select four abstractivesummarization systems and the reference sum-maries (ref) for analysis.

• The pointer generator summarizer (PG) (Seeet al., 2017), which is still often used as abaseline for abstractive summarization

• The abstractive sentence rewriter (ASR) ofGehrmann et al. (2018), which is a strongsummarization system that does not rely onexternal pretraining for its generation step

• Seneca (Sharma et al., 2019), a system thatcombines explicit modelling of coreferenceinformation with an external coherence model

• BART (Lewis et al., 2020), a transformer net-work that achieves SotA on CNN/DM.

We randomly sample 100 documents from thepopular CNN/DM corpus (Hermann et al., 2015)with corresponding summaries from all systems toform the item set for all our studies.

Study design. We ensure a sufficient total num-ber of annotators by using a block design. Weseparated our corpus into 20 blocks of 5 documentsand included all 5 summaries for each documentin the same block, which results in 5 × 5 = 25summaries per block.

All items in a block were judged by the same setof three annotators. No annotator was allowed tojudge more than one block. This results in a totalof 3 × 20 = 60 annotators and 1500 judgementsper task. Figure 1 shows a schematic overview ofour design, which balances the need for a largeenough annotator pool with a sufficient task size tobe worthwhile to annotators.

We recruited native English speakers from thecrowdsourcing platform Prolific2 and carefully ad-justed the reward to be no lower than £7.50 perhour based on pilot studies. Summaries (or sets

2prolific.com

a1

d1…d5 d95…d100

a2

a3

a58

a59

a60

… …

……

Figure 1: Schematic representation of our study de-sign. Rows represent annotators, columns documents.Each blue square corresponds to a judgement of thesummaries of all five systems for a document. Everyrectangular group of blue squares forms one block.

of summaries for Rank) within a block were pre-sented in random order.

4 Ranking vs. Likert

Table 2 shows the average Likert scores and theaverage rank for all systems, tasks and annotationmethods. We use mixed-effect ordinal regressionto identify significant score differences (see Sec-tion 5 for details). Both annotation methods pro-vide compatible system rankings for the two tasks,though for the repetition task both methods strug-gle to differentiate between systems. If we wereinterested in the true ranking, we could conducta power analysis given some effect size of inter-est and elicit additional judgements to improve theranking. However, as we are concerned with theprocess of system evaluation and not the evaluationitself, we do not conduct any further analysis.

In the remainder of this section, we focus onthe reliability of the two methods as well as theircost-effectiveness.

4.1 Reliability

Traditionally, reliability is computed by chance-adjusted agreement on individual instances. How-ever, for NLG evaluation, Amidei et al. (2018) ar-gue that a low agreement often reflects variabilityin language perception. Additionally, we are notinterested in individual document scores, but inwhether independent runs of the same study wouldresult in consistent system scores. In Table 3 wethus report split-half reliability (SHR) in additionto Krippendorffs α (Krippendorff, 1980). To com-pute SHR, we randomly divide judgements intotwo groups that share neither annotators nor docu-ments, i.e. two independent runs of the study. We

Page 5: How to Evaluate a Summarizer: Study Design and Statistical

1865

System Likert (Coh) Rank (Coh) Likert (Rep) Rank (Rep)

BART 5.25(1) 1.73(1) 5.85(2/3) 2.88(2/3/4)

ref 4.33(3/4) 3.31(3/4) 6.14(1/2) 2.41(1/2)

ASR 4.17(3/4) 3.17(3/4) 4.88(4/5) 3.51(4/5)

PG 4.81(2) 2.68(2) 5.63(3) 2.92(3/4)

seneca 3.52(5) 4.11(5) 5.16(4/5) 3.27(3/4/5)

Table 2: Results of our annotation experiment. Numbers in brackets indicate rank for a system for a given an-notation method. Multiple ranks in the brackets indicate systems at these ranks are not statistically significantlydifferent (p ≥ 0.05, mixed-effects ordinal regression).

System α SHRCoh: Likert 0.22 0.96Coh: Rank 0.43 0.98Rep: Likert 0.27 0.95Rep: Rank 0.18 0.91

Table 3: Krippendorffs α with ordinal level of mea-surement and Split-Half-Reliability for both annotationmethods on the two tasks.

then compute the correlation3 between the systemscores in both halves. The final score is the averagecorrelation after 1000 trials.

Though agreement on individual summaries isrelatively low for all annotation methods, studiesstill arrive at consistent system scores when weaverage over many annotators as demonstrated bythe SHR. This reflects similar observations madeby Gillick and Liu (2010).

We find that on coherence, Rank is more reli-able than Likert, though not on repetition. Aninvestigation of the Likert score distributionsfor both tasks in Figure 2 shows that coherencescores are relatively well differentiated whereas amajority of repetition judgements give the highestscore of 7, indicating no repetition at all in mostsummaries. We speculate overall agreement suf-fers, because ranking summaries with similarly lowlevel of repetition (and not allowing ties) is poten-tially arbitrary.4

4.2 Cost-efficiency

While more reliable annotation methods allow forfewer annotations, the cost of a study is ultimatelydetermined by the work-time that needs to be in-vested to achieve a reliable result. To investigate

3We use the Pearson correlation implementation of scipy(Virtanen et al., 2020).

4This is supported by feedback we received from annota-tors that the summaries were difficult to rank as they mostlyavoided repetition well.

Figure 2: Score distribution of Likert for both tasks.Each data point shows the number of times a particularscore was assigned to each system.

Figure 3: Time spent on annotation (in minutes) vs. cor-relation with the full-sized score. We gather annotationtimes in buckets with a width of ten minutes and showthe 95% confidence interval for each bucket.

this, we randomly sample between 2 and 19 blocksfrom our annotations and compute the total timeannotators spent to complete each sample. Wealso compute the Pearson correlation of the systemscores in each sample with the scores on the fullannotation set. We relate time spent to similaritybetween sample and full score in Figure 3.

For coherence, Rank is more efficient thanLikert. On repetition, the lower reliability ofRank also results in lower efficiency. However,with additional annotation effort, reliability be-comes on-par with Likert. This is a consequenceof the overall lower annotator workload for Rank.

Page 6: How to Evaluate a Summarizer: Study Design and Statistical

1866

5 Statistical Analysis and Type I Errors

The two most common significance tests in summa-rization studies, ANOVA and t-test (see Table 1),both assume judgements (or pairs of judgements, inthe case of t-test) are independent. This is, however,not true for most study setups as a single annotatortypically judges multiple summaries and multiplesummaries are generated from the same input doc-ument. Both documents and annotators are thusgrouping factors in a study that must be taken intoaccount by the statistical analysis. Generalizedmixed effect models (Barr et al., 2013) offer a so-lution but have, to the best of our knowledge, notbeen used in summarization evaluation at all. Wechoose a mixed effect ordered logit model to anal-yse our Likert data for both tasks.5 We will showthat traditional analysis methods have a substan-tially elevated risk of type I errors, i.e. differencesbetween systems found in manual analysis mightbe overstated.

Method. The ordered logit model we employ canbe described as follows:

logit(P (Y ≤ c))

= µc − (Xβ + Zaua + Zdud)

where P (Y ≤ c) is the probability that the scoreof a summary is at most c. µc is the threshold coeffi-cient for level c, β is the vector of fixed effects andua, ud are the vectors of annotator- and document-level random effects respectively, where ua, ud areboth drawn from normal distributions with mean 0.Finally,X,Za, Zd are design matrices for fixed andrandom effects. As the only fixed effect, we use adummy-coded variable indicating the system thathas produced the summary, with ref as the refer-ence level. We estimate both random intercepts andslopes for both documents and annotators follow-ing advice of Barr et al. (2013) to always specifythe maximal random effect structure. In practicalterms this means that we allow annotators to bothdiffer in how harsh or generous they are in theirassessment, as well as in which system they prefer.Similarly, we allow system performance to varyper-document, leading to both generally higher orlower scores, as well as different system rankingsper document.

5We do not include Rank data as the ordinal regressionmodel does not generate ranks.

Figure 4: Relation of type I error rates at p < 0.05to the total number of annotators for different designs,all with 100 documents and 3 judgements per summary.We conduct the experiment with both the t-test and ap-proximate randomization test (ART). We show resultsboth with averaging results per document and withoutany aggregation. We run 2000 trials per design. Thered line marks the nominal error rate of 0.05.

We fit all models using the ordinal R-package(Christensen, 2019) and compute pairwise contrastsbetween the parameters estimated for each systemusing the emmeans-package (Lenth et al., 2018)with Tukey-adjustment.

To demonstrate the problem of ignoring thegrouping factors, we can now sample artificial datafrom the model distribution and try to analyse itwith inappropriate tests. This Monte-Carlo simula-tion is similar to the more general analysis of Barret al. (2013).

We set β to 0 so all systems perform equallywell on the population level and only keep the(zero-mean) document and annotator effects in themodel. The false-positive rate of statistical tests onthis artificial data should thus be no higher than thesignificance level. We then repeatedly apply boththe t-test and the approximate randomization test(ART) (Noreen, 1989), a non-parametric test, tosamples drawn from the model and determine thetype I error rate at p < 0.05. We set the numberof documents to 100 and demand 3 judgementsper summary to mirror a common setup in manualevaluation. We then vary the total number of anno-tators between 3 and 300 by changing how manysummaries a single annotator judges.

Results. We report results given the model esti-mated for Likert in Figure 4. Ignoring the de-pendencies between samples leads to inflated typeI error rates, whether using the t-test or the ART.This is especially severe when only few annotatorsjudge the whole corpus. In the extreme case withonly three annotators in total, the null-hypothesisis rejected in about 40% of trials at a significance

Page 7: How to Evaluate a Summarizer: Study Design and Statistical

1867

level of 0.05 in both tasks. Even our original designwith 60 annotators still sees an increase of the typeI error rate by about 3%. Only if every annotatorjudges a single document and annotations are aver-aged per document, samples are independent andthus the real error is at the nominal 0.05 level. Thisdesign, however, is unrealistic given that annotatorsmust be recruited and instructed.

We suggest two solutions to this problem: Eitheruse mixed effect models or aggregate the judge-ments so samples become independent. This al-lows the assumptions of simpler tools such as ARTto be met. In our study, we could average judge-ments in every block to receive independent sam-ples. This is only possible, however, if the designof the study considers this problem in advance:a crowd-sourcing study that allows annotators tojudge as many samples as they like is unlikely toresult in such a design.

6 Study Design and Study Power

When conducting studies for system comparison,we are interested in maximizing their power to de-tect differences between systems. For traditionalanalysis, the power is ultimately determined by thenumber of documents (or judgements, when no ag-gregation takes place) in the study. However, whenanalysis takes into account individual annotators,power becomes additionally dependent on the totalnumber of annotators and how evenly they partici-pated in the study. This gives additional importanceto the design of evaluation studies. In this section,we thus focus on how to optimize studies for powerand reliability.

We first show that for well-powered experiments,we need to ensure that a sufficient total number ofannotators participates in a study. In the second partof this section, we will then demonstrate studiescan improve their power by not eliciting multiplejudgements per summary.

6.1 Overall Number of Annotators

To demonstrate the difference in power caused byvarying the total number of annotators in a study,we determine the power for a design with the sametotal number of documents and judgements perdocument but different total numbers of annotators.

We run the experiment both with regression andART with proper aggregation of dependent samplesas described in Section 5. We refer to the latter asARTagg to differentiate it from normal ART.

Figure 5: Power for 100 documents and 3 judgementsper summary with different number of total annotators.

For each design we repeatedly sample artificialdata from the Likert model and apply both teststo the data. The process is the same as in Sec-tion 5 except we do not set β to zero and countacceptances of the null-hypothesis.6

We again set the number of documents to 100and the number of repeat judgements to 3 and varythe total number of annotators between 3 and 75by varying the number of blocks between 1 and 25.We test for power at a significance level of 0.05.

Figure 5 shows how power drops sharply whenonly few annotators take part in the study. Thisis in line with the theoretical analysis of Juddet al. (2017) that shows that the number of partici-pants is crucial for power when analysing studieswith mixed effect models. The drop is worse forARTagg as fewer annotators mean fewer indepen-dent blocks and thus a lack of datapoints for theanalysis.

6.2 Annotator Distribution

Most studies elicit multiple judgements per sum-mary, following best practices in NLP for corpusdesign (Carletta, 1996). While this leads to betterjudgements per document, the goal of many sum-marization evaluations is a per system judgement.

For this kind of study, Judd et al. (2017) showthat for mixed models that include both annotatorand target (in our case, input document) effects, adesign where targets are nested within annotators,i.e. every annotator has its own set of documents,is always more powerful than one where they are(partially) crossed with annotators, i.e. a study withmultiple annotations per summary, given the sametotal number of judgements. In fact, power could bemaximized by having each annotator judge the sum-

6As this is an observed power analysis it probably over-estimates the power of our analysis for the true effect. Theanalysis is thus only useful to compare designs under our bestestimate of actual effect sizes.

Page 8: How to Evaluate a Summarizer: Study Design and Statistical

1868

Figure 6: Reliabilities of nested vs. crossed designs forRank and Likert for both tasks.

maries for only a single, unique document. How-ever, this is usually not realistic due to the fixedcosts of annotator recruitment and instruction. Wedemonstrate on our dataset how both reliability andpower are affected by nested vs. crossed design.

To compare reliability, we randomly sample bothnested and crossed designs from our full study andthen compute the Pearson correlation of the systemscores given by this smaller annotation set with thesystem scores given by the full study. As shownin Figure 6, nested samples are always at least asgood and mostly better at approximating the resultsof the full annotation compared to a crossed samplewith the same annotation effort.

We also conduct a power analysis for regres-sion and ARTagg comparing nested and crosseddesigns. We again turn to Monte-Carlo simula-tion on the Likert models and sample nestedand crossed designs with the same total numberof judgements (i.e. the same cost). We keep theblock size constant at 5 and vary the number ofannotators between 3 and 60. For nested designs,we drop the document-level random effects fromthe ordinal regression, as document is no longer agrouping factor in nested designs.

Figure 7 shows that nested designs always havea power advantage over crossed designs, especiallywhen few judgements are elicited. We also find thatART can be used to analyse data without loss ofpower when there are enough independent blocks.This might be attractive as ART is less computa-tionally expensive than ordinal regression.

Figure 7: Power for p < 0.05 of nested and crosseddesigns for ARTagg and regression. X-axis shows thenumber of judgements elicited, Y-axis the power-level.

7 Related Work

Human evaluation has a long history in summariza-tion research. This includes work on the correlationof automatic metrics with human judgements (Lin,2004; Liu and Liu, 2008; Graham, 2015; Peyrardand Eckle-Kohler, 2017; Gao et al., 2019; Sunand Nenkova, 2019; Xenouleas et al., 2019; Zhaoet al., 2019; Fabbri et al., 2020; Gao et al., 2020)and improving the efficiency of the annotation pro-cess (Nenkova and Passonneau, 2004; Hardy et al.,2019; Shapira et al., 2019). The impact of annotatorinconsistency on system ranking has been studiedboth by Owczarzak et al. (2012) and Gillick andLiu (2010). To the best of our knowledge, we arethe first to investigate the implications of annotatorvariance on the statistical analysis and the designin summarization system comparison studies.

For general NLG evaluation, van der Lee et al.(2019) establish best practices for evaluation stud-ies. We extend on their advice by conducting ex-perimental studies specifically for summary evalua-tion. In addition, we show the importance of studydesign and consideration of annotator-effects inanalysis on real world data. The advice of Mathuret al. (2017) regarding annotation sequence effectsshould be taken into account in addition to oursuggestions.

Method Comparison. Ranking has been shownto be effective in multiple NLP-tasks (Kiritchenkoand Mohammad, 2017; Zopf, 2018), includingNLG quality evaluation (Novikova et al., 2018). Inthis work we confirm this for coherence evaluation,although we find evidence that ranking is less effi-cient on repetition, where many documents do notexhibit any problems. We also add the dimensionof annotator workload as a primary determinant ofcost to the analysis of the comparison.

Page 9: How to Evaluate a Summarizer: Study Design and Statistical

1869

Multiple methods have been suggested to re-duce study cost by sample selection (Sakaguchiet al., 2014; Novikova et al., 2018; Sakaguchi andVan Durme, 2018; Liang et al., 2020) or integra-tion with automatic metrics (Chaganty et al., 2018).These efforts complement ours, as care still needsto be taken in analysis and study design.

Recently, rank-based magnitude estimation hasbeen shown to be a promising method for elicitingjudgements in NLG tasks and offers a combinationof ranking and rating approaches (Novikova et al.,2018; Santhanam and Shaikh, 2019). However, ithas not yet found widespread use in the summa-rization community. While magnitude estimationhas been shown to reduce annotator variance, ouradvice regarding experimental design and groupingfactors in statistical analysis applies to this methodas well, as annotators can still systematically differin which systems they prefer.

Statistical analysis. With regard to statisticalanalysis of experimental results, Dror et al. (2018)give advice for hypothesis testing in NLP. However,they do not touch on the problem of dependent sam-ples. Rankel et al. (2011) analyse TAC data andshow the importance of accounting for input docu-ments in statistical analysis of summarizer perfor-mance and suggest the use of the Wilcoxon signedrank test for analysis. Sadeqi Azer et al. (2020)argue that p-values are often not well understoodand advocate bayesian methods as an alternative.While the analysis in our paper is frequentist, themixed effect model approach can also be integratedinto a bayesian framework. Kulikov et al. (2019)model annotator bias in such a framework but donot account for differences in annotator preferences.In work conducted in parallel to ours, Card et al.(2020) show that many human experiments in NLPunderreport their experimental parameters and areunderpowered, including Likert-type judgements.Their simulation approach to power analysis is verysimilar to our experiments. In addition to theiranalysis, we show that ignoring grouping factors instatistical analysis of human annotations leads toinflated type I error rates. We also show that powercan be increased by choosing nested over crosseddesigns with the same budget. The problem of un-derpowered studies has also been tackled outsideof NLP by Brysbaert (2019).

For psycholinguistics, Barr et al. (2013) demon-strate how generalizability of results is negativelyimpacted by ignoring grouping factors in the anal-

ysis. Mixed effect models have found use in NLPbefore (Green et al., 2014; Cagan et al., 2017; Ka-rimova et al., 2018; Kreutzer et al., 2020), but tothe best of our knowledge they have not been usedin summary evaluation.

8 Conclusion

We surveyed the current state of the art in manualsummary quality evaluation and investigated meth-ods, statistical analysis and design of these studies.We distill our findings into the following guidelinesfor manual summary quality evaluation:

Method. Both ranking and Likert-type annota-tions are valid choices for quality judgements.However, we present preliminary evidence that theoptimal choice of method is dependent on task char-acteristics: If many summaries are similar for agiven aspect, Likert may be the better option.

Analysis. Analysis of elicited data should takeinto account variance in annotator preferences toavoid inflated type I error rates. We suggest theuse of mixed effect models for analysis that canexplicitly take into account grouping factors in stud-ies. Alternatively, traditional tests can be used withproper study design and aggregation.

Study Design. Study designers should controlthe number of annotators and how many summarieseach individual annotator judges to ensure suffi-cient study power. Additionally, to ensure reliabil-ity of results, studies should report the design andthe total number of annotators in addition to thenumber of documents and repeat judgements. Stud-ies with repeat judgements on the same summarydo not provide any advantage for system compari-son and are less reliable and powerful than nestedstudies of the same size.

We hope that these findings will help researchersplan their own evaluation studies by allowing themto allocate their budget better. We also hope thatour findings will encourage researchers to takemore care in the statistical analysis of results. Thisprevents misleading conclusions due to ignoringthe effect of differences in annotator behaviour.

Acknowledgements

We would like to thank Stefan Riezler for manyfruitful discussions about the applications of mixedeffect models.

Page 10: How to Evaluate a Summarizer: Study Design and Statistical

1870

ReferencesJacopo Amidei, Paul Piwek, and Alistair Willis. 2018.

Rethinking the agreement in human evaluation tasks.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 3318–3329,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.

Dale J. Barr, Roger Levy, Christoph Scheepers, andHarry J. Tily. 2013. Random effects structure forconfirmatory hypothesis testing: Keep it maximal.Journal of Memory and Language, 68(3):255–278.

Marc Brysbaert. 2019. How many participants do wehave to include in properly powered experiments?a tutorial of power analysis with reference tables.Journal of Cognition, 2(1).

Tomer Cagan, Stefan L. Frank, and Reut Tsarfaty. 2017.Data-driven broad-coverage grammars for opinion-ated natural language generation (ONLG). In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1331–1341, Vancouver, Canada. As-sociation for Computational Linguistics.

Dallas Card, Peter Henderson, Urvashi Khandelwal,Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020.With little power comes great responsibility. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 9263–9274, Online. Association for Computa-tional Linguistics.

Jean Carletta. 1996. Assessing agreement on classifica-tion tasks: The kappa statistic. Computational Lin-guistics, 22(2):249–254.

Arun Chaganty, Stephen Mussmann, and Percy Liang.2018. The price of debiasing automatic metrics innatural language evalaution. In Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 643–653, Melbourne, Australia. Associationfor Computational Linguistics.

R. H. B. Christensen. 2019. ordinal—regression mod-els for ordinal data. R package version 2019.12-10.https://CRAN.R-project.org/package=ordinal.

Hoa Trang Dang. 2005. Overview of duc 2005. In InProceedings of the Document Understanding Conf.Wksp. 2005 (DUC 2005) at the Human LanguageTechnology Conf./Conf. on Empirical Methods inNatural Language Processing (HLT/EMNLP.

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re-ichart. 2018. The hitchhiker’s guide to testing statis-tical significance in natural language processing. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1383–1392, Melbourne, Aus-tralia. Association for Computational Linguistics.

Alexander R. Fabbri, Wojciech Kryscinski, BryanMcCann, Caiming Xiong, Richard Socher, andDragomir Radev. 2020. Summeval: Re-evaluatingsummarization evaluation.

Yang Gao, Wei Zhao, and Steffen Eger. 2020. SU-PERT: Towards new frontiers in unsupervised evalu-ation metrics for multi-document summarization. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1347–1354, Online. Association for Computational Lin-guistics.

Yanjun Gao, Chen Sun, and Rebecca J. Passonneau.2019. Automated pyramid summarization evalua-tion. In Proceedings of the 23rd Conference on Com-putational Natural Language Learning (CoNLL),pages 404–418, Hong Kong, China. Association forComputational Linguistics.

Sebastian Gehrmann, Yuntian Deng, and AlexanderRush. 2018. Bottom-up abstractive summarization.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 4098–4109, Brussels, Belgium. Associationfor Computational Linguistics.

Dan Gillick and Yang Liu. 2010. Non-expert evalua-tion of summarization systems is risky. In Proceed-ings of the NAACL HLT 2010 Workshop on CreatingSpeech and Language Data with Amazon’s Mechan-ical Turk, pages 148–151, Los Angeles. Associationfor Computational Linguistics.

Yvette Graham. 2015. Re-evaluating automatic sum-marization with BLEU and 192 shades of ROUGE.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages128–137, Lisbon, Portugal. Association for Compu-tational Linguistics.

Spence Green, Sida I. Wang, Jason Chuang, JeffreyHeer, Sebastian Schuster, and Christopher D. Man-ning. 2014. Human effort and machine learnabil-ity in computer aided translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1225–1236, Doha, Qatar. Association for ComputationalLinguistics.

Hardy Hardy, Shashi Narayan, and Andreas Vlachos.2019. HighRES: Highlight-based reference-lessevaluation of summarization. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 3381–3392, Florence,Italy. Association for Computational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Proceedings of the 28th Inter-national Conference on Neural Information Process-ing Systems - Volume 1, NIPS’15, page 1693–1701,Cambridge, MA, USA. MIT Press.

Page 11: How to Evaluate a Summarizer: Study Design and Statistical

1871

Charles M Judd, Jacob Westfall, and David A Kenny.2017. Experiments with more than one random fac-tor: Designs, analytic models, and statistical power.Annual Review of Psychology, 68:601–625.

Sariya Karimova, Patrick Simianer, and Stefan Riezler.2018. A user-study on online adaptation of neuralmachine translation to human post-edits. MachineTranslation, 32(4):309–324.

Svetlana Kiritchenko and Saif Mohammad. 2017. Best-worst scaling more reliable than rating scales: Acase study on sentiment intensity annotation. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 465–470, Vancouver, Canada. Asso-ciation for Computational Linguistics.

Julia Kreutzer, Nathaniel Berger, and Stefan Riezler.2020. Correct me if you can: Learning from errorcorrections and markings. Proceedings of the 22ndAnnual Conference of the European Association forMachine Translation.

Klaus Krippendorff. 1980. Content Analysis: An In-troduction to Its Methodology. Sage, Beverly Hills,CA.

Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-Cann, Caiming Xiong, and Richard Socher. 2019.Neural text summarization: A critical evaluation. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 540–551, Hong Kong, China. Association for Computa-tional Linguistics.

Ilia Kulikov, Alexander Miller, Kyunghyun Cho, andJason Weston. 2019. Importance of search and eval-uation strategies in neural dialogue modeling. InProceedings of the 12th International Conference onNatural Language Generation, pages 76–87, Tokyo,Japan. Association for Computational Linguistics.

Chris van der Lee, Albert Gatt, Emiel van Miltenburg,Sander Wubben, and Emiel Krahmer. 2019. Bestpractices for the human evaluation of automaticallygenerated text. In Proceedings of the 12th Interna-tional Conference on Natural Language Generation,pages 355–368, Tokyo, Japan. Association for Com-putational Linguistics.

Russell Lenth, Henrik Singmann, Jonathon Love, PaulBuerkner, and Maxime Herve. 2018. Emmeans: Es-timated marginal means, aka least-squares means. Rpackage version, 1(1):3.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In Proceedings of the 58th An-nual Meeting of the Association for Computational

Linguistics, pages 7871–7880, Online. Associationfor Computational Linguistics.

Weixin Liang, James Zou, and Zhou Yu. 2020. Be-yond user self-reported Likert scale ratings: A com-parison model for automatic dialog evaluation. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1363–1374, Online. Association for Computational Lin-guistics.

Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In Text Summariza-tion Branches Out, pages 74–81, Barcelona, Spain.Association for Computational Linguistics.

Feifan Liu and Yang Liu. 2008. Correlation betweenROUGE and human evaluation of extractive meetingsummaries. In Proceedings of ACL-08: HLT, ShortPapers, pages 201–204, Columbus, Ohio. Associa-tion for Computational Linguistics.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2017. Sequence effects in crowdsourced annota-tions. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 2860–2865, Copenhagen, Denmark. Associa-tion for Computational Linguistics.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers), pages 1747–1759, New Orleans, Louisiana.Association for Computational Linguistics.

Ani Nenkova and Rebecca Passonneau. 2004. Evaluat-ing content selection in summarization: The pyra-mid method. In Proceedings of the Human Lan-guage Technology Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: HLT-NAACL 2004, pages 145–152,Boston, Massachusetts, USA. Association for Com-putational Linguistics.

Eric W Noreen. 1989. Computer-intensive methods fortesting hypotheses. Wiley New York.

Jekaterina Novikova, Ondrej Dusek, and Verena Rieser.2018. RankME: Reliable human ratings for naturallanguage generation. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 72–78, New Orleans, Louisiana. Associationfor Computational Linguistics.

Karolina Owczarzak, John M. Conroy, Hoa TrangDang, and Ani Nenkova. 2012. An assessment ofthe accuracy of automatic evaluation in summariza-tion. In Proceedings of Workshop on EvaluationMetrics and System Comparison for Automatic Sum-marization, pages 1–9, Montreal, Canada. Associa-tion for Computational Linguistics.

Page 12: How to Evaluate a Summarizer: Study Design and Statistical

1872

Maxime Peyrard. 2019. Studying summarization eval-uation metrics in the appropriate scoring range. InProceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages5093–5100, Florence, Italy. Association for Compu-tational Linguistics.

Maxime Peyrard and Judith Eckle-Kohler. 2017. Aprincipled framework for evaluating summarizers:Comparing models of summary quality against hu-man judgments. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 26–31,Vancouver, Canada. Association for ComputationalLinguistics.

Peter Rankel, John Conroy, Eric Slud, and DianneO’Leary. 2011. Ranking human and machine sum-marization systems. In Proceedings of the 2011Conference on Empirical Methods in Natural Lan-guage Processing, pages 467–473, Edinburgh, Scot-land, UK. Association for Computational Linguis-tics.

Erfan Sadeqi Azer, Daniel Khashabi, Ashish Sabhar-wal, and Dan Roth. 2020. Not all claims are createdequal: Choosing the right statistical approach to as-sess hypotheses. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 5715–5725, Online. Association forComputational Linguistics.

Keisuke Sakaguchi, Matt Post, and BenjaminVan Durme. 2014. Efficient elicitation of annota-tions for human evaluation of machine translation.In Proceedings of the Ninth Workshop on Statisti-cal Machine Translation, pages 1–11, Baltimore,Maryland, USA. Association for ComputationalLinguistics.

Keisuke Sakaguchi and Benjamin Van Durme. 2018.Efficient online scalar annotation with bounded sup-port. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 208–218, Melbourne,Australia. Association for Computational Linguis-tics.

Sashank Santhanam and Samira Shaikh. 2019. To-wards best experiment design for evaluating dia-logue system output. In Proceedings of the 12thInternational Conference on Natural Language Gen-eration, pages 88–94, Tokyo, Japan. Association forComputational Linguistics.

Natalie Schluter. 2017. The limits of automatic sum-marisation according to ROUGE. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume2, Short Papers, pages 41–45, Valencia, Spain. As-sociation for Computational Linguistics.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computa-tional Linguistics.

Ori Shapira, David Gabay, Yang Gao, Hadar Ro-nen, Ramakanth Pasunuru, Mohit Bansal, Yael Am-sterdamer, and Ido Dagan. 2019. Crowdsourcinglightweight pyramids for manual summary evalua-tion. In Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), pages 682–687, Minneapolis, Minnesota. Association for Com-putational Linguistics.

Eva Sharma, Luyang Huang, Zhe Hu, and Lu Wang.2019. An entity-driven framework for abstractivesummarization. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3280–3291, Hong Kong, China. As-sociation for Computational Linguistics.

Simeng Sun and Ani Nenkova. 2019. The feasibilityof embedding based automatic evaluation for sin-gle document summarization. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 1216–1221, Hong Kong,China. Association for Computational Linguistics.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant,Matt Haberland, Tyler Reddy, David Courna-peau, Evgeni Burovski, Pearu Peterson, WarrenWeckesser, Jonathan Bright, Stefan J. van der Walt,Matthew Brett, Joshua Wilson, K. Jarrod Millman,Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones,Robert Kern, Eric Larson, CJ Carey, Ilhan Polat,Yu Feng, Eric W. Moore, Jake Vand erPlas, DenisLaxalde, Josef Perktold, Robert Cimrman, Ian Hen-riksen, E. A. Quintero, Charles R Harris, Anne M.Archibald, Antonio H. Ribeiro, Fabian Pedregosa,Paul van Mulbregt, and SciPy 1. 0 Contributors.2020. SciPy 1.0: Fundamental Algorithms forScientific Computing in Python. Nature Methods,17:261–272.

Stratos Xenouleas, Prodromos Malakasiotis, MariannaApidianaki, and Ion Androutsopoulos. 2019. SUM-QE: a BERT-based summary quality estimationmodel. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages6005–6011, Hong Kong, China. Association forComputational Linguistics.

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-tian M. Meyer, and Steffen Eger. 2019. MoverScore:Text generation evaluating with contextualized em-beddings and earth mover distance. In Proceedingsof the 2019 Conference on Empirical Methods in

Page 13: How to Evaluate a Summarizer: Study Design and Statistical

1873

Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 563–578, HongKong, China. Association for Computational Lin-guistics.

Markus Zopf. 2018. Estimating summary quality withpairwise preferences. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 1687–1696, New Orleans, Louisiana. Associ-ation for Computational Linguistics.

A Interface Screenshots

We show screenshots of the instructions for bothannotation methods and tasks in Figure 8 and inter-faces in Figure 9.

B Survey

B.1 CategoriesWhile most categories are self-explanatory, weelaborate on some of the decisions we made duringthe survey in this section.

Evaluation Questions. We allow a single studyto include multiple evaluation questions, as long asall questions are answered by the same annotatorsand use the same method. We make no distinc-tion between informativeness, coverage, focus andrelevance and summarize them under Content. Sim-ilarly, we summarize fluency, grammaticality andreadability under Fluency. Other includes:

• One study with a specialized set of evalua-tion questions evaluating the usefulness of agenerated related work summary

• One study of polarity in a sentiment summa-rization context

• One study where annotators were asked toidentify the aspect a summary covers in thecontext of review summarization

• Two studies evaluating formality and meaningsimilarity of reference and system summary

• One study evaluating diversity

• One study conducting a Turing test

• One study asking paper authors whether theywould consider a sentence part of a summaryof their own paper.

• One study evaluating structure and topic di-versity

Evaluation Method. Binary includes any taskwith a yes/no style decision, while pairwise in-cludes any method in which two systems are rankedagainst each other. Other includes

• The aspect identification task mentionedabove

• One study in which participants selected asingle best summary out of a set of summaries

Annotator Recruitment. Other includes any re-cruitment strategy that does not rely on crowdsourc-ing. This includes cases in which the recruitmentwas not specified, students, experts, the authorsthemselves and various kinds of volunteers.

Statistical Evaluation. Other/unspecified in-cludes

• Four studies which reported statistical signifi-cance without reporting the test used

• Two studies using the approximate random-ization test

• One study using the chi-square test

• One study using a Tukey test without priorANOVA.

B.2 Survey Files

All papers we considered for the survey arelisted in the supplementary material in the fileall papers.yaml by their id in the ACL an-thology bib-file. The 58 SDS/MDS system papersthat contain new human evaluation studies and arethus included in the survey are listed in the categorywith human eval.

For the sake of completeness, we further list sum-marization papers we did not include in our survey.We separate them into the following categories:

no human eval 47 SDS/MDS system paperswithout human evaluation

sentsum 27 Sentence summarization and headlinegeneration papers

non system 34 summarization papers that do notintroduce new systems, like surveys, opinionpieces and evaluation studies

other 10 Papers that conduct summarization witheither non-textual input or non-textual output

Page 14: How to Evaluate a Summarizer: Study Design and Statistical

1874

(a) Likert - Coherence (b) Rank - Coherence

(c) Likert - Repetition (d) Rank - Repetition

Figure 8: Screenshots of the Annotator Instructions.

(a) Likert - Coherence

Summary 1/5

Comments

Next

Show Instructions

Please read the following summaries and sort them in descendingorder of coherence in the list to the right.

fa announced this week that england are pulling out of the eventwith immediate effect in order to achieve a more varied fixture list ,including more foreign opposition . england have pulled out of thehome nations international under 16 tournament . sky say theirrecommendations , including shortening the time between games ,would have raised the profile of an historic competition that first tookplace in 1925 .

sky sports ' drastic cost-cutting across the board after paying #11million a match to retain premier league rights is being blamed forthe demise of the victory shield. england have pulled out of thehome nations international under 16 tournament. bt sport are tobroadcast the inaugural european games in baku in june, havingfinally agreed terms.

sky allegedly withdrew their title sponsorship of under 16tournament bt sport are to broadcast inaugural european games inbaku in june brazilian legend pele is due in london on thursday for anart exhibition

sky sports ' drastic cost-cutting across the board after paying £11million a match to retain premier league rights is being blamed forthe demise of the victory shield , the home nations under 16tournament . england are pulling out of the event with immediateeffect in order to achieve a more varied fixture list . england havepulled out of a home nations international under 16 ' . the , and .

sky sports ' drastic cost-cutting across the board after paying #11million a match . england are pulling out of the event withimmediate effect in order . sky 's price hikes involving all theirprogramming since almost breaking the bank by committing # 4.2bn . england have pulled out of home nations under 16 .

Most coherent

Least Coherent

(b) Rank - Coherence

(c) Likert - Repetition

Summary 1/5

Comments

Next

Show Instructions

Please rank the following summaries into the list to the right so thatthe summary with the least amount of unnecessary repetition is firstand the one with the most unnecessary repetition is last.

bundchen was the highest-paid model in 2014 , according to forbesmagazine , with a total $ 47 million in contracts . she is the face ofchanel and carolina herrera has her own line of lingerie . the , and .

tom brady to gisele bundchen : `` you inspire me every day ''bundchen had last runway show wednesday she 'll be focusing moreon family , `` special projects ''

tom brady 's love for his wife will never go out of fashion . bundchenwas the highest-paid model in 2014 . bundchen announced herretirement from the catwalk last weekend . bundchen walked therunway for the last time wednesday and the new england patriotsquarterback was n't there to support her in person .

gisele bundchen, 34, announced her retirement from the catwalklast weekend. she was the highest-paid model in 2014, according toforbes magazine. she is the face of chanel and carolina herrera andhas her own line of lingerie.

tom brady 's love for his wife , model gisele bundchen , will never goout of fashion . bundchen , 34 , announced her retirement from thecatwalk last weekend . she is the face of chanel and carolina herreraand has her own line of lingerie .

Least unnecessary repetition

Most unnecessary repetition

(d) Rank - Repetition

Figure 9: Screenshots of the Annotation Interfaces.

Page 15: How to Evaluate a Summarizer: Study Design and Statistical

1875

We give a full list of the survey results for allpapers with human evaluation studies in the filesurvey details.csv. The file has the follow-ing columns:

paper Id of the paper in the ACL anthology

eval id Id of the evaluation study to differentiatethem in papers with multiple studies

task Summarization task of the paper: SDS vs.MDS

genre Genre of the summarized documents

#docs Number of documents in the evaluation

#systems Number of systems in the evaluation

includes reference Whether the reference sum-mary is included in the human evaluation

#ann total Total number of annotators in the study

#ann item Number of annotators per summary

content, fluency, repetition, coherence, referen-tial clarity, other, overall Binary columnsindicating evaluation questions in the paper

measure Annotation method used in the study

anntype Annotator recruitment strategy

stattest Statistical test used

design specified Indicates whether it is possibleto determine the full design from the informa-tion given about the study in the paper

comments Comment column. This column de-scribes the use of other where present.