a simple theoretical model of importance for summarization · for instance, structural features...

15
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1059–1073 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 1059 A Simple Theoretical Model of Importance for Summarization Maxime Peyrard * EPFL [email protected] Abstract Research on summarization has mainly been driven by empirical approaches, crafting sys- tems to perform well on standard datasets with the notion of information Importance remain- ing latent. We argue that establishing the- oretical models of Importance will advance our understanding of the task and help to fur- ther improve summarization systems. To this end, we propose simple but rigorous defi- nitions of several concepts that were previ- ously used only intuitively in summarization: Redundancy, Relevance, and Informativeness. Importance arises as a single quantity naturally unifying these concepts. Additionally, we pro- vide intuitions to interpret the proposed quan- tities and experiments to demonstrate the po- tential of the framework to inform and guide subsequent works. 1 Introduction Summarization is the process of identifying the most important information from a source to pro- duce a comprehensive output for a particular user and task (Mani, 1999). While producing readable outputs is a problem shared with the field of Nat- ural Language Generation, the core challenge of summarization is the identification and selection of important information. The task definition is rather intuitive but involves vague and undefined terms such as Importance and Information. Since the seminal work of Luhn (1958), au- tomatic text summarization research has focused on empirical developments, crafting summariza- tion systems to perform well on standard datasets leaving the formal definitions of Importance la- tent (Das and Martins, 2010; Nenkova and McKe- own, 2012). This view entails collecting datasets, defining evaluation metrics and iteratively select- ing the best-performing systems either via super- * Research partly done at UKP Lab from TU Darmstadt. vised learning or via repeated comparison of un- supervised systems (Yao et al., 2017). Such solely empirical approaches may lack guidance as they are often not motivated by more general theoretical frameworks. While these ap- proaches have facilitated the development of prac- tical solutions, they only identify signals correlat- ing with the vague human intuition of Importance. For instance, structural features like centrality and repetitions are still among the most used proxies for Importance (Yao et al., 2017; Kedzie et al., 2018). However, such features just correlate with Importance in standard datasets. Unsurprisingly, simple adversarial attacks reveal their weaknesses (Zopf et al., 2016). We postulate that theoretical models of Impor- tance are beneficial to organize research and guide future empirical works. Hence, in this work, we propose a simple definition of information im- portance within an abstract theoretical framework. This requires the notion of information, which has received a lot of attention since the work from Shannon (1948) in the context of communication theory. Information theory provides the means to rigorously discuss the abstract concept of informa- tion, which seems particularly well suited as an entry point for a theory of summarization. How- ever, information theory concentrates on uncer- tainty (entropy) about which message was chosen from a set of possible messages, ignoring the se- mantics of messages (Shannon, 1948). Yet, sum- marization is a lossy semantic compression de- pending on background knowledge. In order to apply information theory to sum- marization, we assume texts are represented by probability distributions over so-called semantic units (Bao et al., 2011). This view is compati- ble with the common distributional embedding representation of texts rendering the presented framework applicable in practice. When applied

Upload: others

Post on 20-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1059–1073Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

1059

A Simple Theoretical Model of Importance for Summarization

Maxime Peyrard∗EPFL

[email protected]

Abstract

Research on summarization has mainly beendriven by empirical approaches, crafting sys-tems to perform well on standard datasets withthe notion of information Importance remain-ing latent. We argue that establishing the-oretical models of Importance will advanceour understanding of the task and help to fur-ther improve summarization systems. To thisend, we propose simple but rigorous defi-nitions of several concepts that were previ-ously used only intuitively in summarization:Redundancy, Relevance, and Informativeness.Importance arises as a single quantity naturallyunifying these concepts. Additionally, we pro-vide intuitions to interpret the proposed quan-tities and experiments to demonstrate the po-tential of the framework to inform and guidesubsequent works.

1 Introduction

Summarization is the process of identifying themost important information from a source to pro-duce a comprehensive output for a particular userand task (Mani, 1999). While producing readableoutputs is a problem shared with the field of Nat-ural Language Generation, the core challenge ofsummarization is the identification and selectionof important information. The task definition israther intuitive but involves vague and undefinedterms such as Importance and Information.

Since the seminal work of Luhn (1958), au-tomatic text summarization research has focusedon empirical developments, crafting summariza-tion systems to perform well on standard datasetsleaving the formal definitions of Importance la-tent (Das and Martins, 2010; Nenkova and McKe-own, 2012). This view entails collecting datasets,defining evaluation metrics and iteratively select-ing the best-performing systems either via super-

∗Research partly done at UKP Lab from TU Darmstadt.

vised learning or via repeated comparison of un-supervised systems (Yao et al., 2017).

Such solely empirical approaches may lackguidance as they are often not motivated by moregeneral theoretical frameworks. While these ap-proaches have facilitated the development of prac-tical solutions, they only identify signals correlat-ing with the vague human intuition of Importance.For instance, structural features like centrality andrepetitions are still among the most used proxiesfor Importance (Yao et al., 2017; Kedzie et al.,2018). However, such features just correlate withImportance in standard datasets. Unsurprisingly,simple adversarial attacks reveal their weaknesses(Zopf et al., 2016).

We postulate that theoretical models of Impor-tance are beneficial to organize research and guidefuture empirical works. Hence, in this work, wepropose a simple definition of information im-portance within an abstract theoretical framework.This requires the notion of information, which hasreceived a lot of attention since the work fromShannon (1948) in the context of communicationtheory. Information theory provides the means torigorously discuss the abstract concept of informa-tion, which seems particularly well suited as anentry point for a theory of summarization. How-ever, information theory concentrates on uncer-tainty (entropy) about which message was chosenfrom a set of possible messages, ignoring the se-mantics of messages (Shannon, 1948). Yet, sum-marization is a lossy semantic compression de-pending on background knowledge.

In order to apply information theory to sum-marization, we assume texts are represented byprobability distributions over so-called semanticunits (Bao et al., 2011). This view is compati-ble with the common distributional embeddingrepresentation of texts rendering the presentedframework applicable in practice. When applied

Page 2: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1060

to semantic symbols, the tools of informationtheory indirectly operate at the semantic level(Carnap and Bar-Hillel, 1953; Zhong, 2017).

Contributions:

• We define several concepts intuitively con-nected to summarization: Redundancy, Rel-evance and Informativeness. These conceptshave been used extensively in previous sum-marization works and we discuss along theway how our framework generalizes them.

• From these definitions, we formulate prop-erties required from a useful notion of Im-portance as the quantity unifying these con-cepts. We provide intuitions to interpret theproposed quantities.

• Experiments show that, even under simpli-fying assumptions, these quantities corre-lates well with human judgments making theframework promising in order to guide futureempirical works.

2 Framework

2.1 Terminology and Assumptions

We call semantic unit an atomic piece of informa-tion (Zhong, 2017; Cruse, 1986). We note Ω theset of all possible semantic units.

A text X is considered as a semantic sourceemitting semantic units as envisioned by Weaver(1953) and discussed by Bao et al. (2011). Hence,we assume that X can be represented by a proba-bility distribution PX over the semantic units Ω.

Possible interpretations:One can interpret PX as the frequency distributionof semantic units in the text. Alternatively,PX(ωi) can be seen as the (normalized) likelihoodthat a text X entails an atomic information ωi(Carnap and Bar-Hillel, 1953). Another inter-pretation is to view PX(ωi) as the normalizedcontribution (utility) of ωi to the overall meaningof X (Zhong, 2017).

Motivation for semantic units:In general, existing semantic information theo-ries either postulate or imply the existence of se-mantic units (Carnap and Bar-Hillel, 1953; Bao

et al., 2011; Zhong, 2017). For example, the The-ory of Strongly Semantic Information produced byFloridi (2009) implies the existence of semanticunits (called information units in his work). Build-ing on this, Tsvetkov (2014) argued that the origi-nal theory of Shannon can operate at the semanticlevel by relying on semantic units.

In particular, existing semantic information the-ories imply the existence of semantic units informal semantics (Carnap and Bar-Hillel, 1953),which treat natural languages as formal languages(Montague, 1970). In general, lexical seman-tics (Cruse, 1986) also postulates the existence ofelementary constituents called minimal semanticconstituents. For instance, with frame semantics(Fillmore, 1976), frames can act as semantic units.

Recently, distributional semantics approacheshave received a lot of attention (Turian et al., 2010;Mikolov et al., 2013b). They are based on the dis-tributional hypothesis (Harris, 1954) and the as-sumption that meaning can be encoded in a vec-tor space (Turney and Pantel, 2010; Erk, 2010).These approaches also search latent and indepen-dent components that underlie the behavior ofwords (Gabor et al., 2017; Mikolov et al., 2013a).

While different approaches to semantics postu-late different basic units and different propertiesfor them, they have in common that meaningarises from a set of independent and discreteunits. Thus, the semantic units assumption isgeneral and has minimal commitment to the actualnature of semantics. This makes the frameworkcompatible with most existing semantic represen-tation approaches. Each approach specifies theseunits and can be plugged in the framework, e.g.,frame semantics would define units as frames,topic models (Allahyari et al., 2017) would defineunits as topics and distributional representationswould define units as dimensions of a vectorspace.

In the following paragraphs, we represent thesource document(s) D and a candidate summaryS by their respective distributions PD and PS .1

2.2 Redundancy

Intuitively, a summary should contain a lot ofinformation. In information-theoretic terms, theamount of information is measured by Shannon’s

1We sometimes note X instead of PX when it is not am-biguous

Page 3: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1061

entropy. For a summary S represented by PS :

H(S) = −∑ωi

PS(ωi) · log(PS(ωi)) (1)

H(S) is maximized for a uniform probabilitydistribution when every semantic unit is presentonly once in S: ∀(i, j),PS(ωi) = PS(ωj). There-fore, we define Redundancy, our first quantity rel-evant to summarization, via entropy:

Red(S) = Hmax −H(S) (2)

Since Hmax = log |Ω| is a constant indepedent ofS, we can simply write: Red(S) = −H(S).

Redundancy in Previous Works:By definition, entropy encompasses the notion ofmaximum coverage. Low redundancy via maxi-mum coverage is the main idea behind the use ofsubmodularity (Lin and Bilmes, 2011). Submodu-lar functions are generalizations of coverage func-tions which can be optimized greedily with guar-antees that the result would not be far from optimal(Fujishige, 2005). Thus, they have been used ex-tensively in summarization (Sipos et al., 2012; Yo-gatama et al., 2015). Otherwise, low redundancy isusually enforced during the extraction/generationprocedures like MMR (Carbonell and Goldstein,1998).

2.3 Relevance

Intuitively, observing a summary should reduceour uncertainty about the original text. A sum-mary approximates the original source(s) and thisapproximation should incur a minimum loss of in-formation. This property is usually called Rele-vance.

Here, estimating Relevance boils down to com-paring the distributions PS and PD, which is donevia the cross-entropy Rel(S,D) = −CE(S,D):

Rel(S,D) =∑ωi

PS(ωi) · log(PD(ωi)) (3)

The cross-entropy is interpreted as the average sur-prise of observing S while expecting D. A sum-mary with a low expected surprise produces a lowuncertainty about what were the original sources.This is achieved by exhibiting a distribution of se-mantic units similar to the one of the source docu-ments: PS ≈ PD.

Furthermore, we observe the following connec-tion with Redundancy:

KL(S||D) = CE(S,D)−H(S)

−KL(S||D) = Rel(S,D)−Red(S)(4)

KL divergence is the information loss incurred byusing D as an approximation of S (i.e., the uncer-tainty aboutD arising from observing S instead ofD). A summarizer that minimizes the KL diver-gence minimizes Redundancy while maximizingRelevance.

In fact, this is an instance of the KullbackMinimum Description Principle (MDI) (Kull-back and Leibler, 1951), a generalization of theMaximum Entropy Principle (Jaynes, 1957): thesummary minimizing the KL divergence is theleast biased (i.e., least redundant or with highestentropy) summary matching D. In other words,this summary fits D while inducing a minimumamount of new information. Indeed, any newinformation is necessarily biased since it does notarise from observations in the sources. The MDIprinciple and KL divergence unify Redundancyand Relevance.

Relevance in Previous Works:Relevance is the most heavily studied aspect ofsummarization. In fact, by design, most unsu-pervised systems model Relevance. Usually, theyused the idea of topical frequency where the mostfrequent topics from the sources must be extracted.Then, different notions of topics and countingheuristics have been proposed. We briefly discussthese developments here.

Luhn (1958) introduced the simple but influen-tial idea that sentences containing the most impor-tant words are most likely to embody the originaldocument. Later, Nenkova et al. (2006) showedexperimentally that humans tend to use words ap-pearing frequently in the sources to produce theirsummaries. Then, Vanderwende et al. (2007) de-veloped the system SumBasic, which scores eachsentence by the average probability of its words.

The same ideas can be generalized to n-grams.A prominent example is the ICSI system (Gillickand Favre, 2009) which extracts frequent bigrams.Despite being rather simple, ICSI produces strongand still close to state-of-the-art summaries (Honget al., 2014).

Different but similar words may refer to thesame topic and should not be counted separately.

Page 4: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1062

This observation gave rise to a set of importanttechniques based on topic models (Allahyari et al.,2017). These approaches cover sentence cluster-ing (McKeown et al., 1999; Radev et al., 2000;Zhang et al., 2015), lexical chains (Barzilay andElhadad, 1999), Latent Semantic Analysis (Deer-wester et al., 1990) or Latent Dirichlet Alloca-tion (Blei et al., 2003) adapted to summariza-tion (Hachey et al., 2006; Daume III and Marcu,2006; Wang et al., 2009; Davis et al., 2012). Ap-proaches like hLDA can exploit repetitions bothat the word and at the sentence level (Celikyilmazand Hakkani-Tur, 2010).

Graph-based methods form another particularlypowerful class of techniques to estimate the fre-quency of topics, e.g., via the notion of centrality(Mani and Bloedorn, 1997; Mihalcea and Tarau,2004; Erkan and Radev, 2004). A significant bodyof research was dedicated to tweak and improvevarious components of graph-based approaches.For example, one can investigate different simi-larity measures (Chali and Joty, 2008). Also, dif-ferent weighting schemes between sentences havebeen investigated (Leskovec et al., 2005; Wan andYang, 2006).

Therefore, in existing approaches, the topics(i.e., atomic units) were words, n-grams, sentencesor combinations of these. The general idea of pre-ferring frequent topics based on various countingheuristics is formalized by cross-entropy. Indeed,requiring the summary to minimize the cross-entropy with the source documents implies thatfrequent topics in the sources should be extractedfirst.

An interesting line of work is based on the as-sumption that the best sentences are the ones thatpermit the best reconstruction of the input docu-ments (He et al., 2012). It was refined by a streamof works using distributional similarities (Li et al.,2015; Liu et al., 2015; Ma et al., 2016). There,the atomic units are the dimensions of the vec-tor spaces. This information bottleneck idea isalso neatly captured by the notion of cross-entropywhich is a measure of information loss. Alterna-tively, (Daume and Marcu, 2002) viewed summa-rization as a noisy communication channel whichis also rooted in information theory ideas. (Wil-son and Sperber, 2008) provide a more general andless formal discussion of relevance in the contextof Relevance Theory (Lavrenko, 2008).

2.4 Informativeness

Relevance still ignores other potential sources ofinformation such as previous knowledge or pre-conceptions. We need to further extend the con-textual boundary. Intuitively, a summary is infor-mative if it induces, for a user, a great change inher knowledge about the world. Therefore, weintroduce K, the background knowledge (or pre-conceptions about the task). K is represented by aprobability distribution PK over semantic units Ω.

Formally, the amount of new information con-tained in a summary S is given by the cross-entropy Inf(S,K) = CE(S,K):

Inf(S,K) = −∑ωi

PS(ωi) · log(PK(ωi)) (5)

For Relevance the cross-entropy between S and Dshould be low. However, for Informativeness, thecross-entropy between S andK should be high be-cause we measure the amount of new informationinduced by the summary in our knowledge.

Background knowledge is modeled by assign-ing a high probability to known semantic units.These probabilities correspond to the strength ofωi in the user’s memory. A simple model could bethe uniform distribution over known information:PK(ωi) is 1

n if the user knows ωi, and 0 otherwise.However, K can control other variants of thesummarization task: A personalized Kp modelsthe preferences of a user by setting low probabili-ties to the semantic units of interest. Similarly, aqueryQ can be encoded by setting low probabilityto semantic units related to Q. Finally, there isa natural formulation of update summarization.Let U and D be two sets of documents. Updatesummarization consists in summarizing D giventhat the user has already seen U . This is modeledby setting K = U , considering U as previousknowledge.

Informativeness in Previous Works:The modelization of Informativeness has receivedless attention by the summarization community.The problem of identifying stopwords originallyfaced by Luhn (1958) could be addressed bydevelopments in the field of information re-trieval using background corpora like TF·IDF(Sparck Jones, 1972). Based on the same intu-ition, Dunning (1993) outlined an alternative wayof identifying highly descriptive words: the log-likelihood ratio test. Words identified with such

Page 5: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1063

techniques are known to be useful in news sum-marization (Harabagiu and Lacatusu, 2005).

Furthermore, Conroy et al. (2006) proposedto model background knowledge by a large ran-dom set of news articles. In update summariza-tion, Delort and Alfonseca (2012) used Bayesiantopic models to ensure the extraction of informa-tive summaries. Louis (2014) investigated back-ground knowledge for update summarization withBayesian surprise. This is comparable to thecombination of Informativeness and Redundancyin our framework when semantic units are n-grams. Thus, previous approaches to Informative-ness generally craft an alternate background distri-bution to model the a-priori importance of units.Then, units from the document rare in the back-ground are preferred, which is captured by max-imizing the cross-entropy between the summaryand K. Indeed, unfrequent units in the back-ground would be preferred in the summary be-cause they would be surprising (i.e., informative)to an average user.

2.5 ImportanceSince Importance is a measure that guides whichchoices to make when discarding semantic units,we must devise a way to encode their relative im-portance. Here, this means finding a probabilitydistribution unifyingD andK by encoding expec-tations about which semantic units should appearin a summary.

Informativeness requires a biased summary(w.r.t. K) and Relevance requires an unbiasedsummary (w.r.t. D). Thus, a summary should,by using only information available in D, producewhat brings the most new information to a userwith knowledge K. This could formalize a com-mon intuition in summarization that units frequentin the source(s) but rare in the background are im-portant.

Formally, let di = PD(ωi) be the probability ofthe unit ωi in the source D. Similarly, we noteki = PK(ωi). We seek a function f(di, ki) en-coding the importance of unit ωi. We formulatesimple requirements that f should satisfy:

• Informativeness: ∀i 6= j, if di = dj and ki >kj then f(di, ki) < f(dj , kj)

• Relevance: ∀i 6= j, if di > dj and ki = kjthen f(di, ki) > f(dj , kj)

• Additivity: I(f(di, ki)) ≡ αI(di) + βI(ki)

(I is the information measure from Shan-non’s theory (Shannon, 1948))

• Normalization:∑if(di, ki) = 1

The first requirement states that, for two semanticunits equally represented in the sources, we preferthe more informative one. The second requirementis an analogous statement for Relevance. The thirdrequirement is a consistency constraint to preserveadditivity of the information measures (Shannon,1948). The fourth requirement ensures that f is avalid distribution.

Theorem 1. The functions satisfying the previousrequirements are of the form:

PDK

(ωi) =1

C· d

αi

kβi(6)

C =∑i

dαi

kβi, α, β ∈ R+ (7)

C is the normalizing constant. The parametersα and β represent the strength given to Relevanceand Informativeness respectively which is madeclearer by equation (11). The proof is provided inappendix B.

Summary scoring function:By construction, a candidate summary should ap-proximate PD

K, which encodes the relative impor-

tance of semantic units. Furthermore, the sum-mary should be non-redundant (i.e., high entropy).These two requirements are unified by the Kull-back MDI principle: The least biased summaryS∗ that best approximates the distribution PD

Kis

the solution of:

S∗ = argmaxS

θI = argminS

KL(S||PDK

) (8)

Thus, we note θI as the quantity that scores sum-maries:

θI(S,D,K) = −KL(PS , ||PDK

) (9)

Interpretation of PDK

:PD

Kcan be viewed as an importance-encoding dis-

tribution because it encodes the relative impor-tance of semantic units and gives an overall targetfor the summary.

For example, if a semantic unit ωi is promi-nent in D (PD(ωi) is high) and not known inK (PD(ωi) is low), then PD

K(ωi) is very high,

Page 6: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1064

which means very desired in the summary. Indeed,choosing this unit will fill the gap in the knowl-edge K while matching the sources.

Figure 1 illustrates how this distribution be-haves with respect to D and K (for α = β = 1).

Summarizability:The target distribution PD

Kmay exhibit different

properties. For example, it might be clear whichsemantic units should be extracted (i.e., a spikyprobability distribution) or it might be unclear(i.e., many units have more or less the same impor-tance score). This can be quantified by the entropyof the importance-encoding distribution:

HDK

= H(PDK

) (10)

Intuitively, this measures the number of possiblygood summaries. If HD

Kis low then PD

Sis

spiky and there is little uncertainty about whichsemantic units to extract (few possible goodsummaries). Conversely, if the entropy is high,many equivalently good summaries are possible.

Interpretation of θI :To better understand θI , we remark that it can beexpressed in terms of the previously defined quan-tities:

θI(S,D,K) ≡ −Red(S) + αRel(S,D) (11)

+ βInf(S,K) (12)

Equality holds up to a constant term logC inde-pendent from S. Maximizing θI is equivalent tomaximizing Relevance and Informativeness whileminimizing Redundancy. Their relative strengthare encoded by α and β.

Finally, H(S), CE(S,D) and CE(S,K) arethe three independent components of Importance.

It is worth noting that each previously definedquantity: Red, Rel and Inf are measured in bits(using base 2 for the logarithm). Then, θI is alsoan information measure expressed in bits. Shan-non (1948) initially axiomatized that informationquantities should be additive and therefore θI aris-ing as the sum of other information quantities isunsurprising. Moreover, we ensured additivitywith the third requirement of PD

K.

2.6 Potential InformationRelevance relates S and D, Informativeness re-lates S and K, but we can also connect D and K.

Intuitively, we can extract a lot of new informationfrom D only when K and D are different.

With the same argument laid out for Informa-tiveness, we can define the amount of potential in-formation as the average surprise of observing Dwhile already knowing K. Again, this is given bythe cross-entropy PIK(D) = CE(D,K):

PIK(D) = −∑ωi

PD(ωi) · log(PK(ωi)) (13)

Previously, we stated that a summary should aim,using only information from D, to offer the max-imum amount of new information with respect toK. PIK(D) can be understood as Potential In-formation or maximum Informativeness, the max-imum amount of new information that a summarycan extract fromD while knowingK. A summaryS cannot extract more than PIK(D) bits of infor-mation (if using only information from D).

3 Experiments

3.1 Experimental setupTo further illustrate the workings of the formula,we provide examples of experiments done with asimplistic choice for semantic units: words. Evenwith simple assumptions θI is a meaningful quan-tity which correlates well with human judgments.

Data:We experiment with standard datasets for two dif-ferent summarization tasks: generic and updatemulti-document summarization.

We use two datasets from the Text AnalysisConference (TAC) shared task: TAC-2008 andTAC-2009.2 In the update part, 10 new documents(B documents) are to be summarized assumingthat the first 10 documents (A documents) havealready been seen. The generic task consists insummarizing the initial document set (A).

For each topic, there are 4 human referencesummaries and a manually created Pyramid set(Nenkova et al., 2007). In both editions, allsystem summaries and the 4 reference summarieswere manually evaluated by NIST assessors forreadability, content selection (with Pyramid) andoverall responsiveness. At the time of the sharedtasks, 57 systems were submitted to TAC-2008and 55 to TAC-2009.

2http://tac.nist.gov/2009/Summarization/, http://tac.nist.gov/2008/

Page 7: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1065

(a) ditribution PD (b) distribution PK (c) distribution PDK

Figure 1: figure 1a represents an example distribution of sources, figure 1b an example distribution of backgroundknowledge and figure 1c is the resulting target distribution that summaries should approximate.

Setup and Assumptions:To keep the experiments simple and focused on il-lustrating the formulas, we make several simplis-tic assumptions. First, we choose words as se-mantic units and therefore texts are represented asfrequency distributions over words. This assump-tion was already employed by previous works us-ing information-theoretic tools for summarization(Haghighi and Vanderwende, 2009). While it islimiting, this remains a simple approximation let-ting us observe the quantities in action.K,α and β are the parameters of the theory

and their choice is subject to empirical investiga-tion. Here, we make simple choices: for updatesummarization, K is the frequency distributionover words in the background documents (A). Forgeneric summarization, K is the uniform proba-bility distribution over all words from the sourcedocuments. Furthermore, we use α = β = 1.

3.2 Correlation with humansFirst, we measure how well the different quantitiescorrelate with human judgments. We compute thescore of each system summary according to eachquantity defined in the previous section: Red,Rel,Inf , θI(S,D,K). We then compute the correla-tions between these scores and the manual Pyra-mid scores. Indeed, each quantity is a summaryscoring function and could, therefore, be evaluatedbased on its ability to correlate with human judg-ments (Lin and Hovy, 2003). Thus, we also reportthe performances of the summary scoring func-tions from several standard baselines: Edmund-son (Edmundson, 1969) which scores sentencesbased on 4 methods: term frequency, presence ofcue-words, overlap with title and position of thesentence. LexRank (Erkan and Radev, 2004) is apopular graph-based approach which scores sen-tences based on their centrality in a sentence sim-ilarity graph. ICSI (Gillick and Favre, 2009) ex-tracts a summary by solving a maximum coverage

problem considering the most frequent bigramsin the source documents. KL and JS (Haghighiand Vanderwende, 2009) which measure the di-vergence between the distribution of words in thesummary and in the sources. Furthermore, wereport two baselines from Louis (2014) whichaccount for background knowledge: KLback andJSback which measure the divergence between thedistribution of the summary and the backgroundknowledgeK. Further details concerning baselinescoring functions can be found in appendix A.

We measure the correlations with Kendall’s τ , arank correlation metric which compares the ordersinduced by both scored lists. We report results forboth generic and update summarization averagedover all topics for both datasets in table 1.

In general, the modelizations of Relevance(based only on the sources) correlate better withhuman judgments than other quantities. Metricsaccounting for background knowledge work bet-ter in the update scenario. This is not surprising asthe background knowledge K is more meaningfulin this case (using the previous document set).

We observe that JS divergence gives slightlybetter results than KL. Even though KL is moretheoretically appealing, JS is smoother and usuallyworks better in practice when distributions havedifferent supports (Louis and Nenkova, 2013).

Finally, θI significantly3 outperforms all base-lines in both the generic and the update case.Red, Rel and Inf are not particularly strong ontheir own, but combined together they yield astrong summary scoring function θI . Indeed,each quantity models only one aspect of contentselection, only together they form a strong signalfor Importance.

3at 0.01 with significance testing done with a t-test tocompare two means

Page 8: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1066

We need to be careful when interpreting theseresults because we made several strong assump-tions: by choosing n-grams as semantic units andby choosing K rather arbitrarily. Nevertheless,these are promising results. By investigating bet-ter text representations and more realistic K, weshould expect even higher correlations.

We provide a qualitative example on one topicin appendix C with a visualization of PD

Kin

comparison to reference summaries.

Generic Update

ICSI .178 .139Edm. .215 .205LexRank .201 .164

KL .204 .176JS .225 .189KLback .110 .167JSback .066 .187

Red .098 .096Rel .212 .192Inf .091 .086

θI .294 .211

Table 1: Correlation of various information-theoreticquantities with human judgments measured byKendall’s τ on generic and update summarization.

3.3 Comparison with Reference Summaries

Intuitively, the distribution PDK

should be similarto the probability distribution PR of the human-written reference summaries.

To verify this, we scored the system summariesand the reference summaries with θI and checkedwhether there is a significant difference betweenthe two lists.4 We found that θI scores referencesummaries significantly higher than systemsummaries. The p−value, for the generic case,is 9.2e−6 and 1.1e−3 for the update case. Bothare much smaller than the 1e−2 significancelevel. Therefore, θI is capable of distinguishingsystems summaries from human written ones. Forcomparison, the best baseline (JS) has the fol-lowing p−values: 8.2e−3 (Generic) and 4.5e−2(Update). It does not pass the 1e−2 significancelevel for the update scenario.

4with standard t-test for comparing two related means.

4 Conclusion and Future Work

In this work, we argued for the development oftheoretical models of Importance and proposedone such framework. Thus, we investigated atheoretical formulation of the notion of Impor-tance. In a framework rooted in information the-ory, we formalized several summary-related quan-tities like: Redundancy, Relevance and Informa-tiveness. Importance arises as the notion unify-ing these concepts. More generally, Importanceis the measure that guides which choices to makewhen information must be discarded. The intro-duced quantities generalize the intuitions that havepreviously been used in summarization research.

Conceptually, it is straightforward to build asystem out of θI once a semantic units represen-tation and a K have been chosen. A summarizerintends to extract or generate a summary maximiz-ing θI . This fits within the general optimizationframework for summarization (McDonald, 2007;Peyrard and Eckle-Kohler, 2017b; Peyrard andGurevych, 2018)

The background knowledge and the choice ofsemantic units are free parameters of the theory.They are design choices which can be exploredempirically by subsequent works. Our experi-ments already hint that strong summarizers canbe developed from this framework. Characters,character n-grams, morphemes, words, n-grams,phrases, and sentences do not actually qualify assemantic units. Even though previous works whorelied on information theoretic motivation (Linet al., 2006; Haghighi and Vanderwende, 2009;Louis and Nenkova, 2013; Peyrard and Eckle-Kohler, 2016) used some of them as support forprobability distributions, they are neither atomicnor independent. It is mainly because they are sur-face forms whereas semantic units are abstract andoperate at the semantic level. However, they mightserve as convenient approximations. Then, inter-esting research questions arise like Which gran-ularity offers a good approximation of semanticunits? Can we automatically learn good approx-imations? N-grams are known to be useful, butother granularities have rarely been considered to-gether with information-theoretic tools.

For the background knowledge K, a promisingdirection would be to use the framework to actu-ally learn it from data. In particular, one can applysupervised techniques to automatically search forK, α and β: finding the values of these parame-

Page 9: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1067

ters such that θI has the best correlation with hu-man judgments. By aggregating over many usersand many topics one can find a generic K: what,on average, people consider as known when sum-marizing a document. By aggregating over differ-ent people but in one domain, one can uncover adomain-specificK. Similarly, by aggregating overmany topics for one person, one would find a per-sonalized K.

These consistute promising research directionsfor future works.

Acknowledgements

This work was partly supported by the Ger-man Research Foundation (DFG) as part ofthe Research Training Group “Adaptive Prepara-tion of Information from Heterogeneous Sources”(AIPHES) under grant No. GRK 1994/1, andvia the German-Israeli Project Cooperation (DIP,grant No. GU 798/17-1). We also thank theanonymous reviewers for their comments.

ReferencesMehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi,

Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutier-rez, and Krys Kochut. 2017. Text SummarizationTechniques: A Brief Survey. International Journalof Advanced Computer Science and Applications,8(10).

Jie Bao, Prithwish Basu, Mike Dean, Craig Partridge,Ananthram Swami, Will Leland, and James AHendler. 2011. Towards a theory of semantic com-munication. In Network Science Workshop (NSW),2011 IEEE, pages 110–117. IEEE.

Regina Barzilay and Michael Elhadad. 1999. UsingLexical Chains for Text Summarization. Advancesin Automatic Text Summarization, pages 111–121.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet Allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Jaime Carbonell and Jade Goldstein. 1998. The Useof MMR, Diversity-based Reranking for ReorderingDocuments and Producing Summaries. In Proceed-ings of the 21st Annual International ACM SIGIRConference on Research and Development in Infor-mation Retrieval, SIGIR ’98, pages 335–336.

Rudolf Carnap and Yehoshua Bar-Hillel. 1953. AnOutline of a Theory of Semantic Information.British Journal for the Philosophy of Science., 4.

Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. A Hy-brid Hierarchical Model for Multi-Document Sum-marization. In Proceedings of the 48th Annual

Meeting of the Association for Computational Lin-guistics, pages 815–824, Uppsala, Sweden. Associ-ation for Computational Linguistics.

Yllias Chali and Shafiq R. Joty. 2008. Improvingthe performance of the random walk model for an-swering complex questions. In Proceedings of the46th Annual Meeting of the Association for Com-putational Linguistics on Human Language Tech-nologies: Short Papers, pages 9–12. Association forComputational Linguistics.

John M. Conroy, Judith D. Schlesinger, and Di-anne P. O’Leary. 2006. Topic-Focused Multi-Document Summarization Using an ApproximateOracle Score. In Proceedings of the COLING/ACL2006 Main Conference Poster Sessions, pages 152–159, Sydney, Australia. Association for Computa-tional Linguistics.

D.A. Cruse. 1986. Lexical Semantics. Cambridge Uni-versity Press, Cambridge, UK.

Dipanjan Das and Andre F. T. Martins. 2010. A Sur-vey on Automatic Text Summarization. LiteratureSurvey for the Language and Statistics II Course atCMU.

Hal Daume, III and Daniel Marcu. 2002. A Noisy-channel Model for Document Compression. In Pro-ceedings of the 40th Annual Meeting on Associationfor Computational Linguistics, pages 449–456.

Hal Daume III and Daniel Marcu. 2006. BayesianQuery-Focused Summarization. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the Asso-ciation for Computational Linguistics, pages 305–312. Association for Computational Linguistics.

Sashka T. Davis, John M. Conroy, and Judith D.Schlesinger. 2012. OCCAMS–An Optimal Com-binatorial Covering Algorithm for Multi-documentSummarization. In Proceeding of the 12th Inter-national Conference on Data Mining Workshops(ICDMW), pages 454–463. IEEE.

Scott Deerwester, Susan T. Dumais, George W. Fur-nas, Thomas K. Landauer, and Richard Harshman.1990. Indexing by Latent Semantic Analysis. Jour-nal of the American Society for Information Science,41(6):391–407.

Jean-Yves Delort and Enrique Alfonseca. 2012. Du-alSum: A Topic-model Based Approach for UpdateSummarization. In Proceedings of the 13th Confer-ence of the European Chapter of the Association forComputational Linguistics, pages 214–223.

Ted Dunning. 1993. Accurate Methods for the Statis-tics of Surprise and Coincidence. Computationallinguistics, 19(1):61–74.

H. P. Edmundson. 1969. New Methods in AutomaticExtracting. Journal of the Association for Comput-ing Machinery, 16(2):264–285.

Page 10: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1068

Katrin Erk. 2010. What is Word Meaning, Really?(and How Can Distributional Models Help Us De-scribe It?). In Proceedings of the 2010 workshop ongeometrical models of natural language semantics,pages 17–26. Association for Computational Lin-guistics.

Gunes Erkan and Dragomir R. Radev. 2004. LexRank:Graph-based Lexical Centrality As Salience in TextSummarization. Journal of Artificial IntelligenceResearch, pages 457–479.

Charles J. Fillmore. 1976. Frame Semantics Andthe Nature of Language. Annals of the New YorkAcademy of Sciences, 280(1):20–32.

Luciano Floridi. 2009. Philosophical Conceptions ofInformation. In Formal Theories of Information,pages 13–53. Springer.

Satoru Fujishige. 2005. Submodular functions and op-timization. Annals of discrete mathematics. Else-vier, Amsterdam, Boston, Paris.

Kata Gabor, Haifa Zargayouna, Isabelle Tellier, DavideBuscaldi, and Thierry Charnois. 2017. ExploringVector Spaces for Semantic Relations. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1814–1823,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Dan Gillick and Benoit Favre. 2009. A Scalable GlobalModel for Summarization. In Proceedings of theWorkshop on Integer Linear Programming for Nat-ural Language Processing, pages 10–18, Boulder,Colorado. Association for Computational Linguis-tics.

Ben Hachey, Gabriel Murray, and David Reitter.2006. Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization.In Proceedings of the Workshop on Task-FocusedSummarization and Question Answering, pages 1–7. Association for Computational Linguistics.

Aria Haghighi and Lucy Vanderwende. 2009. Explor-ing Content Models for Multi-document Summa-rization. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 362–370.

Sanda Harabagiu and Finley Lacatusu. 2005. TopicThemes for Multi-document Summarization. InProceedings of the 28th Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval, pages 202–209.

Zellig Harris. 1954. Distributional structure. Word,10:146–162.

Zhanying He, Chun Chen, Jiajun Bu, Can Wang, LijunZhang, Deng Cai, and Xiaofei He. 2012. DocumentSummarization Based on Data Reconstruction. InProceeding of the Twenty-Sixth Conference on Arti-ficial Intelligence.

Kai Hong, John Conroy, benoit Favre, Alex Kulesza,Hui Lin, and Ani Nenkova. 2014. A Repositoryof State of the Art and Competitive Baseline Sum-maries for Generic News Summarization. In Pro-ceedings of the Ninth International Conference onLanguage Resources and Evaluation (LREC’14),pages 1608–1616, Reykjavik, Iceland.

Edwin T. Jaynes. 1957. Information Theory and Statis-tical Mechanics. Physical Review, 106:620–630.

Chris Kedzie, Kathleen McKeown, and Hal Daume III.2018. Content Selection in Deep Learning Modelsof Summarization. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1818–1828. Association for Com-putational Linguistics.

Solomon Kullback and Richard A. Leibler. 1951. OnInformation and Sufficiency. The Annals of Mathe-matical Statistics, 22(1):79–86.

Victor Lavrenko. 2008. A generative theory of rele-vance, volume 26. Springer Science & BusinessMedia.

Jure Leskovec, Natasa Milic-Frayling, and Marko Gro-belnik. 2005. Impact of Linguistic Analysis on theSemantic Graph Coverage and Learning of Docu-ment Extracts. In Proceedings of the National Con-ference on Artificial Intelligence, pages 1069–1074.

Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao.2015. Reader-Aware Multi-document Summariza-tion via Sparse Coding. In Proceedings of the 24thInternational Conference on Artificial Intelligence ,pages 1270–1276.

Chin-Yew Lin, Guihong Cao, Jianfeng Gao, and Jian-Yun Nie. 2006. An Information-Theoretic Approachto Automatic Evaluation of Summaries. In Pro-ceedings of the Human Language Technology Con-ference at NAACL, pages 463–470, New York City,USA.

Chin-Yew Lin and Eduard Hovy. 2003. Auto-matic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of the 2003Conference of the North American Chapter of theAssociation for Computational Linguistics on Hu-man Language Technology, volume 1, pages 71–78.

Hui Lin and Jeff A. Bilmes. 2011. A Class of Submod-ular Functions for Document Summarization. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages510–520, Portland, Oregon.

He Liu, Hongliang Yu, and Zhi-Hong Deng. 2015.Multi-document Summarization Based on Two-levelSparse Representation Model. In Proceedings of theTwenty-Ninth AAAI Conference on Artificial Intelli-gence, pages 196–202.

Page 11: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1069

Annie Louis. 2014. A Bayesian Method to Incorpo-rate Background Knowledge during Automatic TextSummarization. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 333–338,Baltimore, Maryland.

Annie Louis and Ani Nenkova. 2013. Automati-cally Assessing Machine Summary Content With-out a Gold Standard. Computational Linguistics,39(2):267–300.

Hans Peter Luhn. 1958. The Automatic Creation ofLiterature Abstracts. IBM Journal of Research De-velopment, 2:159–165.

Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. 2016.An Unsupervised Multi-Document SummarizationFramework Based on Neural Document Model. InProceedings of COLING 2016, the 26th Interna-tional Conference on Computational Linguistics:Technical Papers, pages 1514–1523. The COLING2016 Organizing Committee.

Inderjeet Mani. 1999. Advances in Automatic TextSummarization. MIT Press, Cambridge, MA, USA.

Inderjeet Mani and Eric Bloedorn. 1997. Multi-document Summarization by Graph Search andMatching. In Proceedings of the Fourteenth Na-tional Conference on Artificial Intelligence andNinth Conference on Innovative Applications of Ar-tificial Intelligence, pages 622–628, Providence,Rhode Island. AAAI Press.

Ryan McDonald. 2007. A Study of Global InferenceAlgorithms in Multi-document Summarization. InProceedings of the 29th European Conference on In-formation Retrieval Research, pages 557–564.

Kathleen R. McKeown, Judith L. Klavans, VasileiosHatzivassiloglou, Regina Barzilay, and Eleazar Es-kin. 1999. Towards Multidocument Summariza-tion by Reformulation: Progress and Prospects. InProceedings of the Sixteenth National Conferenceon Artificial Intelligence and the Eleventh Innova-tive Applications of Artificial Intelligence Confer-ence Innovative Applications of Artificial Intelli-gence, pages 453–460.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient Estimation of Word Repre-sentations in Vector Space. CoRR, abs/1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their composition-ality. In Advances in Neural Information Process-ing Systems, pages 3111–3119, Lake Tahoe, Nevada,USA.

Richard Montague. 1970. English as a formal lan-guage. In Bruno Visentini, editor, Linguaggi nellasocieta e nella tecnica, pages 188–221. Edizioni diCommunita.

Ani Nenkova and Kathleen McKeown. 2012. A Sur-vey of Text Summarization Techniques. Mining TextData, pages 43–76.

Ani Nenkova, Rebecca Passonneau, and KathleenMcKeown. 2007. The Pyramid Method: Incorporat-ing Human Content Selection Variation in Summa-rization Evaluation. ACM Transactions on Speechand Language Processing (TSLP), 4(2).

Ani Nenkova, Lucy Vanderwende, and Kathleen McK-eown. 2006. A Compositional Context SensitiveMulti-document Summarizer: Exploring the FactorsThat Influence Summarization. In Proceedings ofthe 29th Annual International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval, SIGIR ’06, pages 573–580.

Maxime Peyrard and Judith Eckle-Kohler. 2016.A General Optimization Framework for Multi-Document Summarization Using Genetic Algo-rithms and Swarm Intelligence. In Proceedings ofthe 26th International Conference on ComputationalLinguistics (COLING), pages 247 – 257.

Maxime Peyrard and Judith Eckle-Kohler. 2017a. Aprincipled framework for evaluating summarizers:Comparing models of summary quality against hu-man judgments. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (ACL 2017), volume Volume 2: Short Pa-pers, pages 26–31. Association for ComputationalLinguistics.

Maxime Peyrard and Judith Eckle-Kohler. 2017b.Supervised learning of automatic pyramid foroptimization-based multi-document summarization.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (ACL2017), volume Volume 1: Long Papers, pages 1084–1094. Association for Computational Linguistics.

Maxime Peyrard and Iryna Gurevych. 2018. Objec-tive function learning to match human judgementsfor optimization-based summarization. In Proceed-ings of the 16th Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 654–660. Association for Computational Lin-guistics.

Dragomir R. Radev, Hongyan Jing, and MalgorzataBudzikowska. 2000. Centroid-based Summariza-tion of Multiple Documents: Sentence Extraction,Utility-based Evaluation, and User Studies. In Pro-ceedings of the NAACL-ANLP Workshop on Auto-matic Summarization, volume 4, pages 21–30, Seat-tle, Washington.

Page 12: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1070

Claude E. Shannon. 1948. A Mathematical Theory ofCommunication. Bell Systems Technical Journal,27:623–656.

Ruben Sipos, Pannaga Shivaswamy, and ThorstenJoachims. 2012. Large-margin Learning of Sub-modular Summarization Models. In Proceedings ofthe 13th Conference of the European Chapter of theAssociation for Computational Linguistics, pages224–233, Avignon, France. Association for Compu-tational Linguistics.

Karen Sparck Jones. 1972. A Statistical Interpretationof Term Specificity and its Application in Retrieval.Journal of documentation, 28(1):11–21.

Victor Yakovlevich Tsvetkov. 2014. The KE Shannonand L. Floridi’s Amount of Information. Life Sci-ence Journal, 11(11):667–671.

Joseph Turian, Lev Ratinov, and Yoshua Bengio.2010. Word Representations: A Simple and GeneralMethod for Semi-supervised Learning. In Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 384–394.

Peter D Turney and Patrick Pantel. 2010. From Fre-quency to Meaning: Vector Space Models of Se-mantics. Journal of artificial intelligence research,37:141–188.

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,and Ani Nenkova. 2007. Beyond SumBasic: Task-focused Summarization with Sentence Simplifica-tion and Lexical Expansion. Information Processing& Management, 43(6):1606–1618.

Xiaojun Wan and Jianwu Yang. 2006. Improved Affin-ity Graph Based Multi-Document Summarization.In Proceedings of the Human Language Technol-ogy Conference of the NAACL, Companion Volume:Short Papers, pages 181–184. Association for Com-putational Linguistics.

Dingding Wang, Shenghuo Zhu, Tao Li, and YihongGong. 2009. Multi-document Summarization Us-ing Sentence-based Topic Models. In Proceedingsof the ACL-IJCNLP 2009, pages 297–300. Associa-tion for Computational Linguistics.

Warren Weaver. 1953. Recent Contributions to theMathematical Theory of Communication. ETC: AReview of General Semantics, pages 261–281.

Deirdre Wilson and Dan Sperber. 2008. RelevanceTheory, chapter 27. John Wiley and Sons, Ltd.

Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. 2017.Recent Advances in Document Summarization.Knowledge and Information Systems, 53(2):297–336.

Dani Yogatama, Fei Liu, and Noah A. Smith. 2015.Extractive Summarization by Maximizing SemanticVolume. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 1961–1966, Lisbon, Portugal.

Yang Zhang, Yunqing Xia, Yi Liu, and Wenmin Wang.2015. Clustering Sentences with Density Peaks forMulti-document Summarization. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1262–1267, Denver, Colorado. Association for Computa-tional Linguistics.

Yixin Zhong. 2017. A Theory of Semantic Informa-tion. In Proceedings of the IS4SI 2017 Summit Dig-italisation for a Sustainable Society, 129.

Markus Zopf, Eneldo Loza Mencıa, and JohannesFurnkranz. 2016. Beyond Centrality and StructuralFeatures: Learning Information Importance for TextSummarization. In Proceedings of the 20th SIGNLLConference on Computational Natural LanguageLearning (CoNLL 2016), pages 84–94.

Page 13: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1071

A Details about Baseline ScoringFunctions

In the paper, we compare the summary scoringfunction θI against the summary scoring func-tions derived from several summarizers followingthe methodology from Peyrard and Eckle-Kohler(2017a). Here, we give explicit formulation of thebaseline scoring functions.Edmundson: (Edmundson, 1969)Edmundson (1969) presented a heuristic whichscores sentences according to 4 different features:

• Cue-phrases: It is based on the hypothesisthat the probable relevance of a sentence isaffected by the presence of certain cue wordssuch as ’significant’ or ’important’. Bonuswords have positive weights, stigma wordshave negative weights and all the others haveno weight. The final score of the sentence isthe sum of the weights of its words.

• Key: High-frequency content words are be-lieved to be positively correlated with rele-vance (Luhn, 1958). Each word receives aweight based on its frequency in the docu-ment if it is not a stopword. The score of thesentence is also the sum of the weights of itswords.

• Title: It measures the overlap between thesentence and the title.

• Location: It relies on the assumption thatsentences appearing early or late in the sourcedocuments are more relevant.

By combining these scores with a linear combi-nation, we can recognize the objective function:

θEdm.(S) =∑s∈S

α1 · C(s) + α2 ·K(s) (14)

+ α3 · T (s) + α4 · L(s) (15)

The sum runs over sentences and C,K, T and Loutput the sentence scores for each method (Cue,Key, Title and Location).

ICSI: (Gillick and Favre, 2009)A global linear optimization that extracts a sum-mary by solving a maximum coverage problemof the most frequent bigrams in the source doc-uments. ICSI has been among the best systems ina classical ROUGE evaluation (Hong et al., 2014).

Here, the identification of the scoring function istrivial because it was originally formulated as anoptimization task. If ci is the i-th bigram selectedin the summary and wi is its weight computedfrom D, then:

θICSI(S) =∑ci∈S

ci · wi (16)

LexRank: (Erkan and Radev, 2004)This is a well-known graph-based approach. Asimilarity graph G(V,E) is constructed where Vis the set of sentences and an edge eij is drawnbetween sentences vi and vj if and only if thecosine similarity between them is above a giventhreshold. Sentences are scored according to theirPageRank score in G. Thus, θLexRank is given by:

θLexRank(S) =∑s∈S

PRG(s) (17)

Here, PR is the PageRank score of sentence s.

KL-Greedy: (Haghighi and Vanderwende, 2009)In this approach, the summary should minimizethe Kullback-Leibler (KL) divergence between theword distribution of the summary S and the worddistribution of the documents D (i.e., θKL =−KL):

θKL(S) = −KL(S||D) (18)

= −∑g∈S

PS(g) logPS(g)

PD(g)(19)

PX(w) represents the frequency of the word (orn-gram) w in the text X . The minus sign indicatesthat KL should be lower for better summaries. In-deed, we expect a good system summary to exhibita similar probability distribution of n-grams as thesources.

Alternatively, the Jensen-Shannon (JS) diver-gence can be used instead of KL. Let M be theaverage word frequency distribution of the candi-date summary S and the source documents D dis-tribution:

∀g ∈ S, PM (g) =1

2(PS(g) + PD(g)) (20)

Then, the formula for JS is given by:

θJS(S) = −JS(S||D) (21)

=1

2(KL(S||M) +KL(D||M)) (22)

Page 14: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1072

Within our framework, the KL divergence actsas the unification of Relevance and Redundancywhen semantic units are bigrams.

B Proof of Theorem 1

Let Ω be the set of semantic units. The nota-tion ωi represents one unit. Let PT , and PK bethe text representations of the source documentsand background knowledge as probability distri-butions over semantic units.

We note ti = PT (ωi), the probability of theunit ωi in the source T . Similarly, we note ki =PK(ωi). We seek a function f unifying T and Ksuch that: f(ωi) = f(ti, ki).

We remind the simple requirements that fshould satisfy:

• Informativeness: ∀i 6= j, if ti = tj and ki >kj then f(ti, ki) < f(tj , kj)

• Relevance: ∀i 6= j, if ti > tj and ki = kjthen f(ti, ki) > f(tj , kj)

• Additivity: I(f(ti, ki)) ≡ αI(ti)+βI(ki) (Iis the information measure from Shannon’stheory (Shannon, 1948))

• Normalization:∑if(ti, ki) = 1

Theorem 1 states that the functions satisfyingthe previous requirements are:

P TK

(ωi) =1

C· t

αi

kβi

C =∑i

tαi

kβi, α, β ∈ R+

(23)

with C the normalizing constant.

Proof. The information function defined by Shan-non (1948) is the logarithm: I = log. Then, theAdditivity criterion can be written:

log(f(ti, ki)) = α log(ti) + β log(ki) +A (24)

with A a constant independent of ti and kiSince log is monotonous and increasing, the In-

formativeness and Additivity criteria can be com-bined:

∀i 6= j, if ti = tj and ki > kj then:

log f(ti, ki) < log f(tj , kj)

α log(ti) + β log(ki) < α log(tj) + β log(kj)

β log(ki) < β log(kj)

But ki > kj , therefore:

β < 0

For clarity, we can now use −β with β ∈ R+.Similarly, we can combine the Relevance and

Additivity criteria: ∀i 6= j, if ti > tj and ki = kjthen:

log f(ti, ki) > log f(tj , kj)

α log(ti) + β log(ki) > α log(tj) + β log(kj)

α log(ti) > α log(tj)

But ti > tj , therefore:

α > 0

Then, we have the following form from the Ad-ditivity criterion:

log f(ti, ki) = α log(ti)− β log(ki) +A

f(ti, ki) = eAe[α log(ti)−β log(ki)]

f(ti, ki) = eAtαi

kβix

Finally, the Normalization constraint specifiesthe constant eA:

C =1

eA

and C =∑i

tαi

kβi

then: A = − log(∑i

tαi

kβi)

C Example

As an example, for one selected topic of TAC-2008 update track, we computed the PD

Kand com-

pare it to the distribution of the 4 reference sum-maries.

We report the two distributions together in fig-ure 2. For visibility, only the top 50 words accord-ing to PD

Kare considered. However, we observe

Page 15: A Simple Theoretical Model of Importance for Summarization · For instance, structural features like centrality and repetitions are still among the most used proxies for Importance

1073

Figure 2: Example of PDK

in comparison to the word distribution of reference summaries for one topic of TAC-2008(D0803).

a good match between the distribution of the ref-erence summaries and the ideal distribution as de-fined by PD

K.

Furthermore, the most desired words accordingto PD

Kmake sense. This can be seen by looking

at one of the human-written reference summary ofthis topic:

Reference summary for topic D0803China sacrificed coal mine safety in itsmassive demand for energy. Gas explo-sions, flooding, fires, and cave-ins causemost accidents. The mining industry isriddled with corruption from mining of-ficials to owners. Officials are often ille-gally invested in mines and ignore safetyprocedures for production. South Africarecently provided China with informa-tion on mining safety and technologyduring a conference. China is begin-ning enforcement of safety regulations.Over 12,000 mines have been orderedto suspend operations and 4,000 othersordered closed. This year 4,228 minerswere killed in 2,337 coal mine accidents.China’s mines are the most dangerousworldwide.