predicting sentence specificity, with applications to news summarization ani nenkova, joint work...

Post on 29-Mar-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Predicting sentence specificity, with applications to news summarization

Ani Nenkova, joint work with Annie Louis

University of Pennsylvania

Motivation A well-written text is a mix of general

statements and sentences providing details

In information retrieval: find relevant and well-written documents

Writing support: visualize general and specific areas

Supervised sentence-level classifier for general/specific Training data

Used existing annotations for discourse relations from PDTB

Features Lexical, language model, syntax, etc

Testing data Annotators judged more sentences

Applications to analysis of summarization output Automatic summaries too specific, worse for that

Training data

Penn discourse tree bank

Penn Discourse Treebank (PDTB)

Largest annotated corpus of explicit and implicit discourse relations

1 million words of Wall Street Journal

Arguments – spans linked by a relation (Arg1, Arg2)

Sense – semantics of the relation (3 level hierarchy)

I love ice-cream but I hate chocolates.(discourse connectives)

I came late. I missed the train.(adjacent sentences in the same paragraph)

5

Distribution of relations between adjacent sentences

(Adjacent sentences linked by an entity. Not considered a true discourse relation.)

6

7

Training data from PDTB Expansions

Expansion

Conjunction[Also, Further]

Restatement[Specifically, Overall]

Instantiation[For example]

List[And]

Alternative[Or, Instead]

Exception[except]

Specification

Equivalence

Generalization Conjunctive Disjunctive

Chosen alternative

7

Instantiation example

The 40 year old Mr. Murakami is a publishing sensation in Japan.

A more recent novel, “Norwegian wood”, has sold more than forty million copies since Kodansha published it in 1987.

8

Examples of general /specific sentences

Despite recent declines in yields, investors continue to pour cash into money funds.

Assets of the 400 taxable funds grew by $1.5 billion during the latest week, to $352 billion. [Instantiation]

By most measures, the nation’s industrial sector is now growing very slowly—if at all.

Factory payrolls fell in September. [Specification]

9

Experimental setup—Two classifiers Instantiations-based

Arg1: General, Arg2: specific 1403 examples

Restatement#Specifications-based Arg1: General, Arg2: specific 2370 examples

Implicit relations only 50% baseline accuracy; 10 fold-cross

validation; Logistic regression

10

Features

Developed from a small development set 10 pairs of specification 10 pairs of instantiation

Features for general vs specific Sentence length: no. of tokens, no. of nouns

Expected general sentences to be shorter

Polarity: no. of positive/ negative/ polarity words, also normalized by length General Inquirer MPQA subjectivity lexicon In dev set, sentences with strong opinion are general

Language models: unigram/ bigram/ trigram probability & perplexity Trained on one year of New York Times news In dev set, general sentences contained unexpected, catchy

phrases12

Features for general vs specific

Specificity min/ max/ avg IDF WordNet: hypernym distance to root for nouns and

verbs—min/ max/ avg

Syntax: No. of adjectives, adverbs, ADJP, ADVP, verb phrases, avg VP length

Entities: Numbers, proper names, $ sign, plural nouns

Words: count of each word in the sentence13

Accuracy of general/specific classifier using Instantiations

50 55 60 65 70 75 80

verbs

sent len.

polarity

syntax

specificity

lang. md.

entities

words

all

all-words

Accuracy

14

Best: 76% accuracy

Accuracy of general/specific classifier using Specifications

50 55 60 65

polarity

verbs

lang. md.

entities

sent len.

specificity

syntax

words

all

all-words

Accuracy15

Best: 60% accuracy

Instantiation based classifier gave better performance

Best individual feature set: words (74.8%) Non-lexical features are equally good: 74.1%

No improvement by combining: 75.8%

16

Feature analysis Words with highest weight [Instantiation-based]

General: number, but, also, however, officials, some, what,

lot, prices, business, were… Specific: one, a, to, co, I, called, we, could, get…

General sentences are characterized by Plural nouns Dollar sign Lower probability More polarity words and more adjectives and adverbs

Specific sentences are characterized by Numbers and names

More testing data

Direct judgments of WSJ and AP sentences on Amazon Mechanical Turk

~ 600 sentences 5 judgments per sentence

Agree TotalWSJ

GeneralWSJ

SpecificWSJ

TotalAP

GeneralAP

SpecificAP

5 96 51 45 108 33 75

4 102 57 45 91 35 56

3 95 52 43 88 49 39

Total 294 160 133 292 117 170

In WSJ, more sentences are general (55%)In AP, more sentences are specific (60%)

Why the difference between Instantiation and Specification? Some of the annotations were on our initial

training data

20

Instantiation (32)

General Specific

Arg1 29 3

Arg2 6 26

Specification (16)

General Specific

Arg1 10 6

Arg2 8 8

Has more detectable properties

associated with Arg1 and Arg2

Accuracy of classifier on new dataExamples

All features

Nonlexical

Words

All features

Nonlexical

Words

5 Agree 90.6 96.8 84.3 69.4 94.4 78.7

4+5 Agree

80.8 88.8 77.7 65.8 89.9 74.8

All 73.7 76.7 71.6 59.2 81.1 67.5

Non-lexical features work better on this dataPerformance is almost the same as in cross validation

Classifier is more accurate on examples where people agreeClassifier confidence correlates with annotator agreement

22

Application of our classifier to full articles

Distribution of general/specific sentences in news documents

Can the classifier detect differences in general/specific summaries by people

Do summaries have more general/specific content compared to input? How does it impact summary quality?

Compare different types of summaries Human abstracts: written from scratch Human extracts: select sentences as a whole from inputs System summaries: all extracts

22

Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later. [general]

The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said.

[specific]

Example general and specific predictions

23

24

Example predictions

The novel, a story of a Scottish low-life narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers who have damned all six books shortlisted for the prize as boring, elitist and – worse of all – unsaleable.

…The Booker prize has, in its 26-year history, always

provoked controversy.

24

Specific

General

Computing specificity for a text

Sentences in summary are of varying length, so we compute a score on word level “Average specificity of words in the text”

25

S1: w12w11 …w13

S2: w22w21 …w23

S3: w32w31 …w33

Confidence for beingin specific class

0.23

0.81

0.680.68 0.68 0.68 0.68

0.23 0.23 0.23 0.23

0.81 0.81 0.81 0.81

Average score on tokens

Specificity score

50 specific and general human summariesText General category Specific category

Summaries 0.55 0.63

Inputs 0.63 0.65

No significant differences in specificity of the input

Significant differences in specificity of summaries in the two categories

Our classifier is able to detect the differences

Data: DUC 2002

Generic multidocument summarization task

59 input sets 5 to 15 news documents

3 types of summaries 200 words Manually assigned content and linguistic quality scores

1. Humanabstracts

27

2. Humanextracts

3. Systemextracts

2 assessors * 59 2 assessors * 59 9 systems * 59

Specificity analysis of summaries

1. More general content is preferred in abstracts

2. Simply the process of extraction makes summaries more specific

3. System summaries are overly specific

28

0.7 0.80.6

Inputs (0.65)

H. Abs (0.62)

S.ext (0.74)

H.ext (0.72)

[Avg. specificity]

Histogram of specificity scores

Human summaries are more general

Is the aspect related to summary quality?

Analysis of ‘system summaries’: specificity and quality

1. Content quality Importance of content included in the summary

2. Linguistic quality How well-written the summary is perceived to be

3. Quality of general/specific summaries When a summary is intended to be general or specific

30

31

Relationship to content selection scores Coverage score: closeness to human summary

Clause level comparison

For system summaries Correlation between coverage score and average

specificity -0.16*, p-value = 0.0006

Less specific ~ better content

But the correlation is not very high

Specificity is related to realization of content Different from importance of the content

Content quality = content importance + appropriate specificity level

Content importance: ROUGE scores N-gram overlap of system summary and human summary Standard evaluation of automatic summaries

32

Specificity as one of the predictors

Coverage score ~ ROUGE-2 (bigrams) + specificity

Linear regression

Weights for predictors in the regression model

33

Mean β Significance (hypothesis β = 0)

(Intercept) 0.212 2.3e-11

ROUGE-2 1.299 < 2.0e-16

Specificity -0.166 3.1e-05

Is the combination a better predictor than ROUGE alone?

2. Specificity and linguistic quality

Used different data: TAC 2009 DUC 2002 only reported number of errors Were also specified as a range: 1-5 errors

TAC 2009 linguistic quality score Manually judged: scale 1 – 10 Combines different aspects

coherence, referential clarity, grammaticality, redundancy

34

What is the avg specificity in different score categories?

More general ~ lower score! General content is useful

but need proper context!

35

Ling score No. summaries

Poor (1, 2) 202

Mediocre (5) 400

Best (9, 10) 79

If a summary starts as follows:“We are quite a ways from that, actually.”As ice and snow at the poles melt, …

Specificity = lowLinguistic quality = 1

Average specificity

0.71

0.72

0.77

Data for analysing generalization operation Aligned pairs of abstract and source sentences

conveying the same content Traditional data used for compression experiments

Ziff-Davis tree alignment corpus 15964 sentence pairs Any number of deletions, up to 7 substitutions

Only 25% abstract sentences are mapped But beneficial to observe the trends

36

[Galley & McKeown (2007)]

Generalization operation in human abstracts

Transition

SS

SG

GG

GS

37

One-third of all transformations are specific to general

Human abstracts involve a lot of generalization

No. pairs % pairs

6371 39.9

5679 35.6

3562 22.3

352 2.2

How specific sentences get converted to general?

SG

SS

GG

GS

38

Orig. length

33.5

33.4

21.5

22.7

New/orig length

40.8

56.6

60.8

66.0

Avg. deletions(words)

21.4

16.3

9.3

8.4

Choose long sentences and compress heavily!

A measure of generality would be useful to guide compression Currently only importance and grammaticality are used

Use of general sentences in human extracts

Details of Maxwell’s death were sketchy. Folksy was an understatement. “Long live democracy!” Instead it sank like the Bismarck.

Example use of a general sentence in a summary…With Tower’s qualifications for the job, the nominations should

have sailed through with flying colors. [Specific]Instead it sank like the Bismarck. [General]…Future: can we learn to generate and select general sentences to

include in automatic summaries?

Conclusions Built a classifier for general and specific

sentences Used existing annotations to do that But tested on new data and task-based

evaluation

The confidence of the classifier is highly correlated with human agreement

Analyzed human and machine summaries Machine summaries are too specific But adding general sentences is difficult because

the context has to be right

Further details in Annie Louis and Ani Nenkova, Automatic identification of general and specific

sentences by leveraging discourse annotations, Proceedings of IJCNLP, 2011 (To Appear).

Annie Louis and Ani Nenkova, Text specificity and impact on quality of news summaries, Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011.

Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment, Proceedings of NAACL-HLT 2010.

Two types of local coherence—Entity & Rhetorical

Local coherence: Adjacent sentences in a text flow from one to another

Entity – same topic John was hungry. He went to a restaurant.

But only 42% sentence pairs are entity-linked [previous corpus studies]

Will core discourse relations connect the non-entity sharing sentence pairs? Popular hypothesis in prior work

42

Investigations into text quality The mix of discourse relations in a text is

highly predictive of the perceived quality of the text

Both implicit and explicit relations are needed to predict text quality

Predicting the sense of implicit discourse relations is a very difficult task; most predicted to be “expansion”

How is local coherence created?

Joint analysis by combining PDTB and Ontonotes annotations 590 articles Noun phrase coreference from Ontonotes

40 to 50% of sentence pairs do not share entities in articles of different lengths

44

Expansions cover most of non-entity sharing instances

45

Expansions have the least rate of coreference

46

Rate of coreference in 2nd level elaboration relations

47

Example instantiations and list relations

Instantiation The economy is showing signs of weakness, particularly

among manufacturers.

Exports which played a key role in fueling growth over the last two years, seem to have stalled.

List Many of Nasdaq's biggest technology stocks were in the

forefront of the rally.

- Microsoft added 2 1/8 to 81 3/4 and Oracle Systems rose 1 1/2 to 23 1/4.

- Intel was up 1 3/8 to 33 3/4.

48

Overall distribution of sentence pairs among the two coherence devices

49

30% sentence pairs have no coreference and are in a weak discourse relation (expansion/entrel)

We must explore elaboration more closely to identify how they create coherence

top related