summarization. cs276b web search and mining lecture 14 text mining ii (includes slides borrowed from...

Summarization

CS276BWeb Search and Mining

Lecture 14Text Mining II

(includes slides borrowed from G. Neumann, M. Venkataramani, R. Altman, L. Hirschman, and D. Radev)

Automated Text summarization Tutorial — COLING/ACL’98

3

an exciting challenge......put a book on the scanner, turn the dial

to ‘2 pages’, and read the result...

...download 1000 documents from the web, send them to the summarizer, and select the best ones by reading the summaries of the clusters...

...forward the Japanese email to the summarizer, select ‘1 par’, and skim the translated summary.

4

Headline news — informing

5

TV-GUIDES — decision making

6

Abstracts of papers — time saving

7

Graphical maps — orienting

8

Textual Directions — planning

9

Cliff notes — Laziness support

10

Real systems — Money making

11

Questions

What kinds of summaries do people want? What are summarizing, abstracting, gisting,...?

How sophisticated must summ. systems be? Are statistical techniques sufficient? Or do we need symbolic techniques and deep

understanding as well?

What milestones would mark quantum leaps in summarization theory and practice? How do we measure summarization quality?

What is a Summary?

Informative summary Purpose: replace original document Example: executive summary

Indicative summary Purpose: support decision: do I want to

read original document yes/no? Example: Headline, scientific abstract

13

‘Genres’ of Summary? Indicative vs. informative

...used for quick categorization vs. content processing.

Extract vs. abstract...lists fragments of text vs. re-phrases content coherently.

Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.

Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-date.

Single-document vs. multi-document source...based on one text vs. fuses together many texts.

14

Examples of Genres

Exercise: summarize the following texts for the following readers:

text1: Coup Attempt

text2: childrens’ story

reader1: your friend, who knows nothing about South Africa.

reader2: someone who lives in South Africa and knows the political position.

reader3: your 4-year-old niece.reader4: the Library of Congress.

Why Automatic Summarization?

Algorithm for reading in many domains is:1) read summary2) decide whether relevant or not3) if relevant: read whole document

Summary is gate-keeper for large number of documents.

Information overload Often the summary is all that is read.

Human-generated summaries are expensive.

Summary Length (Reuters)

Goldstein et al. 1999

Summarization Algorithms

Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content

Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content

Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well

Something between the last two methods?

Sentence Extraction

Represent each sentence as a feature vector Compute score based on features Select n highest-ranking sentences Present in order in which they occur in text. Postprocessing to make summary more

readable/concise Eliminate redundant sentences Anaphors/pronouns

A woman walks. She smokes. Delete subordinate clauses, parentheticals

Sentence Extraction: Example

Sigir95 paper on summarization by Kupiec, Pedersen, Chen

Trainable sentence extraction

Proposed algorithm is applied to its own description (the paper)

Sentence Extraction: Example

Feature Representation

Fixed-phrase feature Certain phrases indicate summary, e.g. “in

summary” Paragraph feature

Paragraph initial/final more likely to be important. Thematic word feature

Repetition is an indicator of importance Uppercase word feature

Uppercase often indicates named entities. (Taylor) Sentence length cut-off

Summary sentence should be > 5 words.

Feature Representation (cont.)

Sentence length cut-off Summary sentences have a minimum length.

Fixed-phrase feature True for sentences with indicator phrase

“in summary”, “in conclusion” etc. Paragraph feature

Paragraph initial/medial/final Thematic word feature

Do any of the most frequent content words occur?

Uppercase word feature Is uppercase thematic word introduced?

Training

Hand-label sentences in training set (good/bad summary sentences)

Train classifier to distinguish good/bad summary sentences

Model used: Naïve Bayes

Can rank sentences according to score and show top n to user.

Evaluation

Compare extracted sentences with sentences in abstracts

Evaluation of features

Baseline (choose first n sentences): 24% Overall performance (42-44%) not very good. However, there is more than one good

summary.

Multi-Document (MD) Summarization

Summarize more than one document Harder but benefit is large (can’t scan 100s

of docs) To do well, need to adopt more specific

strategy depending on document set. Other components needed for a production

system, e.g., manual post-editing. DUC: government sponsored bake-off

200 or 400 word summaries Longer → easier

Types of MD Summaries

Single event/person tracked over a long time period Elizabeth Taylor’s bout with pneumonia Give extra weight to character/event May need to include outcome (dates!)

Multiple events of a similar nature Marathon runners and races More broad brush, ignore dates

An issue with related events Gun control Identify key concepts and select sentences

accordingly

Determine MD Summary Type

First, determine which type of summary to generate

Compute all pairwise similarities Very dissimilar articles → multi-event

(marathon) Mostly similar articles

Is most frequent concept named entity? Yes → single event/person (Taylor) No → issue with related events (gun control)

MultiGen Architecture (Columbia)

Generation

Ordering according to date Intersection

Find concepts that occur repeatedly in a time chunk

Sentence generator

Processing

Selection of good summary sentences Elimination of redundant sentences Replace anaphors/pronouns with noun

phrases they refer to Need coreference resolution

Delete non-central parts of sentences

Newsblaster (Columbia)

Query-Specific Summarization

So far, we’ve look at generic summaries. A generic summary makes no assumption

about the reader’s interests. Query-specific summaries are specialized

for a single information need, the query. Summarization is much easier if we have a

description of what the user wants.

Genre

Some genres are easy to summarize Newswire stories Inverted pyramid structure The first n sentences are often the best

summary of length n Some genres are hard to summarize

Long documents (novels, the bible) Scientific articles?

Trainable summarizers are genre-specific.

Discussion

Correct parsing of document format is critical. Need to know headings, sequence, etc.

Limits of current technology Some good summaries require natural

language understanding Example: President Bush’s nominees for

ambassadorships Contributors to Bush’s campaign Veteran diplomats Others

Coreference Resolution

Coreference

Two noun phrases referring to the same entity are said to corefer.

Example: Transcription from RL95-2 is mediated through an ERE element at the 5-flanking region of the gene.

Coreference resolution is important for many text mining tasks: Information extraction Summarization First story detection

Types of Coreference

Noun phrases: Transcription from RL95-2 … the gene …

Pronouns: They induced apoptosis. Possessives: … induces their rapid

dissociation … Demonstratives: This gene is responsible

for Alzheimer’s

Preferences in pronoun interpretation

Recency: John has an Integra. Bill has a legend. Mary likes

to drive it. Grammatical role:

John went to the Acura dealership with Bill. He bought an Integra.

(?) John and Bill went to the Acura dealership. He bought an Integra.

Repeated mention: John needed a car to go to his new job. He

decided that he wanted something sporty. Bill went to the Acura dealership with him. He bought an Integra.

Preferences in pronoun interpretation

Parallelism: Mary went with Sue to the Acura dealership.

Sally went with her to the Mazda dealership. ??? Mary went with Sue to the Acura

dealership. Sally told her not to buy anything.

Verb semantics: John telephoned Bill. He lost his pamphlet

on Acuras. John criticized Bill. He lost his pamphlet on

Acuras.

An algorithm for pronoun resolution

Two steps: discourse model update and pronoun resolution.

Salience values are introduced when a noun phrase that evokes a new entity is encountered.

Salience factors: set empirically.

Salience weights in Lappin and Leass

Sentence recency 100

Subject emphasis 80

Existential emphasis 70

Accusative emphasis 50

Indirect object and oblique complement emphasis

40

Non-adverbial emphasis 50

Head noun emphasis 80

Lappin and Leass (cont’d)

Recency: weights are cut in half after each sentence is processed.

Examples: An Acura Integra is parked in the lot.

(subject) There is an Acura Integra parked in the lot.

(existential predicate nominal) John parked an Acura Integra in the lot.

(object) John gave Susan an Acura Integra. (indirect

object) In his Acura Integra, John showed Susan his

new CD player. (demarcated adverbial PP)

Algorithm

1. Collect the potential referents (up to four sentences back).

2. Remove potential referents that do not agree in number or gender with the pronoun.

3. Remove potential referents that do not pass intrasentential syntactic coreference constraints.

4. Compute the total salience value of the referent by adding any applicable values for role parallelism (+35) or cataphora (-175).

5. Select the referent with the highest salience value. In case of a tie, select the closest referent in terms of string position.

Observations

Lappin & Leass - tested on computer manuals - 86% accuracy on unseen data.

Another well known theory is Centering (Grosz, Joshi, Weinstein), which has an additional concept of a “center”. (More of a theoretical model; less empirical confirmation.)

47

Making Sense of it All...

To understand summarization, it helps to consider several perspectives simultaneously:

1. Approaches: basic starting point, angle of attack, core focus question(s): psycholinguistics, text linguistics, computation...

2. Paradigms: theoretical stance; methodological preferences: rules, statistics, NLP, Info Retrieval, AI...

3. Methods: the nuts and bolts: modules, algorithms, processing: word frequency, sentence position, concept generalization...

48

Psycholinguistic Approach: 2 Studies

Coarse-grained summarization protocols from professional summarizers (Kintsch and van Dijk, 78):

Delete material that is trivial or redundant. Use superordinate concepts and actions. Select or invent topic sentence.

552 finely-grained summarization strategies from professional summarizers (Endres-Niggemeyer, 98):

Self control: make yourself feel comfortable. Processing: produce a unit as soon as you have

enough data. Info organization: use “Discussion” section to

check results. Content selection: the table of contents is relevant.

49

Computational Approach: Basics

Top-Down: I know what I want!

— don’t confuse me with drivel!

User needs: only certain types of info

System needs: particular criteria of interest, used to focus search

Bottom-Up: • I’m dead curious:

what’s in the text?

• User needs: anything that’s important

• System needs: generic importance metrics, used to rate content

50

Query-Driven vs. Text-DRIVEN Focus

Top-down: Query-driven focus Criteria of interest encoded as search specs. System uses specs to filter or analyze text

portions. Examples: templates with slots with semantic

characteristics; termlists of important terms. Bottom-up: Text-driven focus

Generic importance metrics encoded as strategies. System applies strategies over rep of whole text. Examples: degree of connectedness in semantic

graphs; frequency of occurrence of tokens.

51

Bottom-Up, using Info. Retrieval

IR task: Given a query, find the relevant document(s) from a large set of documents.

Summ-IR task: Given a query, find the relevant passage(s) from a set of passages (i.e., from one or more documents).

• Questions: 1. IR techniques work on large

volumes of data; can they scale down accurately enough?

2. IR works on words; do abstracts require abstract representations?

xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx xxx xxxx xxx xxxx xx xxxxx xxxxx xx xxx xxxxxx xxx

52

Top-Down, using Info. Extraction

IE task: Given a template and a text, find all the information relevant to each slot of the template and fill it in.

Summ-IE task: Given a query, select the best template, fill it in, and generate the contents.

• Questions:1. IE works only for very particular

templates; can it scale up?

2. What about information that doesn’t fit into any template—is this a generic limitation of IE?

xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx x xxxx xxxx xxx xxxx xx xxxxx xxxxx xx xxx xxxxxx xxx

Xxxxx: xxxx Xxx: xxxx Xxx: xx xxx Xx: xxxxx xXxx: xx xxx Xx: x xxx xx Xx: xxx x Xxx: xx Xxx: x

53

NLP/IE:• Approach: try to

‘understand’ text—re-represent content using ‘deeper’ notation; then manipulate that.

• Need: rules for text analysis and manipulation, at all levels.

• Strengths: higher quality; supports abstracting.

• Weaknesses: speed; still needs to scale up to robust open-domain summarization.

IR/Statistics:• Approach: operate at

lexical level—use word frequency, collocation counts, etc.

• Need: large amounts of text.

• Strengths: robust; good for query-oriented summaries.

• Weaknesses: lower quality; inability to manipulate information at abstract levels.

Paradigms: NLP/IE vs. ir/statistics

54

Toward the Final Answer... Problem: What if neither IR-like nor

IE-like methods work?

Solution: semantic analysis of the text (NLP), using adequate knowledge bases that

support inference (AI).

Mrs. Coolidge: “What did the preacher preach about?”

Coolidge: “Sin.”Mrs. Coolidge: “What did

he say?”Coolidge: “He’s against it.”

– sometimes counting and templates are insufficient,

– and then you need to do inference to understand.

Word counting

Inference

55

The Optimal Solution...

Combine strengths of both paradigms…

...use IE/NLP when you have suitable template(s),

...use IR when you don’t…

…but how exactly to do it?

56

A Summarization Machine

EXTRACTS

ABSTRACTS

?

MULTIDOCS

Extract Abstract

Indicative

Generic

Background

Query-oriented

Just the news

10%

50%

100%

Very BriefBrief

Long

Headline

Informative

DOC QUERY

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS

57

The Modules of the Summarization Machine

EXTRACTION

INTERPRETATION

EXTRACTS

ABSTRACTS

?

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS

MULTIDOC

EXTRACTS

GENERATION

FILTERING

DOCEXTRACTS

59

Summarization methods (& exercise). Topic Extraction. Interpretation. Generation.

60

Overview of Extraction Methods Position in the text

lead method; optimal position policy title/heading method

Cue phrases in sentences Word frequencies throughout the text Cohesion: links among words

word co-occurrence coreference lexical chains

Discourse structure of the text Information Extraction: parsing and

analysis

61

POSition-based method (1) Claim: Important sentences occur at the

beginning (and/or end) of texts. Lead method: just take first sentence(s)!

Experiments: In 85% of 200 individual paragraphs the

topic sentences occurred in initial position and in 7% in final position (Baxendale, 58).

Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).

62

Optimum Position Policy (OPP)

Claim: Important sentences are located at positions that are genre-dependent; these positions can be determined automatically through training (Lin and Hovy, 97). Corpus: 13000 newspaper articles (ZIFF corpus). Step 1: For each article, determine overlap

between sentences and the index terms for the article.

Step 2: Determine a partial ordering over the locations where sentences containing important words occur: Optimal Position Policy (OPP)

63

Title-Based Method (1) Claim: Words in titles and headings are

positively relevant to summarization.

Shown to be statistically valid at 99% level of significance (Edmundson, 68).

Empirically shown to be useful in summarization systems.

64

Cue-Phrase method (1) Claim 1: Important sentences contain

‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible.

Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97).

Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.

65

Word-frequency-based method (1)

Claim: Important sentences contain words that occur “somewhat” frequently.

Method: Increase sentence score for each frequent word.

Evaluation: Straightforward approach empirically shown to be mostly detrimental in summarization systems.

words

Wordfrequency

The resolving power of words

(Luhn, 59)

66

Cohesion-based methods

Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures.

Classes of approaches word co-occurrences; local salience and grammatical relations; co-reference; lexical similarity (WordNet, lexical chains); combinations of the above.

67

Cohesion: WORD co-occurrence (1)

Apply IR methods at the document level: texts are collections of paragraphs (Salton et al., 94; Mitra et al., 97; Buckley and Cardie, 97): Use a traditional, IR-based, word similarity

measure to determine for each paragraph Pi the set Si of paragraphs that Pi is related to.

Method: determine relatedness score Si for each

paragraph, extract paragraphs with largest Si scores.

P1P2

P3

P4

P5P6

P7

P8

P9

68

Word co-occurrence method (2)

Cornell’s Smart-based approach expand original query compare expanded query against paragraphs select top three paragraphs (max 25% of

original) that are most similar to the original query

(SUMMAC,98): 71.9% F-score for relevance judgment CGI/CMU approach

maximize query-relevance while minimizing redundancy with previous information.

(SUMMAC,98): 73.4% F-score for relevance judgment

In the context of query-based summarization

69

Cohesion: Local salience Method

Assumes that important phrasal expressions are given by a combination of grammatical, syntactic, and contextual parameters (Boguraev and Kennedy, 97):

CNTX: 50 iff the expression is in the current discourse segmentSUBJ: 80 iff the expression is a subjectEXST: 70 iff the expression is an existential constructionACC: 50 iff the expression is a direct objectHEAD: 80 iff the expression is not contained in another phraseARG: 50 iff the expression is not contained in an adjunct

70

Cohesion: Lexical chains method (1)

But Mr. Kenny’s move speeded up work on a machine which uses micro-computers to control the rate at which an anaesthetic is pumpedinto the blood of patients undergoing surgery. Such machines are nothing new. But Mr. Kenny’s device uses two personal-computers to achievemuch closer monitoring of the pump feeding the anaesthetic into the patient. Extensive testing of the equipment has sufficiently impressedthe authorities which regulate medical equipment in Britain, and, so far,four other countries, to make this the first such machine to be licensedfor commercial sale to hospitals.

Based on (Morris and Hirst, 91)

71

Lexical chains-based method (2)

Assumes that important sentences are those that are ‘traversed’ by strong chains (Barzilay and Elhadad, 97). Strength(C) = length(C) -

#DistinctOccurrences(C) For each chain, choose the first sentence that is

traversed by the chain and that uses a representative set of concepts from that chain. LC algorithm Lead-based algorithm[Jing et al., 98] corpus Recall Prec Recall Prec

10% cutoff 67% 61% 82.9% 63.4%

20% cutoff 64% 47% 70.9% 46.9%

72

Cohesion: Coreference method

Build co-reference chains (noun/event identity, part-whole relations) between

query and document - In the context of query-based summarization

title and document sentences within document

Important sentences are those traversed by a large number of chains:

a preference is imposed on chains (query > title > doc)

Evaluation: 67% F-score for relevance (SUMMAC, 98). (Baldwin and Morton, 98)

73

Cohesion: Connectedness method (1)

Map texts into graphs: The nodes of the graph are the words of the

text. Arcs represent adjacency, grammatical, co-

reference, and lexical similarity-based relations.

Associate importance scores to words (and sentences) by applying the tf.idf metric.

Assume that important words/sentences are those with the highest scores.

(Mani and Bloedorn, 97)

79

Information extraction Method (1)

Idea: content selection using templates Predefine a template, whose slots specify what is of

interest. Use a canonical IE system to extract from a (set of)

document(s) the relevant information; fill the template.

Generate the content of the template as the summary.

Previous IE work: FRUMP (DeJong, 78): ‘sketchy scripts’ of terrorism,

natural disasters, political visits... (Mauldin, 91): templates for conceptual IR. (Rau and Jacobs, 91): templates for business. (McKeown and Radev, 95): templates for news.

80

Information Extraction method (2) Example template:

MESSAGE:ID TSL-COL-0001SECSOURCE:SOURCE ReutersSECSOURCE:DATE 26 Feb 93

Early afternoonINCIDENT:DATE 26 Feb 93INCIDENT:LOCATION World Trade CenterINCIDENT:TYPE BombingHUM TGT:NUMBER AT LEAST 5

81

IE State of the Art MUC conferences (1988–97):

Test IE systems on series of domains: Navy sub-language (89), terrorism (92), business (96),...

Create increasingly complex templates. Evaluate systems, using two measures:

Recall (how many slots did the system actually fill, out of the total number it should have filled?).

Precision (how correct were the slots that it filled?).1989 1992 1996

Recall 63.9 71.5 67.1Precision 87.4 84.2 78.3

82

Review of Methods Text location: title,

position Cue phrases Word frequencies Internal text cohesion:

word co-occurrences local salience co-reference of names,

objects lexical similarity semantic rep/graph

centrality Discourse structure

centrality

Information extraction templates

Query-driven extraction:

query expansion lists co-reference with query

names lexical similarity to

query

Bottom-up methods Top-down methods

summarization. cs276b web search and mining lecture 14 text mining ii (includes slides borrowed from...

Documents