summarization. cs276b web search and mining lecture 14 text mining ii (includes slides borrowed from...
TRANSCRIPT
Summarization
CS276BWeb Search and Mining
Lecture 14Text Mining II
(includes slides borrowed from G. Neumann, M. Venkataramani, R. Altman, L. Hirschman, and D. Radev)
Automated Text summarization Tutorial — COLING/ACL’98
3
an exciting challenge......put a book on the scanner, turn the dial
to ‘2 pages’, and read the result...
...download 1000 documents from the web, send them to the summarizer, and select the best ones by reading the summaries of the clusters...
...forward the Japanese email to the summarizer, select ‘1 par’, and skim the translated summary.
4
Headline news — informing
5
TV-GUIDES — decision making
6
Abstracts of papers — time saving
7
Graphical maps — orienting
8
Textual Directions — planning
9
Cliff notes — Laziness support
10
Real systems — Money making
11
Questions
What kinds of summaries do people want? What are summarizing, abstracting, gisting,...?
How sophisticated must summ. systems be? Are statistical techniques sufficient? Or do we need symbolic techniques and deep
understanding as well?
What milestones would mark quantum leaps in summarization theory and practice? How do we measure summarization quality?
What is a Summary?
Informative summary Purpose: replace original document Example: executive summary
Indicative summary Purpose: support decision: do I want to
read original document yes/no? Example: Headline, scientific abstract
13
‘Genres’ of Summary? Indicative vs. informative
...used for quick categorization vs. content processing.
Extract vs. abstract...lists fragments of text vs. re-phrases content coherently.
Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.
Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-date.
Single-document vs. multi-document source...based on one text vs. fuses together many texts.
14
Examples of Genres
Exercise: summarize the following texts for the following readers:
text1: Coup Attempt
text2: childrens’ story
reader1: your friend, who knows nothing about South Africa.
reader2: someone who lives in South Africa and knows the political position.
reader3: your 4-year-old niece.reader4: the Library of Congress.
Why Automatic Summarization?
Algorithm for reading in many domains is:1) read summary2) decide whether relevant or not3) if relevant: read whole document
Summary is gate-keeper for large number of documents.
Information overload Often the summary is all that is read.
Human-generated summaries are expensive.
Summary Length (Reuters)
Goldstein et al. 1999
Summarization Algorithms
Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content
Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content
Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well
Something between the last two methods?
Sentence Extraction
Represent each sentence as a feature vector Compute score based on features Select n highest-ranking sentences Present in order in which they occur in text. Postprocessing to make summary more
readable/concise Eliminate redundant sentences Anaphors/pronouns
A woman walks. She smokes. Delete subordinate clauses, parentheticals
Sentence Extraction: Example
Sigir95 paper on summarization by Kupiec, Pedersen, Chen
Trainable sentence extraction
Proposed algorithm is applied to its own description (the paper)
Sentence Extraction: Example
Feature Representation
Fixed-phrase feature Certain phrases indicate summary, e.g. “in
summary” Paragraph feature
Paragraph initial/final more likely to be important. Thematic word feature
Repetition is an indicator of importance Uppercase word feature
Uppercase often indicates named entities. (Taylor) Sentence length cut-off
Summary sentence should be > 5 words.
Feature Representation (cont.)
Sentence length cut-off Summary sentences have a minimum length.
Fixed-phrase feature True for sentences with indicator phrase
“in summary”, “in conclusion” etc. Paragraph feature
Paragraph initial/medial/final Thematic word feature
Do any of the most frequent content words occur?
Uppercase word feature Is uppercase thematic word introduced?
Training
Hand-label sentences in training set (good/bad summary sentences)
Train classifier to distinguish good/bad summary sentences
Model used: Naïve Bayes
Can rank sentences according to score and show top n to user.
Evaluation
Compare extracted sentences with sentences in abstracts
Evaluation of features
Baseline (choose first n sentences): 24% Overall performance (42-44%) not very good. However, there is more than one good
summary.
Multi-Document (MD) Summarization
Summarize more than one document Harder but benefit is large (can’t scan 100s
of docs) To do well, need to adopt more specific
strategy depending on document set. Other components needed for a production
system, e.g., manual post-editing. DUC: government sponsored bake-off
200 or 400 word summaries Longer → easier
Types of MD Summaries
Single event/person tracked over a long time period Elizabeth Taylor’s bout with pneumonia Give extra weight to character/event May need to include outcome (dates!)
Multiple events of a similar nature Marathon runners and races More broad brush, ignore dates
An issue with related events Gun control Identify key concepts and select sentences
accordingly
Determine MD Summary Type
First, determine which type of summary to generate
Compute all pairwise similarities Very dissimilar articles → multi-event
(marathon) Mostly similar articles
Is most frequent concept named entity? Yes → single event/person (Taylor) No → issue with related events (gun control)
MultiGen Architecture (Columbia)
Generation
Ordering according to date Intersection
Find concepts that occur repeatedly in a time chunk
Sentence generator
Processing
Selection of good summary sentences Elimination of redundant sentences Replace anaphors/pronouns with noun
phrases they refer to Need coreference resolution
Delete non-central parts of sentences
Newsblaster (Columbia)
Query-Specific Summarization
So far, we’ve look at generic summaries. A generic summary makes no assumption
about the reader’s interests. Query-specific summaries are specialized
for a single information need, the query. Summarization is much easier if we have a
description of what the user wants.
Genre
Some genres are easy to summarize Newswire stories Inverted pyramid structure The first n sentences are often the best
summary of length n Some genres are hard to summarize
Long documents (novels, the bible) Scientific articles?
Trainable summarizers are genre-specific.
Discussion
Correct parsing of document format is critical. Need to know headings, sequence, etc.
Limits of current technology Some good summaries require natural
language understanding Example: President Bush’s nominees for
ambassadorships Contributors to Bush’s campaign Veteran diplomats Others
Coreference Resolution
Coreference
Two noun phrases referring to the same entity are said to corefer.
Example: Transcription from RL95-2 is mediated through an ERE element at the 5-flanking region of the gene.
Coreference resolution is important for many text mining tasks: Information extraction Summarization First story detection
Types of Coreference
Noun phrases: Transcription from RL95-2 … the gene …
Pronouns: They induced apoptosis. Possessives: … induces their rapid
dissociation … Demonstratives: This gene is responsible
for Alzheimer’s
Preferences in pronoun interpretation
Recency: John has an Integra. Bill has a legend. Mary likes
to drive it. Grammatical role:
John went to the Acura dealership with Bill. He bought an Integra.
(?) John and Bill went to the Acura dealership. He bought an Integra.
Repeated mention: John needed a car to go to his new job. He
decided that he wanted something sporty. Bill went to the Acura dealership with him. He bought an Integra.
Preferences in pronoun interpretation
Parallelism: Mary went with Sue to the Acura dealership.
Sally went with her to the Mazda dealership. ??? Mary went with Sue to the Acura
dealership. Sally told her not to buy anything.
Verb semantics: John telephoned Bill. He lost his pamphlet
on Acuras. John criticized Bill. He lost his pamphlet on
Acuras.
An algorithm for pronoun resolution
Two steps: discourse model update and pronoun resolution.
Salience values are introduced when a noun phrase that evokes a new entity is encountered.
Salience factors: set empirically.
Salience weights in Lappin and Leass
Sentence recency 100
Subject emphasis 80
Existential emphasis 70
Accusative emphasis 50
Indirect object and oblique complement emphasis
40
Non-adverbial emphasis 50
Head noun emphasis 80
Lappin and Leass (cont’d)
Recency: weights are cut in half after each sentence is processed.
Examples: An Acura Integra is parked in the lot.
(subject) There is an Acura Integra parked in the lot.
(existential predicate nominal) John parked an Acura Integra in the lot.
(object) John gave Susan an Acura Integra. (indirect
object) In his Acura Integra, John showed Susan his
new CD player. (demarcated adverbial PP)
Algorithm
1. Collect the potential referents (up to four sentences back).
2. Remove potential referents that do not agree in number or gender with the pronoun.
3. Remove potential referents that do not pass intrasentential syntactic coreference constraints.
4. Compute the total salience value of the referent by adding any applicable values for role parallelism (+35) or cataphora (-175).
5. Select the referent with the highest salience value. In case of a tie, select the closest referent in terms of string position.
Observations
Lappin & Leass - tested on computer manuals - 86% accuracy on unseen data.
Another well known theory is Centering (Grosz, Joshi, Weinstein), which has an additional concept of a “center”. (More of a theoretical model; less empirical confirmation.)
47
Making Sense of it All...
To understand summarization, it helps to consider several perspectives simultaneously:
1. Approaches: basic starting point, angle of attack, core focus question(s): psycholinguistics, text linguistics, computation...
2. Paradigms: theoretical stance; methodological preferences: rules, statistics, NLP, Info Retrieval, AI...
3. Methods: the nuts and bolts: modules, algorithms, processing: word frequency, sentence position, concept generalization...
48
Psycholinguistic Approach: 2 Studies
Coarse-grained summarization protocols from professional summarizers (Kintsch and van Dijk, 78):
Delete material that is trivial or redundant. Use superordinate concepts and actions. Select or invent topic sentence.
552 finely-grained summarization strategies from professional summarizers (Endres-Niggemeyer, 98):
Self control: make yourself feel comfortable. Processing: produce a unit as soon as you have
enough data. Info organization: use “Discussion” section to
check results. Content selection: the table of contents is relevant.
49
Computational Approach: Basics
Top-Down: I know what I want!
— don’t confuse me with drivel!
User needs: only certain types of info
System needs: particular criteria of interest, used to focus search
Bottom-Up: • I’m dead curious:
what’s in the text?
• User needs: anything that’s important
• System needs: generic importance metrics, used to rate content
50
Query-Driven vs. Text-DRIVEN Focus
Top-down: Query-driven focus Criteria of interest encoded as search specs. System uses specs to filter or analyze text
portions. Examples: templates with slots with semantic
characteristics; termlists of important terms. Bottom-up: Text-driven focus
Generic importance metrics encoded as strategies. System applies strategies over rep of whole text. Examples: degree of connectedness in semantic
graphs; frequency of occurrence of tokens.
51
Bottom-Up, using Info. Retrieval
IR task: Given a query, find the relevant document(s) from a large set of documents.
Summ-IR task: Given a query, find the relevant passage(s) from a set of passages (i.e., from one or more documents).
• Questions: 1. IR techniques work on large
volumes of data; can they scale down accurately enough?
2. IR works on words; do abstracts require abstract representations?
xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx xxx xxxx xxx xxxx xx xxxxx xxxxx xx xxx xxxxxx xxx
52
Top-Down, using Info. Extraction
IE task: Given a template and a text, find all the information relevant to each slot of the template and fill it in.
Summ-IE task: Given a query, select the best template, fill it in, and generate the contents.
• Questions:1. IE works only for very particular
templates; can it scale up?
2. What about information that doesn’t fit into any template—is this a generic limitation of IE?
xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx x xxxx xxxx xxx xxxx xx xxxxx xxxxx xx xxx xxxxxx xxx
Xxxxx: xxxx Xxx: xxxx Xxx: xx xxx Xx: xxxxx xXxx: xx xxx Xx: x xxx xx Xx: xxx x Xxx: xx Xxx: x
53
NLP/IE:• Approach: try to
‘understand’ text—re-represent content using ‘deeper’ notation; then manipulate that.
• Need: rules for text analysis and manipulation, at all levels.
• Strengths: higher quality; supports abstracting.
• Weaknesses: speed; still needs to scale up to robust open-domain summarization.
IR/Statistics:• Approach: operate at
lexical level—use word frequency, collocation counts, etc.
• Need: large amounts of text.
• Strengths: robust; good for query-oriented summaries.
• Weaknesses: lower quality; inability to manipulate information at abstract levels.
Paradigms: NLP/IE vs. ir/statistics
54
Toward the Final Answer... Problem: What if neither IR-like nor
IE-like methods work?
Solution: semantic analysis of the text (NLP), using adequate knowledge bases that
support inference (AI).
Mrs. Coolidge: “What did the preacher preach about?”
Coolidge: “Sin.”Mrs. Coolidge: “What did
he say?”Coolidge: “He’s against it.”
– sometimes counting and templates are insufficient,
– and then you need to do inference to understand.
Word counting
Inference
55
The Optimal Solution...
Combine strengths of both paradigms…
...use IE/NLP when you have suitable template(s),
...use IR when you don’t…
…but how exactly to do it?
56
A Summarization Machine
EXTRACTS
ABSTRACTS
?
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very BriefBrief
Long
Headline
Informative
DOC QUERY
CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS
57
The Modules of the Summarization Machine
EXTRACTION
INTERPRETATION
EXTRACTS
ABSTRACTS
?
CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS
MULTIDOC
EXTRACTS
GENERATION
FILTERING
DOCEXTRACTS
59
Summarization methods (& exercise). Topic Extraction. Interpretation. Generation.
60
Overview of Extraction Methods Position in the text
lead method; optimal position policy title/heading method
Cue phrases in sentences Word frequencies throughout the text Cohesion: links among words
word co-occurrence coreference lexical chains
Discourse structure of the text Information Extraction: parsing and
analysis
61
POSition-based method (1) Claim: Important sentences occur at the
beginning (and/or end) of texts. Lead method: just take first sentence(s)!
Experiments: In 85% of 200 individual paragraphs the
topic sentences occurred in initial position and in 7% in final position (Baxendale, 58).
Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).
62
Optimum Position Policy (OPP)
Claim: Important sentences are located at positions that are genre-dependent; these positions can be determined automatically through training (Lin and Hovy, 97). Corpus: 13000 newspaper articles (ZIFF corpus). Step 1: For each article, determine overlap
between sentences and the index terms for the article.
Step 2: Determine a partial ordering over the locations where sentences containing important words occur: Optimal Position Policy (OPP)
63
Title-Based Method (1) Claim: Words in titles and headings are
positively relevant to summarization.
Shown to be statistically valid at 99% level of significance (Edmundson, 68).
Empirically shown to be useful in summarization systems.
64
Cue-Phrase method (1) Claim 1: Important sentences contain
‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible.
Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97).
Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.
65
Word-frequency-based method (1)
Claim: Important sentences contain words that occur “somewhat” frequently.
Method: Increase sentence score for each frequent word.
Evaluation: Straightforward approach empirically shown to be mostly detrimental in summarization systems.
words
Wordfrequency
The resolving power of words
(Luhn, 59)
66
Cohesion-based methods
Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures.
Classes of approaches word co-occurrences; local salience and grammatical relations; co-reference; lexical similarity (WordNet, lexical chains); combinations of the above.
67
Cohesion: WORD co-occurrence (1)
Apply IR methods at the document level: texts are collections of paragraphs (Salton et al., 94; Mitra et al., 97; Buckley and Cardie, 97): Use a traditional, IR-based, word similarity
measure to determine for each paragraph Pi the set Si of paragraphs that Pi is related to.
Method: determine relatedness score Si for each
paragraph, extract paragraphs with largest Si scores.
P1P2
P3
P4
P5P6
P7
P8
P9
68
Word co-occurrence method (2)
Cornell’s Smart-based approach expand original query compare expanded query against paragraphs select top three paragraphs (max 25% of
original) that are most similar to the original query
(SUMMAC,98): 71.9% F-score for relevance judgment CGI/CMU approach
maximize query-relevance while minimizing redundancy with previous information.
(SUMMAC,98): 73.4% F-score for relevance judgment
In the context of query-based summarization
69
Cohesion: Local salience Method
Assumes that important phrasal expressions are given by a combination of grammatical, syntactic, and contextual parameters (Boguraev and Kennedy, 97):
CNTX: 50 iff the expression is in the current discourse segmentSUBJ: 80 iff the expression is a subjectEXST: 70 iff the expression is an existential constructionACC: 50 iff the expression is a direct objectHEAD: 80 iff the expression is not contained in another phraseARG: 50 iff the expression is not contained in an adjunct
70
Cohesion: Lexical chains method (1)
But Mr. Kenny’s move speeded up work on a machine which uses micro-computers to control the rate at which an anaesthetic is pumpedinto the blood of patients undergoing surgery. Such machines are nothing new. But Mr. Kenny’s device uses two personal-computers to achievemuch closer monitoring of the pump feeding the anaesthetic into the patient. Extensive testing of the equipment has sufficiently impressedthe authorities which regulate medical equipment in Britain, and, so far,four other countries, to make this the first such machine to be licensedfor commercial sale to hospitals.
Based on (Morris and Hirst, 91)
71
Lexical chains-based method (2)
Assumes that important sentences are those that are ‘traversed’ by strong chains (Barzilay and Elhadad, 97). Strength(C) = length(C) -
#DistinctOccurrences(C) For each chain, choose the first sentence that is
traversed by the chain and that uses a representative set of concepts from that chain. LC algorithm Lead-based algorithm[Jing et al., 98] corpus Recall Prec Recall Prec
10% cutoff 67% 61% 82.9% 63.4%
20% cutoff 64% 47% 70.9% 46.9%
72
Cohesion: Coreference method
Build co-reference chains (noun/event identity, part-whole relations) between
query and document - In the context of query-based summarization
title and document sentences within document
Important sentences are those traversed by a large number of chains:
a preference is imposed on chains (query > title > doc)
Evaluation: 67% F-score for relevance (SUMMAC, 98). (Baldwin and Morton, 98)
73
Cohesion: Connectedness method (1)
Map texts into graphs: The nodes of the graph are the words of the
text. Arcs represent adjacency, grammatical, co-
reference, and lexical similarity-based relations.
Associate importance scores to words (and sentences) by applying the tf.idf metric.
Assume that important words/sentences are those with the highest scores.
(Mani and Bloedorn, 97)
79
Information extraction Method (1)
Idea: content selection using templates Predefine a template, whose slots specify what is of
interest. Use a canonical IE system to extract from a (set of)
document(s) the relevant information; fill the template.
Generate the content of the template as the summary.
Previous IE work: FRUMP (DeJong, 78): ‘sketchy scripts’ of terrorism,
natural disasters, political visits... (Mauldin, 91): templates for conceptual IR. (Rau and Jacobs, 91): templates for business. (McKeown and Radev, 95): templates for news.
80
Information Extraction method (2) Example template:
MESSAGE:ID TSL-COL-0001SECSOURCE:SOURCE ReutersSECSOURCE:DATE 26 Feb 93
Early afternoonINCIDENT:DATE 26 Feb 93INCIDENT:LOCATION World Trade CenterINCIDENT:TYPE BombingHUM TGT:NUMBER AT LEAST 5
81
IE State of the Art MUC conferences (1988–97):
Test IE systems on series of domains: Navy sub-language (89), terrorism (92), business (96),...
Create increasingly complex templates. Evaluate systems, using two measures:
Recall (how many slots did the system actually fill, out of the total number it should have filled?).
Precision (how correct were the slots that it filled?).1989 1992 1996
Recall 63.9 71.5 67.1Precision 87.4 84.2 78.3
82
Review of Methods Text location: title,
position Cue phrases Word frequencies Internal text cohesion:
word co-occurrences local salience co-reference of names,
objects lexical similarity semantic rep/graph
centrality Discourse structure
centrality
Information extraction templates
Query-driven extraction:
query expansion lists co-reference with query
names lexical similarity to
query
Bottom-up methods Top-down methods