a new approach to unsupervised text summarization
DESCRIPTION
A New Approach to Unsupervised Text Summarization. Agenda. Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work. Introduction. Supervised typically make use of human-made summaries or - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/1.jpg)
A New Approach to Unsupervised Text Summarization
![Page 2: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/2.jpg)
Agenda
Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work
![Page 3: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/3.jpg)
Introduction
Supervisedtypically make use of human-made summaries or
extracts to find features or parameters of
summarization algorithms.
Problem: human-made summaries should be reliable enough.
Unsuperviseddetermine relevant parameters without regard to
human-made summaries.
![Page 4: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/4.jpg)
Introduction (cont’d)
Validity?
![Page 5: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/5.jpg)
Introduction (cont’d)
ExperimentA large group students of university to identify 10% sentences
in a text (various domains in a news paper corpus) which they
believe to be most important.
Reporting the rather modest result of 25% agreement among
their choice.
Problem1.Reliability
2.Portabolity
![Page 6: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/6.jpg)
The Approach
Evaluate summaryNot in terms of how well they match human-made extracts.
Not in terms of how much time it takes for humans to make
relevance judgments on them.
In terms of how well they represent source documents in usual
IR tasks such as document retrieval and text categorization.
![Page 7: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/7.jpg)
The Approach (cont’d)
ExtractionLack of fluency or cohesion.
But humans are able to perform as well reading 20%-30%
extracts as the original full text.
![Page 8: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/8.jpg)
Diversity-Based Summarization
ProblemWhat is the most important sentences that can represent the
text.
Katz’s make an important observation that the numbers of
occurrences of content words in a document do not depend on
the document’s length.
The frequencies per document of individual content words do
not grow proportionally with the length of a document.
![Page 9: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/9.jpg)
Diversity-Based Summarization (cont’d)
Two important properties of text1.Redundancy – How repetitive concepts are.
2.Diversity – How many different concept are in the text.
Much of the prior work is focus on redundancy, few of them
take an issue with the problem of diversity.
MMR (maximal marginal relevance)
![Page 10: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/10.jpg)
Diversity-Based Summarization (cont’d)
Method1.Find diversity – Find diverse topic areas in text.
2.Reduce-Redundancy – From each topic area, identify the
most important sentence and take that sentence as a
representative of the area.
A summary is then a set of sentences generated by Reduce-
Redundancy.
![Page 11: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/11.jpg)
Diversity-Based Summarization (cont’d)
Find DiversityBuilt upon the K-means clustering algorithm extended with
Minimum Description Length Principle (MDL) version of X-
means.
X-means is an extension of K-means with an added
functionality of estimating K, K is supplied by user.
![Page 12: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/12.jpg)
Diversity-Based Summarization (cont’d)
μj – the coordinates of the centroid with the index j.
xi – the coordinates of the i-th data point.
(i) represents the index of the centroid closest to the data point
i.
Ex. μ(j) denotes the centroid associated with the data point j.
ci - denotes a cluster with the index i.
![Page 13: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/13.jpg)
Diversity-Based Summarization (cont’d)
K-meansA hard clustering algorithm that produces a clustering of input
data points into K disjoint subsets.
Starting with some randomly chosen initial points. A bad choice
of initial centers can have adverse effects on performance in
Clustering.
A best solution is one that minimizes distortion.
![Page 14: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/14.jpg)
Diversity-Based Summarization (cont’d)
Define distortion as the averaged sum of squares of Euclidean
distances between objects of a cluster and its centroid.
For some clustering solution S = {c1, . . . , ck}, its distortion is
where
ci - a cluster
xj - an object in ci
μ(i) - the centroid of ci
| ・ | - the cardinality function
![Page 15: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/15.jpg)
Diversity-Based Summarization (cont’d)
Problem of K-meansUser should supply the number of clusters.
It’s prone to searching local minima.
![Page 16: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/16.jpg)
Diversity-Based Summarization (cont’d)
X-meansGlobally searching the space of centroid locations to find the
best way of partitioning the input data.
Resorting to a model selection criterion known as the Baysian
Information Criterion (BIC) to decide whether to split a cluster.
When the information gain from splitting a cluster as measured
by BIC is greater than the gain for keeping that cluster as it is.
It splits.
![Page 17: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/17.jpg)
Diversity-Based Summarization (cont’d)
![Page 18: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/18.jpg)
Diversity-Based Summarization (cont’d)
![Page 19: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/19.jpg)
Diversity-Based Summarization (cont’d)
![Page 20: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/20.jpg)
Diversity-Based Summarization (cont’d)
Modification of X-meansReplacing BIC by MDL
![Page 21: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/21.jpg)
Diversity-Based Summarization (cont’d)
![Page 22: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/22.jpg)
Diversity-Based Summarization (cont’d)
![Page 23: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/23.jpg)
Diversity-Based Summarization (cont’d)
![Page 24: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/24.jpg)
Diversity-Based Summarization (cont’d)
![Page 25: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/25.jpg)
Diversity-Based Summarization (cont’d)
![Page 26: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/26.jpg)
Diversity-Based Summarization (cont’d)
![Page 27: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/27.jpg)
Diversity-Based Summarization (cont’d)
![Page 28: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/28.jpg)
Diversity-Based Summarization (cont’d)
![Page 29: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/29.jpg)
Diversity-Based Summarization (cont’d)
Reduce-RedundancyUse a simple sentence weighting model (the Z-model)
Taking the weight of a given sentence as the sum of tf ・ idf
values of index terms in that sentence.
x - a index term
tf(x) - the frequency of term x in document
idf(x) - the inverse document frequency of x
![Page 30: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/30.jpg)
Diversity-Based Summarization (cont’d)
Z-model sentence selection1.Determining the weights of sentences in the text.
2.Sorting them in a decreasing order.
3.Selecting top sentences.
Further normalizes sentence weight of length.
Find out the best W(s) score. Then take the sentence as a
representative of the cluster.
Minimize the loss of the resulting summary’s relevance to
potential query.
![Page 31: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/31.jpg)
Diversity-Based Summarization (cont’d)
ProblemThe process does not preserve statistical properties of a
source text, which are often left statistically indistinguishable
after the process.
SolutionExtrapolating frequencies of index terms in extracts in order to
estimate their true frequencies in source texts.
![Page 32: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/32.jpg)
Diversity-Based Summarization (cont’d)
Extrapolation formula
pr - the probability of a given word occurring r times in the document.m ≥ 0
In this experiments, index terms with two or more occurrences in the document, so the extrapolation would be E(k | k ≥ 2) .
![Page 33: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/33.jpg)
Test Data and Evaluation Procedure
BMIR-J2Benchmark for Japanese IR system version 2, represents a
test collection of 5080 news article which published in 1994 in
Japan.
![Page 34: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/34.jpg)
Test Data and Evaluation Procedure
F-measure
P – PrecisionR – Recall
![Page 35: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/35.jpg)
Test Data and Evaluation Procedure
Two-set of experimentStrict relevance scheme (SRS), takes only A-labeled
documents as relevant to the query.
Moderate relevance scheme (MRS), takes both A- and B-
labeled documents as relevant.
![Page 36: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/36.jpg)
Test Data and Evaluation Procedure
Summarization method1.Z model
2.diversity-based summarizer with the standard K-means
(DBS/K)
3.diversity-based summarizer with XM-means (DBS/XM)
Compression rate is between 20% to 50%.
![Page 37: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/37.jpg)
Test Data and Evaluation Procedure
Experiment procedure1.At each compression rate, run Z-model, DBS/K and
DBS/XM on the entire BMIR-J2 collection, to produce
respective pools of extracts.
2.For each query from BMIR-J2, perform a search on
each pool generated, and score performance with the
uninterpolated average F-measure.
![Page 38: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/38.jpg)
Results and Discussion
![Page 39: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/39.jpg)
Results and Discussion
![Page 40: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/40.jpg)
Results and Discussion
![Page 41: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/41.jpg)
Results and Discussion
![Page 42: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/42.jpg)
Results and Discussion
![Page 43: A New Approach to Unsupervised Text Summarization](https://reader034.vdocuments.site/reader034/viewer/2022051418/56815168550346895dbf99af/html5/thumbnails/43.jpg)
Conclusion and Future Work
Diversity-based summarization (DBS/XM) was found to be
superior to relevance-based summarization (Z-model) in
measuring the loss of information in extracts in terms of
retrieval performance.
Future WorkExtending the current DBS framework to deal with multi-
document summarization.
Speech summarization with audio input and output.
Text categorization.