a critique and improvement of an evaluation metric for text segmentation a paper by lev pevzner...
TRANSCRIPT
A Critique and Improvement of an Evaluation Metric for Text Segmentation
A Paper by
Lev Pevzner (Harvard University)Marti A. Hearst (UC, Berkeley)
Presented by
Saima Aman SITE, University of OttawaNov 10, 2005
Presentation Outline
Problem Description Text Segmentation Evaluation Measures: Precision and Recall Evaluation Metric P
k
Problems with Evaluation Metric Pk
Solution Modified Metric – WindowDiff Simulation Results
Conclusions
What is Text Segmentation?
Documents are generally comprised of multiple sub-topics. Text segmentation is the task of determining the positions at
which topics change in a document.
Applications of Text Segmentation Information Retrieval (IR) for retrieval of relevant
passages Automated Summarization Story Segmentation of Video Detection of Topic and Story boundaries in News
Feeds
Approaches to Text Segmentation
Patterns of Lexical Co-occurrence and Distribution
Large shifts in vocabulary indicate subtopic boundaries Clustering based on word co-occurrences Lexical chains
A large number of lexical chains are found to originate
and end at segment boundaries Cue-words that tend to be used near segment boundaries.
Hand-selected cue words
Machine learning techniques used to select cue words
Segmentation Evaluation
Challenges of Evaluation:
Difficult to choose a reference segmentation
Human judges disagree over placement of boundaries.
Disagreement on how fine-grained segmentation should be.
Criticality of errors is often application dependent.
Near misses may be okay in information retrieval
Near misses critical in news boundary detection
How to Evaluate Segmentation?
There are a set of true boundaries according to reference segmentation
Segmentation algorithm may identify correct as well as incorrect boundaries
The set of segment boundaries identified by the algorithm may not perfectly match the actual set of true boundaries
Precision & Recall
Recall: Ratio of the number of true segment boundaries identified to the total number of true segment boundaries in the document.
Precision: Ratio of the number of correct segment boundaries identified to the total number of boundaries identified.
Precision and Recall - Challenges
Inherent trade-off between Precision and Recall: Trying to improve one quantity may deteriorate the other.
F1-measure is sometimes maximized. Placing more boundaries may improve Recall but reduces
Precision Not sensitive to “near misses”
Both algorithms A-0 and A-1 would receive scores of 0 for both Precision and Recall.
Desirable to have a metric that penalizes A-0 less harshly than A-1
A New Metric: Pk
Proposed by Beeferman, Berger, and Lafferty (1997)
Attempts to resolve problems with Precision and Recall
Pk measures the probability that two sentences k units
apart are incorrectly labeled as being in different segments.
Pk = Total number of disagreements (with reference)
Number of measurements taken
It compute penalties via a moving window of length k, where k = (average segment size)/2
How is Pk Calculated?
• Segment Size = 8 , and window size, k = 4
• At each location, the algorithm determines if the two ends of the probe are in the same or different segments.
• Penalties are assigned whenever two units are incorrectly labelled with respect to reference.
Solid lines indicate no penalty is assigned
Dashed lines indicate a penalty is assigned.
Scope of the Paper
Authors identify several limitations of metric Pk
Propose a modified metric – WindowDiff
Claim that the new metric (WindowDiff) solves most of the problems associated with evaluation metric Pk.
Present results of their simulations that suggest that the modified metric is an improvement over the original.
Problems with Metric Pk
● False Negatives Penalized More Than False Positives– False Negatives always assigned a penalty 'k'– On an average, False Positives assigned a penalty of 'k/2'.
● Number of Boundaries Between Probe Ends Ignored– Causes some errors to go un-penalized
● Sensitivity to Variations in Segment Size
– As segment size gets smaller, penalty for both false positives and false negatives decreases
– As segment size increases, penalty for false positives increases
● Near-Miss Error Penalized Too Much
● Pk is Non-intuitive and its Interpretation is Difficult
Modified Metric - WindowDiff
For each position of the probe, compute:● r
i – the number of reference segmentation
boundaries that fall between the two ends of a fixed-length probe.
● ai – the number of boundaries that are assigned in
this interval by the algorithm
The algorithm is penalized if the two numbers do not match, that is if | ri – ai | > 0
Validation via Simulation
Simulations performed for following metrics: Evaluation Metric Pk
Metric P'k (which doubles penalty for false positives) WindowDiff.
Simulation Details A single trial included generating a reference
segmentation of 1,000 segments Generating different experimental segmentations of a
specific type 100 times Computing the metrics and averaging over 100 results. Different segment size distributions were used
Results for WindowDiff
Successfully distinguishes 'near misses' as a separate kind of error.
Penalizes near misses less than pure false positives and pure false negatives.
Gives equal weight to false positive and false negative penalties (eliminates the asymmetry seen in Pk metric).
Catches false positives and false negatives within segments of length less than k.
Only slightly affected by variation in segment size distribution
Interpretation of WindowDiff
Test Results show that WindowDiff metric grows in a roughly 'linear' fashion with the difference between the reference and the experimental segmentations.
WindowDiff metric value can be interpreted as an indication of the number of discrepancies occurring between the reference and the algorithm’s result.
Evaluation Metric Pk is a measure of how often two text units are incorrectly labelled as being in different segments. This interpretation is less intuitive.
Conclusions
Evaluation Metric Pk suffered from several drawbacks A modified version P'k which doubles the false positive
penalty can only solve the problem of over-penalizing false negatives but not the other problems.
Metric WindowDiff is able to solve all problems associated with Pk.
Popularity of the new metric WindowDiff
A search on internet shows several citations of this paper.
Most people in text and media segmentation now use the WindowDiff measure.