a critique and improvement of an evaluation metric for text segmentation a paper by lev pevzner...

A Critique and Improvement of an Evaluation Metric for Text Segmentation

A Paper by

Lev Pevzner (Harvard University)Marti A. Hearst (UC, Berkeley)

Presented by

Saima Aman SITE, University of OttawaNov 10, 2005

Presentation Outline

Problem Description Text Segmentation Evaluation Measures: Precision and Recall Evaluation Metric P

k

Problems with Evaluation Metric Pk

Solution Modified Metric – WindowDiff Simulation Results

Conclusions

What is Text Segmentation?

Documents are generally comprised of multiple sub-topics. Text segmentation is the task of determining the positions at

which topics change in a document.

Applications of Text Segmentation Information Retrieval (IR) for retrieval of relevant

passages Automated Summarization Story Segmentation of Video Detection of Topic and Story boundaries in News

Feeds

Approaches to Text Segmentation

Patterns of Lexical Co-occurrence and Distribution

Large shifts in vocabulary indicate subtopic boundaries Clustering based on word co-occurrences Lexical chains

A large number of lexical chains are found to originate

and end at segment boundaries Cue-words that tend to be used near segment boundaries.

Hand-selected cue words

Machine learning techniques used to select cue words

Segmentation Evaluation

Challenges of Evaluation:

Difficult to choose a reference segmentation

Human judges disagree over placement of boundaries.

Disagreement on how fine-grained segmentation should be.

Criticality of errors is often application dependent.

Near misses may be okay in information retrieval

Near misses critical in news boundary detection

How to Evaluate Segmentation?

There are a set of true boundaries according to reference segmentation

Segmentation algorithm may identify correct as well as incorrect boundaries

The set of segment boundaries identified by the algorithm may not perfectly match the actual set of true boundaries

Precision & Recall

Recall: Ratio of the number of true segment boundaries identified to the total number of true segment boundaries in the document.

Precision: Ratio of the number of correct segment boundaries identified to the total number of boundaries identified.

Precision and Recall - Challenges

Inherent trade-off between Precision and Recall: Trying to improve one quantity may deteriorate the other.

F1-measure is sometimes maximized. Placing more boundaries may improve Recall but reduces

Precision Not sensitive to “near misses”

Both algorithms A-0 and A-1 would receive scores of 0 for both Precision and Recall.

Desirable to have a metric that penalizes A-0 less harshly than A-1

A New Metric: Pk

Proposed by Beeferman, Berger, and Lafferty (1997)

Attempts to resolve problems with Precision and Recall

Pk measures the probability that two sentences k units

apart are incorrectly labeled as being in different segments.

Pk = Total number of disagreements (with reference)

Number of measurements taken

It compute penalties via a moving window of length k, where k = (average segment size)/2

How is Pk Calculated?

• Segment Size = 8 , and window size, k = 4

• At each location, the algorithm determines if the two ends of the probe are in the same or different segments.

• Penalties are assigned whenever two units are incorrectly labelled with respect to reference.

Solid lines indicate no penalty is assigned

Dashed lines indicate a penalty is assigned.

Scope of the Paper

Authors identify several limitations of metric Pk

Propose a modified metric – WindowDiff

Claim that the new metric (WindowDiff) solves most of the problems associated with evaluation metric Pk.

Present results of their simulations that suggest that the modified metric is an improvement over the original.

Problems with Metric Pk

● False Negatives Penalized More Than False Positives– False Negatives always assigned a penalty 'k'– On an average, False Positives assigned a penalty of 'k/2'.

● Number of Boundaries Between Probe Ends Ignored– Causes some errors to go un-penalized

● Sensitivity to Variations in Segment Size

– As segment size gets smaller, penalty for both false positives and false negatives decreases

– As segment size increases, penalty for false positives increases

● Near-Miss Error Penalized Too Much

● Pk is Non-intuitive and its Interpretation is Difficult

Modified Metric - WindowDiff

For each position of the probe, compute:● r

i – the number of reference segmentation

boundaries that fall between the two ends of a fixed-length probe.

● ai – the number of boundaries that are assigned in

this interval by the algorithm

The algorithm is penalized if the two numbers do not match, that is if | ri – ai | > 0

Validation via Simulation

Simulations performed for following metrics: Evaluation Metric Pk

Metric P'k (which doubles penalty for false positives) WindowDiff.

Simulation Details A single trial included generating a reference

segmentation of 1,000 segments Generating different experimental segmentations of a

specific type 100 times Computing the metrics and averaging over 100 results. Different segment size distributions were used

Results for WindowDiff

Successfully distinguishes 'near misses' as a separate kind of error.

Penalizes near misses less than pure false positives and pure false negatives.

Gives equal weight to false positive and false negative penalties (eliminates the asymmetry seen in Pk metric).

Catches false positives and false negatives within segments of length less than k.

Only slightly affected by variation in segment size distribution

Interpretation of WindowDiff

Test Results show that WindowDiff metric grows in a roughly 'linear' fashion with the difference between the reference and the experimental segmentations.

WindowDiff metric value can be interpreted as an indication of the number of discrepancies occurring between the reference and the algorithm’s result.

Evaluation Metric Pk is a measure of how often two text units are incorrectly labelled as being in different segments. This interpretation is less intuitive.

Conclusions

Evaluation Metric Pk suffered from several drawbacks A modified version P'k which doubles the false positive

penalty can only solve the problem of over-penalizing false negatives but not the other problems.

Metric WindowDiff is able to solve all problems associated with Pk.

Popularity of the new metric WindowDiff

A search on internet shows several citations of this paper.

Most people in text and media segmentation now use the WindowDiff measure.