1 evaluating summary content selection pyramid method: work in progress rebecca passonneau ani...

Post on 19-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Evaluating Summary Evaluating Summary Content SelectionContent Selection

Pyramid Method: Work in Pyramid Method: Work in ProgressProgress

Rebecca Passonneau

Ani Nenkova

2

OUTLINEOUTLINE

1. Motivation

2. Problems

3. DUC Evaluations

4. Pyramid Method: Current Status

5. Open Issues

6. Conclusions

3

EVALUATION GOALSEVALUATION GOALS Define parameters of the problem

o What is summarization?

Compare systemso Is the metric meaningful?

Track progresso When does output improve?

Cost Effectivenesso Can it be (partly) automated?

4

PICTURING CONTENT PICTURING CONTENT “OVERLAP”“OVERLAP”

Philippine Airlines (PAL) experienced a crisis in 1998. Unable to make payments on a $2.1 billion debt, it was faced by a pilot's strike in June and the region's currency problems which reduced passenger numbers and inflated costs. On September 23 PAL shut down after the ground crew union turned down a settlement which it accepted two . . .

Starting in May 1998, Philippine Airlines (PAL) laid off 5000 of its 13,000 workers. A 3-week pilots' strike in June and a currency crisis that reduced passenger numbers made payments on PAL's $2 billion debt debt impossible. President Estrada brokered an agreement to suspend collective bargaining for 10 years in exchange for 20% of PAL stock and union seats on its board.The large ground crew union initially voted no.After PAL shut down operations for 13 days starting Sept. 23rd, leaving much of the country without air service and foreign . . .

5

OBSTACLESOBSTACLES

 Humans select different content

Humans present same content differently

Lack clear standard of “good” summary

[Contrasts with translation: L1(C)L2(C)]

Need objective method to get at subjective notion of what a summary IS

6

PREVIOUS WORK: PessimismPREVIOUS WORK: PessimismHuman Judgments

Extraction Low Agreement (Rath, 1961; Salton et al, 1997) Inconsistent over time (Rath, 1961; Lin & Hovy,

2002)

Abstraction (Depends on individual’s orientation (Gerrig et al1991)

Automated Evaluation

Extraction (Pastra & Saggion, 2003 EACL) 3-humans; multiple “models”; inconclusive

Abstraction (Lin & Hovy, 2002 ACL) Accepts inconsistent judgments as target Difficult to extend

7

PREVIOUS WORK: OptimismPREVIOUS WORK: Optimism

Good design methodology leads to better understanding areas of agreement

High compression rate leads to high agreement (Jing et al., 1998)

Content variation offset by logarithmic growth in pool of distinct content units (Halteren & Teufel,2003)

Content can be reliably annotated (Beck et al., 1991)

8

HOW TO GET AT “CONTENT” HOW TO GET AT “CONTENT” FROM ITS “EXPRESSION”FROM ITS “EXPRESSION”

1. ADAPT BLEU MT EVALUATIONa) Collect multiple “model” summariesb) Quantify ngram overlap

2. IDENTIFY ABSTRACT CONTENT UNITSa) DUC

b) Reading Comprehension

3. A THIRD WAYa) Content unit “level”b) Multiple expressions of same content

unit

9

DUC: THE CURRENT DUC: THE CURRENT APPROACHAPPROACH

Yearly evaluation of systems on new data sets

NIST evaluations performed by humans

Widely cited results

Does it work?• Compare current systems • Track individual system progress • Track community progress from year to year• Identify specific strengths/weaknesses• Can it eventually be automated?

10

DUC SCORING METHODDUC SCORING METHOD

Datasets: human/machine summaries

Designate “model” human summary

(Automatically) identify content units in “model” summary

Split “peer” summaries into sentences

Human judges evaluate “peer” against model

11

COMPUTE DUC SCORESCOMPUTE DUC SCORES

1. For each EDU:a) Does peer sentence express any partb) How much? (0, 20, 40, 60, 80, 100%)

2. Average EDU percent overlap scores

3. Resulting score ranges from 0 to 1

12

DRAWBACKS TO DUC DRAWBACKS TO DUC SCORESSCORES

• Very sensitive to choice of “model”

• All “model” units created equal

• Difficult to interpret scoreso Human summary scores as low as 0.1o Scores vary for same summarizero Scores vary for same summary

• Systems cannot be differentiated

13

DUC SCATTERPLOTDUC SCATTERPLOT

10 DUC Summary Evaluators

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 10 20 30 40

Summary ID (1 to 30)

Sco

re

Summarizer A

Summarizer B

Summarizer C

Summarizer D

Summarizer E

Summarizer F

Summarizer G

Summarizer H

Summarizer I

Summarizer J

14

FOUNDATION OF PYRAMIDFOUNDATION OF PYRAMID

A few CUs appear in many summaries

Humans can identify same/different CUs

Weight CUs differentially

15

MULTIPLE GOOD SUMMARIESMULTIPLE GOOD SUMMARIES

This pyramid predicts 6 different good summaries consisting of 4 SCUs:

16

SCU ANNOTATION EXAMPLESCU ANNOTATION EXAMPLE

A.2 Unable to make payments on a $2 billion debt

H.2 made payments on PAL’s $2 billion debt impossible

I.1 With a rising $2.1 billion debt J .3 PAL is buried under a $2.2 billion dollar debt

it cannot repay

SCU1 W=4 PAL has a debt of over $2 billion

SCU2 W=3 PAL cannot make its payments

17

PAL PYRAMID TIER: W=3 PAL PYRAMID TIER: W=3 (N=4)(N=4)

SCU1: PAL has $2.1 billion debt

H2 [PAL’s $2 billion debt]1

I1 [and with a rising $2.1 billion debt,]1

J3 [PAL is buried under a $2.2 billion dollar debt]1

 

SCU2: PAL enforced a shutdown

H5 [After PAL shut down operations]2

I1 [stopped all operations]2

J5 [by a]2 [shutdown]2

 

SCU3: PAL in crisis

H1 [Philippine Airlines]3

I1 [Philippines Airlines (PAL),]3 [devastated]3

J1 [The fate]3 [is uncertain.]3

 

18

PAL PYRAMID TIER: W=2 PAL PYRAMID TIER: W=2 (N=8)(N=8)

SCU5: PAL unable to repay debtH2 [made payments on]5 [impossible.]5J3 [it cannot repay]5 SCU6: PAL experienced pilots' strikeH2 [A]5 [pilots' strike]6I1 [by pilot]5 [strikes]6 SCU7: this PAL crisis occurred in 1988

H1 [1998,]7I1 [in 1998]7

. . .

19

ANNOTATION: KEEPING ANNOTATION: KEEPING TRACKTRACK

H1 [Starting in May]23 [1998,]7 [Philippine Airlines]3

[laid off 5000 of its 13,000 workers.]24

H2 [A]6 [3-week]25 [pilots' strike]6 [in June]11 [and a

currency crisis]12 [that reduced passenger numbers]13

H3 [President Estrada brokered an agreement to suspend

collective bargaining for 10 years]17 [in exchange

for 20% of PAL stock and union seats on its board.]26

H4 [The large ground crew union initially voted no.]18

H5 [After PAL shut down operations]2 [for 13 days]4

[starting Sept. 23rd,]8 [leaving much of the country

without air service]27 [and foreign carriers flying

some domestic routes,]9 [61% voted yes.]19

. . .

20

RELIABILITYRELIABILITY

Two Annotators Consensus Annotation

Number of SCUs: 33 versus 37 35

Count of Pairwise Agreements (PAs) SCU Label SCU Members

Comparison of Annotations to Consensus Recall/Precision not valid 65/69 PAs Most “disagreements” due to membership size Only 2 “conflicts”

21

ANOTHER CONSISTENCY ANOTHER CONSISTENCY TESTTEST

Pyramid A H C J

Consensus .95 .89 .85 .76

Annotation1 .97 .87 .83 .82

Annotation2 .94 .87 .84 .74

22

PYRAMID SCORE PART 1PYRAMID SCORE PART 1 1. For N summaries, score each “peer”

against a pyramid with N-1 tiers2. “Peer” annotation

a) Gives SCU “size”

b) Yields a residue of SCUs not in pyramid

3. Compute D (Observed distribution) where D=sum of weights of SCUs

EG: Summary A (D30042), size=20D=(6x3) + (6x2) + (4x1) + (4x0) = 34

23

PYRAMID SCORE PART IIPYRAMID SCORE PART II

1. Compute Max = Ideal Sum of weights of SCUs, given the summary SCU size

2. Pyramid of H,I,J:

a) 9 SCUs in tier, w=3b) 10 SCUs in tier, w=2c) 12 SCUs in tier, w=1

3. Size=20, Max=(9x3) + (10x2) + (1x1)=48

4. P=D/Max PA= 34/48=.71

24

COMPARISON TO DUC COMPARISON TO DUC SCORES:SCORES:

HUMAN SUMMARIESHUMAN SUMMARIES

Lockerbie A B C D DUC n.a. .82 .54 .74 Pyramid .71 .82 .71 .81 PAL A H I J DUC .30 n.a. .30 .10 Pyramid .76 .72 .60 .45 China C D D F DUC n.a. .28 .27 .13 Pyramid .52 .65 .73 .62

25

MACHINE SUMMARY MACHINE SUMMARY EXAMPLEEXAMPLE

African countries voted in June to ignore the U.N. flight ban which was imposed in 1992 to try and force Libya to hand over for trial two suspects wanted in the 1988 bombing of an American airliner over Lockerbie, Scotland. The reported jailing of the three officials comes as Gadhafi is under pressure to accept a plan to turn over for trial two other Libyans wanted for the 1988 bombing of Pan am flight 103 over Lockerbie, Scotland, that led to 270 deaths. The visit was Farrakhan's …

26

COMPARISON TO DUC COMPARISON TO DUC SCORES:SCORES:

MACHINE SUMMARIESMACHINE SUMMARIESSYSTEM DUC PYRAMID

Sys06* .30 .79

Sys13 .03 .24

Sys14 .25 .51

Sys16* .25 .26

Sys17* .03 .17

Sys18 .03 .20

Sys20 .10 .64

27

MACHINE SUMMARIESMACHINE SUMMARIES

System 6

PAL, Asia’s oldest airline, has been unable to make payments on dlrs 2.1 billion debt after being devasted by a pilot’s strike and by Asia’s currency crisis. PAL earlier accepted a preliminary investment offer from Cathay Pacific, Ailing Philippine Airlines and prospective investor Cathy Pacific Airways have clashed over . . .

28

MACHINE SUMMARIESMACHINE SUMMARIES

System 16

President Joseph Estrada on Saturday urged militant unionists at Philippine Airlines to accept a vote by workers approving a 10-year no-strike deal to revive the debt-laden airline. President Joseph Estrada said Saturday the financially troubled airlines will resume its international flights on Sunday by flying him to Singapore . . .

29

MACHINE SUMMARIESMACHINE SUMMARIES

System 17

Christmas is a sacred holiday in the Philippines, and nowhere is that more evident than at the headquarters of Philippine Airlines. But Ramos, who was intent on privatizing the economy, opened the industry to competition, licensing rivals like Air Philippines, Cebu Pacific, and Grand Air. PAL closed for nearly 2 weeks on Sep. 23 after . . .

30

OPEN ISSUESOPEN ISSUES

Distribution of SCUs NOT an independent variableOrderingKnowledgeInformational Goal

Can Pyramid Scoring be Automated?

31

SCU INTERDEPENDENCIESSCU INTERDEPENDENCIES

1. SCU4 presupposes SCU1:

SCU1 (w=4): PAL has a debt > 2 billion

SCU4 (w=3): PAL cannot make its debt payments

2. SCU7, SCU8 depend on SCU2

  SCU2 (w=4): PAL shutdown operations

  SCU7 (w=3): shutdown began on 9/23

  SCU8 (w=3): shutdown lasted 2 weeks

32

SCUs and DEPENDENCY/TAG SCUs and DEPENDENCY/TAG GRGRA3

[On September 23]7

[PAL shut down]2

[after the ground crew union turned down a

settlement]18

[which it accepted two weeks later.]19 SCU71 On IN 5 shut t02 September NNP 4 PAL t2 3 23 CD 4 PAL t2

33

““LARGE” CONSTITUENTSLARGE” CONSTITUENTS

1. PAL experienced a crisis in 1998.

2. Unable to make payments on a $2.1 billion debt,

3. it was faced by a pilot's strike in June

4. and the region's currency problems

5. which reduced passenger numbers and inflated costs.

6. On September 23 pal shut down

7. after the ground crew union turned down a settlement

8. which it accepted two weeks later.

9. PAL resumed domestic flights on October 7

10. and [resumed] international flights on October 26.

11. Resolution of the basic financial problems was elusive,

however,

12. and as of December 18 pal was still $2.2 billion in

debt

13. and [pal was] losing close to $1 million a day.

34

DOCSET TF*IDF DOCSET TF*IDF

TERMS: $2, airline, billion, day, debt, pal (6 of 13 LCs) 1 1. Philippine Airlines (pal) experienced a crisis in

1998.SCU3 w=3

3 2. Unable to make payments on a $2.1 billion debt,SCU1 w=4

1 6. On September 23 pal shut downSCU2 w=4 & SCU7 w=3

1 9. pal resumed domestic flights on October 7SCU10 w=2

4 12. and as of December 18 pal was still $2.2 billion in debtNO SCU

1 13. and losing close to $1 million a day.SCU15 w=2

35

CONCLUSIONSCONCLUSIONS

Define parameters of the problemo What is summarization?

Compare systems and/or humanso Is the metric meaningful?

Track progresso When does output improve?

Cost Effectivenesso Can it be (partly) automated?

top related