gistexter: a system for summarizing text documents sanda harabagiu, dan moldovan, paul morarescu,...

24
GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana Girju Department of Computer Sciences The University of Texas at Austin Austin TX 78712-1188 Department of Computer Science The University of Texas at Dallas Richardson TX 75083-0688 Dept. of Computer Science & Engr. Southern Methodist University Dallas TX 75275-0122 Language Computer Corporation 6440 N. Central Expressway Dallas TX 75206

Upload: louise-gilbert

Post on 11-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

GISTexter: A System for Summarizing Text Documents

Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana Girju

Department of Computer SciencesThe University of Texas at Austin

Austin TX 78712-1188

Department of Computer ScienceThe University of Texas at Dallas

Richardson TX 75083-0688

Dept. of Computer Science & Engr.Southern Methodist University

Dallas TX 75275-0122

Language Computer Corporation6440 N. Central Expressway

Dallas TX 75206

Page 2: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Outline

1. Background2. System Architecture3. Single-Document Summaries4. Multi-Document Summaries5. Results and Conclusions

Page 3: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

GISTexter-Generating summaries as similar as possible to

human-written abstracts.

Two assumptions:(1) Single-document summaries: extract the same information a human would

consider when writing an abstract of the same document.(2) Multi-document summaries: capture textual information shared across the

document set.

Page 4: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Our interestMulti-Document Summaries applicable to Question-Answering

Enables the usage of IE technology !!!

Need domain information use CICERO – for topics that are already encoded in it develop a back-up solution:

- gisting information by combining cohesion and coherence indicators for sentence extraction.

Page 5: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

What is gisting?

-an activity in which the information taken into account is less that the full information content available.

Empirical Principles: Named Entities common to the set of documentsare anchors for argument structures that act like ad-hoc templates.

Sometimes cue phrases indicate coherence with some related informationthat should be gleaned in the summary.

Page 6: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

SentenceExtraction

SentenceCompression

SummaryReduction

Single-DocumentDecompo-

sitionCohesive andCoherence-

BasedExtraction

ExtractionTemplates

Content-BasedPlanning andGeneration

Multi-DocumentDecomposition

Corpus ofhuman-written

abstracts

WordNet

Known Topic?

Tokenizer + PreprocessorNamed-Entity RecognizerPart-of-speech DisambiguatorNamed-Entity Alias RecognitionPhrasal ParserCombinerEntity CoreferenceDomain-Event RecognizerDomain CoreferenceMerging of Event Information

CICERO Information Extraction

System

Multi-Document Summarizer

Single-Document Summarizer

Input article Input article set

NoYes

Summary

GISTexter

Page 7: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

CICERO: Technical Details

World knowledge incorporated –

the implementation infrastructure:

Template Object

Person Location Organization Money Measures Date MergeTemplate

DocumentTemplate

statetemplate

transitiontemplate

- compare- merge

Page 8: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Novelties introduced by Cicero

Incorporation of World Knowledge: extend IE paradigm with linguistic patterns that are non-

deterministic and capable of handling ambiguities (unlike lex & yacc)

conceive a language that extends the C/C++ Integration

Qualitative improvements usage of a full parser that is fast enough improved grammars

Page 9: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Novelties introduced by Cicero (cont’d)

Ease of Use and Customization explicit domain and combiner rules minimalist rules that are expanded by built-in compiler.

Quantitative improvements unprecedented F-measures: 78.8% unprecedented speed average:

2.2 seconds/document.

Page 10: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

The Organization of Knowledge in CICERO

Layer 1: Rule Compiler and Run Time System

Layer 2: Information Extraction Phrases

Layer 3: Domain Knowledge

Layer 4: World Knowledge

Page 11: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

New Domain: Natural DisastersThink of new words characteristic of the domain

Nouns: tornadofloodIce Storm

Verbs: happenfearriphit

Page 12: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

World Knowledge

What causes disasters?Nouns: ice, stormVerbs: bring, raise

How are people affected?Verbs: injure, evacuate

Why Natural Disasters?- consequences

Verbs: cost, exceed

Page 13: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Create One New Domain Pattern

Example: “The tornado ripped through Florida”

expand{ Active Base Active Infinitive Active Relative Subject}

with{ ??label = “HAPPENED_IN” ??head = $HAPPEN_WORD ??subj = in.B (“is Disaster”) ??obj = $Absent ??prep = “through” | “in” ??pobj1 = in.type == TYPE.LOCATION ??pobj2 = $Absent ??pobj3 = $Absent ??semantics = cerr<<“APPEND-IN PATTERN FOUND \n”}

Page 14: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

A New Combiner Pattern

Example: “Flood caused by ice and snow”

COMPLEX_NG[10] ==> DISASTER_NG; out.cat += #NG; out.B["isDisaster"] = true;;

DISASTER_NG ==> #NG[$DISASTER_WORD]:1 { #VG[$CAUSE_WORD, in.tense ==TENSE_PAST]

"by" #NG[$DISASTER_CAUSE] { "," #NG[$DISASTER_CAUSE] }? { ","? "and" #NG[$DISASTER_CAUSE] }?}?; out.item = in(1).item;;

Page 15: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Compilation process

“.g” specification file

psc

runtime library

“.cc” C++ source code file

g++

“.o” object file

g++

binary file

Page 16: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Information/Sentence Extraction

Sentence Extraction: - learn an extraction function that identifies sentencescontaining information essential to the summary.

METHOD: Abstract decompositionCASE: Single-document summaries:

- based on HMMs to assign probabilities and the Viterbi algorithm to decide the positions (Sentence, Word-in-Sentence) (Jing & McKeown SIGIR’99)

Page 17: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Decomposition of multiple abstracts

-Maximize similarity to all human-written abstracts.

Abstract 1Sentence 1Sentence 3Sentence 5Sentence 9

Abstract 2Sentence 2Sentence 5Sentence 7Sentence 9

Abstract 3Sentence 2Sentence 4Sentence 7Sentence 9

Step1: Extract each sentence used at least by one human

Step2:

Step3:

Rank the sentences

Reduce the summary

Page 18: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Interesting Data

Agreements between pairs of human-written abstracts.

humansbothbyextractedsentencesnumbertotal

sentencescommonA

______#

_#

%60%40 A

abstractwrittenhumananybyusedsentencestotal

summaryinsentencesD

_______#

__#

Page 19: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Sentence ExtractionInstance-Based Learning (26 features)

Features: Position-Related

1. Sentence position in document2. Sentence position in paragraph

Frequency-Related

3. Sum of TF of all terms in sentence4. Sum of IDF of all terms in sentence5. Maximal Marginal Relevance

Page 20: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Maximal Marginal Relevance

A measure for quantifying the degree of dissimilarity between the Sentence being considered and the sentences already selected forExtraction. (Goldstein & Carbonell)

Suppose S is the set of sentences selected; R the set of relevant sentences.

)],(max)1(),([max 21\ jiSciSRC ccSimTopiccSimARGMMRii

countWordneIDFTFSim _/)*10)1.0/((1 countWordneIDFTFSim _/)*10)1.0/((1

)_(2 wordscontentweightsSim )_(2 wordscontentweightsSim

Page 21: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Other features

NE-Related

6. # Person NEs in the sentenced7. # Organization NEs in the sentence8. # Date NEs in the sentence9. # Disease NEs in the sentence10. # Money NEs in the sentence11. # Location NEs in the sentence

Topic Signature-Related

12-26. Frequency of term in document * weight of term in topicsignature

Topic Signatures (Lin & Hovy)

Page 22: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Ranking Table (Single-Doc)G C O T

R(1) V(1) V(1) V(1)

P(2) R(2) R(2) R(2)

Q(3) O(3) T(3) O(3)

O(4) T(4) O(4) T(4)

V(5) W(5) P(5) P(5)

S(6) P(6) W(6) Q(6)

T(7) Q(7) X(7) W(7)

X(8) X(8) Y(8) X(8)

W(9) Y(9) Q(9) S(9)

Z(10) S(10) S(10) Y(10)

Y(11) Z(11) Z(11) Z(11)

L(*) L(*) L(*) L(*)

M(*) M(*) M(*) M(*)

N,U(*) N,U(*) N,U(*) M,U(*)

Page 23: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Ranking Table (Multiple-Doc)G C O T

O(1) T(1) T(1) T(1)

L(2) O(2) R(2) O(2)

P(3) R(3) O(3) R(3)

N(4) M(4) M(4) N(4)

R(5) N(5) N(5) P(5)

S(6) P(6) P(6) M(6)

T(7) S(7) S(7) L(7)

M(8) L(8) Z(8) S(8)

Z(9) Z(9) L(9) Z(9)

U(10) Y(10) Y(10) Y(10)

W(11) W(11) W(11) W(11)

Y(12) U(12) U(12) U(*)

Q(*) Q(*) Q(*) Q(*)

V,X(*) V,X(*) V,X(*) V,X(*)

Page 24: GISTexter: A System for Summarizing Text Documents Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana

Details

http://www.seas.smu.edu/~sanda/duc.ps.gzhttp://www.cs.utexas.edu/users/sanda/duc.ps.gz