gistexter: a system for summarizing text documents sanda harabagiu, dan moldovan, paul morarescu,...

GISTexter: A System for Summarizing Text Documents

Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana Girju

Department of Computer SciencesThe University of Texas at Austin

Austin TX 78712-1188

Department of Computer ScienceThe University of Texas at Dallas

Richardson TX 75083-0688

Dept. of Computer Science & Engr.Southern Methodist University

Dallas TX 75275-0122

Language Computer Corporation6440 N. Central Expressway

Dallas TX 75206

Outline

1. Background2. System Architecture3. Single-Document Summaries4. Multi-Document Summaries5. Results and Conclusions

GISTexter-Generating summaries as similar as possible to

human-written abstracts.

Two assumptions:(1) Single-document summaries: extract the same information a human would

consider when writing an abstract of the same document.(2) Multi-document summaries: capture textual information shared across the

document set.

Our interestMulti-Document Summaries applicable to Question-Answering

Enables the usage of IE technology !!!

Need domain information use CICERO – for topics that are already encoded in it develop a back-up solution:

- gisting information by combining cohesion and coherence indicators for sentence extraction.

What is gisting?

-an activity in which the information taken into account is less that the full information content available.

Empirical Principles: Named Entities common to the set of documentsare anchors for argument structures that act like ad-hoc templates.

Sometimes cue phrases indicate coherence with some related informationthat should be gleaned in the summary.

SentenceExtraction

SentenceCompression

SummaryReduction

Single-DocumentDecompo-

sitionCohesive andCoherence-

BasedExtraction

ExtractionTemplates

Content-BasedPlanning andGeneration

Multi-DocumentDecomposition

Corpus ofhuman-written

abstracts

WordNet

Known Topic?

Tokenizer + PreprocessorNamed-Entity RecognizerPart-of-speech DisambiguatorNamed-Entity Alias RecognitionPhrasal ParserCombinerEntity CoreferenceDomain-Event RecognizerDomain CoreferenceMerging of Event Information

CICERO Information Extraction

System

Multi-Document Summarizer

Single-Document Summarizer

Input article Input article set

NoYes

Summary

GISTexter

CICERO: Technical Details

World knowledge incorporated –

the implementation infrastructure:

Template Object

Person Location Organization Money Measures Date MergeTemplate

DocumentTemplate

statetemplate

transitiontemplate

- compare- merge

Novelties introduced by Cicero

Incorporation of World Knowledge: extend IE paradigm with linguistic patterns that are non-

deterministic and capable of handling ambiguities (unlike lex & yacc)

conceive a language that extends the C/C++ Integration

Qualitative improvements usage of a full parser that is fast enough improved grammars

Novelties introduced by Cicero (cont’d)

Ease of Use and Customization explicit domain and combiner rules minimalist rules that are expanded by built-in compiler.

Quantitative improvements unprecedented F-measures: 78.8% unprecedented speed average:

2.2 seconds/document.

The Organization of Knowledge in CICERO

Layer 1: Rule Compiler and Run Time System

Layer 2: Information Extraction Phrases

Layer 3: Domain Knowledge

Layer 4: World Knowledge

New Domain: Natural DisastersThink of new words characteristic of the domain

Nouns: tornadofloodIce Storm

Verbs: happenfearriphit

World Knowledge

What causes disasters?Nouns: ice, stormVerbs: bring, raise

How are people affected?Verbs: injure, evacuate

Why Natural Disasters?- consequences

Verbs: cost, exceed

Create One New Domain Pattern

Example: “The tornado ripped through Florida”

expand{ Active Base Active Infinitive Active Relative Subject}

with{ ??label = “HAPPENED_IN” ??head = $HAPPEN_WORD ??subj = in.B (“is Disaster”) ??obj = $Absent ??prep = “through” | “in” ??pobj1 = in.type == TYPE.LOCATION ??pobj2 = $Absent ??pobj3 = $Absent ??semantics = cerr<<“APPEND-IN PATTERN FOUND \n”}

A New Combiner Pattern

Example: “Flood caused by ice and snow”

COMPLEX_NG[10] ==> DISASTER_NG; out.cat += #NG; out.B["isDisaster"] = true;;

DISASTER_NG ==> #NG[$DISASTER_WORD]:1 { #VG[$CAUSE_WORD, in.tense ==TENSE_PAST]

"by" #NG[$DISASTER_CAUSE] { "," #NG[$DISASTER_CAUSE] }? { ","? "and" #NG[$DISASTER_CAUSE] }?}?; out.item = in(1).item;;

Compilation process

“.g” specification file

psc

runtime library

“.cc” C++ source code file

g++

“.o” object file

g++

binary file

Information/Sentence Extraction

Sentence Extraction: - learn an extraction function that identifies sentencescontaining information essential to the summary.

METHOD: Abstract decompositionCASE: Single-document summaries:

- based on HMMs to assign probabilities and the Viterbi algorithm to decide the positions (Sentence, Word-in-Sentence) (Jing & McKeown SIGIR’99)

Decomposition of multiple abstracts

-Maximize similarity to all human-written abstracts.

Abstract 1Sentence 1Sentence 3Sentence 5Sentence 9



Step1: Extract each sentence used at least by one human

Step2:

Step3:

Rank the sentences

Reduce the summary

Interesting Data

Agreements between pairs of human-written abstracts.

humansbothbyextractedsentencesnumbertotal

sentencescommonA

______#

_#

%60%40 A

abstractwrittenhumananybyusedsentencestotal

summaryinsentencesD

_______#

__#

Sentence ExtractionInstance-Based Learning (26 features)

Features: Position-Related

1. Sentence position in document2. Sentence position in paragraph

Frequency-Related

3. Sum of TF of all terms in sentence4. Sum of IDF of all terms in sentence5. Maximal Marginal Relevance

Maximal Marginal Relevance

A measure for quantifying the degree of dissimilarity between the Sentence being considered and the sentences already selected forExtraction. (Goldstein & Carbonell)

Suppose S is the set of sentences selected; R the set of relevant sentences.

)],(max)1(),([max 21\ jiSciSRC ccSimTopiccSimARGMMRii

countWordneIDFTFSim _/)*10)1.0/((1 countWordneIDFTFSim _/)*10)1.0/((1

)_(2 wordscontentweightsSim )_(2 wordscontentweightsSim

Other features

NE-Related

6. # Person NEs in the sentenced7. # Organization NEs in the sentence8. # Date NEs in the sentence9. # Disease NEs in the sentence10. # Money NEs in the sentence11. # Location NEs in the sentence

Topic Signature-Related

12-26. Frequency of term in document * weight of term in topicsignature

Topic Signatures (Lin & Hovy)

Ranking Table (Single-Doc)G C O T

R(1) V(1) V(1) V(1)

P(2) R(2) R(2) R(2)

Q(3) O(3) T(3) O(3)

O(4) T(4) O(4) T(4)

V(5) W(5) P(5) P(5)

S(6) P(6) W(6) Q(6)

T(7) Q(7) X(7) W(7)

X(8) X(8) Y(8) X(8)

W(9) Y(9) Q(9) S(9)

Z(10) S(10) S(10) Y(10)

Y(11) Z(11) Z(11) Z(11)

L(*) L(*) L(*) L(*)

M(*) M(*) M(*) M(*)

N,U(*) N,U(*) N,U(*) M,U(*)

Ranking Table (Multiple-Doc)G C O T

O(1) T(1) T(1) T(1)

L(2) O(2) R(2) O(2)

P(3) R(3) O(3) R(3)

N(4) M(4) M(4) N(4)

R(5) N(5) N(5) P(5)

S(6) P(6) P(6) M(6)

T(7) S(7) S(7) L(7)

M(8) L(8) Z(8) S(8)

Z(9) Z(9) L(9) Z(9)

U(10) Y(10) Y(10) Y(10)

W(11) W(11) W(11) W(11)

Y(12) U(12) U(12) U(*)

Q(*) Q(*) Q(*) Q(*)

V,X(*) V,X(*) V,X(*) V,X(*)

Details

http://www.seas.smu.edu/~sanda/duc.ps.gzhttp://www.cs.utexas.edu/users/sanda/duc.ps.gz