gistexter: a system for summarizing text documents sanda harabagiu, dan moldovan, paul morarescu,...
TRANSCRIPT
GISTexter: A System for Summarizing Text Documents
Sanda Harabagiu, Dan Moldovan, Paul Morarescu, Finley Lacatusu, Rada Mihalcea, Vasile Rus and Roxana Girju
Department of Computer SciencesThe University of Texas at Austin
Austin TX 78712-1188
Department of Computer ScienceThe University of Texas at Dallas
Richardson TX 75083-0688
Dept. of Computer Science & Engr.Southern Methodist University
Dallas TX 75275-0122
Language Computer Corporation6440 N. Central Expressway
Dallas TX 75206
Outline
1. Background2. System Architecture3. Single-Document Summaries4. Multi-Document Summaries5. Results and Conclusions
GISTexter-Generating summaries as similar as possible to
human-written abstracts.
Two assumptions:(1) Single-document summaries: extract the same information a human would
consider when writing an abstract of the same document.(2) Multi-document summaries: capture textual information shared across the
document set.
Our interestMulti-Document Summaries applicable to Question-Answering
Enables the usage of IE technology !!!
Need domain information use CICERO – for topics that are already encoded in it develop a back-up solution:
- gisting information by combining cohesion and coherence indicators for sentence extraction.
What is gisting?
-an activity in which the information taken into account is less that the full information content available.
Empirical Principles: Named Entities common to the set of documentsare anchors for argument structures that act like ad-hoc templates.
Sometimes cue phrases indicate coherence with some related informationthat should be gleaned in the summary.
SentenceExtraction
SentenceCompression
SummaryReduction
Single-DocumentDecompo-
sitionCohesive andCoherence-
BasedExtraction
ExtractionTemplates
Content-BasedPlanning andGeneration
Multi-DocumentDecomposition
Corpus ofhuman-written
abstracts
WordNet
Known Topic?
Tokenizer + PreprocessorNamed-Entity RecognizerPart-of-speech DisambiguatorNamed-Entity Alias RecognitionPhrasal ParserCombinerEntity CoreferenceDomain-Event RecognizerDomain CoreferenceMerging of Event Information
CICERO Information Extraction
System
Multi-Document Summarizer
Single-Document Summarizer
Input article Input article set
NoYes
Summary
GISTexter
CICERO: Technical Details
World knowledge incorporated –
the implementation infrastructure:
Template Object
Person Location Organization Money Measures Date MergeTemplate
DocumentTemplate
statetemplate
transitiontemplate
- compare- merge
Novelties introduced by Cicero
Incorporation of World Knowledge: extend IE paradigm with linguistic patterns that are non-
deterministic and capable of handling ambiguities (unlike lex & yacc)
conceive a language that extends the C/C++ Integration
Qualitative improvements usage of a full parser that is fast enough improved grammars
Novelties introduced by Cicero (cont’d)
Ease of Use and Customization explicit domain and combiner rules minimalist rules that are expanded by built-in compiler.
Quantitative improvements unprecedented F-measures: 78.8% unprecedented speed average:
2.2 seconds/document.
The Organization of Knowledge in CICERO
Layer 1: Rule Compiler and Run Time System
Layer 2: Information Extraction Phrases
Layer 3: Domain Knowledge
Layer 4: World Knowledge
New Domain: Natural DisastersThink of new words characteristic of the domain
Nouns: tornadofloodIce Storm
Verbs: happenfearriphit
World Knowledge
What causes disasters?Nouns: ice, stormVerbs: bring, raise
How are people affected?Verbs: injure, evacuate
Why Natural Disasters?- consequences
Verbs: cost, exceed
Create One New Domain Pattern
Example: “The tornado ripped through Florida”
expand{ Active Base Active Infinitive Active Relative Subject}
with{ ??label = “HAPPENED_IN” ??head = $HAPPEN_WORD ??subj = in.B (“is Disaster”) ??obj = $Absent ??prep = “through” | “in” ??pobj1 = in.type == TYPE.LOCATION ??pobj2 = $Absent ??pobj3 = $Absent ??semantics = cerr<<“APPEND-IN PATTERN FOUND \n”}
A New Combiner Pattern
Example: “Flood caused by ice and snow”
COMPLEX_NG[10] ==> DISASTER_NG; out.cat += #NG; out.B["isDisaster"] = true;;
DISASTER_NG ==> #NG[$DISASTER_WORD]:1 { #VG[$CAUSE_WORD, in.tense ==TENSE_PAST]
"by" #NG[$DISASTER_CAUSE] { "," #NG[$DISASTER_CAUSE] }? { ","? "and" #NG[$DISASTER_CAUSE] }?}?; out.item = in(1).item;;
Compilation process
“.g” specification file
psc
runtime library
“.cc” C++ source code file
g++
“.o” object file
g++
binary file
Information/Sentence Extraction
Sentence Extraction: - learn an extraction function that identifies sentencescontaining information essential to the summary.
METHOD: Abstract decompositionCASE: Single-document summaries:
- based on HMMs to assign probabilities and the Viterbi algorithm to decide the positions (Sentence, Word-in-Sentence) (Jing & McKeown SIGIR’99)
Decomposition of multiple abstracts
-Maximize similarity to all human-written abstracts.
Abstract 1Sentence 1Sentence 3Sentence 5Sentence 9
Abstract 2Sentence 2Sentence 5Sentence 7Sentence 9
Abstract 3Sentence 2Sentence 4Sentence 7Sentence 9
Step1: Extract each sentence used at least by one human
Step2:
Step3:
Rank the sentences
Reduce the summary
Interesting Data
Agreements between pairs of human-written abstracts.
humansbothbyextractedsentencesnumbertotal
sentencescommonA
______#
_#
%60%40 A
abstractwrittenhumananybyusedsentencestotal
summaryinsentencesD
_______#
__#
Sentence ExtractionInstance-Based Learning (26 features)
Features: Position-Related
1. Sentence position in document2. Sentence position in paragraph
Frequency-Related
3. Sum of TF of all terms in sentence4. Sum of IDF of all terms in sentence5. Maximal Marginal Relevance
Maximal Marginal Relevance
A measure for quantifying the degree of dissimilarity between the Sentence being considered and the sentences already selected forExtraction. (Goldstein & Carbonell)
Suppose S is the set of sentences selected; R the set of relevant sentences.
)],(max)1(),([max 21\ jiSciSRC ccSimTopiccSimARGMMRii
countWordneIDFTFSim _/)*10)1.0/((1 countWordneIDFTFSim _/)*10)1.0/((1
)_(2 wordscontentweightsSim )_(2 wordscontentweightsSim
Other features
NE-Related
6. # Person NEs in the sentenced7. # Organization NEs in the sentence8. # Date NEs in the sentence9. # Disease NEs in the sentence10. # Money NEs in the sentence11. # Location NEs in the sentence
Topic Signature-Related
12-26. Frequency of term in document * weight of term in topicsignature
Topic Signatures (Lin & Hovy)
Ranking Table (Single-Doc)G C O T
R(1) V(1) V(1) V(1)
P(2) R(2) R(2) R(2)
Q(3) O(3) T(3) O(3)
O(4) T(4) O(4) T(4)
V(5) W(5) P(5) P(5)
S(6) P(6) W(6) Q(6)
T(7) Q(7) X(7) W(7)
X(8) X(8) Y(8) X(8)
W(9) Y(9) Q(9) S(9)
Z(10) S(10) S(10) Y(10)
Y(11) Z(11) Z(11) Z(11)
L(*) L(*) L(*) L(*)
M(*) M(*) M(*) M(*)
N,U(*) N,U(*) N,U(*) M,U(*)
Ranking Table (Multiple-Doc)G C O T
O(1) T(1) T(1) T(1)
L(2) O(2) R(2) O(2)
P(3) R(3) O(3) R(3)
N(4) M(4) M(4) N(4)
R(5) N(5) N(5) P(5)
S(6) P(6) P(6) M(6)
T(7) S(7) S(7) L(7)
M(8) L(8) Z(8) S(8)
Z(9) Z(9) L(9) Z(9)
U(10) Y(10) Y(10) Y(10)
W(11) W(11) W(11) W(11)
Y(12) U(12) U(12) U(*)
Q(*) Q(*) Q(*) Q(*)
V,X(*) V,X(*) V,X(*) V,X(*)
Details
http://www.seas.smu.edu/~sanda/duc.ps.gzhttp://www.cs.utexas.edu/users/sanda/duc.ps.gz