summarization of xml documents kondreddi sarath kumar
TRANSCRIPT
Summarization of XML Summarization of XML DocumentsDocuments
Kondreddi Sarath KumarKondreddi Sarath Kumar
OutlineOutline
I. Motivation
II. System for XML Summarization
III. Ranking Model and Summary Generation
IV. User Evaluation
V. Xoom tool and few example summaries
VI. Conclusion
MotivationMotivationXML Document Collection (eg: IMDB)
XML Document
Types of XML Document Summaries
1)Generic summary – summarizes entire contents of the document.
2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.
AimsAims
We aim at summaries which are :
• Generated automatically
• Highly constrained by size
• Highly informative
• High coverage
AimsAims
We aim at summaries which are :
• Generated automatically
• Highly constrained by size
• Highly informative
• High coverage
ChallengesChallenges
• Structure is as important as text
AimsAims
We aim at summaries which are :
• Generated automatically
• Highly constrained by size
• Highly informative
• High coverage
ChallengesChallenges
• Structure is as important as text
• Varying text length
System for XML SummarizationSystem for XML Summarization
Info Unit Generator
SUMMARY GENERATOR
RANKING UNIT
Tag Ranker
Text Ranker
Corpus Statistics
Tag Units
Text Units
Summary Size
Ranked Tag units
Ranked Textunits
Summary
XMLDoc
Information Units of an XML DocumentInformation Units of an XML Document
Information Units of an XML DocumentInformation Units of an XML Document
Tag
- Regarded as metadata
- Can be highly redundant
Information Units of an XML DocumentInformation Units of an XML Document
Tag
- Regarded as metadata
- Can be highly redundant
Text
- Instance for the tag
- Much less redundant
- Have different sizes
Ranking UnitRanking UnitI. Tag Ranking
Typicality : How salient is the tag in the corpus?
E.g.: <title>
• Typical tags define the context of the document
• Occur regularly in most or all of the documents
• Quantified by fraction of documents in which the tag occurs (df)
Specialty : Does the tag occur more/less frequent in this document?
• Special tags denote a special aspect of the current document
• Occurs too many or too few times in the current document than usual
• Quantified by deviation from average number of occurrences per document
Ranking UnitRanking UnitI. Tag Ranking
Typicality : How salient is the tag in the corpus?
E.g.: <title>
• Typical tags define the context of the document
• Occur regularly in most or all of the documents
• Quantified by fraction of documents in which the tag occurs (df)
Specialty : Does the tag occur more/less frequent in this document?
• Special tags denote a special aspect of the current document
• Occurs too many or too few times in the current document than usual
• Quantified by deviation from average number of occurrences per document
)()1()()( ispeitypi TPTPTP
II. Text Ranking
Two categories of text
1) Entities
2) Regular text
Tag context Document context Corpus context
)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij
Ranking is done based on context of occurrence.
- No redundancy in tag context (E.g.: actor names, genre)
- Redundancy in tag context (E.g.: plots, goofs, trivia items)
Tag context Document context Corpus context
)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij
Ranking is done based on context of occurrence.
- No redundancy in tag context (E.g.: actor names, genre)
- Redundancy in tag context (E.g.: plots, goofs, trivia items)
Tag context Document context Corpus context
)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij
Ranking is done based on context of occurrence.
- No redundancy in tag context (E.g.: actor names, genre)
- Redundancy in tag context (E.g.: plots, goofs, trivia items)
Correlated tags and text
Often find related tag units – siblings of each otherE.g.: Actor and Role
Inclusion Principle
Case 1 :
Case 2 :
},....,,,{ 321 ksib TTTTT )(...)()( 21 kTrankTrankTrank Let and
kjTrankTrankTrank j where)(...)()( 21
s)Tother (also )( )( 21 iTrankTrank
siblings. its and
te text valurankedbest its include , }T..., , {T from T random Choose ij1i
once.at included be tohave }T..., , {T of All j1
on. so and included is tof sibling is which tvaluethen text
inclusion,for considered isT if stagelater aAt
tvaluebest text itswith currently included isTOnly
12
2
1. 1
Generation of SummaryGeneration of Summary
Tag Prob.
Actor 0.5
Keyword 0.3
Trivia 0.2
Consider the following tag rank table :
To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.
Tag Required no. of tags
Available no. of tags
Actor 15 30
Keyword 9 2
trivia 6 15
Generation of SummaryGeneration of Summary
Tag Prob.
Actor 0.5
Keyword 0.3
Trivia 0.2
Consider the following tag rank table :
To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.
Tag Required no. of tags
Available no. of tags
Actor 15 30
Keyword 9 2
trivia 6 15
Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags
Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags
Step Tag Prob. No of tags available
No of tags to be added
No of tags added in the round
Round 1
1.1 actor 0.5 30 15 15
1.2 keyword 0.3 2 9 2
1.3 trivia 0.2 15 6 6
Total 23
Round 2
2.1 actor 0.715 15 5 5 (20)
- keyword 0 0 0 0 (2)
2.2 trivia 0.285 15 2 2 (8)
Total 30
Generating the summary with 30 tags
User EvaluationUser Evaluation
Dataset No of files
No of unique tags
No of documents used for evaluation
Movie 200,000 39 8
People 150,000 11 4
Size alpha
Movie5
10
20
1, 0.8
1, 0.8, 0.6
1, 0.8, 0.6
People5
10
1, 0.6
1, 0.6
Total 64+16 = 80
• Automatically generated summaries (80) have been mixed with human-generated summaries (32)
• Summaries graded using a scale of 1-7where 1 – extremely bad & 7 – perfect
• Six different evaluators – each summary evaluated by at least three
User EvaluationUser Evaluation
Dataset Size alpha
1.0 0.8 0.6 Total (across alpha)
Movie5
10
20
8/8 (100%)
8/8 (100%)
7/8 (87.5%)
5/8 (62.5%)
7/8 (87.5%)
7/8 (87.5%)
-
1/8 (12.5%)
4/8 (50%)
13/16 (81.25%)
16/24 (66.6%)
18/24 (75%)
Total(across sizes)
23/24 (95.8%) 19/24 (79.1%) 5/16 (31.2%)
47/64 (73.4%)
People5
10
3/4 (75%)
4/4 (100%)
-
-
1/4 (62.5%)
4/4 (100%)
4/8 (50%)
8/8 (100%)
Total(across sizes)
7/8 (87.5%) - 5/8 (62.5%) 12/16 (75%)
Tabulation of average and above average grades (4-7)
Note: Grades shown only if at least 2 evaluators agreed on it.
Xoom Xoom A tool for exploring and summarizing XML documents
Exploration Mode
XoomXoom
Summarization Mode - Titanic.xml
ConclusionConclusion
• A fully automated XML summary generator
• Ranking of tags and text based on the ranking model
• Generation of summary from ranked tags & text within memory budget
• Xoom – a tool for exploring and summarizing XML documents
• User Evaluation
PublicationsPublications
• Xoom: A tool for zooming in and out of XML Documents (Demo)Maya Ramanath and Kondreddi Sarath KumarProc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009
• A Rank-Rewrite Framework for Summarizing XML DocumentsMaya Ramanath and Kondreddi Sarath Kumar2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008
User Evaluation of Summaries
Link: http://www.mpi-inf.mpg.de/~ramanath/Summarization/
Thanks!Thanks!
AppendixInformativeness
Coverage
Why not tag-text pairs?
Ocean’s Eleven.xml - Summaries
Titanic.xml on OST Summarizer
Gern </actor> <role > Drowning man </role> </casting> <casting> <actor > Martin, Johnny (I) </actor> <role > Rescue boat crewman </role> </casting> <casting> <actor > Lynch, Don (II) </actor> <role > Frederick Spedden </role> </casting> <casting> <actor > Cameron, James (I) </actor> <role > Cameo appearance (steerage dancer) </role> </casting> <casting> <actor > Cragnotti, Chris </actor> <role > Victor Giglio </role> </casting> <casting> <actor > Kenny, Tony (I) </actor> <role > Deckhand </role> </casting> <casting> <actor > Campolo, Bruno </actor> <role > Second-class man </role> </casting> </cast> <misc> <miscEntry> <person > Abercrombie, Ian </person> <job > adr loop group </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Allen, Melinda </person> <job > assistant: James Cameron </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > historical music advisor </job> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > music arranger: period music </job> </miscEntry> <miscEntry> <person > Amorelli, Mike </person> <job > rigging gaffer </job> </miscEntry> <miscEntry> <person > Amorelli, Paul </person> <job > rigging best boy electric </job> </miscEntry> <miscEntry> <person > Anaya, Daniel </person> <job > grip </job> </miscEntry> <miscEntry> <person > Andrade, Maria Louise </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Baker, Brett </person> <job > photo double: Leonardo DiCaprio </job> </miscEntry> <miscEntry> <person > Arvizu, Ricardo </person> <job > grip </job> </miscEntry> <miscEntry> <person > Bailes, Tim </person> <job > marine consultant </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic researcher </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic supervisor </job> </miscEntry> <miscEntry> <person > Arnold, Amy </person> <job > key set costumer: women </job> </miscEntry> <miscEntry> <person > Atkinson, Lisa (I) </person> <job > pre-production consultant </job> </miscEntry> <miscEntry> <person > Barius, Claudette </person> <job > additional still photographer: pre-production </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Baker, Jeanie </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Barton, Roger </person> <job > associate editor </job> </miscEntry> <miscEntry> <person > Baker, Tom (VI) </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bass, Andy (I) </person> <job > assistant music engineer </job> </miscEntry> <miscEntry> <person > Barber, Jamie (I) </person> <job > first assistant camera: Halifax </job> <miscEntry> <person > Baylon, Hugo </person> <job > location assistant </job> </miscEntry> <miscEntry> <person > Bee, Guy Norman </person> <job > camera operator </job> </miscEntry> <miscEntry> <person > Benarroch, Ariel </person> <job > first assistant camera: second unit </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Bendt, Tony </person> <job > company grip </job> </miscEntry> <miscEntry> <person > Boccoli, Daniel </person> <job > apprentice editor </job> </miscEntry> <miscEntry> <person > Botham, Buddy </person> <job > generator operator </job> </miscEntry> <miscEntry> <person > Bonner, Kit </person> <job > naval consultant </job> </miscEntry> <miscEntry> <person > Blevins, Cha </person> <job > costumer </job> <extra > as Deborah 'Cha' Blevins </extra> </miscEntry> <miscEntry> <person > Bloom, Kirk </person> <job > second assistant camera </job> </miscEntry> <miscEntry> <person > Bolton, Paul </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bornstein, Bob </person> <job > music preparation </job> </miscEntry> <miscEntry> <person > Bozeman, Marsha </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Broberg, David </person> <job > first assistant film editor </job> </miscEntry> <miscEntry> <person > Brady, Kenneth Patrick </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bruno, Keri </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bryan, Mitch (III) </person> <job > assistant video assist operator </job> </miscEntry> <miscEntry> <person > Bryce, Malcolm </person> <job > lamp operator </job> </miscEntry> <miscEntry> <person > Burdick, Geoff </person> <job > production associate </job> </miscEntry> <miscEntry> <person > Buckley, John (III) </person> <job > gaffer </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > director of photography: Titanic deep dive camera </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > special camera equipment designer </job> </miscEntry> <miscEntry> <person > Cameron, Michael (II) </person> <job > special deep ocean camera system </job> </miscEntry> <miscEntry> <person > Byall, Bruce </person> <job > grip </job> </miscEntry> <miscEntry> <person > Byron, Carol Sue </person> <job > additional production accountant </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Canedo, Luis </person> <job > rigging electrician </job> <extra > as Jose
Dataset Filename
Movie American BeautyOcean’s ElevenKill Bill Part IISaving Private RyanThe Last SamuraiThe Usual SuspectsTitanicA Space Odyssey
People Brad PittMatt DamonBen AffleckLeonardo DiCaprio
User Evaluation of Summaries – IMDB Dataset
Files used
User Evaluation of Summaries – IMDB Dataset
Xoom