search and information extraction lab iiit hyderabad
TRANSCRIPT
![Page 1: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/1.jpg)
A SUMMARIZATION JOURNEY
Search and Information Extraction Lab
IIIT Hyderabad
![Page 2: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/2.jpg)
Information Overload
Explosive growth of information on web
Failure of information retrieval systems tosatisfy user’s information need
Need for sophisticated information accesssolutions
![Page 3: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/3.jpg)
Summarization
Summary is a condensed version of source document(s) having a recognizable genre : to give the reader an exact and concise idea of the contents of the source.
Text interpretation
Extraction of Relevant information
Condensing Extracted Information
Summary Generation
![Page 4: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/4.jpg)
Flavors of Summarization
Progressive
Single documen
t
Query Focused
Opinion/ Sentimen
t
Code
ComparativeGuided
Personalized
![Page 5: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/5.jpg)
Extract Vs. Abstract
Extract An extract is a summary consisting of
entirely of material from the input text Abstract
An abstract is a summary at least some of whose material is not present in the input. eg. paraphrases of content, subject of
categories
![Page 6: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/6.jpg)
Towards Abstraction
Personalized , Cross Lingual Summarization
Guided SummarizationCode SummarizationComparison Summarization
Blog summarization Progressive Summarization
Abstractive
Single Document, Query Focused Multi Document Summarization
![Page 7: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/7.jpg)
Technological Aspects
Summarization
Support Vector Regression
Relevance based
Language Models
External Knowledge
Web, Wikipedia
User ModelingStatistics – word and
document
Similarity measures,
Novelty detection
Graph Clustering –
Topic identification
![Page 8: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/8.jpg)
EXTRACTIVE SUMMARIZERS
![Page 9: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/9.jpg)
Query Focused Summarization
Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system
Query Dependent ranking: Relevance Based Language models Language models (PHAL)
Query Independent ranking: Sentence Prior
![Page 10: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/10.jpg)
RBLM is an IR approach that computes the conditional probabilities of relevance from document and query
PHAL- probabilistic extension to HAL spaces HAL constructs dependencies of a term w on other terms
based on their occurrence in its context in the corpus
![Page 11: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/11.jpg)
DUC Peformance
38 systems participated in 2006
Significant difference between first two systems
2006
![Page 12: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/12.jpg)
Extract vs. Abstract Summarization
We conducted a study (post TAC 2006) Generated best possible extracts Calculated the scores for these extracts
Evaluation with respect to the reference summaries
Rouge 2 Rouge SU4
Human Answers 0.1025 0.1624
Best Answers 0.09965 0.15407
HAL Feature 0.07618 0.13805
![Page 13: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/13.jpg)
Cross Lingual Summarization
![Page 14: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/14.jpg)
Cross Lingual Summarization
A bridge between CLIR and MT Extended our mono-lingual summarization
framework to a cross-lingual setting in RBLM framework
Designed a cross-lingual experimental setup using DUC 2005 dataset
Experiments were conducted for Telugu-English language pair
Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures
![Page 15: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/15.jpg)
Progressive Summarization
Emerging area of research in summarization
Summarization with a sense of prior knowledge
Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009
Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles.
To keep track of temporal news stories
![Page 16: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/16.jpg)
Key challenge
To detect information that is not only relevant but also new given the prior knowledge of reader
Relevant and new VsNon-Relevant and new Vs Relevant and redundant
![Page 17: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/17.jpg)
Three level approach to Novelty DetectionSentence Scoring
Developing new features that capture novelty along with relevance of a sentence
NF, NW
Ranking
Sentences are re ranked based on the amount of novelty it contains
ITSim, CoSim
Summary Generation
A selected pool of sentences that contain novel facts. All remaining sentences are filtered out
![Page 18: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/18.jpg)
Evaluations
TAC 2008 Update Summarization data for training: 48 topics
Each topic divided into A, B with 10 documents
Summary for cluster A is normal summary and cluster B is update summary
TAC 2009 update Summarization for testing: 44 topics
Baseline summarizer generates summary by picking first 100 words of last document
Run1 – DFS + SL1
Run2 – PHAL + KL
![Page 19: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/19.jpg)
Personalized Summarization Perception of text differs with background of
the reader Need of incorporating user background in the
summarization process Summarization not only a function of input text
but also the reader
Serve
Tennis player
Hotel managerPolitician
![Page 20: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/20.jpg)
Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog.
Estimate Model P(w/Mu) to incorporate user in sentence extraction process
![Page 21: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/21.jpg)
Opinion summarizationSentiment Analysis User-generated-content is growing rapidly
through blogs Sentiment analysis provides better access to
information
Sentiment Textual information on the Web can be
categorized as facts and opinions Computational study of opinions, sentiments in
market perspective
![Page 22: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/22.jpg)
Optimization of sentiment in the summary to the maximum extent
Sentiment summarization as a two stage classification problem at sentence level
Polarity Estimation Opinion/fact Positive/Negative
![Page 23: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/23.jpg)
SEMI ABSTRACTIVE SUMMARIZERS
![Page 24: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/24.jpg)
Comparative summarization Summaries for comparing multiples items belonging to a
category Category of “Mobile phones“ will have “Nokia”, “Black
berry’ as its items
Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item. “Memory”, “Display”, “Battery Life”,
Memory
Battery Life
![Page 25: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/25.jpg)
Comparative Summaries Generation
Attribute Extraction Find the attributes of the product class
Attribute Ranking Rank the attributes according to importance in
comparison Summary Generation
Find the occurrence of attributes in various products
![Page 26: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/26.jpg)
Guided Summarization Query Focused Summarization
User’s information need expressed as a query along with a narrative
Set of documents related to the topic Goal is to produce a shot coherent summary
focusing answer to the query Guided Summarization
Each topic is classified into a set of predefined categories
Each category has a template of important aspects about the topic
Summary is expected to answer all the aspects of template while containing other relevant information
![Page 27: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/27.jpg)
Guided summarization
Encourage deeper linguistic and semantic analysis of the source documents instead of relying only on document word frequencies to select important concepts
Shares similarity with information extraction Specific information from unstructured text is
identified and consequently classified into a set of semantic labels (templates)
Makes information more suitable for other information processing tasks
A guided summarization system has to produce a readable summary encompassing all the information about the templates
Very few investigations exploring the potential of merging summarization with information extraction techniques
![Page 28: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/28.jpg)
Our approach Building a domain model
Essential background knowledge for information extraction
Sentence Annotations To identify sentences having answers to aspects of
template
Concept Mining To use semantic concepts instead of words to
calculate sentence importance
Summary Extraction Modification of summary extraction algorithm to
adapt to the requirements using sentence annotations
![Page 29: Search and Information Extraction Lab IIIT Hyderabad](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ebf5503460f94bca7d5/html5/thumbnails/29.jpg)
THANKS