streamgrid: summarization of large-scale events using topic modeling and temporal analysis
DESCRIPTION
Paper presentation at #somus2014 workshop, part of ICMR 2014, Glasgow, Scotland.TRANSCRIPT
SoMuS 2014 Workshop ICMR, Glasgow, Scotland, 1 April 2014
StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal AnalysisManos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas
Information Technologies Institute (ITI)Center for Research & Technology Hellas (CERTH)
Department of Electrical & Computer Engineering Aristotle University of Thessaloniki (AUTh)
#somus2014 #2
Overview
• The Problem
• Existing Approaches
• StreamGrid
• Experimental Study
• Summary – Future Work
#somus2014 #3
Event Summarizationmotivation & definition
#somus2014 #4
Large-scale Public Events
• A lot of attendants using social media• Huge amount of event-related social content
#oscars 4.5M tweets#sxsw 1.35M tweets#SB48 (Super Bowl) 24.9M tweets in 4 hours!
#somus2014
Large-scale Public Events
• Long-running events consist of several sub-events, e.g. 10 days of Sundance Film Festival include opening and awards ceremonies, screenings etc.
• Many aspects and entities of interest in the context of an event e.g. films in film festivals, teams in sports events, etc.
• Many messages can be considered as spam or non-informative.
• Redundancy due to near-duplicate messages
#5
#somus2014
Event-based Summarization
Produce concise multi-document summaries for a given event, covering its main aspects.
#6
Event-basedSummarizer
List of all messages
Set of Selected Messages
#somus2014
Related Work
#7
#somus2014 #8
Existing Approaches
Radev et al. 2004 (baseline)• Summary consists of the messages closest to the tf idf ∙
centroid of all messages
Shen et al. 2013• Mixture model to detect sub-events at participant level• tf idf ∙ centroid to find a summary of each sub-event
Chakrabarti & Punera 2011• Hidden Markov Model to obtain a time-based segmentation• tf idf ∙ centroid to find a summary of time segment
#somus2014
Existing Approaches
Erkan et al. 2004 (LexRank)• Graph-based approach to find salient sentences• Uses centrality of each sentence in a similarity graph • Adapted for multi-document summarization using
each message as a sentence• Outperforms naïve centroid-based approach
Shen et al. 2013• Online clustering algorithm to find sub-events.• Greedy algorithm for summarization using the
LexRank score of each message.
#9
#somus2014
StreamGridapproach description
#10
#somus2014
StreamGrid Overview
• Find topics using Latent Dirichlet Allocation (LDA) • Create a timeline for each topic• Create StreamGrid structure• Summarize using StreamGrid
#11
#somus2014
Topic Modeling using LDA
• To work with very short documents (tweets), LDA needs some kind of message pooling
• Number of topics estimation– Minimize: (a) total perplexity for a set of test documents
and (b) average textual similarity across topics
#12
Microblog messages
merge
Pooling Schemes•Time proximity •Same Author•Same Hashtag•Textual similarity
Merged messages
#somus2014
Topic Modeling using LDA
• Split documents D to Dtrain and Dtest
• Estimate K topics over Dtrain
• Calculate total perplexity of Dtest
#13
#somus2014
StreamGrid Creation
• Assign each message to the topic with the highest probability p under condition p > pth
(spam messages are discarded)
• Create StreamGrid
#14
time interval j
topic i
cell c(i,j) = {set of messages associated with topic i, posted during time interval j }
#somus2014
StreamGrid Creation
• For each cell c(i,j) calculate a merged tf idf∙ vector uij
• For each term t calculate the weight:
where tfij(t) is the frequency of t in cell c(i,j)• For each message m of c(i,j) calculate the weight:
#15
#somus2014
StreamGrid Creation
• Detect active cells of each topic by applying peak detection on the associated topic timeline.
• Given a topic i and a detected peak in time window [a,b], all cells c(i,j), a < j < b, are defined as active.
• For the set of active topics A during a time interval j, calculate a significance score:
#16
#somus2014
StreamGrid Creation
• To get an overall estimation of the importance of each topic throughout the event we calculate two measures:
#17
#somus2014
Topic-time Summarization
• Our goal is the generation of a summary of an event for an arbitrary time frame F=[x1,x2].
• Summary has to meet the following criteria– As many aspects of the event are covered – Redundancy due to near duplicate messages are
minimized• We use a greedy algorithm that selects important
messages from each active topic in F and minimizes redundancy simultaneously.
#18
#somus2014
Topic-time Summarization
• A topic i is active in F if any of the cells contained in F is active.• The significance score of an active topic i in F is the max
significance score across all time intervals in F.• The weight W(m,F) of a message m in F is the sum of the
weights in each time interval.
#19
Time frame F’
Active topics in F’
Time frame F
Active topics in F
#somus2014
Topic-time Summarization: Algorithm
Input: StreamGrid, time frame F, summary length LOutput: summary set S1. Get active topics in F2. for each active topic select message with highest weight
Mc3. while |S|<L do4. for each message m in Mc do 5. calculate score(m)6. end for7. Add message with highest score to S and remove
it from Mc8. end while
#20
#somus2014
Topic-time Summarization
• The score of a message m is a combination of its importance and of the redundancy introduced by its selection.
• Redundancy is the average textual similarity among the set of already selected messages S
#21
#somus2014
Experimental StudySundance Film Festival 2013
#22
#somus2014
Dataset & Event
Sundance Film Festival• Two week festival: Jan 15-30, 2013• Data collection based on Streaming API with the
following parameters:– hashtags: #sundance, #sundance2013, #sundancefest– account: @sundancefest
• Total number of tweets: 201,752• Total number of original tweets: 100,046
#23
#somus2014
Topic Modeling
• Merge messages with the same hashtag gave the best results with respect to perplexity.
• Main trend for perplexity is to decrease as K increases.
• Average similarity between clusters stabilized for K>200 →
K = 200
#24
#somus2014
Peaky & Persistent Topics
#25
#somus2014
Event Timeline
#26
Awards ceremony
“Stoker” film by Chan-wook Park & “Use Orally as Indicated” film
#somus2014
Selected Timeslots
• Evaluate using two timeslots with high activity.
• The first time frame has a small number of very popular tweets mainly about two films.
• The second is a more diverse set of tweets.• A good measure of the quality of a summary is the number of films
covered.
#27
From To Tweets
Description
Mon Jan 21 05:00:00 EET 2013
Mon Jan 21 06:00:00 EET 2013
5755 “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film
Sun Jan 27 03:00:00 EET 2013
Sun Jan 27 09:00:00 EET 2013
9009 Awards ceremony
#somus2014
Baselines
• Random Summarizer: Selects L random tweets.• Popularity Summarizer: Selects the top L tweets
based on retweet count.• tf idf∙ Summarizer: Uses tf idf∙ weight of each tweet
to select top L.• Cluster-based Summarizer: Creates L clusters using
k-means clustering and selects the highest weighted message of each cluster.
• LexRank Summarizer: Graph-based method that assigns a weight on each tweet based on its adjacent edges.
#28
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
Popularity-based Summarizer• 5/10 tweets of the summary are related to the Stoker Film → Tends
to cover only a few popular aspects of the event• Minimizes near-duplicate redundancy, as it uses only the original
tweets.
• "Use Orally as Indicated“ is the second film covered in the summary (130 RTs)
#29
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
LexRank Summarizer• 9/10 tweets of the summary are retweets of a tweet related to
“Use Orally as Indicated” film → A lot of redundancy
• These tweets have high degree centrality, as there are many connections between them.
tf idf∙ Summarizer• Covers two different films (Stoker, Stuart Hall).• Many tweets about these films.
#30
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
StreamGrid Summarizer• Covers five different films (The Look of Love, Dirty Wars, Before
Midnight, Kill you Darlings, Life according to Sam)
• There are no duplicates or near-duplicates.• “Stoker” and “Use Orally as Indicated” are not covered!• A combination of StreamGrid Summarization and Popularity
Summarization could solve this.
#31
#somus2014
Timeslot #2 (Awards Ceremony)
KPI: Number of winning films covered by the summary
• Popularity-based summarizer outperforms all other approaches: covers 8 films that won any award that night (Afternoon Delight, Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty Wars, Crystal Fair, Pussy Riot)
• StreamGrid covers 6 films (Computer Chess, Inequality for all, Fruitvale, Afternoon Delight, In a world, American Promise).
• Only two films in common → Integrate popularity into StreamGrid to obtain better results.
• LexRank does not cover any of the winning films, but includes this: 'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan to Blame?
• tf idf∙ Summarizer includes three films but none from the winning ones!
#32
#somus2014
Multimedia Summaries
#33
Popularity-based summary
StreamGrid summary
Is there any systematic-objective way to evaluate these?
#somus2014
Conclusions & Future Work
#34
#somus2014
Summary
• Topic modeling approach to capture automatically the main aspects of the event from a large set of event-related microblogging messages.
• Peak detection on each topic-related timeline to find active moments of each topic.
• Use of active topic to select a set of representative messages for an arbitrary time frame.
• Greedy algorithm for the selection of messages with respect to content coverage and redundancy reduction.
#35
#somus2014
Future Work
• Real-time version of StreamGrid framework to get summaries of evolving and continuous social streams.
• Investigate how different topic modeling techniques affect the produced summary.
• Find a more systematic way to evaluate summaries (especially multimedia!).
#36
#somus2014
Thank you!
#37
Questions?
#somus2014
Key References
• Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet streams." Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013.
• Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for event exploration." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2011.
• Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
• Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1 (2004): 457-479.
#38