streamgrid: summarization of large-scale events using topic modeling and temporal analysis

SoMuS 2014 Workshop ICMR, Glasgow, Scotland, 1 April 2014

StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal AnalysisManos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas

Information Technologies Institute (ITI)Center for Research & Technology Hellas (CERTH)

Department of Electrical & Computer Engineering Aristotle University of Thessaloniki (AUTh)

#somus2014 #2

Overview

• The Problem

• Existing Approaches

• StreamGrid

• Experimental Study

• Summary – Future Work

#somus2014 #3

Event Summarizationmotivation & definition

#somus2014 #4

Large-scale Public Events

• A lot of attendants using social media• Huge amount of event-related social content

#oscars 4.5M tweets#sxsw 1.35M tweets#SB48 (Super Bowl) 24.9M tweets in 4 hours!

#somus2014

Large-scale Public Events

• Long-running events consist of several sub-events, e.g. 10 days of Sundance Film Festival include opening and awards ceremonies, screenings etc.

• Many aspects and entities of interest in the context of an event e.g. films in film festivals, teams in sports events, etc.

• Many messages can be considered as spam or non-informative.

• Redundancy due to near-duplicate messages

#5

#somus2014

Event-based Summarization

Produce concise multi-document summaries for a given event, covering its main aspects.

#6

Event-basedSummarizer

List of all messages

Set of Selected Messages

#somus2014

Related Work

#7

#somus2014 #8

Existing Approaches

Radev et al. 2004 (baseline)• Summary consists of the messages closest to the tf idf ∙

centroid of all messages

Shen et al. 2013• Mixture model to detect sub-events at participant level• tf idf ∙ centroid to find a summary of each sub-event

Chakrabarti & Punera 2011• Hidden Markov Model to obtain a time-based segmentation• tf idf ∙ centroid to find a summary of time segment

#somus2014

Existing Approaches

Erkan et al. 2004 (LexRank)• Graph-based approach to find salient sentences• Uses centrality of each sentence in a similarity graph • Adapted for multi-document summarization using

each message as a sentence• Outperforms naïve centroid-based approach

Shen et al. 2013• Online clustering algorithm to find sub-events.• Greedy algorithm for summarization using the

LexRank score of each message.

#9

#somus2014

StreamGridapproach description

#10

#somus2014

StreamGrid Overview

• Find topics using Latent Dirichlet Allocation (LDA) • Create a timeline for each topic• Create StreamGrid structure• Summarize using StreamGrid

#11

#somus2014

Topic Modeling using LDA

• To work with very short documents (tweets), LDA needs some kind of message pooling

• Number of topics estimation– Minimize: (a) total perplexity for a set of test documents

and (b) average textual similarity across topics

#12

Microblog messages

merge

Pooling Schemes•Time proximity •Same Author•Same Hashtag•Textual similarity

Merged messages

#somus2014

Topic Modeling using LDA

• Split documents D to Dtrain and Dtest

• Estimate K topics over Dtrain

• Calculate total perplexity of Dtest

#13

#somus2014

StreamGrid Creation

• Assign each message to the topic with the highest probability p under condition p > pth

(spam messages are discarded)

• Create StreamGrid

#14

time interval j

topic i

cell c(i,j) = {set of messages associated with topic i, posted during time interval j }

#somus2014

StreamGrid Creation

• For each cell c(i,j) calculate a merged tf idf∙ vector uij

• For each term t calculate the weight:

where tfij(t) is the frequency of t in cell c(i,j)• For each message m of c(i,j) calculate the weight:

#15

#somus2014

StreamGrid Creation

• Detect active cells of each topic by applying peak detection on the associated topic timeline.

• Given a topic i and a detected peak in time window [a,b], all cells c(i,j), a < j < b, are defined as active.

• For the set of active topics A during a time interval j, calculate a significance score:

#16

#somus2014

StreamGrid Creation

• To get an overall estimation of the importance of each topic throughout the event we calculate two measures:

#17

#somus2014

Topic-time Summarization

• Our goal is the generation of a summary of an event for an arbitrary time frame F=[x1,x2].

• Summary has to meet the following criteria– As many aspects of the event are covered – Redundancy due to near duplicate messages are

minimized• We use a greedy algorithm that selects important

messages from each active topic in F and minimizes redundancy simultaneously.

#18

#somus2014


• A topic i is active in F if any of the cells contained in F is active.• The significance score of an active topic i in F is the max

significance score across all time intervals in F.• The weight W(m,F) of a message m in F is the sum of the

weights in each time interval.

#19

Time frame F’

Active topics in F’

Time frame F

Active topics in F

#somus2014

Topic-time Summarization: Algorithm

Input: StreamGrid, time frame F, summary length LOutput: summary set S1. Get active topics in F2. for each active topic select message with highest weight

Mc3. while |S|<L do4. for each message m in Mc do 5. calculate score(m)6. end for7. Add message with highest score to S and remove

it from Mc8. end while

#20

#somus2014


• The score of a message m is a combination of its importance and of the redundancy introduced by its selection.

• Redundancy is the average textual similarity among the set of already selected messages S

#21

#somus2014

Experimental StudySundance Film Festival 2013

#22

#somus2014

Dataset & Event

Sundance Film Festival• Two week festival: Jan 15-30, 2013• Data collection based on Streaming API with the

following parameters:– hashtags: #sundance, #sundance2013, #sundancefest– account: @sundancefest

• Total number of tweets: 201,752• Total number of original tweets: 100,046

#23

#somus2014

Topic Modeling

• Merge messages with the same hashtag gave the best results with respect to perplexity.

• Main trend for perplexity is to decrease as K increases.

• Average similarity between clusters stabilized for K>200 →

K = 200

#24

#somus2014

Peaky & Persistent Topics

#25

#somus2014

Event Timeline

#26

Awards ceremony

“Stoker” film by Chan-wook Park & “Use Orally as Indicated” film

#somus2014

Selected Timeslots

• Evaluate using two timeslots with high activity.

• The first time frame has a small number of very popular tweets mainly about two films.

• The second is a more diverse set of tweets.• A good measure of the quality of a summary is the number of films

covered.

#27

From To Tweets

Description

Mon Jan 21 05:00:00 EET 2013

Mon Jan 21 06:00:00 EET 2013

5755 “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film

Sun Jan 27 03:00:00 EET 2013

Sun Jan 27 09:00:00 EET 2013

9009 Awards ceremony

#somus2014

Baselines

• Random Summarizer: Selects L random tweets.• Popularity Summarizer: Selects the top L tweets

based on retweet count.• tf idf∙ Summarizer: Uses tf idf∙ weight of each tweet

to select top L.• Cluster-based Summarizer: Creates L clusters using

k-means clustering and selects the highest weighted message of each cluster.

• LexRank Summarizer: Graph-based method that assigns a weight on each tweet based on its adjacent edges.

#28

#somus2014

Timeslot #1 (Stoker & Use Orally as Indicated)

Popularity-based Summarizer• 5/10 tweets of the summary are related to the Stoker Film → Tends

to cover only a few popular aspects of the event• Minimizes near-duplicate redundancy, as it uses only the original

tweets.

• "Use Orally as Indicated“ is the second film covered in the summary (130 RTs)

#29

#somus2014


LexRank Summarizer• 9/10 tweets of the summary are retweets of a tweet related to

“Use Orally as Indicated” film → A lot of redundancy

• These tweets have high degree centrality, as there are many connections between them.

tf idf∙ Summarizer• Covers two different films (Stoker, Stuart Hall).• Many tweets about these films.

#30

#somus2014


StreamGrid Summarizer• Covers five different films (The Look of Love, Dirty Wars, Before

Midnight, Kill you Darlings, Life according to Sam)

• There are no duplicates or near-duplicates.• “Stoker” and “Use Orally as Indicated” are not covered!• A combination of StreamGrid Summarization and Popularity

Summarization could solve this.

#31

#somus2014

Timeslot #2 (Awards Ceremony)

KPI: Number of winning films covered by the summary

• Popularity-based summarizer outperforms all other approaches: covers 8 films that won any award that night (Afternoon Delight, Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty Wars, Crystal Fair, Pussy Riot)

• StreamGrid covers 6 films (Computer Chess, Inequality for all, Fruitvale, Afternoon Delight, In a world, American Promise).

• Only two films in common → Integrate popularity into StreamGrid to obtain better results.

• LexRank does not cover any of the winning films, but includes this: 'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan to Blame?

• tf idf∙ Summarizer includes three films but none from the winning ones!

#32

#somus2014

Multimedia Summaries

#33

Popularity-based summary

StreamGrid summary

Is there any systematic-objective way to evaluate these?

#somus2014

Conclusions & Future Work

#34

#somus2014

Summary

• Topic modeling approach to capture automatically the main aspects of the event from a large set of event-related microblogging messages.

• Peak detection on each topic-related timeline to find active moments of each topic.

• Use of active topic to select a set of representative messages for an arbitrary time frame.

• Greedy algorithm for the selection of messages with respect to content coverage and redundancy reduction.

#35

#somus2014

Future Work

• Real-time version of StreamGrid framework to get summaries of evolving and continuous social streams.

• Investigate how different topic modeling techniques affect the produced summary.

• Find a more systematic way to evaluate summaries (especially multimedia!).

#36

#somus2014

Thank you!

#37

Questions?

#somus2014

Key References

• Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet streams." Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013.

• Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for event exploration." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2011.

• Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

• Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1 (2004): 457-479.

#38

streamgrid: summarization of large-scale events using topic modeling and temporal analysis

Social Media

somus2014 topic modeling

somus2014 streamgrid

somus2014 streamgrid

somus2014 event timeline

active topic i

f time frame f active

topic modelling

somus2014 related work