bcs sgai workshop on social media analysis, 10th december 2013 mining newsworthy topics from social...

Post on 01-Apr-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BCS SGAI Workshop on Social Media Analysis, 10th December 2013

Mining Newsworthy Topics from Social

MediaCarlos Martin, David Corney and Ayse Goker (Robert Gordon University)Andrew MacFarlance (City University London)

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#2

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#3

Introduction & Motivation

• Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit

• Journalists use Social Media to rapidly discover stories and eye-witness accounts.

#4

• Other tools to detect newsworthy stories:– Twitter trends – http://www.twitter.com– Trendsmap - http://trendsmap.com/– Newship - http://www.newswhip.com/

Introduction & Motivation

#5

Introduction & Motivation

• Gap in the market– Story description is incomplete/unclear (based on the use

of hashtags and entities)– Use of mainstream media

• Proposal of an approach to detect newsworthy stories in real time from Twitter where story description is complete and posts from social network users are associated to each story– Journalists and news readers don’t get overwhelmed.

#6

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#7

BNgram approach

• Detection of the most representative topics from a timeslot making special emphasis on temporal dimension of data.

1. Detection of emerging phrases (word n-grams) based on df-idft score. It is a variant of tf-idf.

Ranking of n-grams per timeslot sorted by df-idft, avoiding overlaps. Boost factor: Named entity recognition (Stanford) – 3 class classifier (Person, location and organization).

#8

boost

t

df

dfidfdf

t

jji

it

11log

1

1

BNgram approach

2. Hierarchical clustering of the top k n-grams with the highest df-idft scores. Topic score is computed as the maximum df-idft of its n-grams.

#9

BNgram approach

• Evaluation benchmark: Comparison with other 4 TDT (document-pivot and feature-pivot) and a baseline (LDA) approach – TMM paper

• User-centred evaluation:– Collections: FA Cup, Super Tuesday and US Elections

(tracking keywords).

– Ground truth: Set of representative topics (manually selected) corresponding to different timeslots, coming from main-stream media(MSM). Timeslot size: FA Cup – 1 min., Super Tuesday and US elections – 10 min. Topics: 13 FA Cup, 22 Super Tuesday and 64 US elections.

#10

BNgram approach

• Collections:

#11

BNgram approach

• Results – TMM paper

#12

Method T-REC@2 – FA Cup T-REC@10 – Super Tuesday

T-REC@10 – US Elections

Latent Dirichlet Allocation (baseline)

0.6923 0 0.1094

Document-pivot topic detection

0.7692 0.2273 0.2344

Graph-based feature-pivot topic detection

0 0.0455 0.0781

Frequent pattern mining

0.3077 0.1364 0

Soft Frequent pattern mining

0.6154 0.1818 0.3594

BNgram 0.7692 0.5 0.4844

BNgram approach

• Examples of topics

#13

Detected topic Corresponding story Sample tweet

FACUP

over line saved super cech claimingwent @chelseafc carroll header liverpool#cfcwembley #facupfinal sl

Liverpool nearly score Andy Carroll takes a shot. PetrCech makes a fantastic save.

Liverpool nearly score Andy Carroll takes a shot. PetrCech makes a fantastic save.

Super Tuesday

romney wins virginia republican presidentialprimary mitt @ap breaking

Fox/NBC is projecting Mitt Romney has won the Virginiaprimary.

@ap: BREAKING: Mitt Romney wins the Virginia Republicanpresidential primary. -RAS

US Elections

@barackobama four more yearsObama tweeted “Four more years”

Several television networks report Obama has been reelected;

@MessyNelle: @barackobama four more yearshttp://t.co/6ortbfqt

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#14

Further modifications

• BNgram approach modifications:– Study of different types of n-grams.– Timeslots vs. Number of tweet slots– Clustering techniques have been tested for

BNgram approach: Apriori and GMM algorithms.– New topic ranking technique has been

considered.

#15

N-grams

• Word order is often essential to indicate meaning. For example, 'dog bites man' is not news, but 'man bites dog' is news. A bag-of-words approach cannot distinguish these cases.

• Popular in NLP• In this work, n-gram we refer to sequences of up to n

consecutive terms• Copies of posts and RTs are very frequent in Twitter

space. Focused posts in 140 characters.

#16

• What’s the best timeslot size?.• Other alternatives: Number of tweet slots –

Minimum changes in the approach.

• Small slot size missed stories• Large slot size delay in some stories (refresh rate)

Timeslots vs. Number of tweet slots

#17

Fixed number of tweets instead of time

boost

Clustering approaches

• Weakness detected in our clustering technique:– Example: US elections ngram ranking (sorted by df-idft):

• Basic hierarchical clustering: Incomplete stories.– From our example, the candidate clusters could be:

•Cluster 1: Barack Obama wins + wins Wisconsin (Complete)•Cluster 2: wins California (Incomplete, who?)

• New grouping techniques where one n-gram can be assigned to different clusters.

#18

Position Ngram Docs

#1 Barack Obama wins 1,2,4,6,7,8,9,10

#2 wins Wisconsin 1,2,4,6

#3 wins California 7,8,9,10

Clustering approaches – Gaussian Mixture Models (GMM)• Unsupervised method• Assign probabilities (or strengths) of membership of each n-

gram to each cluster – Partial membership• Iterative approach. Tries to find the parameters of the

probability distribution that has the maximum likelihood of its attributes.

• Input: Number of clusters - Bayesian Information Criteria (BIC)

#19

Clustering approaches – Gaussian Mixture Models (GMM)• Expectation-Maximisation - Two steps:

– E-Step: Estimates the probability of each point belongs to each cluster.

– M-step: Re-estimate the parameter vector of the probability distribution of each class.

• The algorithm finishes when the distribution parameters converges or maximum number of iterations.

#20

Clustering approaches - Apriori algorithm

• Explore associations between n-grams based on the number of shared tweets.

• Number of n-grams per association: Each association contains from 1 n-gram to the considered number of n-grams from the ranking.

• One association is considered if the number of shared tweets for the n-grams of the association is bigger than a threshold (support value).

• In a posterior step, the maximal associations are obtained to avoid overlaps.

#21

Clustering approaches - Apriori algorithm

• From the previous example (if threshold is 3): – Candidate associations: #1, #2, #3, #1#2, #1#3– Maximal associations: #1#2, #1#3

#22

Topic ranking

• Maximum df-idft n-gram approach is not the best alternative for these new clustering techniques

• Inconvenient for slots with active and diverse topics.

#23

n-gram1

n-gram2

N-gram ranking Topic ranking

topic1

topic4

topic3

topic2

topic5

Topic ranking

• Weighted topic-length approach:

where st is the score of topic t, Lt is the length of the topic, Lmax is the maximum number of terms in any topic from the current slot, Nt is the number of tweets in topic t and Ns is the number of tweets in the slot. Finally, α is a weighting term.

#24

Evaluation

• We have estimated the starting and ending times of each event in the ground-truth

#25

Topics for slot i-3

Topics for slot i-2

Topics for slot i-1

Topics for slot i

Starting time (event) Ending time (event)

mm m

m

Merged topics to evaluate the event (top m)

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#26

Experiments – n-grams

• Topic recall for different types of n-grams and three datasets using hierarchical clustering and maximum n-gram topic ranking techniques and fixing the slot size to 1000 tweets (similar patterns observed using other configurations)

#27

• Normalised area under the curve for the three datasets and its weighted average.

Experiments – n-grams

#28

Experiments- slot size

• Topic recall for different slot-sizes using hierarchical clustering and weighted topic-length topic ranking techniques (3-grams).

• Possible correlation between slot size and tweet rate (Super Tuesday: 832 tpm, FA Cup: 1293 tpm, US elections: 2209 tpm)

• Consider refresh rate UI

#29

Experiments – clustering and topic ranking techniques• Topic recall for different clustering techniques in the three

datasets and using both topic ranking techniques (3-grams and slot size = 1500 tweets)

#30

• Normalised area under the curve

Experiments – clustering and topic ranking techniques

#31

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#32

Demo

• Social Sensor project – http://www.socialsensor.eu

#33

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#34

Conclusions and Future work

• New TDT approach based on temporal dimension of data and n-grams in Twitter space

• Improve tracking issues – ongoing• Trust and verifications based on following newshounds – ongoing• Improve Topic title – ongoing• Better association of tweets to topics – ongoing• Improve evaluation methods/metrics• Smoothing techniques for df-idft computation• Entity recognition – Other approaches (Illinois NLP tools,…)• Participation in TDT challenges (SNOW14)

#35

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#36

References

• Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in twitter. Multimedia, IEEE Transactions on 15(6) (2013) 1268–1282

• Martin, C., Corney, D., Goker, A.: Finding newsworthy topics on Twitter. IEEE Computer Society Special Technical Community on Social Networking E-Letter 1(3) (September 2013)

• Steve Schifferes, Nic Newman, Neil Thurman, David Corney, Ayse Göker, Carlos Martin. (2013). Identifying and verifying news through social media: Developing a user-centred tool for professional journalists. In The Future of Journalism Conference 2013, Cardiff, UK.

• Spot the ball: Detecting sports events on Twitter. In proceedings of ECIR 2014, Amsterdam, Netherlands. (To appear)

#37

Thank you!

top related