hashtags as milestones in time

Post on 08-Feb-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hashtags as Milestones in Time. Stewart Whiting University of Glasgow Omar Alonso Microsoft/Bing Time Aware Information Access Workshop, SIGIR Oregon, 2012 . (Work done while on internship at Microsoft). Identifying the hashtags for meaningful - PowerPoint PPT Presentation

TRANSCRIPT

Hashtags as Milestones in TimeIdentifying the hashtags for meaningfulevents using Twitter search logs and Wikipedia data

Stewart Whiting University of GlasgowOmar Alonso Microsoft/Bing

Time Aware Information Access Workshop, SIGIR Oregon, 2012.(Work done while on internship at Microsoft)

Alright… Outline1. Hashtags as milestones in time2. Introduction

1. Why milestones2. Why hashtags? Can they useful as milestones?

3. Motivation4. Approach

1. Data preparation2. Approach steps

5. Constructing a timeline – examples6. Preliminary conclusions

Abstract: Hashtags as milestones in timeWhat we want to do:• Identify event-based hashtags, for timeline creation

– Currently using historic/past data• Filter out junk• Find most temporally significant hashtags

– Use multiple signals: Twitter search logs + related Wikipedia article popularity

• We are not doing topic detection/tracking!

Why?• A good way to express (anchor) a topic on a timeline…• Help users make sense of/navigate temporal information

#what?

Introduction• Hashtags used by authors to explicitly

denote the relevant topic(s) in message– “Great passing, great game #euro2012”

• Used by authors and searchers– Broadcast a consume a specific topic– Especially useful in short text retrieval where bag of

words/language modelling are challenging• Reflect mainstream events (or memes!) in real-time

– See trending topics right now• Timelines are very good for displaying events

– But you need to express the events as a meaningful marker, or milestone!

Introduction to the data• Two crowds of people

– Authors/searchers on Twitter– Editors/browsers on Wikipedia

• Correlation between signals from the two crowds– People search for what is happening– People edit Wikipedia with what is happening– Two very distinctive signals!

Twitter hashtag signals (in search logs)

• But plenty of memes too…– #20PeopleWhoIWantToMeet– #PresentingInTheBatCave– #whiteppldoitbutblackppldont

Wikipedia signals

• Whitney Houston• TV appearances• Her death in February

2012

• Events were reflected by discussion with hashtags in Twitter, e.g.– #ripwhitney– #bgtwhitney (BGT =

Britain’s got Talent)

Motivation• Both signals have large coverage

– Celebrities, news, weather, people, science, movies etc.• Two robust signals coming from large crowds

– Difficult to influence by individuals (spam?)– Not so reliant on single signal analysis (i.e. wavelets or burst

detection etc)• Discard memes by looking for associated Wikipedia

articles.• Meaningful milestones in timelines provide strong features

to navigate temporal content– Alonso et al. (2010), Matthews et al. (2010), From et al. (2003)

Data Preparation – Hashtag Data• Extracted from Bing Social and IE8 query logs• Provides hashtag use, aggregated per day• (Proprietary, but could be extracted from other sources)

• Hashtags are mostly a mix of unigrams and bigrams!• We also want the words in the hashtag• Need to use a word breaker…

– We used Microsoft Web N-Gram Services– Breaks #crosstownshootout into ‘cross town shoutout’ and

#basketballwivesla into ‘basketball wives la’

Data Preparation – Wikipedia Data• Created a Lucene index using the Wikipedia Extraction

(WEX) data.

• Wikipedia article viewing popularity statistics– Dump available for each hour since Dec 2007– Published near real-time, for the past hour (on the hour)– Huge number of data points!– So we sampled 8am/8pm each day– Transformed into a daily aggregated time-series (therefore

comparable with hashtag signals)– Smoothed with exponential smoothing (alpha = 0.2)– Over 2 billion data points!

Approach Outline

1. For each hashtags from the logs, use word breaker service to extract hashtag terms.

2. Use separated terms to query Wikipedia index – maps each hashtag to a set of possibly associated articles.

3. For each article/hashtag, prepare a same-length comparable time-series of popularity1. Frequency of hashtag over time2. Popularity of article over time

• Pearson correlation co-efficient computed.– Measures association between temporality of the hashtag

occurrence and the Wikipedia article popularity.

Example Correlations

Constructing a Timeline

Conclusions• Early work, but correlating the signals does yield high-

profile temporal events– Hashtag can therefore be used to anchor events on a timeline

• Occasional spurious correlation (need better hashtag frequency data to improve this)– Correlation does not imply causation!

• Future work…– Automatic construction of timelines– Improving correlation quality – examine time windows– Designing an evaluation framework to assess overall timeline

quality

top related