using tag recommendations to homogenize folksonomies in microblogging environments

Using Tag Recommendations to

Homogenize Folksonomies in Microblogging Environments

Eva Zangerle, Wolfgang Gassler and Günther Specht

1

Outline

• Motivation

• Approach

• Ranking Methods

• Evaluation

• Future Directions

• Conclusion

2

Hashtags

• Tags for Tweets

• (Manual) Categorization of conversations

• Follow streams of conversation

• Indicator for certain topic or audience

3

Motivation

• Only 20% of tweets contain hashtags

• Hashtags can be chosen freely

– #socinfo2011? #socinfo11? #socinfo? all?

– Synonymous hashtags

– Heterogeneity

– Search capability limited

– Which stream to follow?

4

Motivation

5

Proposed Solution:

Hashtag Recommendations

Motivation

6

Goals

• Recommendation of suitable hashtags during entering a tweet

• Encourage use of hashtags

– Improve search capabilities

– Better categorization

• Fight heterogeneity

– Avoid use of synonymous hashtags

7

Our Approach in a Nutshell

• Based on a set of existing tweets

• Analysis of entered tweet

• Analysis of dataset

• Recommendations based on hashtags within similar messages

8

Approach - Workflow

User enters message

Retrieve 500 most similar messages

Retrieve candidate-set of Hashtags

Ranking of Hashtags

Top-k Recommendations

9

Crawled Dataset

• Crawled July 2010 – April 2011

• 18,731,800 messages in total

• 3,753,927 messages containing hashtags

– about 20%

– used as dataset for evaluation

• 5,968,571 hashtags → avg of 1.6 hashtags

• 585,140 distinct hashtags

– 502,172 hashtags occurred less then 5 times

10

Longtail Distribution

11

Hashtags per Tweet

12

Candidate Set Generation

• Find tweets most similar to the user‘s tweet

• Cosine similarity of tf/idf weighted term vectors

• Take 500 most similar tweets

• Extract hashtags from these tweets

• Next step: ranking the hashtags

13

Basic Ranking Methods

Input: Set of Candidate Hashtags (from 500 similar tweets) Output: Ranked Candidate List -> top k shown 1. SimRank

– Use similarity measure of tweets for ranking (tf/idf cosine similarity)

– The higher the similarity of the tweets, the higher the ranking of the corresponding hashtags

2. TimeRank – Recency of usage of the hashtag – The more recently a hashtag has been used, the higher the

ranking within the candidate hashtags

14

Basic Ranking Methods

Input: Set of Candidate Hashtags (from 500 similar tweets) Output: Ranked Candidate List -> top k shown 3. RecCountRank

– Count number of occurrences for each hashtag within candidate list

– The more similar tweets feature the hashtag, the higher the rank of the hashtag

4. PopRank – Global popularity of the hashtag within the whole dataset – The more popular a hashtag is overall, the higher is its ranking

15

Hybrid Ranking Methods

• Based on 4 basic ranking methods

• ℎ𝑦𝑏𝑟𝑖𝑑𝑟𝑎𝑛𝑘(𝑟1, 𝑟2) = 𝛼 ∗ 𝑟1 + 1 − 𝛼 ∗ 𝑟2

• Hybrid ranking computed for all possible combinations of basic ranking methods

16

Evaluation

Compare top-k recommendations

Use proposed ranking methods

Compute hashtag recommendations for t

Use t as input for recommendation algorithm

Remove hashtags from t

Randomly select tweet t from dataset

17

Evaluation

• Dataset

– 3,753,927 messages

– 5,968,571 hashtags

– 585,140 distinct hashtags

• Testrun

– 10,000 randomly chosen tweets (max. 5 hashtags)

– Retweets excluded

18

Recall - Basic Methods

19

Top-5 recommendations

enough?

Recall@5 - Hybrid Methods

20

Precision@5

21

Development of Recall Values

22

Future Directions

• Social Graph

• User‘s Timeline

• Realtime Recommendations

• Real User Tests

23

Conclusion

• Motivation

• Hashtag Recommendations

• Simple, straight-forward approach

• Promising results

24

using tag recommendations to homogenize folksonomies in microblogging environments

Technology

hashtags hashtags

candidate hashtags

use of hashtags

datasetremove hashtags

hashtags avg

hashtags tags

use of synonymous hashtags

distinct hashtags testrun