using tag recommendations to homogenize folksonomies in microblogging environments
TRANSCRIPT
Using Tag Recommendations to
Homogenize Folksonomies in Microblogging Environments
Eva Zangerle, Wolfgang Gassler and Günther Specht
1
Outline
• Motivation
• Approach
• Ranking Methods
• Evaluation
• Future Directions
• Conclusion
2
Hashtags
• Tags for Tweets
• (Manual) Categorization of conversations
• Follow streams of conversation
• Indicator for certain topic or audience
3
Motivation
• Only 20% of tweets contain hashtags
• Hashtags can be chosen freely
– #socinfo2011? #socinfo11? #socinfo? all?
– Synonymous hashtags
– Heterogeneity
– Search capability limited
– Which stream to follow?
4
Motivation
5
Proposed Solution:
Hashtag Recommendations
Motivation
6
Goals
• Recommendation of suitable hashtags during entering a tweet
• Encourage use of hashtags
– Improve search capabilities
– Better categorization
• Fight heterogeneity
– Avoid use of synonymous hashtags
7
Our Approach in a Nutshell
• Based on a set of existing tweets
• Analysis of entered tweet
• Analysis of dataset
• Recommendations based on hashtags within similar messages
8
Approach - Workflow
User enters message
Retrieve 500 most similar messages
Retrieve candidate-set of Hashtags
Ranking of Hashtags
Top-k Recommendations
9
Crawled Dataset
• Crawled July 2010 – April 2011
• 18,731,800 messages in total
• 3,753,927 messages containing hashtags
– about 20%
– used as dataset for evaluation
• 5,968,571 hashtags → avg of 1.6 hashtags
• 585,140 distinct hashtags
– 502,172 hashtags occurred less then 5 times
10
Longtail Distribution
11
Hashtags per Tweet
12
Candidate Set Generation
• Find tweets most similar to the user‘s tweet
• Cosine similarity of tf/idf weighted term vectors
• Take 500 most similar tweets
• Extract hashtags from these tweets
• Next step: ranking the hashtags
13
Basic Ranking Methods
Input: Set of Candidate Hashtags (from 500 similar tweets) Output: Ranked Candidate List -> top k shown 1. SimRank
– Use similarity measure of tweets for ranking (tf/idf cosine similarity)
– The higher the similarity of the tweets, the higher the ranking of the corresponding hashtags
2. TimeRank – Recency of usage of the hashtag – The more recently a hashtag has been used, the higher the
ranking within the candidate hashtags
14
Basic Ranking Methods
Input: Set of Candidate Hashtags (from 500 similar tweets) Output: Ranked Candidate List -> top k shown 3. RecCountRank
– Count number of occurrences for each hashtag within candidate list
– The more similar tweets feature the hashtag, the higher the rank of the hashtag
4. PopRank – Global popularity of the hashtag within the whole dataset – The more popular a hashtag is overall, the higher is its ranking
15
Hybrid Ranking Methods
• Based on 4 basic ranking methods
• ℎ𝑦𝑏𝑟𝑖𝑑𝑟𝑎𝑛𝑘(𝑟1, 𝑟2) = 𝛼 ∗ 𝑟1 + 1 − 𝛼 ∗ 𝑟2
• Hybrid ranking computed for all possible combinations of basic ranking methods
16
Evaluation
Compare top-k recommendations
Use proposed ranking methods
Compute hashtag recommendations for t
Use t as input for recommendation algorithm
Remove hashtags from t
Randomly select tweet t from dataset
17
Evaluation
• Dataset
– 3,753,927 messages
– 5,968,571 hashtags
– 585,140 distinct hashtags
• Testrun
– 10,000 randomly chosen tweets (max. 5 hashtags)
– Retweets excluded
18
Recall - Basic Methods
19
Top-5 recommendations
enough?
Recall@5 - Hybrid Methods
20
Precision@5
21
Development of Recall Values
22
Future Directions
• Social Graph
• User‘s Timeline
• Realtime Recommendations
• Real User Tests
23
Conclusion
• Motivation
• Hashtag Recommendations
• Simple, straight-forward approach
• Promising results
24
25