harnessing volume and velocity challenge on the social web using crowd-sourced knowledge bases
DESCRIPTION
Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565 Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use. My research addresses these challenges by (1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users. (2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.TRANSCRIPT
Pavan KapanipathiKno.e.sis Center, Wright State University
Dayton, OH
Frontiers of Cloud Computing and Big Data Workshop 2014 IBM TJ Watson Research Center
Yorktown Heights, NY
Addressing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases
o Social Web in 60 seconds oTwitter
o Big Data Challenges on Social Web
o Addressing Volume o Hierarchical Interest Graphs
o Addressing Velocityo Tracking Dynamic Topics on Twitter
o Conclusion
Overview
Social Web in 60 secs
500M tweets per day
Leveraging Twitter
o Brands are monitoring Twittero 62% active in 2011 to 97% active in 2013
o Twitter is used for disaster managemento 35% of 20M tweets during hurricane sandy shared information and news
o Personalization using Twittero Search Engines use influence scores derived from Twitter network.
Challenges
The Four Vs of Big
Data
VolumeScale of Data
VelocityStreaming Data
VarietyDifferent forms
of Data
VeracityUncertainty of
Data
• Data Perspective: 12TB of data/day1.
• Information Perspective: Information that interests me. Reducing information overload for users.
• Tracking dynamic topics on Twitter.
• Improving recall of relevant, dynamic streaming twitter data for real-time Twitter analysis.
Addressing Volume: Information Perspective
State of the Art
Addressing Volume: User Perspective
o Approacho Generating interest profiles of users by understanding their activities on Twitter. o Filtering/Recommending content that matches their interests.
o Determining user interests from tweetso Exploiting Knowledge base to gain further insights about the interests and infer a hierarchical interest graph.
10
11
Internet
Semantic Search Linked Data Metadata
Technology
World Wide Web
Semantic Web
Structured Information
0.8 0.2 0.6Scores for Interests
User Interests
0.7
0.5
0.4
0.3
Hierarchical Interest Graphs
o Spreading Activation Theory
o Wikipedia Graph based Distributional Semantics
o Hierarchical Interest Graph with scores for each category in the Hierarchy.
Evaluation
Evaluated the top-10 categories of interests derived from the
hierarchy• 76% Mean Average Precision• 98% Mean Reciprocal Recall
Addressing Velocity: Tracking Dynamic Topics on Twitter
Tracking Dynamic Events on Twitter
o Twitris – A Semantic Web application for analyzing tweets.o Political, Disasters & Healthcare tweets
o Event relevant tweetso Twitter Streaming API, Keywords/geo-location based.o Dynamic events are not easy to crawl using these techniques. o Hashtags as queries.
Hashtag Analysis for Dynamic Topics
Colorado Shooting OWS
Hashtag Analysis for Dynamic Topics
Colorado Shooting OWS
Hashtags co-occur with each other
Hashtag Analysis for Dynamic Topics
Powerlaw distribution
Colorado Shooting OWS
Hashtags co-occur with each other
Hashtag Analysis for Dynamic Topics
Powerlaw distribution
Colorado Shooting OWS
Hashtags co-occur with each other
Very few Hashtags are popular. Top 1% can get
85% of the tweets.
Hashtag Analysis for Dynamic Topics
Powerlaw distribution
Colorado Shooting OWS
Hashtags co-occur with each other
Very few Hashtags are popular. Top 1% can get
85% of the tweets.
Clustering co-efficient
Hashtag Analysis for Dynamic Topics
Powerlaw distribution
Colorado Shooting OWS
Hashtags co-occur with each other
Very few Hashtags are popular. Top 1% can get
85% of the tweets.
The top ones co-occur with each other the best
Clustering co-efficient
Approach using Wikipedia
o Input an event-relevant hashtag and the corresponding Event Wikipedia page.
o Utilize dynamically evolving hyperlink structure of Wikipedia Event page.
o Determine event relevant hashtags based on its similarity to event page and its co-occurrence with the existing relevant hashtags.
Evaluation
Evaluated tweets comprising of top-relevant hashtags detected
for dynamic topics• NDCG - 92% at top-5 Mean Average
Precision
Conclusion
o What’s there in this presentationo Big Data challenges in leveraging Twitter.o Focus on “Information” overload instead of “Data” overload.
o Wikipedia categories in the Hierarchy are considered as interests by users.
o Evolving set of hashtags as queries for dynamic events.
o What I missed (catch me at the poster session) o How are Knowledge-bases exploited for our work?o Impact of Knowledge Bases, specifically Wikipedia.