harnessing volume and velocity challenge on the social web using crowd-sourced knowledge bases

25
Pavan Kapanipathi Kno.e.sis Center , Wright State University Dayton, OH Frontiers of Cloud Computing and Big Data Workshop 2014 IBM TJ Watson Research Center Yorktown Heights, NY Addressing Volume and Velocity Challenge on the Social Web using Crowd- Sourced Knowledge Bases

Upload: knoesis-center-wright-state-university

Post on 22-May-2015

403 views

Category:

Data & Analytics


0 download

DESCRIPTION

Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565 Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use. My research addresses these challenges by (1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users. (2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.

TRANSCRIPT

Page 1: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Pavan KapanipathiKno.e.sis Center, Wright State University

Dayton, OH

Frontiers of Cloud Computing and Big Data Workshop 2014 IBM TJ Watson Research Center

Yorktown Heights, NY

Addressing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Page 2: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

o Social Web in 60 seconds oTwitter

o Big Data Challenges on Social Web

o Addressing Volume o Hierarchical Interest Graphs

o Addressing Velocityo Tracking Dynamic Topics on Twitter

o Conclusion

Overview

Page 3: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Social Web in 60 secs

Page 4: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Twitter

500M tweets per day

Page 5: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Leveraging Twitter

o Brands are monitoring Twittero 62% active in 2011 to 97% active in 2013

o Twitter is used for disaster managemento 35% of 20M tweets during hurricane sandy shared information and news

o Personalization using Twittero Search Engines use influence scores derived from Twitter network.

Page 6: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Challenges

The Four Vs of Big

Data

VolumeScale of Data

VelocityStreaming Data

VarietyDifferent forms

of Data

VeracityUncertainty of

Data

• Data Perspective: 12TB of data/day1.

• Information Perspective: Information that interests me. Reducing information overload for users.

• Tracking dynamic topics on Twitter.

• Improving recall of relevant, dynamic streaming twitter data for real-time Twitter analysis.

Page 7: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Addressing Volume: Information Perspective

Page 8: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

State of the Art

Page 9: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Addressing Volume: User Perspective

o Approacho Generating interest profiles of users by understanding their activities on Twitter. o Filtering/Recommending content that matches their interests.

o Determining user interests from tweetso Exploiting Knowledge base to gain further insights about the interests and infer a hierarchical interest graph.

Page 10: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

10

Page 11: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

11

Internet

Semantic Search Linked Data Metadata

Technology

World Wide Web

Semantic Web

Structured Information

0.8 0.2 0.6Scores for Interests

User Interests

0.7

0.5

0.4

0.3

Page 12: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hierarchical Interest Graphs

o Spreading Activation Theory

o Wikipedia Graph based Distributional Semantics

o Hierarchical Interest Graph with scores for each category in the Hierarchy.

Page 13: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Evaluation

Evaluated the top-10 categories of interests derived from the

hierarchy• 76% Mean Average Precision• 98% Mean Reciprocal Recall

Page 14: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Addressing Velocity: Tracking Dynamic Topics on Twitter

Page 15: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Tracking Dynamic Events on Twitter

o Twitris – A Semantic Web application for analyzing tweets.o Political, Disasters & Healthcare tweets

o Event relevant tweetso Twitter Streaming API, Keywords/geo-location based.o Dynamic events are not easy to crawl using these techniques. o Hashtags as queries.

Page 16: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Colorado Shooting OWS

Page 17: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Colorado Shooting OWS

Hashtags co-occur with each other

Page 18: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Powerlaw distribution

Colorado Shooting OWS

Hashtags co-occur with each other

Page 19: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Powerlaw distribution

Colorado Shooting OWS

Hashtags co-occur with each other

Very few Hashtags are popular. Top 1% can get

85% of the tweets.

Page 20: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Powerlaw distribution

Colorado Shooting OWS

Hashtags co-occur with each other

Very few Hashtags are popular. Top 1% can get

85% of the tweets.

Clustering co-efficient

Page 21: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Hashtag Analysis for Dynamic Topics

Powerlaw distribution

Colorado Shooting OWS

Hashtags co-occur with each other

Very few Hashtags are popular. Top 1% can get

85% of the tweets.

The top ones co-occur with each other the best

Clustering co-efficient

Page 22: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Approach using Wikipedia

o Input an event-relevant hashtag and the corresponding Event Wikipedia page.

o Utilize dynamically evolving hyperlink structure of Wikipedia Event page.

o Determine event relevant hashtags based on its similarity to event page and its co-occurrence with the existing relevant hashtags.

Page 23: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Evaluation

Evaluated tweets comprising of top-relevant hashtags detected

for dynamic topics• NDCG - 92% at top-5 Mean Average

Precision

Page 24: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Conclusion

o What’s there in this presentationo Big Data challenges in leveraging Twitter.o Focus on “Information” overload instead of “Data” overload.

o Wikipedia categories in the Hierarchy are considered as interests by users.

o Evolving set of hashtags as queries for dynamic events.

o What I missed (catch me at the poster session) o How are Knowledge-bases exploited for our work?o Impact of Knowledge Bases, specifically Wikipedia.

Page 25: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Thanks

Contact: [email protected]

Twitter:@pavankaps