Download - 02 The Web Goes Social
The Web Goes Social:
Blogosphere and Twittersphere
March 24, 2011
A Look into the Science of Web Retrieval
Multimedia University, Invited Talk
Presenter: Younus, Arjumand
Contents
Introduction
The Changing Role of Today’s Web Web as Media: Social Media
Role of Search Engines in the Social Web
A Blogosphere Case-Study
A Twittersphere Case-Study
March 24, 2011
Quick Survey
Do you have a Facebook, MySpace, Twitter, or LinkedIn account?
Do you own a blog?
Do you read blogs?
Have you ever searched for something on Wikipedia?
Have you ever submitted content to a social network?
March 24, 2011
Web 1.0 vs. Web 2.0
March 24, 2011
Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
What is so Different about Web 2.0?
User Generated Content
Collaborative Environment: Participatory Web, Citizen Journalism
User is the Driving Factor
March 24, 2011
A Paradigm Shift rather than a Technology Shift
Top 20 Most Visited Web Sites
Internet traffic report by Alexa on July 29th 2008
March 24, 2011
1 Yahoo! 11 Orkut
2 Google 12 RapidShare
3 YouTube 13 Baidu.com
4 Windows Live 14 Microsoft Corporation
5 Microsoft Network 15 Google India
6 Myspace 16 Google Germany
7 Wikipedia 17 QQ.Com
8 Facebook 18 EBay
9 Blogger 19 Hi5
10 Yahoo! Japan 20 Google France
Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
Role of Today’s Web
March 24, 2011
Marketing Tool
Information Finding Tool
Media Tool
New Dimensions in Search with The Social Web
Information Overload Search engines don’t always hold answers that users are looking for
Smart Search (CNN Money) “The Web, they say, is leaving the era of search and entering one of
discovery. What’s the difference? Search is what you do when you’re looking for something. Discovery is when something wonderful that you didn’t know existed, or didn’t know how to ask for, finds you.”
March 24, 2011
What does that mean for search engines? Will they be left behind?
Role of Today’s Web
March 24, 2011
Marketing Tool
Information Finding Tool
Media Tool
Research Issues in the Blogosphere
Understanding of the structure and properties of blogosphere [GLM+09] [CZS+07]
Community extraction from the blogosphere through an understanding of relationships between bloggers, readers, blog posts, comments, and different sites in the blogosphere [CZS+07] [YSK+09]
Blog clustering (particularly relevant for blog search engines) [QYS+10] [AGL+10]
Trend analysis through event detection in the blogosphere [LJS+10]
Blog mining for influence analysis and opinion mining [MGL09]
March 24, 2011
Research Issues in the Twittersphere
Study of information diffusion [RMK11]
Influence analysis [KLP+10]
Sentiment analysis and opinion mining [OBR+10]
Event detection through identification of breaking news [SOM10]
Study of unfollow phenomenon [KGN11]
March 24, 2011
Characteristics of Blog Search and Microblog Search [MR06] [TRM11]
Blog Search [MR06]
Tracking references to named entities
Locating blogs by theme
Engaged in technology, entertainment and politics with a particular interest in current events
Microblog Search [TRM11]
Temporal nature
Locating people using specialized syntax
Repetitive queries which change very little
March 24, 2011
Social Search [HK10]
From “library” paradigm of search to “village” paradigm of search Trust in Web search based on “authority”
Trust in Social search based on “intimacy”
Key Characteristics Communities of users actively participating in the search process
Users interact with the system
Users interact with other users either implicitly or explicitly
March 24, 2011
Enhancing Search using Social Network Features
Recency Crawling and Ranking Identification of Hot Topics on Social Web [YQG+11]
News in the Making
Real-Time Search
March 24, 2011
Wael Ghonim’s tweets shown on Google during Egypt uprising.
March 24, 2011
Blogosphere Case-Study
Blogosphere Clustering: Problem Definition
Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in some particular topic.
Which blog holds the greatest influence for the particular topic?
March 24, 2011
Blog Clustering Approach
Blog considered along three dimensions: Part of speech
Occurrence
Blog post no
March 24, 2011
Topic Discussion Isolation Rank
Metric used to discover the topic clusters Based on set of given topic words and some linguistic rules
We define the TDIR score of a blog as follows:
nnoun, nadjective and nadverb is respectively the number of times a noun, adjective or adverb for a specific topic are found in all the blog posts
wn, wadj and wadv are respective weights assigned to the noun, adjective and adverb for a specific topic
March 24, 2011
Topic Discussion Rank
Metric used to rank the blogs within a topic cluster Based on hyperlinked social network of blogs and blog post contents
We define the TDR score of a blog as follows:
Matching_Outlinks represent blogs that are part of topic cluster
o : (o,b) – outlinks from blog b
damp is the damping factor
March 24, 2011
Role of Damping Factor
Assume TDIR of blog A is 2 and TDIR of blog B is 1
TDR without damping factor A: 2 + (1/1 x 1) = 3
B: 1 + (1/1 x 2) = 3
TDR with damping factor A: 2 + (1/1 x 1 x 0.9) = 2.9
B: 1 + (1/1 x 2 x 0.9) = 2.8
March 24, 2011
Performance Evalution
Experimental data Real blog data collected during crawling of blogspot domain
102 blog sites comprising of 50,471 blog posts
Experimental topics “compute”, “democracy”, “secularism”, “bioinformatics”, “Haiti”, “Obama”
Experimental Measures Precision
Recall
March 24, 2011
Experimental Results - Precision
March 24, 2011
Average precision found to be 0.87
Experimental Results - Recall
March 24, 2011
Average recall found to be 0.971
March 24, 2011
Twittersphere Case-Study
Studying Ins and Outs of News
Using Twitter to study hot news items people are heavily tweeting about
March 24, 2011
Algorithm for Identification of Popular News
March 24, 2011
1. Crawl daily news data and send the unranked news articles list to the UI module of the system.
2. Extract news title, news summary and news text for each news article per day. 3. From the news summary extract named entities per news article through a named
entity recognition approach.4. Match each named entity across entities in the common entity corpus and tag each
named entity per news article as common or uncommon.5. Use Boolean query model to compose the query per news article:
a. Use AND predicate with each common entity and OR predicate with each uncommon entity. If all entities for a particular news article are common use news title for the query construction.
6. For all articles per day:a. Send a request to Twitter Search API and extract result tweets per news article
per dayb. For each t in result tweet:
i. Use t’s metadata to find following and follower statistics for each unique Twitterer who has tweeted about the news article
ii. Calculate rank of each article using the ranking function 7. Send the ranked news articles list to the UI module of the system
Application Prototype
March 24, 2011
Observations (1/3)
March 24, 2011
Percentage of news in tweets per day greater than 50% for all days except one
day
Observations (2/3)
March 24, 2011
Highest Number of Recorded Tweets per Day
Observations (3/3)
March 24, 2011
DATE EVENT
17th Oct. Karachi violence (local)
18th Oct. Hopes fade for trapped Chinese miners (international)
19th Oct. Lakki Marwat suicide attack attempt foiled (local)
20th Oct. Rebels raid parliament in Grozny; 7 dead (international)
21st Oct. Obama to visit next year (local)
22nd Oct. Nuclear plant completes decade of good performance (local)
23rd Oct. Ghazi shrine suicide bomber back home (local)
24th Oct. WikiLeaks makes fresh claim about Iraq deaths (international)
25th Oct. Court orders Iraqi parliament back to work (international)
26th Oct. SCBA presidential election (local)
References
[AGL+10] Agarwal, N., Galan, M., Liu, H., and Subramanya, S., "WisColl: Collective Wisdom based Blog Clustering." In Journal Information Sciences, Special Issue on Collective Intelligence, Vol. 180, Issue 1, Jan. 2010.
[GLM+09] Gotz, M., Leskovec, J., McGlohon, M., and Faloutsos, M., A., “Modeling Blog Dynamics.” In Proc. 3rd Internatonal Conference on Weblogs and Social Media (ICWSM 2009), San Jose, California, United States, May 2009.
[CZS+07] Chi, Y., Zhu, S., Song, X., Tatemura, J., and Tseng, B.L., "Structural and Temporal Analysis of the Blogosphere through Community Factorization." In Proc. 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '07), San Jose, California, United States, Aug. 2007.
[MR06] Mishne, D., and de Rijke, M., A., “A Study of Blog Search.” In Proc. 28th European Conf. on Information Retrieval (ECIR 2006), London, United Kingdom, Apr. 2006.
[TRM11] Teevan, J., Ramage, D., and Morris, M. R., “#TwitterSearch: A Comparison of Microblog Search and Web Search.” In Proc. 4th Int’l Conf. on Web Search and Data Mining (WSDM 2011), Hong Kong, China, Feb. 2011.
[HK10] Horowitz D. and Kamvar, S. D., “The Anatomy of a Large-Scale Social Search Engine.” In Proc. 19th Int’l Conf. on World Wide Web (WWW 2010), Raleigh, USA, Apr.. 2010.[KGN11] Kivran-Swaine, F., Govondan, P., and Naaman, M., “The Impact of Network Structure on Breaking Ties in Online Social Networks: Unfollowing on Twitter.” In Proc. ACM SIGCHI Conf. on Human Factors in Computing Systems (SIGCHI’11), Vancouver, Canada, May 2011.[KLP+10] Kwak, H., Lee, C., Park, H., and Moon, S. What is Twitter, a social network or a news media? In Proc. WWW 2010, ACM Press (2010), 591-600.[LJS+10] Lee, Y., Jung, H., Song, W., and Lee, J.H., “Mining the Blogosphere for Top News Stories Identification.” In Proc. 33rd ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR’10), Geneva, Switzerland, Apr.. 2010.[MGL09] Melville, P., Gryc, W., and Lawrence, R.D., “Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification.” In Proc. 15th ACM SIGKDD Int’l. Conf. on Knowledge Discovery and Data Mining (SIGKDD’09), Paris, France, June 2009.[OBR+10] O’Connor, B., Balasubramanyan, R., Routledge, B.R., and Smith, N.A. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In Proceedings of the International AAAI Conference on Weblogs and Social Media (Washington DC, USA, May, 2010).March 24, 2011
References
[QYS+10] Qureshi M.A., Younus, A., Saeed, M., and Touheed, N. T.., “Identifying and Ranking Topic Clusters in the Blogosphere.” In Proc. 20th Int’l Conf. on World Wide Web (WWW 2011), Hyderabad, India, Mar.. 2011..[RMK11] Romero,D ., Meeder, B., and Kleinberg, J., “Differences in the Mechanics of Information Diffusion Across Topics: Idioms, Political Hashtags and Complex Contagion on Twitter.” In Proc. COLING Workshop on People’s Web Meets NLP 2010, Beijing, China, Aug. 2010.[SOM10] Sakaki, T., Okazaki, M., and Matsuo, Y., “Earthquake shakes Twitter users: real-time event detection by social sensors.” In Proc. WWW 2010, ACM Press (2010), 851-860.[YQG+10] Younus, A., Qureshi M.A., Ghazi, A.N.., Mumtaz, S., Saeed, M., Touheed, N. T.., and Qureshi, M.S. ,“Ins and Outs of News: Twitter as a Real-Time News Analysis Service.” In Proc. IUI Workshop on Visual Interfcacs to the Social and Semantic Web (VISSW ’11), Stanford University, California, USA, Feb. 2011.[YSK+09] Yoon, S.H., Shin, J.H., Kim, S.W., and Park, S., "Extraction of a Latent Blog Community based on Subject. ” In Proc. 18th ACM Conference on Information and Knowledge Management (CIKM '09), Hong Kng, China, Nov. 2009.
March 24, 2011