paper presentation for inf 384h (
DESCRIPTION
Based on the paper: Heymann, Koutrika, and Garcia-Molina. 2008. Can Social Bookmarking Improve Web Search?TRANSCRIPT
Can Social Bookmarking Improve Web Search?
Ashish Jain
Information Retrieval
Paper Presentation
Outline
1 Introduction
2 Terminology
3 Collection of Data
4 Related Work
5 URLsResult 1 (Positive)Result 2 (Positive)Result 3 (Positive)Result 4 (Positive)Result 5 (Positive)Result 8 (Negative)Result 9 (Negative)
6 TagsResult 6 (Positive)Result 7 (Positive)Result 10 (Negative)Result 11 (Negative)
7 Discussion
Introduction
What is social bookmarking?
Show video (http://www.commoncraft.com/video/social-bookmarking).
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 3 / 51
Introduction
Figure: Major types of data used by search engines
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 4 / 51
Introduction
What information does del.icio.us have?
Lots of < url , tag , user > tuples.
How can del.icio.us information help a search engine?
If the URLs are unknown to a search engine, they can be added to thelist of URLs to be crawled.
Vocabulary problem: Users use different words to refer to the sameinformation. For example, a user searching for pain killers might enterthe query “analgesic”.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 5 / 51
Introduction
Possibilities
Suppose K represents known to a search engine and U represents unknownto a search engine.
Tags (K) Tags (U)
URLs (K) Both known Tags unknownURLs (U) URLs unknown Both Tags and URLs unknown
When will del.icio.us information be useful to a search engine?
When the URLs of del.icio.us is not a subset of the URLs crawled bya search engine.
Tags given to a particular web page are not present in the URL, title,content of a web page.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 6 / 51
Introduction
Authors are trying to find answers to the following questions:
How often do we find “non-obvious” tags?
Is del.icio.us really more up-to-date than a search engine?
What coverage does delicious have of the web?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 7 / 51
Terminology
Definitions
Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i hastagged URL k with tag j .
Post A post is a URL bookmarked by a user and the associated metadata. A post is made up of many triples, though it may containedinformation like a user comment.
Label A label is a < tagi , urlk > pair that signifies that at least one triplecontaining tag i and URL k exists in the system.
Host Full host part of a URL example inhttp://i.stanford.edu/index.html, i.stanford.edu is the host.
Domain Institutional level part of the host example inhttp://i.stanford.edu/index.html, stanford.edu is the domain.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 8 / 51
Collection of Data
Possible Sources
Del.icio.us Interfaces
“Recent” feed provides the most recent bookmarks posted todel.icio.us in real time
All posts for a given URL
All posts by a given user
Most recent posts with a given tag
Crawl
Alternatively, one can crawl del.icio.us treating it as a tripartite graph ofusers, URLs and tags.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 9 / 51
Collection of Data
Datasets
(C)rawl
Large scale crawl ofdel.icio.us inSeptember 2006.
(R)ecent
Data gathered usingdel.icio.us recent feedinterface for nearly 8months beginningSeptember 28, 2006.
(M)onth
Data gathered fromdel.icio.us recent feedinterface for onecomplete monthstarting May 25,2007. Gatheringprocess enhanced somore accurate thanthe R dataset.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 10 / 51
Collection of Data
Comparison
(C)rawl (R)ecent (M)onthPosts ≈ 22M ≈ 11M ≈ 3.6M
Unique URLs ≈ 1.3M ≈ 3M ≈ 2.5MDisadvantage Biased towards Missing data Missing data
popular URLs, tags, users
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 11 / 51
Collection of Data
Query Dataset
AOL Query DatasetAbout 20 million search queries by roughly 650,000 usersUsed to simulate distribution of queries that a search engine might receive.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 12 / 51
URLs
Figure: Overview
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 13 / 51
URLs Result 1 (Positive)
Result 1
Aim
Are pages posted to del.icio.us often recently modified?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 14 / 51
URLs Result 1 (Positive)
Methodology
Modification Date of a Web page
As we studied in previous papers, determining the exact modificationdate of a web page is hard.
The search engines have to estimate the modification date of a webpage in order to crawl the web efficiently.
Yahoo! Search API gives the modification date of a web page.Authors use the same to determine the modification date of a webpage.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 15 / 51
URLs Result 1 (Positive)
Methodology
Compare
del.icio.us Pages sampled from del.icio.us recent feed as they wereposted
Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) ofYahoo! searches for queries sampled from the AOL querydataset.
ODP Pages sampled from the Open Directory Project (dmoz.org)
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 16 / 51
URLs Result 1 (Positive)
Results
Pages from del.icio.us are often more recently modified than ODP
Found a correlation between a search result being ranked higher and aresult having been modified more recently.
Top 10 results from Yahoo! Search were about the same age as thepages found bookmarked in del.icio.us .
Conclusion
del.icio.us users post interesting pages that are actively updated or havebeen recently created.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 17 / 51
URLs Result 2 (Positive)
Result 2
Aim
How many pages belonging to del.icio.us are not known to a search engine?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 18 / 51
URLs Result 2 (Positive)
Methodology
Sample pages from the del.icio.us feed as they were posted, and thenrun searches on those pages immediately after.
Of those pages, about 42.5% were not found. This could be due toseveral reasons:
Page is indexed under another canonicalized URLCould be spamCould be an odd MIME-type for example an imagePage could not have been found yet
Continuously search for the web page in the next four weeks. If foundassume it was not indexed.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 19 / 51
URLs Result 2 (Positive)
Result
Out of 5,724 URLS which were sampled and were missing, 1,750 werelater found.
Implies roughly 30% of the missing URLs were new URLs.
Implies 12.5% of del.icio.us i.e. 42.5% × 30%.
Conclusion
del.icio.us can serve as a (small) data source for new web pages and tohelp crawl ordering.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 20 / 51
URLs Result 2 (Positive)
Figure: Result 2
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 21 / 51
URLs Result 3 (Positive)
Aim
Check coverage of search results by del.icio.us
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 22 / 51
URLs Result 3 (Positive)
Methodology
Sample queries from AOL dataset based on query event frequency(Implies biased towards popular queries).
Run query on Yahoo! Search
Intersect search results with datasets C, M, R.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 23 / 51
URLs Result 3 (Positive)
Results
For the top 100 results, del.icio.us covers 9% of the results returnedfor a set of over 30,000 queries.
For the top 10 results, del.icio.us covers 19% of the results returned.
Conclusion
del.icio.us users are disproportionately common in search results comparedto their coverage.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 24 / 51
URLs Result 4 (Positive)
Q. Are there some subset of users responsible for most of the data indel.icio.us ?
On social news sites, it is commonly cited that the majority of frontpage posts come from a dedicated group of less than 100 users.
del.icio.us does exhibit some of these traits but it is not as dependenton some relatively small group of users.
The top 10% only account for 56% of the posts.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 25 / 51
URLs Result 4 (Positive)
Figure: Result 4
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 26 / 51
URLs Result 5 (Positive)
How much of the new information added to del.icio.us is new?
Estimated using dataset M.
A new post in dataset M was not in del.icio.us 40% of the time.Should be about 30% after adjusting for filtering (How did they comeup with this number is not known!)
How often is a completely new domain added to del.icio.us?
12% of posts in Dataset M were URLs whose domains were not ineither Dataset C or R.Implies about 1/8th of the time
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 27 / 51
URLs Result 5 (Positive)
Figure: Result 5
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 28 / 51
URLs Result 8 (Negative)
Aim
How many URLs are posted to del.icio.us every day?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 29 / 51
URLs Result 8 (Negative)
Methodology
Plot the posts for every hour in Dataset M and compare the samewith data collected by Philipp Keller a. The two are mutuallyreinforcing.
Also plot posts from dataset R.
ahttp://deli.ckoma.net/stats (Defunct website)
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 30 / 51
URLs Result 8 (Negative)
Results
About 92,000 posts per day of each weekend
About 133,000 posts per weekday
Implies about 851,000 posts per week
About 44 million posts per year a
aThere are about 1.5 million blog posts per day
Conclusion
Compared to blog posts, the number of posts per day is small about1/10
Posting rate on del.icio.us is marked by a series of increases followedby periods of relative stability.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 31 / 51
URLs Result 9 (Negative)
Aim
What is the size of del.icio.us ?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 32 / 51
URLs Result 9 (Negative)
Methodology
Divide time into three sets.
t1 Period before Schacter’s announcement on May 24th a
t2 May 24th and start of Philipp Keller’s data gatheringt3 Start of Philipp Keller’s data gathering to the present
t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f )Equal to about 117 million posts b
Reasonable estimate should be between 60 and 150 million posts.c
Estimate between 20 and 50 percent of posts are unique URLs.
aJoshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were400,000 posts and 200,000 URLs.
bMost likely an overestimate as the authors chose upper bound values for db and dk .cIt does not include private posts
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 33 / 51
URLs Result 9 (Negative)
Results
There are about 115 million public posts a.
There are about 30-50 million unique URLs.
aThey estimate that there are between 60 and 150 million posts. 115 million is notan average of 60 and 150 million!
Conclusion
The number of total posts is relatively small compared to the web as awhole.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 34 / 51
URLs Result 9 (Negative)
Figure: Result 9
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 35 / 51
Tags Result 6 (Positive)
Aim
Is there any correlation between tags and queries?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 36 / 51
Tags Result 6 (Positive)
Methodology
Checked the tag-query overlap between the tags in dataset M and thequery terms in the AOL query dataset.
22% of the AOL query dataset is made up of queries. Removed those.
Removed certain stop word like tags from dataset M.
Plotted number of times a tag occurs in Dataset M versus thenumber of times it occurs in the AOL query dataset.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 37 / 51
Tags Result 6 (Positive)
Figure: A scatter plot of tag count versus query count for top tags and queries indel.icio.us and AOL query dataset
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 38 / 51
Tags Result 6 (Positive)
Results
One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%,36.8% of these non-domain, non-URL queries.
Conclusion
del.icio.us may be able to help with queries where tags overlap with queryterms.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 39 / 51
Tags Result 7 (Positive)
Aim
Are the tags in del.icio.us of good quality? Are they non-sensical tags like“cool”, “fi32”, etc.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 40 / 51
Tags Result 7 (Positive)
Methodology: User Study
10 people (graduate students and “mix of individuals associated withour department”) manually evaluate posts to determine their quality.
Sampled one post out of every five hundred, and then gave blocks ofposts for individuals to label.
Most individuals labeled 100 to 150 posts.
For each tag, we asked whether the tag was “relevant”, “applies tothe whole document,” and/or “subjective.”
Bar for relevance was set low: whether a random person would agreethat it was reasonable to say that the tag described the page.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 41 / 51
Tags Result 7 (Positive)
Results
Only about 7% were deemed subjective (less than one in twenty forall users)
No “spam”
Conclusion
Tags on the whole are of good quality.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 42 / 51
Tags Result 10 (Negative)
Aim
Do people use tags which are not obvious from the context?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 43 / 51
Tags Result 10 (Negative)
Methodology
Randomly pick 20,000 posts from Dataset M.
Convert HTML to text. Also look at page text of pages that link tothe URL in question (backlinks) and pages that are linked from theURL in question (forward links).
Extract tokens. Check whether pages are in English or not.
Lower case all tags and tokens.
Compare
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 44 / 51
Tags Result 10 (Negative)
Results
50% of the time tag is in the page text
16% of the time it is in the title itself
20% of the time it’ll appear in three places: the page it annotates, atleast one of its backlinks, at least one of its forward links.
80% of the time, tags will appear in one of three places: the page, itsbacklinks, its forward links.
The tags in the other 20% seem to be of lower quality: misspellings,confusing tagging schemes (food/dining).
Conclusion
Most tags can be discovered by a search engine
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 45 / 51
Tags Result 11 (Negative)
Aim
Are some domains strongly correlated with particular tags and vice-versa?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 46 / 51
Tags Result 11 (Negative)
Example
Table: This example lists the five hosts in Dataset C with the most URLsannotated with the tag java.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 47 / 51
Tags Result 11 (Negative)
Methodology
Used Dataset C which is highly biased towards popular URLs, tagsand users. Therefore, the results of this experiment do not necessarilyapply to del.icio.us as a whole.
Build a simple binary classifier and see how it does.
Figure: Function for classification
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 48 / 51
Tags Result 11 (Negative)
Result
Domains are often highly correlated with particular tags and vice-versa.
Conclusion
It may be more efficient to train librarians to label domains than to askusers to tag pages.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 49 / 51
Discussion
Summary
Advantages
Actively updated
Prominent in search results
Tags are relevant and objective
Disadvantages
Small amount of data
Tags in titles, page text, URLs
Not good enough to be used by major search engines.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 50 / 51
Discussion
Discussion
Personalized search using del.icio.us bookmarks.
I found the conclusions drawn in subsection Result 1 hard to believe.
I found the conclusions drawn in subsection Result 5 hard to believe.
I found the conclusions drawn in subsection Result 11 hard to believe.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51
Discussion
Heymann, Koutrika, and Garcia-Molina. 2008. Can SocialBookmarking Improve Web Search? WSDM 2008.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51