paper presentation for inf 384h (

52
Can Social Bookmarking Improve Web Search? Ashish Jain Information Retrieval Paper Presentation

Upload: ashishjain87

Post on 06-Sep-2014

336 views

Category:

Technology


1 download

DESCRIPTION

Based on the paper: Heymann, Koutrika, and Garcia-Molina. 2008. Can Social Bookmarking Improve Web Search?

TRANSCRIPT

Page 1: Paper Presentation for INF 384H (

Can Social Bookmarking Improve Web Search?

Ashish Jain

Information Retrieval

Paper Presentation

Page 2: Paper Presentation for INF 384H (

Outline

1 Introduction

2 Terminology

3 Collection of Data

4 Related Work

5 URLsResult 1 (Positive)Result 2 (Positive)Result 3 (Positive)Result 4 (Positive)Result 5 (Positive)Result 8 (Negative)Result 9 (Negative)

6 TagsResult 6 (Positive)Result 7 (Positive)Result 10 (Negative)Result 11 (Negative)

7 Discussion

Page 3: Paper Presentation for INF 384H (

Introduction

What is social bookmarking?

Show video (http://www.commoncraft.com/video/social-bookmarking).

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 3 / 51

Page 4: Paper Presentation for INF 384H (

Introduction

Figure: Major types of data used by search engines

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 4 / 51

Page 5: Paper Presentation for INF 384H (

Introduction

What information does del.icio.us have?

Lots of < url , tag , user > tuples.

How can del.icio.us information help a search engine?

If the URLs are unknown to a search engine, they can be added to thelist of URLs to be crawled.

Vocabulary problem: Users use different words to refer to the sameinformation. For example, a user searching for pain killers might enterthe query “analgesic”.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 5 / 51

Page 6: Paper Presentation for INF 384H (

Introduction

Possibilities

Suppose K represents known to a search engine and U represents unknownto a search engine.

Tags (K) Tags (U)

URLs (K) Both known Tags unknownURLs (U) URLs unknown Both Tags and URLs unknown

When will del.icio.us information be useful to a search engine?

When the URLs of del.icio.us is not a subset of the URLs crawled bya search engine.

Tags given to a particular web page are not present in the URL, title,content of a web page.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 6 / 51

Page 7: Paper Presentation for INF 384H (

Introduction

Authors are trying to find answers to the following questions:

How often do we find “non-obvious” tags?

Is del.icio.us really more up-to-date than a search engine?

What coverage does delicious have of the web?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 7 / 51

Page 8: Paper Presentation for INF 384H (

Terminology

Definitions

Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i hastagged URL k with tag j .

Post A post is a URL bookmarked by a user and the associated metadata. A post is made up of many triples, though it may containedinformation like a user comment.

Label A label is a < tagi , urlk > pair that signifies that at least one triplecontaining tag i and URL k exists in the system.

Host Full host part of a URL example inhttp://i.stanford.edu/index.html, i.stanford.edu is the host.

Domain Institutional level part of the host example inhttp://i.stanford.edu/index.html, stanford.edu is the domain.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 8 / 51

Page 9: Paper Presentation for INF 384H (

Collection of Data

Possible Sources

Del.icio.us Interfaces

“Recent” feed provides the most recent bookmarks posted todel.icio.us in real time

All posts for a given URL

All posts by a given user

Most recent posts with a given tag

Crawl

Alternatively, one can crawl del.icio.us treating it as a tripartite graph ofusers, URLs and tags.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 9 / 51

Page 10: Paper Presentation for INF 384H (

Collection of Data

Datasets

(C)rawl

Large scale crawl ofdel.icio.us inSeptember 2006.

(R)ecent

Data gathered usingdel.icio.us recent feedinterface for nearly 8months beginningSeptember 28, 2006.

(M)onth

Data gathered fromdel.icio.us recent feedinterface for onecomplete monthstarting May 25,2007. Gatheringprocess enhanced somore accurate thanthe R dataset.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 10 / 51

Page 11: Paper Presentation for INF 384H (

Collection of Data

Comparison

(C)rawl (R)ecent (M)onthPosts ≈ 22M ≈ 11M ≈ 3.6M

Unique URLs ≈ 1.3M ≈ 3M ≈ 2.5MDisadvantage Biased towards Missing data Missing data

popular URLs, tags, users

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 11 / 51

Page 12: Paper Presentation for INF 384H (

Collection of Data

Query Dataset

AOL Query DatasetAbout 20 million search queries by roughly 650,000 usersUsed to simulate distribution of queries that a search engine might receive.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 12 / 51

Page 13: Paper Presentation for INF 384H (

URLs

Figure: Overview

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 13 / 51

Page 14: Paper Presentation for INF 384H (

URLs Result 1 (Positive)

Result 1

Aim

Are pages posted to del.icio.us often recently modified?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 14 / 51

Page 15: Paper Presentation for INF 384H (

URLs Result 1 (Positive)

Methodology

Modification Date of a Web page

As we studied in previous papers, determining the exact modificationdate of a web page is hard.

The search engines have to estimate the modification date of a webpage in order to crawl the web efficiently.

Yahoo! Search API gives the modification date of a web page.Authors use the same to determine the modification date of a webpage.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 15 / 51

Page 16: Paper Presentation for INF 384H (

URLs Result 1 (Positive)

Methodology

Compare

del.icio.us Pages sampled from del.icio.us recent feed as they wereposted

Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) ofYahoo! searches for queries sampled from the AOL querydataset.

ODP Pages sampled from the Open Directory Project (dmoz.org)

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 16 / 51

Page 17: Paper Presentation for INF 384H (

URLs Result 1 (Positive)

Results

Pages from del.icio.us are often more recently modified than ODP

Found a correlation between a search result being ranked higher and aresult having been modified more recently.

Top 10 results from Yahoo! Search were about the same age as thepages found bookmarked in del.icio.us .

Conclusion

del.icio.us users post interesting pages that are actively updated or havebeen recently created.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 17 / 51

Page 18: Paper Presentation for INF 384H (

URLs Result 2 (Positive)

Result 2

Aim

How many pages belonging to del.icio.us are not known to a search engine?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 18 / 51

Page 19: Paper Presentation for INF 384H (

URLs Result 2 (Positive)

Methodology

Sample pages from the del.icio.us feed as they were posted, and thenrun searches on those pages immediately after.

Of those pages, about 42.5% were not found. This could be due toseveral reasons:

Page is indexed under another canonicalized URLCould be spamCould be an odd MIME-type for example an imagePage could not have been found yet

Continuously search for the web page in the next four weeks. If foundassume it was not indexed.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 19 / 51

Page 20: Paper Presentation for INF 384H (

URLs Result 2 (Positive)

Result

Out of 5,724 URLS which were sampled and were missing, 1,750 werelater found.

Implies roughly 30% of the missing URLs were new URLs.

Implies 12.5% of del.icio.us i.e. 42.5% × 30%.

Conclusion

del.icio.us can serve as a (small) data source for new web pages and tohelp crawl ordering.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 20 / 51

Page 21: Paper Presentation for INF 384H (

URLs Result 2 (Positive)

Figure: Result 2

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 21 / 51

Page 22: Paper Presentation for INF 384H (

URLs Result 3 (Positive)

Aim

Check coverage of search results by del.icio.us

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 22 / 51

Page 23: Paper Presentation for INF 384H (

URLs Result 3 (Positive)

Methodology

Sample queries from AOL dataset based on query event frequency(Implies biased towards popular queries).

Run query on Yahoo! Search

Intersect search results with datasets C, M, R.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 23 / 51

Page 24: Paper Presentation for INF 384H (

URLs Result 3 (Positive)

Results

For the top 100 results, del.icio.us covers 9% of the results returnedfor a set of over 30,000 queries.

For the top 10 results, del.icio.us covers 19% of the results returned.

Conclusion

del.icio.us users are disproportionately common in search results comparedto their coverage.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 24 / 51

Page 25: Paper Presentation for INF 384H (

URLs Result 4 (Positive)

Q. Are there some subset of users responsible for most of the data indel.icio.us ?

On social news sites, it is commonly cited that the majority of frontpage posts come from a dedicated group of less than 100 users.

del.icio.us does exhibit some of these traits but it is not as dependenton some relatively small group of users.

The top 10% only account for 56% of the posts.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 25 / 51

Page 26: Paper Presentation for INF 384H (

URLs Result 4 (Positive)

Figure: Result 4

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 26 / 51

Page 27: Paper Presentation for INF 384H (

URLs Result 5 (Positive)

How much of the new information added to del.icio.us is new?

Estimated using dataset M.

A new post in dataset M was not in del.icio.us 40% of the time.Should be about 30% after adjusting for filtering (How did they comeup with this number is not known!)

How often is a completely new domain added to del.icio.us?

12% of posts in Dataset M were URLs whose domains were not ineither Dataset C or R.Implies about 1/8th of the time

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 27 / 51

Page 28: Paper Presentation for INF 384H (

URLs Result 5 (Positive)

Figure: Result 5

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 28 / 51

Page 29: Paper Presentation for INF 384H (

URLs Result 8 (Negative)

Aim

How many URLs are posted to del.icio.us every day?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 29 / 51

Page 30: Paper Presentation for INF 384H (

URLs Result 8 (Negative)

Methodology

Plot the posts for every hour in Dataset M and compare the samewith data collected by Philipp Keller a. The two are mutuallyreinforcing.

Also plot posts from dataset R.

ahttp://deli.ckoma.net/stats (Defunct website)

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 30 / 51

Page 31: Paper Presentation for INF 384H (

URLs Result 8 (Negative)

Results

About 92,000 posts per day of each weekend

About 133,000 posts per weekday

Implies about 851,000 posts per week

About 44 million posts per year a

aThere are about 1.5 million blog posts per day

Conclusion

Compared to blog posts, the number of posts per day is small about1/10

Posting rate on del.icio.us is marked by a series of increases followedby periods of relative stability.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 31 / 51

Page 32: Paper Presentation for INF 384H (

URLs Result 9 (Negative)

Aim

What is the size of del.icio.us ?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 32 / 51

Page 33: Paper Presentation for INF 384H (

URLs Result 9 (Negative)

Methodology

Divide time into three sets.

t1 Period before Schacter’s announcement on May 24th a

t2 May 24th and start of Philipp Keller’s data gatheringt3 Start of Philipp Keller’s data gathering to the present

t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f )Equal to about 117 million posts b

Reasonable estimate should be between 60 and 150 million posts.c

Estimate between 20 and 50 percent of posts are unique URLs.

aJoshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were400,000 posts and 200,000 URLs.

bMost likely an overestimate as the authors chose upper bound values for db and dk .cIt does not include private posts

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 33 / 51

Page 34: Paper Presentation for INF 384H (

URLs Result 9 (Negative)

Results

There are about 115 million public posts a.

There are about 30-50 million unique URLs.

aThey estimate that there are between 60 and 150 million posts. 115 million is notan average of 60 and 150 million!

Conclusion

The number of total posts is relatively small compared to the web as awhole.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 34 / 51

Page 35: Paper Presentation for INF 384H (

URLs Result 9 (Negative)

Figure: Result 9

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 35 / 51

Page 36: Paper Presentation for INF 384H (

Tags Result 6 (Positive)

Aim

Is there any correlation between tags and queries?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 36 / 51

Page 37: Paper Presentation for INF 384H (

Tags Result 6 (Positive)

Methodology

Checked the tag-query overlap between the tags in dataset M and thequery terms in the AOL query dataset.

22% of the AOL query dataset is made up of queries. Removed those.

Removed certain stop word like tags from dataset M.

Plotted number of times a tag occurs in Dataset M versus thenumber of times it occurs in the AOL query dataset.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 37 / 51

Page 38: Paper Presentation for INF 384H (

Tags Result 6 (Positive)

Figure: A scatter plot of tag count versus query count for top tags and queries indel.icio.us and AOL query dataset

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 38 / 51

Page 39: Paper Presentation for INF 384H (

Tags Result 6 (Positive)

Results

One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%,36.8% of these non-domain, non-URL queries.

Conclusion

del.icio.us may be able to help with queries where tags overlap with queryterms.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 39 / 51

Page 40: Paper Presentation for INF 384H (

Tags Result 7 (Positive)

Aim

Are the tags in del.icio.us of good quality? Are they non-sensical tags like“cool”, “fi32”, etc.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 40 / 51

Page 41: Paper Presentation for INF 384H (

Tags Result 7 (Positive)

Methodology: User Study

10 people (graduate students and “mix of individuals associated withour department”) manually evaluate posts to determine their quality.

Sampled one post out of every five hundred, and then gave blocks ofposts for individuals to label.

Most individuals labeled 100 to 150 posts.

For each tag, we asked whether the tag was “relevant”, “applies tothe whole document,” and/or “subjective.”

Bar for relevance was set low: whether a random person would agreethat it was reasonable to say that the tag described the page.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 41 / 51

Page 42: Paper Presentation for INF 384H (

Tags Result 7 (Positive)

Results

Only about 7% were deemed subjective (less than one in twenty forall users)

No “spam”

Conclusion

Tags on the whole are of good quality.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 42 / 51

Page 43: Paper Presentation for INF 384H (

Tags Result 10 (Negative)

Aim

Do people use tags which are not obvious from the context?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 43 / 51

Page 44: Paper Presentation for INF 384H (

Tags Result 10 (Negative)

Methodology

Randomly pick 20,000 posts from Dataset M.

Convert HTML to text. Also look at page text of pages that link tothe URL in question (backlinks) and pages that are linked from theURL in question (forward links).

Extract tokens. Check whether pages are in English or not.

Lower case all tags and tokens.

Compare

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 44 / 51

Page 45: Paper Presentation for INF 384H (

Tags Result 10 (Negative)

Results

50% of the time tag is in the page text

16% of the time it is in the title itself

20% of the time it’ll appear in three places: the page it annotates, atleast one of its backlinks, at least one of its forward links.

80% of the time, tags will appear in one of three places: the page, itsbacklinks, its forward links.

The tags in the other 20% seem to be of lower quality: misspellings,confusing tagging schemes (food/dining).

Conclusion

Most tags can be discovered by a search engine

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 45 / 51

Page 46: Paper Presentation for INF 384H (

Tags Result 11 (Negative)

Aim

Are some domains strongly correlated with particular tags and vice-versa?

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 46 / 51

Page 47: Paper Presentation for INF 384H (

Tags Result 11 (Negative)

Example

Table: This example lists the five hosts in Dataset C with the most URLsannotated with the tag java.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 47 / 51

Page 48: Paper Presentation for INF 384H (

Tags Result 11 (Negative)

Methodology

Used Dataset C which is highly biased towards popular URLs, tagsand users. Therefore, the results of this experiment do not necessarilyapply to del.icio.us as a whole.

Build a simple binary classifier and see how it does.

Figure: Function for classification

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 48 / 51

Page 49: Paper Presentation for INF 384H (

Tags Result 11 (Negative)

Result

Domains are often highly correlated with particular tags and vice-versa.

Conclusion

It may be more efficient to train librarians to label domains than to askusers to tag pages.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 49 / 51

Page 50: Paper Presentation for INF 384H (

Discussion

Summary

Advantages

Actively updated

Prominent in search results

Tags are relevant and objective

Disadvantages

Small amount of data

Tags in titles, page text, URLs

Not good enough to be used by major search engines.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 50 / 51

Page 51: Paper Presentation for INF 384H (

Discussion

Discussion

Personalized search using del.icio.us bookmarks.

I found the conclusions drawn in subsection Result 1 hard to believe.

I found the conclusions drawn in subsection Result 5 hard to believe.

I found the conclusions drawn in subsection Result 11 hard to believe.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51

Page 52: Paper Presentation for INF 384H (

Discussion

Heymann, Koutrika, and Garcia-Molina. 2008. Can SocialBookmarking Improve Web Search? WSDM 2008.

Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51