the focus project soumen chakrabarti (iit bombay) david gibson (berkeley) piotr indyk (stanford)...

44
The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox) Byron Dom (IBM Almaden)

Upload: jett-withrow

Post on 15-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

The Focus Project

Soumen Chakrabarti (IIT Bombay)David Gibson (Berkeley)

Piotr Indyk (Stanford)Kevin McCurley (IBM Almaden)

Martin van Den Berg (Xerox)Byron Dom (IBM Almaden)

Page 2: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Focused Crawling:A New Approach to Topic-Specific

Web Resource Discovery

Soumen Chakrabarti (IIT Bombay)Martin van Den Berg (Xerox)Byron Dom (IBM Almaden)

Page 3: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

3

Quote 1

Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.

Dan Gillmore, San Jose Mercury News

Page 4: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

4

Scenario

Disk drive research group wants to track magnetic surface technologies

Compiler research group wants to trawl the web for graduate student resumés

____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links

Virtual libraries like Yahoo!, the Open Directory Project and the Mining Co.

Page 5: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

5

Structured web queries

How many links were found from an environment protection agency site to a site about oil and natural gas in the last year?

Apart from cycling, what is the most common topic cited by pages on cycling?

Find Web research pages which are widely cited by Hawaiian vacation pages

Answer: “first-aid”

Page 6: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

6

Quote 2

As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.

Jim HakeFounder, Global Information Infrastructure

http://www.gii-awards.com/

Page 7: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

7

Goals

Spontaneous, decentralized formation of topical communities

Automatic construction of a “focused portal” containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive

Discovery that combine structure and content

Page 8: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

8

Taxonomy with some ‘chosen’ topics

Each page has a relevance score w.r.t. chosen topics

Mendelzon and Milo’s web access cost model

Goal is to ‘expand’ start set to maximize average relevance

ModelAll

Science Sports

Cycling

Hiking

Physics

Zoology

Page 9: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

9

Properties to be exploited

A page with high relevance tends to link to at least some other relevant pages (radius-one rule)

Given that a page u links to relevant page(s), chances are increased that u points to other relevant pages (radius-two rule)

?

Page 10: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

10

Syntactic “query-by-example”

If part of the answer is known, trivial search techniques may do quite well

E.g., “European airlines”+swissair +iberia +klm

E.g., “Car makers”Which pages link to www.honda.com and

www.toyota.com?

Page 11: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

11

Page 12: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

12

The backlink architecture

S1

Chttp://S1/P1

http://S2/P2

S2

GET /P2 HTTP/1.0Referer: http://S1/P1

LocalBacklinkDatabaseC’

Who pointsto S2/P2?

www.cs.berkeley.edu/~soumen/doc/www99back/userstudy

Page 13: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

13

Backlink rationale

Centralized backlink service does not scale Limited additional storage per server Turn hyperlinks into undirected edges A series of forward and backward ‘clicks’

can quickly build a topical community Can be used to boot-strap the focused

crawler

Page 14: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

14

Backlink example 1

Page 15: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

15

Backlink example 2

Page 16: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

16

Backlink example 3

Page 17: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

17

Backlink example 4

Page 18: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

18

Estimating popularity

Extensive research on social network theory Wasserman and Faust

Hyperlink based Large in-degree indicates popularity/authority Not all votes are worth the same

Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) Resource compilation (Chakrabarti et al) Topic distillation (Bharat and Henzinger)

Page 19: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

19

Topic distillation overview

Given web graph and query

Search engine selects sub-graph

Expansion, pruning and edge weights

Nodes iteratively transfer authority to cited neighbors

Search Engine Query

The Web

Selected subgraph

Page 20: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

20

Preliminary distillation-based approach

Design a keyword query to represent topics of focus

Using a large web crawl, run topic distillation on the query

Refine query by inspecting result and trial-and-error

Page 21: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

21

Problems with preliminary approach

Unreliability of keyword match Engines differ significantly on a given query

due to small overlap [Bharat and Bröder] Narrow, arbitrary view of relevant subgraph Topic model does not improve over time

Dependence on large web crawl and index (lack of “output sensitivity”)

Difficulty of query construction

Page 22: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

22

Output sensitivity

Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages

Ideally effort should scale with size of the result

Time spent crawling and indexing sites unrelated to the topic is wasted

Likewise, time that does not improve comprehensiveness is wasted

Page 23: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

23

Query construction

+“power suppl*”

“switch* mode” smps

-multiprocessor*

“uninterrupt* power suppl*” ups

-parcel*

/Companies/Electronics/Power_Supply

Page 24: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

24

Query complexity Complex queries

needed for distillation Typical Alta Vista

queries are much simpler (Silverstein, Henzinger, Marais and Moricz)

Forcing a hub or authority helps 86% of the time

Wo

rds

Op

era

tors

Alta

Vis

ta

Dis

tilla

tion

7.03

4.34

2.35

0.410

2

4

6

8

AltaVista Distillation

Page 25: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

25

Proposed solution

Resource discovery system that can be customized to crawl for any topic by giving examples

Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality

Crawler has guidance hooks controlled by these two scores

Page 26: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

26

Administration scenario

TaxonomyEditor

CurrentExamples

SuggestedAdditionalExamples

Drag

Page 27: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

27

Relevance

All

Bus&Econ Recreation

Companies Cycling

Bike Shops

Mt.Biking

Clubs

Arts

... ...

Path nodes

Good nodesSubsumed nodes

)good(

]|Pr[]good is Pr[c

dcd

Page 28: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

28

Classification

How relevant is a document w.r.t. a class? Supervised learning, filtering, classification,

categorization

Many types of classifiers Bayesian, nearest neighbor, rule-based

Hypertext Both text and links are class-dependent clues How to model link-based features?

Page 29: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

29

The “bag-of-words” document model

Decide topic; topic c is picked with prior probability (c); c(c) = 1

Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1

Fix document length and keep tossing coin Given c, probability of document is

dt

tdntctdn

dncd ),(),(

)},({

)(]|Pr[

Page 30: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

30

Exploiting link features

c=class, t=text, N=neighbors

Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

?

Page 31: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

31

Improvement using link features

9600 patents from 12 classes marked by USPTO

Patents have text and cite other patents

Expand test patent to include neighborhood

‘Forget’ fraction of neighbors’ classes

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known%

Err

or

Text Link Text+Link

Page 32: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

32

Putting it together

TaxonomyDatabase

TaxonomyEditor

ExampleBrowser

CrawlDatabase

HypertextClassifier(Learn)

TopicModels

HypertextClassifier(Apply)

Scheduler

Workers

TopicDistiller

Feedback

Page 33: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

33

Monitoring the crawler

Time

Relevance

One URL

MovingAverage

Page 34: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

34

Measures of success

Harvest rate What fraction of crawled pages are relevant

Robustness across seed sets Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources

Evidence of non-trivial work #Links from start set to the best resources

Page 35: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

35

Harvest rateHarvest Rate (Cycling, Unfocused)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000 10000

#URLs fetched

Ave

rag

e R

ele

van

ce

Avg over 100

Unfocused

Harvest Rate (Cycling, Soft Focus)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000

#URLs fetched

Ave

rag

e R

ele

van

ce

Avg over 100

Avg over 1000

Focused

Page 36: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

36

Crawl robustness

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1000 2000 3000

#URLs crawled

UR

L O

verl

ap

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000

#URLs crawled

Se

rve

r o

verl

ap

Overlap1

Overlap2

URL Overlap Server OverlapCrawl 1 Crawl 2

Page 37: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

38

Page 38: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

39

Page 39: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

40

Robustness of resource discovery

Sample disjoint sets of starting URL’s

Two separate crawls Find best authorities Order by rank Find overlap in the

top-rated resources

Resource Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25#Top resources

Se

rve

r O

verl

ap

Overlap1

Overlap2

Page 40: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

41

Distance to best resources

Resource Distance (Mutual Funds)

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min. distance from crawl seed (#links)

#S

erv

ers

in t

op

10

0

Resource Distance (Cycling)

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12

Min. distance from crawl seed (#links)

#S

erv

ers

in t

op

10

0

Cycling: cooperative Mutual funds: competitive

Page 41: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

42

Observations

Random walk on the Web “rapidly mixes” topics

Yet, there are large coherent paths and clusters

Focused crawling gives topic distillation richer data to work on

Combining content with link structure eliminates the need to tune link-based heuristics

Page 42: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

43

Related work

WebWatcher, HotList and ColdList Filtering as post-processing, not acquisition

ReferralWeb Social network on the Web

Ahoy!, Cora Hand-crafted to find home pages and papers

WebCrawler, Fish, Shark, Fetuccino, agents Crawler guided by query keyword matches

Page 43: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

44

Comparison with agents Agents usually look

for keywords and hand-crafted patterns

Cannot learn new vocabulary dynamically

Do not use distance-2 centrality information

Client-side assistant

We use taxonomy with statistical topic models

Models can evolve as crawl proceeds

Combine relevance and centrality

Broader scope: inter-community linkage analysis and querying

Page 44: The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Soumen ChakrabartiIIT Bombay 1999

45

Conclusion

New architecture for example-driven topic-specific web resource discovery

No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from

keyword query response nodes