the focus project soumen chakrabarti (iit bombay) david gibson (berkeley) piotr indyk (stanford)...
TRANSCRIPT
The Focus Project
Soumen Chakrabarti (IIT Bombay)David Gibson (Berkeley)
Piotr Indyk (Stanford)Kevin McCurley (IBM Almaden)
Martin van Den Berg (Xerox)Byron Dom (IBM Almaden)
Focused Crawling:A New Approach to Topic-Specific
Web Resource Discovery
Soumen Chakrabarti (IIT Bombay)Martin van Den Berg (Xerox)Byron Dom (IBM Almaden)
Soumen ChakrabartiIIT Bombay 1999
3
Quote 1
Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.
Dan Gillmore, San Jose Mercury News
Soumen ChakrabartiIIT Bombay 1999
4
Scenario
Disk drive research group wants to track magnetic surface technologies
Compiler research group wants to trawl the web for graduate student resumés
____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links
Virtual libraries like Yahoo!, the Open Directory Project and the Mining Co.
Soumen ChakrabartiIIT Bombay 1999
5
Structured web queries
How many links were found from an environment protection agency site to a site about oil and natural gas in the last year?
Apart from cycling, what is the most common topic cited by pages on cycling?
Find Web research pages which are widely cited by Hawaiian vacation pages
Answer: “first-aid”
Soumen ChakrabartiIIT Bombay 1999
6
Quote 2
As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.
Jim HakeFounder, Global Information Infrastructure
http://www.gii-awards.com/
Soumen ChakrabartiIIT Bombay 1999
7
Goals
Spontaneous, decentralized formation of topical communities
Automatic construction of a “focused portal” containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive
Discovery that combine structure and content
Soumen ChakrabartiIIT Bombay 1999
8
Taxonomy with some ‘chosen’ topics
Each page has a relevance score w.r.t. chosen topics
Mendelzon and Milo’s web access cost model
Goal is to ‘expand’ start set to maximize average relevance
ModelAll
Science Sports
Cycling
Hiking
Physics
Zoology
Soumen ChakrabartiIIT Bombay 1999
9
Properties to be exploited
A page with high relevance tends to link to at least some other relevant pages (radius-one rule)
Given that a page u links to relevant page(s), chances are increased that u points to other relevant pages (radius-two rule)
?
Soumen ChakrabartiIIT Bombay 1999
10
Syntactic “query-by-example”
If part of the answer is known, trivial search techniques may do quite well
E.g., “European airlines”+swissair +iberia +klm
E.g., “Car makers”Which pages link to www.honda.com and
www.toyota.com?
Soumen ChakrabartiIIT Bombay 1999
11
Soumen ChakrabartiIIT Bombay 1999
12
The backlink architecture
S1
Chttp://S1/P1
http://S2/P2
S2
GET /P2 HTTP/1.0Referer: http://S1/P1
LocalBacklinkDatabaseC’
Who pointsto S2/P2?
www.cs.berkeley.edu/~soumen/doc/www99back/userstudy
Soumen ChakrabartiIIT Bombay 1999
13
Backlink rationale
Centralized backlink service does not scale Limited additional storage per server Turn hyperlinks into undirected edges A series of forward and backward ‘clicks’
can quickly build a topical community Can be used to boot-strap the focused
crawler
Soumen ChakrabartiIIT Bombay 1999
14
Backlink example 1
Soumen ChakrabartiIIT Bombay 1999
15
Backlink example 2
Soumen ChakrabartiIIT Bombay 1999
16
Backlink example 3
Soumen ChakrabartiIIT Bombay 1999
17
Backlink example 4
Soumen ChakrabartiIIT Bombay 1999
18
Estimating popularity
Extensive research on social network theory Wasserman and Faust
Hyperlink based Large in-degree indicates popularity/authority Not all votes are worth the same
Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) Resource compilation (Chakrabarti et al) Topic distillation (Bharat and Henzinger)
Soumen ChakrabartiIIT Bombay 1999
19
Topic distillation overview
Given web graph and query
Search engine selects sub-graph
Expansion, pruning and edge weights
Nodes iteratively transfer authority to cited neighbors
Search Engine Query
The Web
Selected subgraph
Soumen ChakrabartiIIT Bombay 1999
20
Preliminary distillation-based approach
Design a keyword query to represent topics of focus
Using a large web crawl, run topic distillation on the query
Refine query by inspecting result and trial-and-error
Soumen ChakrabartiIIT Bombay 1999
21
Problems with preliminary approach
Unreliability of keyword match Engines differ significantly on a given query
due to small overlap [Bharat and Bröder] Narrow, arbitrary view of relevant subgraph Topic model does not improve over time
Dependence on large web crawl and index (lack of “output sensitivity”)
Difficulty of query construction
Soumen ChakrabartiIIT Bombay 1999
22
Output sensitivity
Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages
Ideally effort should scale with size of the result
Time spent crawling and indexing sites unrelated to the topic is wasted
Likewise, time that does not improve comprehensiveness is wasted
Soumen ChakrabartiIIT Bombay 1999
23
Query construction
+“power suppl*”
“switch* mode” smps
-multiprocessor*
“uninterrupt* power suppl*” ups
-parcel*
/Companies/Electronics/Power_Supply
Soumen ChakrabartiIIT Bombay 1999
24
Query complexity Complex queries
needed for distillation Typical Alta Vista
queries are much simpler (Silverstein, Henzinger, Marais and Moricz)
Forcing a hub or authority helps 86% of the time
Wo
rds
Op
era
tors
Alta
Vis
ta
Dis
tilla
tion
7.03
4.34
2.35
0.410
2
4
6
8
AltaVista Distillation
Soumen ChakrabartiIIT Bombay 1999
25
Proposed solution
Resource discovery system that can be customized to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality
Crawler has guidance hooks controlled by these two scores
Soumen ChakrabartiIIT Bombay 1999
26
Administration scenario
TaxonomyEditor
CurrentExamples
SuggestedAdditionalExamples
Drag
Soumen ChakrabartiIIT Bombay 1999
27
Relevance
All
Bus&Econ Recreation
Companies Cycling
Bike Shops
Mt.Biking
Clubs
Arts
... ...
Path nodes
Good nodesSubsumed nodes
)good(
]|Pr[]good is Pr[c
dcd
Soumen ChakrabartiIIT Bombay 1999
28
Classification
How relevant is a document w.r.t. a class? Supervised learning, filtering, classification,
categorization
Many types of classifiers Bayesian, nearest neighbor, rule-based
Hypertext Both text and links are class-dependent clues How to model link-based features?
Soumen ChakrabartiIIT Bombay 1999
29
The “bag-of-words” document model
Decide topic; topic c is picked with prior probability (c); c(c) = 1
Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1
Fix document length and keep tossing coin Given c, probability of document is
dt
tdntctdn
dncd ),(),(
)},({
)(]|Pr[
Soumen ChakrabartiIIT Bombay 1999
30
Exploiting link features
c=class, t=text, N=neighbors
Text-only model: Pr[t|c] Using neighbors’ text
to judge my topic:Pr[t, t(N) | c]
Better model:Pr[t, c(N) | c]
Non-linear relaxation
?
Soumen ChakrabartiIIT Bombay 1999
31
Improvement using link features
9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
‘Forget’ fraction of neighbors’ classes
0
5
10
15
20
25
30
35
40
0 50 100
%Neighborhood known%
Err
or
Text Link Text+Link
Soumen ChakrabartiIIT Bombay 1999
32
Putting it together
TaxonomyDatabase
TaxonomyEditor
ExampleBrowser
CrawlDatabase
HypertextClassifier(Learn)
TopicModels
HypertextClassifier(Apply)
Scheduler
Workers
TopicDistiller
Feedback
Soumen ChakrabartiIIT Bombay 1999
33
Monitoring the crawler
Time
Relevance
One URL
MovingAverage
Soumen ChakrabartiIIT Bombay 1999
34
Measures of success
Harvest rate What fraction of crawled pages are relevant
Robustness across seed sets Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources
Evidence of non-trivial work #Links from start set to the best resources
Soumen ChakrabartiIIT Bombay 1999
35
Harvest rateHarvest Rate (Cycling, Unfocused)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000 10000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Unfocused
Harvest Rate (Cycling, Soft Focus)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000 6000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Avg over 1000
Focused
Soumen ChakrabartiIIT Bombay 1999
36
Crawl robustness
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1000 2000 3000
#URLs crawled
UR
L O
verl
ap
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000
#URLs crawled
Se
rve
r o
verl
ap
Overlap1
Overlap2
URL Overlap Server OverlapCrawl 1 Crawl 2
Soumen ChakrabartiIIT Bombay 1999
38
Soumen ChakrabartiIIT Bombay 1999
39
Soumen ChakrabartiIIT Bombay 1999
40
Robustness of resource discovery
Sample disjoint sets of starting URL’s
Two separate crawls Find best authorities Order by rank Find overlap in the
top-rated resources
Resource Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25#Top resources
Se
rve
r O
verl
ap
Overlap1
Overlap2
Soumen ChakrabartiIIT Bombay 1999
41
Distance to best resources
Resource Distance (Mutual Funds)
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Resource Distance (Cycling)
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Cycling: cooperative Mutual funds: competitive
Soumen ChakrabartiIIT Bombay 1999
42
Observations
Random walk on the Web “rapidly mixes” topics
Yet, there are large coherent paths and clusters
Focused crawling gives topic distillation richer data to work on
Combining content with link structure eliminates the need to tune link-based heuristics
Soumen ChakrabartiIIT Bombay 1999
43
Related work
WebWatcher, HotList and ColdList Filtering as post-processing, not acquisition
ReferralWeb Social network on the Web
Ahoy!, Cora Hand-crafted to find home pages and papers
WebCrawler, Fish, Shark, Fetuccino, agents Crawler guided by query keyword matches
Soumen ChakrabartiIIT Bombay 1999
44
Comparison with agents Agents usually look
for keywords and hand-crafted patterns
Cannot learn new vocabulary dynamically
Do not use distance-2 centrality information
Client-side assistant
We use taxonomy with statistical topic models
Models can evolve as crawl proceeds
Combine relevance and centrality
Broader scope: inter-community linkage analysis and querying
Soumen ChakrabartiIIT Bombay 1999
45
Conclusion
New architecture for example-driven topic-specific web resource discovery
No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from
keyword query response nodes