the evolution of the web and implications for an incremental crawler junghoo cho stanford university

24
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

The Evolution of the Weband Implications for

an Incremental Crawler

Junghoo ChoStanford University

Page 2: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 3: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Crawling Issues (1) Load at visited web sites Load at crawlers Scope of the crawl

Page 4: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Crawling Issues (2)

Typical crawler Periodic, Batch, Shadowing

Incremental crawling Maintain Pages “fresh” Avoid crawling from scratch

How do we crawl?

Page 5: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Outline Web evolution experiments Freshness metrics Design issues and comparison

Page 6: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Web Evolution Experiment How often does a web page

change? What is the lifespan of a page? How long does it take for 50% of

the web to change?

Page 7: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission)

identified 400 sites with highest “page rank” contacted administrators

720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new &

old pages) ran only 9pm - 6am, 10 seconds between

site requests

Page 8: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

How Often Does a Page Change?

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

Page 9: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Average Change Intervalfr

actio

n of

pag

es

Page 10: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Average Change Interval — By Domain

frac

tion

of p

ages

Page 11: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

How Long Does a Page Live?

experimentduration

pagelifetime

experimentduration

pagelifetime

experimentduration

pagelifetime

experimentduration

pagelifetime

Page 12: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Page Lifespans

frac

tion

of p

ages

Page 13: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Page Lifespans

Method 1 used

fraction of pages

Page 14: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Time for a 50% Change

days

frac

tion

of u

ncha

nged

pag

es

Page 15: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Change Metrics Freshness [SIGMOD 2000]

Freshness of element ei at time t is

F(ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

ei ei

......

web database Freshness of the database S at time t is

F(S ;t ) = F(ei ;t )N

1 N

i=1

Page 16: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Change Metrics

Age [SIGMOD 2000] Age of element ei at time t is

A(ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

ei ei

......

web database Age of the database S at time t is

A(S ; t ) = A(ei ; t )N

1 N

i=1

Page 17: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Crawler Types In-place vs. shadow

Steady vs. batch

ei ei......

web database

ei

...

shadowdatabase

time

crawler on

crawler off

Page 18: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Comparison: Batch vs. Steady

batch modein-placecrawler

steadyin-placecrawler

crawler running

Page 19: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Shadowing Steady Crawler

craw

ler’

s co

llect

ion

curr

ent c

olle

ctio

n

withoutshadowing

Page 20: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Shadowing Batch Crawlercr

awle

r’s

colle

ctio

ncu

rren

t col

lect

ion

withoutshadowing

Page 21: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Experimental Data: Freshness

Steady BatchIn-Place 0.88 0.88Shadowing 0.77 0.86

• Pages change on average every 4 months• Batch crawler works one week out of 4

1

2

0.63

0.50

Page 22: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Uniform vs. Variable

Freshness AgeUniform 0.57 5.6 daysVariable 0.62 4.3 days

In-place, steady crawler;Based on our experimental data[Pages change at different frequencies,as measured in experiment.]

[SIGMOD 2000]

Page 23: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

Summary

Steady In-place Variable visit frequencies

Improvement depends on on how the web changes

improves freshness!

Page 24: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

The End The paper proposes an

architecture Thank you for your attention