synchronizing a database to improve freshness junghoo cho hector garcia-molina stanford university
Post on 19-Dec-2015
220 views
TRANSCRIPT
Synchronizing a DatabaseTo Improve Freshness
Junghoo ChoHector Garcia-Molina
Stanford University
2
Application– Web search engines/crawlers
– Data warehouse
. . .
Problem
Polling
Remote database Local database
QueryUpdate
3
Challenge: How to maintain pages “fresh?”
How does the web change over time?– Web evolution experiment
What does fresh page/database mean?– Change metrics
How can we increase “freshness”?– Crawl policy
4
Web Evolution Experiment
How often does a web page change? How do we model web changes? What is the lifespan of a page? How long does it take for 50% of the web change?
5
Experimental Setup
February 17 to June 24, 1999 270 sites visited (with permission)
– identified 400 sites with highest “page rank”
– contacted administrators
720,000 pages collected– 3,000 pages from each site daily
– start at root, visit breadth first (get new & old pages)
– ran only 9pm - 6am, 10 seconds between site requests
6
How Often Does a Page Change?
Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Is this correct?
1 day
changes
page visited
7
Average Change Intervalfr
actio
n of
pag
es
8
Modeling Web Evolution
Poisson process with rate T is time to next event fT(t) = e-t (t > 0)
9
Change Interval of Pages
for pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
10
Change Metrics
Freshness– Freshness of page ei at time t is
F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
eiei
......
web database
– Freshness of the database S at time t is
F( S ; t ) = F( ei ; t )N
1 N
i=1
11
Change Metrics
Age– Age of page ei at time t is
A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise
eiei
......
web database
– Age of the database S at time t is
A( S ; t ) = A( ei ; t )N
1 N
i=1
12
Change Metrics
F(ei)
A(ei)
0
0
1
time
time
update refresh
F( S ) = lim F(S ; t ) dtt1 t
0t
F( ei ) = lim F(ei ; t ) dtt1 t
0t
Time averages:
similar for age...
13
Refresh Order
Fixed order– Example: Explicit list of URLs to visit
Random Order– Example: Start from seed URLs
& follow links
Purely Random– Example: Refresh pages on demand, as requested by user
eiei
......
webdatabase
14
Freshness vs. Order
r = / f = average change frequency / average revisit frequency
15
Trick Question
Two page database
e1 changes daily
e2 changes once a week
Can visit pages once a week How should we visit pages?
– e1 e1 e1 e1 e1 e1 ...
– e2 e2 e2 e2 e2 e2 ...
– e1 e2 e1 e2 e1 e2 ... [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]
– ?
e1
e2
e1
e2
webdatabase
16
Proportional Often Not Good!
Visit fast changing e1 get 1/2 day of freshness
Visit slow changing e2 get 1/2 week of freshness
Visiting e2 is a better deal!
17
Selecting Optimal Refresh Frequency
• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )
18
Optimal Refresh Frequency for Age
• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )
19
Comparing Policies
Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
20
Summary
Maintaining the collection fresh:– Web evolution experiment
– Change metrics
– Optimal policy
Intuitive policy does not always perform well– Should be careful in deciding revisit policy
21
Future work
Weighted freshness model Non-Poisson process model Change frequency estimation