modeling web content dynamics brian brewington ([email protected])[email protected] george cybenko...

32
Modeling Web Content Dynamics Brian Brewington ([email protected] ) George Cybenko ([email protected] ) IMA February 2001

Post on 22-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Modeling Web Content Dynamics

Brian Brewington ([email protected])George Cybenko ([email protected])

IMA February 2001

Page 2: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Observing changing information sources An index of changing information sources

must re-index items periodically to keep the index from becoming out-of-date.

What does it mean for an observer or index to be “up-to-date” or “current”?

Our work on the web has two parts:– Estimation of change rates for a large

sample of web pages– Re-indexing speed requirements with

respect to a formal definition of “up-to-date”.

Page 3: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Your brain is good at this

Where is your visual attention directed when driving a car? Why?

Form state estimates;re-observe when uncertainty becomes too large

Page 4: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Ingredients

1. A formal definition of “up-to-dateness”

2. Data

3. Scheduling to optimize “up-to-dateness”

Page 5: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

A meaning for “up to date”

An index entry is current if it is correct to within a grace period of time , with probability at least .

To be “-current”:

No alteration allowed in gray region for index entry to be “-current”

(time)

(grace period)

(nex

t obs

erve

d)

(las

t obs

erve

d)

tn

(now)t0 t

0+T t

n-

Page 6: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

currency has meaning in many contexts

Any source has a spectrum of possibilities; here are some possible values (guesses)– Newspaper: (0.9, 1 day)– Television news: (0.95, 1 hour)– Broker watching stocks: (0.95, 30 min)– Air traffic controller: (0.95, 20 sec)– Web search engine: (0.6, 1 day)– An old web page’s links: (0.4, 70 day)

Page 7: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February
Page 8: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Collecting web page data Our web page data comes

from a web monitoring service.

The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email.

We download ~100,000 pages per day for ~30,000 users.

See http://informant.dartmouth.edu

Page 9: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Sampling issues

Biased towards search engine results in the top 10 for users’ queries

No more than one observation of a page per day, pages are usually observed once every three days.

Queries and page checks are run only at night, so sample times are correlated.

Filesystem timestamps are available for about 65% of our observations.

Page 10: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Data in our collection As of March 2000, we had observations of about

3 million web pages. Data in paper spans 7 mo. Each page is observed an average of 12 times,

and the average time span of observation is 38 days.

Each observation includes:– “Last-Modified” timestamps, when available– Observation time (using remote server’s if possible)– Document summary information

» Number of bytes (“Content-Length”)» Number of images, tables, forms, lists, banner ads» 16-bit hash of text, hyperlinks, and image references

Page 11: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

“Lifetimes” vs. “ages” We can model objects as having

independent, identically-distributed time periods between modifications. We call these “lifetimes.”

The “age” is the time since the present lifetime began.

By analogy, thinkBy analogy, thinkof replacement parts,of replacement parts,each with an each with an independentindependentlifetime length.lifetime length.

L1 L2

(Each “(Each “” is a ” is a change)change)

0 0.5 1 1.5 2 2.5 3 3.5 4

Life

time=

1.53

Life

time=

1.14

Life

time=

0.62

Life

time=

0.84

Time

Age

1...

Page 12: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Determining dynamics from the time dataTwo ways to find the distribution of change rates:

1. Observe the time between successive modifications. (Lifetimes)

GoodGood: direct measurement of time between changesBadBad: aliasing possible; needs repeat observations

2. Observe the time since the most recent modification. (Ages)

GoodGood: doesn’t have aliasing problems, works without having to make repeat observationsBadBad: requires that we accurately account for growth

Page 13: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Sampling the lifetime distribution

There are two problems with trying to sample the difference of successive change times:

timex xo oxx x

1. 1. Second observation (o) will miss two changes (x)

x=modificationo=observation

timex x xo o o o o

2. 2. Observation window not big enough to see any changes (x)

o

(Observation timespan)

(Actual lifetime)

(Observed lifetime)

Page 14: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Web page age CDFC

um

ula

tive P

r

Age [days, log scale]

1 d

ay

10 d

ays

100

days

• Median age 120 days• upper 25% > 1 year• lowest 25% < 1 month

0

1

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

Page 15: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Empirical lifetime distribution

0 200 400 600

10-4

10-3

10-2

Lifetime [days]

Pro

babi

lity

den

sity

100 102

0.2

0.4

0.6

0.8

1

Lifetime [days]

Cum

ulat

ive

prob

abil

ity

Lifetime PDF Lifetime CDF

Page 16: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

When do changes happen?Change times, mod 247 hours, show more changes happen

during the span of US working hours (8AM to 8PM, EST)

0 50 100 1500

1

2

3

4x 10

-3

time since Thursday 12:00 GMT [hours]

Rel

ativ

e fr

equ

ency

Wed

s af

tern

oon

Thu

rsda

y

Fri

day Sa

turd

ay

Sund

ay

Mon

day

Tue

sday

Wed

s m

orni

ng

Page 17: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Distribution of mean change times The Weibull distribution, a

generalized exponential, models mean lifetimes fairly well:

This can be used to find an age or lifetime CDF for any shape parameter and scale parameter . But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.

1

/1 tmean mean

tf f t e

Page 18: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

100 101 102 1030

0.2

0.4

0.6

0.8

1Lifetime CDF: F (=1.4, =152.2)

Lifetime [days]

Cu

mu

lati

ve p

roba

bili

ty

Trial Reference

1

/tte

Page 19: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

()currency for Poisson sourceA single source has Poisson changes at rate . If re-indexed every T time units, the expected probability of the index entry being -current is:

1

1

1

T

z

e

T T

e

z

,

/

z T

T

10-2 100 10 2

0.2

0.4

0.6

0.8

Expected changes per check period, T

Pro

babi

lity

,

=0.9

=0.25

=0.6

=0.0

1/T

Page 20: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Probability of currency over a collectionExpected probability of a random index

entry being -current (given distribution f(t) of mean change times t):

/

0

1

/

t T t

t

ef t dt

T T t

1

/( ) ttf t e

Distribution ofavg. lifetimes

Probability of being -current given avg. lifetime

Page 21: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Index performance surface: as a function of T, /T

Surface formed by integrating out the rate dependence

Large period T implies =

Plane shown for =0.95%, intersects at a level set (,T)

Page 22: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

101 10210-1

100

101

102

Re-indexing period, T [days]

Gra

ce p

erio

d,

[da

ys] Age-based

Lifetime-based

T =50 days

=1 week

=1 month

=1 year

T =23 days

T =59 days

T =8.5 days

T =18 days

=1 day T =11.5 days

95% level set: (T,) pairs

Page 23: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Bandwidth needed for (0.95, 1-week) currency

For (0.95, 1 week) currency of this collection:– Must re-index with period around 18 days.– A (0.95, 1-week) index of the whole web (~800

million pages) processes about 50 megabits/sec.– A more “modest” (0.95, 1-week) index of 150

million pages will process 9 megabits/sec.

For fixed-period checks, we can estimate processing speed requirements.

Page 24: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Empirical search engine currency

10 0 101

102

1030.4

0.5

0.6

0.7

0.8

0.9

1

[days]

Google Infoseek AltaVista Northern Light

Page 25: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

A calculus for currency

If x is current andy is current, then

(x,y) ismaxcurrent.

Extend this to other atomic operationson information, eg composition.

Page 26: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Summary About one in five pages has been

modified within the last 12 days. (0.95, 1-week) on our collection: must

observe every 18 days Ideas: More specialty search engines?

Distributed monitoring/remote update? Other work: algorithms for scheduling

observation based on source change rate and importance

Page 27: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Mathematics of “Semantic Hacking”

Page 28: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Problem

Denial of Service Attacks Infrastructure

System attacks Systems

Semantic attacks Information

easy todetect

hard todetect

Page 29: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Distribution of information

“Gaussian”is expected.

Outliers

Collusion?

Page 30: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

What makes a good mystery/thriller?

“Correct”conclusion

“Wrong”conclusion

A wrong conclusion can be reached by onelarge, detectable bad decision or a sequenceof small, undetectably perturbed decisions.

Understand the whole sequence of decisions not justone in isolation.

Page 31: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Ongoing research

Develop a model of such “semantic attacks”.

Develop a way to quantify such things.

Develop some tools for detecting/managingcomplex decision sequences.

Make information/decision systems morerobust.

Page 32: Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February

Acknowledgements

DARPA contractF30602-98-2-

0107

DoD MURI (AFOSR contract F49620-97-1-

03821)

NSF KDI Grant 9873138