no slide title · ppt file · web view · 1999-10-04empirical investigations of www surfing paths...
TRANSCRIPT
Empirical Investigations of WWW Surfing Paths
Jim PitkowUser Interface Research
Xerox Palo Alto Research Center
August 1999
Agenda
• A few characteristics of the Web and clicks• Aggregate click models
– user surfing behaviors– post-hoc hit prediction
• Individual click models– entropy– path prediction
August 1999
1
10
100
1000
10000
100000
1000000
10000000
Aug-92
Feb-93
Aug-93
Feb-94
Aug-94
Feb-95
Aug-95
Feb-96
Aug-96
Feb-97
Aug-97
Feb-98
Aug-98
Serv
ers
Source: World Wide Web Consortium, Mark Gray, Netcraft Server Survey
Web is big, Web is good
1 new server every 2 seconds7.5 new pages per second
August 1999
Users, sessions, and clicksCurrent Internet Universe Estimate 97.1 millionTime spent/month 7:28:16Number of unique sites visited/month 15Page views/month 313Number of sessions/month 16Page views/session 19Time spent/session 0:28:01Time spent/site 0:31:27Duration of a page viewed 0:01:28
Source Nielsen//NetRatings February 1999
August 1999
Popularity of pages—Zipf• Zipf Distribution:
– frequency is inversely proportionate to rank
• Zipf Law:– slope equals
minus one
August 1999
Enter Exit
Users enter a website at variouspages and begin surfing
Continuing surfers distribute themselves down various paths
Surfers arrive at pages having traveled different paths
After some number of page visits surfers leave the web site
(a)
(b)
(c)
(d)
p1
p3
p2
August 1999
Model of surfing
VL = VL-1 + L
• Where L is the number of clicks and is L varies as independent and identically distributed Gaussian random variables
• Surfing proceeds until the perceived cost is larger
than the discounted expected future value
August 1999
Random walk with a stopping threshold• Two parameter inverse Gaussian distribution
• mean (L) = and variance (L) = 3/
LL
LLP 2
3
3 2exp
2)(
August 1999
Experimental design• Client data
– Georgia Tech (3 weeks August, 1994)– Boston University (1995)
• tens of thousands of requests
• Proxy data– AOL (5 days in December, 1997)
• tens of millions of requests
• Server data– Xerox WWW Site (week during May 1997)
August 1999
0
4,000
8,000
12,000
16,000
20,000
0 5 10 15 20 25 30 35 40 45 50Clicks
Freq
uenc
yProbability distribution function
Clicks 1 click/site mode3-4 clicks/site median8-10 clicks/site mean
August 1999
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100Clicks
Prob
abili
ty
Experimentalinverse Gaussian
Cumulative distribution function
*** Experimental
— inverse Gaussian
75% of the distribution accounted for in three clicks
August 1999
Two observations
• Inverse Gaussian distribution has very long tail– expect to see large deviations from average
• Due to asymmetric nature of the Inverse Gaussian distribution, typical behavior does not equal average behavior
August 1999
An interesting derivation
• Up to a constant given by the third term, the probability of finding a group surfing at a given level scales inversely in proportion to its depth
2og
2)(og
23)(og 2
2
lL
LLlLPl
2/3)( LLP
August 1999
Number of surfers at each level
1
10
100
1000
1 10 100 1000
log(Clicks)
log(
Freq
uenc
y)
2/3)( LLP
August 1999
Implications of the Law of Surfing
• Implications on techniques designed to enhance performance– HTTP Keep-Alive, Pre-fetching of content
• Can adapt content based location on curve in a cost sensitive manner– different user modalities (browser, searcher, etc.)– expend different CPU resources for different users
• Web site visitation modeling
August 1999
Spreading Activation
Pump activation into source
Activation spreads through
the network
Activation settles into asymptotic pattern
August 1999
Matrix Formulation
0010000
123456
0 0 0 0 1 00 0 1 0 0 01 1 0 0 1 00 1 0 0 1 00 0 1 1 0 00 1 1 0 0 0
1 2 3 4 5 6
C R
..5
.74311
ASources Network Activation
Content + Usage + Topology Technique
A(t) = C + M A(t - 1)
M = (1 - ) I + RNetworksUser Paths
Text SimilarityTopology
August 1999
Application to hit prediction• Let fL be the fraction of users who, having surfed along
L-1 links, continue to surf to depth L.
• Define the activation value Ni,L as the number of users who are at node i after surfing through L clicks
Lkk
kiLLi NSfN ,,1,
),,1(1),,(1
LFLFfL
August 1999
Predicted versus observed
0
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
log(Observed Hits)
log(
Pred
icte
d Hi
ts)
August 1999
Proportion of surfers traversing link
Freq
uenc
y
Surfing probabilities
by outlinkdensity
No. Links, Li 2 (df = 6)0 < Li 3 25.213 < Li 6 26.446 < Li 9 172.99
9 < Li 12 59.7012 < Li 3666.30
August 1999
Investigating user paths
• Each user path can be thought of as an ngram and represented as tuples of the form <X1, X2, … Xn> to indicate sequences of page clicks– Distribution of ngrams is the Law of Surfing
• Determine the conditional probability of seeing the next page given a matching prior ngram
),...,|Pr( 1 knnnn XXxX
August 1999
Uncertainty and entropy
• Conditional probabilities are also know as as kth-order Markov approximations/models
• Entropy is the expected (average) uncertainty of the random variable measured in bits– minimal number of bits to encode information– uncertainty in the sequence of letters in languages
August 1999
Mathematics of entropy
Xx
Xx
xpxpxp
xp
XpEXH
)(log)()(
1log)(
)(1log)(
2
2
2
August 1999
Conditional entropy
nn Xx
knknnnnnknnn xXxXXHxpXXXH ),...,|()(),...|( 111
Xx Yy
yxpyxpYXH ),(log),(),( 2
),...,|()|()(),...,( 111211 nnn XXXHXXHXHXXH
Conditional probability
Chaining rule
Joint probabilities
August 1999
Entropy versus ngram length
01
234
567
89
1 2 3 4 5 6 7 8 9 10Order of transition
Bits
2345678910
NgramTotal Information Gain
0
1
2
3
4
5
6
7
8
9
2 3 4 5 6 7 8 9 10Path Length
Tota
l Bits
Gai
ned
August 1999
Path predictions
Pr(PPM) the probability that a penultimate path, <xn-1,…xn-k>, observed in the test data was matched by the same penultimate path in the training data
Pr(Hit|PPM) the probability that page xn is visited, given that <xn-1,…xn-k>, is the penultimate path and the highest probability conditional on that path is p(xn|xn-1,…xn-k )
Pr(Hit) = Pr(Hit|PPM)*Pr(PPM), the probability that the page visited in the test set is the one estimated from the training as the most likely to occur
August 1999
Pr(Path Match)
0.00
0.20
0.40
0.60
0.80
1.00
0 2 4 6 8 10Length of Path Used in Prediction
Prob
2345678910
August 1999
Pr(Hit|Path Match)
0.00
0.20
0.40
0.60
0.80
1.00
0 2 4 6 8 10
Length of Path Used in Prediction
Prob
2345678910
August 1999
Pr(Hit)
0.00
0.05
0.10
0.15
0.20
0.25
0 2 4 6 8 10Length of Path Used in Prediction
Prob
2345678910
August 1999
Principles of path prediction
• Paths are not completely modeled by 1st order Markov approximations
• Path Specificity Principle: longer paths contain more predictive power than lower order paths
• Complexity Reduction: keep models as simple and as small as possible
August 1999
Agenda revisited
• A few characteristics of the Web and clicks• Aggregate click models
– user surfing behaviors– post-hoc hit prediction
• Individual click models– entropy– path prediction
August 1999
Areas of further investigation
• Advance the state of predictive modeling– Move from post-hoc to a-priori prediction of user
interactions on the Web– Test hypothetical models of Web site usage
• Validate existing and new models on more representative data sets
• Understand new and emerging applications– streaming video and audio, mobile, etc.
August 1999
More information
http://www.parc.xerox.com/istl/projects/uir/projects/Webology.html