probability, linear algebra, and numerical analysis: …wright/courses/m365/google...google's...
TRANSCRIPT
Google's PageRank Algorithm
Probability, linear algebra, and numerical Probability, linear algebra, and numerical analysis: the mathematics behindanalysis: the mathematics behind
GGoooogglle'e'ssTMTM PageRankPageRankTMTM
Grady WrightDepartment of Mathematics
Boise State University
Google's PageRank Algorithm A GGooooggllee search
Google's PageRank Algorithm A GGooooggllee search
● Two step process
1) Text processing
2) Ranking
● Information Retrieval score
● PageRankTM score“Heart of GGooooggllee software”
● Brin and Page (1998)
● Kleinberg: HITS
www.teoma.com
Google's PageRank Algorithm Outline
● Heuristic interpretation of PageRank
● PageRank as a random walk (surf)
● Linear algebra formulation
● Computing PageRank
● Example and tools
● Advanced topics
Google's PageRank Algorithm A tiny web example● Pages of the web W:
two
one
three four
five
six
Google's PageRank Algorithm PageRank: a random walk (or surf)● Example:
● Infinitely dedicated random surfer● Outlinks
two
one
three four
five
six
Google's PageRank Algorithm PageRank: a random walk (or surf)● Example:
● Infinitely dedicated random surfer● Outlinks● Dangling node
two
one
three four
five
six
Google's PageRank Algorithm A tiny web example● Pages of the web W as a directed graph:
● Interpretation of PageRank: ● A page is important if an important page has a link to it.
● “Democracy of the web”: a link from page A to page B is a vote from A to B.
● The web according to GGooooggllee has about 4.43 billion pages (4,430,000,000) (Estimated October 17, 2018 by http://www.worldwidewebsize.com)
two
one
three four
five
six
Google's PageRank Algorithm PageRank: a random walk (or surf)● Example:
● Infinitely dedicated random surfer● Outlinks● Dangling node
● Markov Chain
● Probabilistic interpretation of PageRank:
A webpage's PageRank is the probability that at any particular time, the infinitely dedicated random surfer is visiting that page.
two
one
three four
five
six
Google's PageRank Algorithm Linear algebra formulation● Example:
● A directed graph can be represented using a connectivity matrix G.
G=[0 0 1 0 0 01 0 1 0 0 01 0 0 0 0 00 0 1 0 0 10 0 0 1 0 10 0 0 1 1 0
]Entries of G:
g i , j=
1 if page j has a link to page i
0 otherwise
i,j = 1,2,...,n
two
one
three four
five
six
Google's PageRank Algorithm Linear algebra formulation● Add the random surf info to G:
Directed graph of W:
Connectivity matrix:
Transition probability matrix:
A=[0 0 1/3 0 0 0
1/2 0 1/3 0 0 01/2 0 0 0 0 00 0 1/3 0 0 1/20 0 0 1/2 0 1/20 0 0 1/2 1 0
] [0 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 0
] = [0 1/6 1/3 0 0 0
1/2 1/6 1/3 0 0 01/2 1/6 0 0 0 0
0 1/6 1/3 0 0 1/20 1/6 0 1/2 0 1/20 1/6 0 1/2 1 0
]P e d T
/6
two
one
three four
five
six
d=[ 0 1 0 0 0 0 ]T
e=[ 1 1 1 1 1 1 ]T
G=[0 0 1 0 0 01 0 1 0 0 01 0 0 0 0 00 0 1 0 0 10 0 0 1 0 10 0 0 1 1 0
]c=[ 2 0 3 2 1 2 ]
T
Google's PageRank Algorithm Linear algebra formulation● Add the random surf info to G:
Directed graph of W:
Connectivity matrix:
Transition probability matrix:
A=[0 0 1/3 0 0 0
1/2 0 1/3 0 0 01/2 0 0 0 0 00 0 1/3 0 0 1/20 0 0 1/2 0 1/20 0 0 1/2 1 0
] [0 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 00 1/6 0 0 0 0
] = [0 1/6 1/3 0 0 0
1/2 1/6 1/3 0 0 01/2 1/6 0 0 0 0
0 1/6 1/3 0 0 1/20 1/6 0 1/2 0 1/20 1/6 0 1/2 1 0
]P e d T
/6
two
one
three four
five
six
e=[ 1 1 1 1 1 1 ]T
d=[ 0 1 0 0 0 0 ]T
c=[ 2 0 3 2 1 2 ]T
G=[0 0 1 0 0 01 0 1 0 0 01 0 0 0 0 00 0 1 0 0 10 0 0 1 0 10 0 0 1 1 0
]
Google's PageRank Algorithm Linear algebra formulation● For a general web W:
c j=∑i=1
n
g i , j j=1,2, , n
Define:
pi , j=if c
j ≠ 0
0 otherwise
i,j = 1,2,...,n
g i , j
c j
(number of outgoing links from page j)
(probability of visiting page i based on a random choice from the links on page j)
d j=if c
j = 0
0 otherwise
j = 1,2,...,n1
(tracks dangling pages)
● Transition probability matrix A:
A=Pend T , e=[ 1 1 ⋯ 1 ]
Twhere
n
Google's PageRank Algorithm Avoiding cycles around cliques● Problem:
● Solution: random teleportation
two
one
three four
five
six
Google's PageRank Algorithm Avoiding cycles around cliques● Problem:
● Solution: random teleportation
● Modification to the transition probability matrix:
A= PendT 1− en eT , 01
● Matrix for above example:
(Google originally set α = 0.85)
A=[0 1/6 1/3 0 0 0
1/2 1/6 1/3 0 0 01/2 1/6 0 0 0 00 1/6 1/3 0 0 1/20 1/6 0 1/2 0 1/20 1/6 0 1/2 1 0
] 1−[1/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/6
]
two
one
three four
five
six
Google's PageRank Algorithm Importance of the transition prob. matrix● Probability distribution vector :
xj = prob. the random surfer is currently visiting page j.
x
Ax j=prob. the random surfer will be visiting page j after leaving her current location.
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
]A
[100000]
x
=[0.0250.450.45
0.0250.0250.025
]
∑j=1
n
x j=1
Google's PageRank Algorithm Importance of the transition prob. matrix
xj = prob. the random surfer is currently visiting page j.
Ax j=prob. the random surfer will be visiting page j after leaving her current location.
● Example:α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
]A
[0.0250.450.450.0250.0250.025
]x
=[0.21625
0.2268750.0993750.226875
0.110.120625
]
two
one
three four
five
six
● Probability distribution vector : x ∑j=1
n
x j=1
Google's PageRank Algorithm PageRank defined● Page j's PageRank: jth entry of the PDV satisfying
v=Av= stationary distribution vector of A.
● What is the mathematical name for ?
● Three concerns:
1. Existence
2. Uniqueness
3. Computation
v
v
v
Google's PageRank Algorithm Perron-Frobenius Theorem Theorem: If A is an n-by-n matrix with positive entries then
1) One of its eigenvalues λ is positive and dominant.
2) There exists a unique (up to scaling) positive eigenvector corresponding to the dominant eigenvalue.
3) The dominant eigenvalue is simple.
Corollary: If the sum of each column of A equals 1 then λ=1 is the dominant eigenvalue.
A= Pend T
1−eneT , 01Recall:
PageRank vector is the dominant eigenvector of A.
Google's PageRank Algorithm Computing PageRank: Power Method● All we need is the dominant eigenvector!
● An idea:
{1,2,3, ,m} , {v , v2 , v3 , , vm}
eigenvalues/vectors of A
Define:
Suppose: x0=v2 v23 v3⋯m vm
Consider: x1=Ax0
=Av2 A v23 A v3⋯m A vm
=v22 v233 v3⋯mm vm
1∣2∣≥∣3∣≥⋯≥∣m∣
∑j=1
n
x j0=1
Google's PageRank Algorithm Computing PageRank: Power Method● All we need is the dominant eigenvector!
x2=Ax1
=A2x0
=v222v233
2v3⋯mm
2vm
● All we need is the dominant eigenvector!
● An idea:
{1,2,3, ,m}, {v , v2 , v3 , , vm}
eigenvalues/vectors of A
Define:
Suppose: x0=v2 v23 v3⋯m vm
Consider: x1=Ax0
=Av2 A v23 A v3⋯m A vm
=v22 v233 v3⋯mm vm
1∣2∣≥∣3∣≥⋯≥∣m∣
∑j=1
n
x j0=1
Google's PageRank Algorithm Computing PageRank: Power Method● All we need is the dominant eigenvector!
. . .
xk1=Axk
=Akx0
=v22kv233
kv3⋯mm
kvm
x2=Ax1
=A2x0
=v222v233
2v3⋯mm
2vm
● An idea:
{1,2,3, ,m}, {v , v2 , v3 , , vm}
eigenvalues/vectors of A
Define:
Suppose: x0=v2 v23 v3⋯m vm
Consider: x1=Ax0
=Av2 A v23 A v3⋯m A vm
=v22 v233 v3⋯mm vm
1∣2∣≥∣3∣≥⋯≥∣m∣
∑j=1
n
x j0=1
Google's PageRank Algorithm Computing PageRank: Power Method
Thus: converges to the PageRank vector as
● All we need is the dominant eigenvector!
xk v k ∞ .
. . .
xk1=Axk
=Akx0
=v22kv233
kv3⋯mm
kvm
x2=Ax1
=A2x0
=v222v233
2v3⋯mm
2vm
● An idea:
{1,2,3, ,m}, {v , v2 , v3 , , vm}
eigenvalues/vectors of A
Define:
Suppose: x0=v2 v23 v3⋯m vm
Consider: x1=Ax0
=Av2 A v23 A v3⋯m A vm
=v22 v233 v3⋯mm vm
1∣2∣≥∣3∣≥⋯≥∣m∣
∑j=1
n
x j0=1
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [1/61/61/61/61/61/6
][0.0958333330.166666670.119444440.166666670.190277780.26111111
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x1x0
∥x1−x0
∥∞=max1≤ j≤n ∣x j1−x j
0∣=9.4⋅10−1
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0958333330.166666670.119444440.166666670.190277780.26111111
][0.0824537040.123182870.0893402780.193425930.230416670.28118056
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x2x1
∥x2−x1
∥∞=4.3⋅10−2
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0824537040.123182870.0893402780.193425930.230416670.28118056
][0.0677639850.10280681
0.0774937310.187226570.244158660.32051109
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x3x2
∥x3−x2
∥∞=3.9⋅10−2
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0677639850.10280681
0.0774937310.187226570.244158660.32051109
][0.0615208550.0903205490.0683639920.197738070.255369440.32668709
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x4x3
∥x4−x3
∥∞=1.3⋅10−2
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0615208550.0903205490.0683639920.197738070.255369440.32668709
][0.0571652090.0833115720.0639417740.196007220.260676100.33889812
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x5x4
∥x5−x4
∥∞=1.2⋅10−2
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0571652090.0833115720.0639417740.196007220.260676100.33889812
][0.0549193090.0792145230.0610976860.198951010.264137240.34168023
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x6x5
∥x6−x5
∥∞=4.1⋅10−3
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37/120 1/40 1/40 1/409/20 1/6 37/120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37/120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0549193090.0792145230.0610976860.198951010.264137240.34168023
][0.0535330690.0768737750.0595627640.198747170.265990330.34529289
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x7x6
∥x6−x5
∥∞=3.6⋅10−3
Google's PageRank Algorithm Example of power method
● Example: two
one
three four
five
six
α = 0.85
[1/40 1/6 37 /120 1/40 1/40 1/409/20 1/6 37 /120 1/40 1/40 1/409/20 1/6 1/40 1/40 1/40 1/401/40 1/6 37 /120 1/40 1/40 9/201/40 1/6 1/40 9/20 1/40 9/201/40 1/6 1/40 9/20 7/8 1/40
] [0.0517047460.0736792640.0574124130.199903810.268596080.34870368
][0.0517047460.0736792630.0574124130.199903810.268596080.34870368
]=● Initial guess: x0
=[ 1 1 ⋯ 1 ]T/n
x35x34
∥x35−x34
∥∞=1.0⋅10−9
PageRank
Google's PageRank Algorithm Power method algorithm and efficiency● Power method algorithm:
x⃗(0)= [1 1 ⋯ 1 ]
T/n
k=1
while
xk =Ax k−1
=∥xk −xk−1
∥∞
k=k1end
δ >ϵ
Operation
Axk−1
# FLOP
2n2−n
xk −xk−1 n
Total 2n2=O n2
● n = 45 ∙109
● 50 – 100 iterations: 19 – 37 days!
● FLOP ≈ 4.05 ∙ 1021
● Sunway TaihuLight: 125.435 ∙ 1015 FLOP/sec
● 1 iteration: 9 hours
Cocktail napkin computational cost analysis for GGooooggllee
δ =∞
Google's PageRank Algorithm More efficient power method for PageRank● Idea: exploit the structure of the transition probability matrix
A= PendT
1−eneT , 01Recall:
Thus: xk =Axk−1
=P xk−1
end T
xk−11−
eneT
xk−1
xk = P xk−1
en
d Txk−1
1−
xk =[
0 0 1 /3 0 0 01/2 0 1 /3 0 0 01/2 0 0 0 0 00 0 1 /3 0 0 1/20 0 0 1/2 0 1/20 0 0 1/2 1 0
] [ xk−1] 16 [
111111] [ 0 1 0 0 0 0 ] [ xk−1] 1−
● Example:
Dense way: 47 FLOPs Sparse way: 30 FLOPs
Google's PageRank Algorithm More efficient power method● Power method algorithm with sparse matrix
xk = P xk−1
en
d Txk−1
1−
=∥xk −xk−1
∥∞
k=k1end
Cocktail napkin computational cost analysis for GGooooggllee
Operation
1 Pxk−1
Approx. # FLOP
14 n
2 d Txk−1 n
Total 17 n=O n
● P averages 7 nonzero entries per row.
● 50 – 100 iterations: < 0.00061 seconds!
● 1 iteration: 6.1 ∙ 10-6 seconds12 n
● In actuality it takes a couple of days.
xk −xk−1 n
● FLOP ≈ 7.65 ∙ 1011
x0=[ 1 1 ⋯ 1 ]
T/n
while
k=1δ =∞
δ >ϵ
● Sunway TaihuLight: 125.435 ∙ 1015 FLOP/sec
Google's PageRank Algorithm Example: Boise State UniversityConnectivity matrix for www.boisestate.edu● Cleve Moler's (2004) surfer.m
(http://www.mathworks.com/moler)
● = 0.85
=∥xk −xk−1
∥∞≤10−7
● Repeat until
● Number of iterations = 58
Algorithm Time (sec.)
Dense matrix 8.28
Sparse matrix 0.08 nnz(P) = 1529518Sparsity ratio = 99.6%
● Modifications for BSU: bsusurfer.m (See course website)
Number of pages = 20000
Google's PageRank Algorithm PageRank results for Boise State
01. 3.67059e-02 http://www.boisestate.edu
02. 8.84482e-03 http://template.boisestate.edu/feed
03. 7.61458e-03 http://template.boisestate.edu/comments/feed
04. 6.98594e-03 http://www.boisestate.edu/index.html
05. 6.58044e-03 http://my.boisestate.edu
06. 6.56559e-03 http://index.boisestate.edu
07. 6.34645e-03 http://directory.boisestate.edu
08. 6.15998e-03 http://maps.boisestate.edu
09. 5.87635e-03 http://news.boisestate.edu/update
10. 5.78294e-03 http://events.boisestate.edu
11. 5.77344e-03 http://go.boisestate.edu
12. 5.67036e-03 http://go.boisestate.edu/about
13. 5.61788e-03 http://go.boisestate.edu/boise-beyond
14. 5.49048e-03 http://go.boisestate.edu/year-in-photos
15. 5.48960e-03 http://news.boisestate.edu/facts
# PageRank Webpage
Google's PageRank Algorithm PageRank results for Boise State
28 5.45711e-03 http://cobe.boisestate.edu
29 5.45711e-03 http://coas.boisestate.edu
43 5.45703e-03 http://coen.boisestate.edu
48 5.42033e-03 http://president.boisestate.edu
116 7.59747e-04 http://biology.boisestate.edu
117 7.56711e-04 http://biomolecularsciences.boisestate.edu
157 5.11459e-04 http://coen.boisestate.edu/ce
164 5.03192e-04 http://coen.boisestate.edu/cs
188 4.58116e-04 http://cobe.boisestate.edu/economics
192 4.26967e-04 http://coen.boisestate.edu/ece
265 3.36702e-04 http://coen.boisestate.edu/mse
267 3.36702e-04 http://coen.boisestate.edu/mbe
269 3.36702e-04 http://math.boisestate.edu
12539 9.81223e-06 http://math.boisestate.edu/~wright
# PageRank Webpage
Google's PageRank Algorithm Exercises
1. What is the PageRank vector for the following web:
one
two
three
five
four
2. What effect does decreasing have on the PageRank model?
A= PendT
1−eneT , 01Recall:
3. Let the web W consist of n pages and suppose satisfies ∑j=1
n
x j=1.x
If show where A is given above.y=Ax , ∑j=1
n
y j=1,
Google's PageRank Algorithm More advanced PageRank topics
1. Effect of changing the teleport probability .
2. Faster algorithms.
xk1=P x k
en
d Txk
1−=vO k
3. Personalizing PageRank.
4. Search engine optimization (SEO).
5. Updating the PageRank vector.
Av=v [ I−Pend ] v=1−
ne
A= PendT
1−eneT
Au= PudT1−ueT
Google's PageRank Algorithm Search engine optimization
two
one
three four
five
six
● Idea
SEO
v=[0.0517047460.0736792630.0574124130.199903810.268596080.34870368
]Before SEO
[0.12742415
0.0612680760.0505303720.113654230.266260950.174231200.20663101
]After SEO
Google's PageRank Algorithm Concluding remarks
● PageRank is the “Heart of GGooooggllee software”
● Use random walk (surf) to formulate PageRank problem.
● Use linear algebra to define PageRank.
● Can use the simple power method to compute PageRank.
● PageRank idea has been applied in many different areas:
For more details see: David F. Gleich, PageRank Beyond the Web, SIAM Review, 57 (2015), 321-363.