numerical linear algebra for data and link analysis
DESCRIPTION
Talk at Google about spectral graph partitioning and distributed pager rank computing using linear systemsTRANSCRIPT
Numerical Linear Algebrafor Data and Link Analysis
Leonid ZhukovJune 9, 2005
Abstract
Numerical Linear Algebra for Data and Link Analysis
Modern information retrieval and data mining systems must operate on extremely large datasets and require efficient, robust andscalable algorithms. Numerical linear algebra provides a solid foundation for the development of such algorithms and analysisof their behavior.
In this talk I will discuss several linear algebra based methods and their practical applications:
i) Spectral graph partitioning. I will describe a recursive spectral algorithm for bi-partite graph partitioning and its applicationto simultaneous clustering of bidded terms and advertisers in pay-for-performance market data. I will also present a new localrefinement strategy that allows us to improve cluster quality.
ii) Web graph link analysis. I will discuss a linear system formulation of the PageRank algorithm and the use of Krylovsubspace methods for an efficient solution. I will also describe our scalable parallel implementation and present results ofnumerical experiments for the convergence of iterative methods on multiple graphs with various parameter settings.
In conclusion I will outline some difficulties encountered while developing these applications and address possible solutionsand future research directions.
Outline
• Introduction
– Computational science and information retrieval
• Spectral clustering and graph partitioning
– Spectral clustering
– Flow refinement
– Bi-partite spectral and advertiser-term clustering
• Web graph link analysis
– PageRank as linear system
– Krylov subspace methods
– Numerical experiments
• Parallel implementation
– Distributed matrices
– MPI, PETSc, etc
• Conclusion and future work
1. Introduction
1.1. Computational science for information retrieval
• Multiple applications of numerical methods, no specilized algorithms
• Large scale problems
• Practical applications
Scientific Computing Information Retrieval
Problem in continuum, governed by PDE Discrete data is givendiscretization for numerical solution no control over problem size
control over resolution
2D or 3D geometry High dimensional spaces
Uniform distribution of node degrees Power-low degree distribution
1.2. Scientific Computing vs Information Retrieval graphs
FEM mesh for CFD simulations Artist-Artist similarity graph
2. Spectral Graph Partitioning
2.1. Graph partitioning
• Bisecting the graph, edge separator
Good and balanced cut
• Balanced partition
• “Natural” boundaries partition = clustering
2.2. Metrics - good cut
• Partitioning:
cut(V1, V2) =∑
i∈V1,j∈V2
eij; assoc(V1, V ) =∑i∈V1
d(vi)
• Objective functions:
– Minimal cut:MCut(V1, V2) = cut(V1, V2);
– Normalized cut:
NCut(V1, V2) =cut(V1, V2)
assoc(V1, V )+
cut(V1, V2)
assoc(V2, V )
– Quotient Cut:
QCut(V1, V2) =cut(V1, V2)
min(assoc(V1, V ), assoc(V2, V ))
2.3. Graph cuts
• Let G = (V, E) - graph,A(G) - adjacency matrix
• Let V = V + ∪ V − be partitioning of the nodes
• Let v = {+1,−1, +1, ...− 1, +1}T - indicator vector
x-1 x-1 x+1 x+1 x+1
• v(i) = +1, if v(i) ∈ V +; v(i) = −1, if v(i) ∈ V −
• Compute the number of edges, connectingV + andV −
cut(V +, V −) =1
4
∑e(i,j)
(v(i)− v(j))2 =1
4vTLv
• L = D−A
• Minimal cut partitioning - smallest number of edges to remove
• Exact solution is NP-hard!
2.4. Spectral method - motivation (from Physics)
• Linear graph - 5 nodes:
x1 x2 x3 x4 x5• Energy of the system:
E =1
2m
∑i
x(i)2 +1
2k
∑i,j
(x(i)− x(j))2
• Equations of motion:
Md2x
dt2= −kLx
• Laplacian matrix 5x5:
L =
1 −1−1 2 −1
−1 2 −1−1 2 −1
−1 1
2.5. Spectral method - motivation (from Physics)
• Eigenproblem:Lx = λx
• Second lowestλ2 = ω22 mode bisecting the string into two equal sized
components
2.6. Spectral method - relaxation
• Discrete problem→ continuous problem
• Discrete problem:find
min(1
4vTLv)
constraintsv(i) = ±1,∑
i v(i) = 0;
• Relaxation - continuous problem:find
min(1
4xTLx)
constraints:∑
x(i)2 = N ,∑
i x(i) = 0
• Exact constraint satisfies relaxed equation, but not other way around!
• Givenx(i), round them up byv(i) = sign(x(i))
2.7. Spectral method - computations
• Constraint optimization problem:
Q(x) =1
4xTLx− λ(xTx−N)
• Additional constraint:x e = {1, 1, 1, .., 1}• Minimization
minx⊥x1
(1
4
xTLx
xTx)
• Courant Fischer Minimax Theorem
Lx = λx
Looking forλ2 (second smallest) eigenvalue andx2
2.8. Family of spectral methods
• Ratio cut:
RCut(V1, V2) =cut(V1, V2)
|V1|+
cut(V1, V2)
|V2|
(D−A)x = λx
• Normalized cut:
NCut(V1, V2) =cut(V1, V2)
assoc(V1, V )+
cut(V1, V2)
assoc(V2, V )
NCut(V1, V2) = 2− (assoc(V1, V1)
assoc(V1, V )+
assoc(V2, V2)
assoc(V2, V ))
(D−A)x = λDx
2.9. Spectral partitioning algorithm
Algorithm 1Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo
if v2(n) < 0 thenput node n in partition V-
elseput node n in partition V+
end ifend for
2.10. Spectral ordering algorithm
Algorithm 2Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo
sort n according tov2(n)end for
• Permute columns and rows ofA according to “new” ordering
• Since∑
e(i,j)(v(i)− v(j))2 is minimized⇒there are few edges connecting distantv(i) andv(j)
2.11. Spectral Example I (good)
2.12. Linear sweep
• Linear sweep:NCut(V1, V2), QCut(V1, V2)
2.13. Spectral Example II (not so good)
2.14. Flow refinment
Set up and solve minimum S-T cut problem
• Divide node in 3 sets according to embedding ordering
• set up s-t max flow problem with one set of nodes pinned to the source andanother to the sink with inf capacity links
• solve to obtain S-T min cut ( min-cut max-flow theorem, find saturated fron-tier),
• move the partition
2.15. Flow refinment
cut(A,B)=171 cut(A,B)=70
QCut=0.0108 QCut=0.0053NCut=0.0206 NCut=0.0088
part size=1433 part size=1195
2.16. Flow refinment
cut(A,B)=11605 cut(A,B)=36688
QCut=0.242 QCut=0.160NCut=0.267 NCut=0.296
part size=266 part size=1103
2.17. Recursive spectral
• tree→ flat clusters
2.18. Example: Recursive Spectral
2.19. Data: Advertiser - bidded term data
Terms
Advertisers
aj
ti
A =
Terms
Advertisers
aj
ti
A =
• Simultaneous clustering of advertisers and bidded terms (co-clustering)
• Bi-partite graph partitioning problem
2.20. Bi-partite graph case
• Adjacency matrix for the bipartite graph
A =
(0 A
AT 0
)• Eigensystem:(
D1 −A−AT D2
) (xy
)= λ
(D1 00 D2
) (xy
)• Normalization:
An = D−1/21 AD
−1/22
Anv = (1− λ)u
ATnu = (1− λ)v
• SVD decomposition:
An = uσvT , σ = 1− λ
2.21. Advertiser - bidded search term matrix
2.22. Advertiser - bidded search term matrix
2.23. Computational consideration
• Large and very sparse matrices
• Only top few eigenvectors needed
• Precision requirements low
• Iterative Krylov subspace methods, Lanczos and Arnoldi algorithms
• Only matrix-vector multiply
3. Web graph link analysis
3.1. PageRank model
• Random walk on the graph
• Markov process: memoryless, homogeneous,
• Stationary distribution: existence, uniqueness, convergence.
• Perron-Frobenius theorem; irreducible, every state is reachable from everyother, and aperiodic - no cycles
3.2. PageRank model
• Construct probability matrix
P = D−1A, D = diag(A)
• Construct transition matrix for Markov process (row-stochastic)
P′ = P + (dvT )
• Correct reducibility (irreducible)
P′′ = cP′ + (1− c)(evT )
• Markov chain stationary distribution exist and unique (Perron-Frobenius)
P′′Tp = λp
3.3. Linear system formulation
• PageRank equation
(cP + c(dvT ) + (1− c)(evT ))x = λx
• Normalization(eTx) = (xTe) = ||x||1, λ1 = 1
• Identity(dTx) = ||x|| − ||PTx||.
• Linear system
(I− cPT )x = v(||x||1 − c||PTx||1)
3.4. Linear System vs Eigensystem
Eigensystem Linear system
P′′Tp = λp (I− cPT )x = k(x) v
P′′ = cP + c(dvT ) + (1− c)(evT ) k(x) = ||x||1 − c||PTx||1
λ = 1 p = x||x||1
• Iteration matrices:P′′, I− cPT - different rate of convergence
• Vectorv - rhs or in the matrix
• More methods available for linear system
• Solution is linear with respect tov
3.5. Flowchart of computational methods
3.6. Stationary iterations
• Power iterations:
p(k+1) = cP′Tp(k) + (1− c)v
• Jacoby Iterations:
p(k+1) = cPTp(k) + kv
• Iteration error:
e(k) = ||x(k) − x(k−1)||1r(k) = ||b−Ax(k)||1
• Convergence in k steps:
k ∼ log(e)/log(c)
3.7. Stationary methods convergence
0 10 20 30 40 50 60 70 8010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Error Metrics for Jacobi Convergence
Iteration
Met
ric V
alue
||x(k) − x*||
||A x(k) − b||
||x(k) − x*||/||x(k)||
3.8. Krylov subspace methods
• Linear systemAx = b, A = I− cPT , b = kv
• Residualr = b−Ax
• Krylov subspace
Km = span{r,Ar,A2r,A3r...Amr}
• xm is build fromx0 + Km, xm = x0 + qm−1(A)r0
• Only matrix-vector products
• Explicit minimization in subspace, extra information for next step
3.9. Krylov subspace methods
• Generalize Minimum Residual (GMRES)pick xn ∈ Kn, such thatmin ||b−Axn||, rn ⊥ AKn
• Biconjugate Gradient (BiCG)pick xn ∈ Kn, such thatrn ⊥ span{w,ATw, ...AT n−1
w)
• Biconjugate Gradient Stabilized (BiCGSTAB)
• Quasi-Minimal Residual (QMR)
• Conjugate Gradient Squared (CGS)
• Chebyshev Iterations.
Preconditioners
• Convergence depends oncond(A) = λmax/λmin
• PreconditionerM, M−1A x = M−1b
• IterateM−1A - better condition number
• Diagonal preconditionerM = D
3.10. Krylov subspace methods: convergence
0 5 10 15 20 25 30 35 40 4510
−8
10−6
10−4
10−2
100
102
Error Metrics for BiCG Convergence
Iteration
Met
ric V
alue
||x(k) − x*||
||A x(k) − b||
||x(k) − x*||/||x(k)||
3.11. Computational Requirements
Method IP SAXPY MV StoragePAGERANK 1 1 M + 3vJACOBI 1 1 M + 3vGMRES i + 1 i + 1 1 M + (i + 5)vBiCG 2 5 2 M + 10vBiCGSTAB 4 6 2 M + 10v
• IP - inner vector products
• SAXPY - scalar times vector plus vector
• MV - matrix vector products
• M - matrix,v - vector
3.12. Graph statistics
Name Nodes Links Storage Sizebs-cc 20k 130k 1.6 MBedu 2M 14M 176 MB
yahoo-r2 14M 266M 3.25 GBuk 18.5M 300M 3.67 GB
yahoo-r3 60M 850M 10.4 GBdb 70M 1B 12.3 GBav 1.4B 6.6B 80 GB
3.13. Graph statistics
100
101
102
103
10−5
10−4
10−3
10−2
10−1
100
bs outdegree, b = 1.526
100
101
102
103
104
10−5
10−4
10−3
10−2
10−1
100
bs indegree, b = 1.747
100
101
102
103
104
105
106
107
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
y2 outdegree, b = 1.454
100
101
102
103
104
105
106
107
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
y2 indegree, b = 1.848
100
101
102
103
104
105
106
107
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
db outdegree, b = 2.010
100
101
102
103
104
105
106
107
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
db indegree, b = 1.870
3.14. Convergence I
0 10 20 30 40 50 60 70 8010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Iteration
Err
or
uk iteration convergence
stdjacobigmresbicgbcgs
0 5 10 15 20 25 30 35 4010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Time (sec)E
rror
uk time convergence
stdjacobigmresbicgbcgs
3.15. Convergence II
0 10 20 30 40 50 60 7010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Iteration
Err
or
db iteration convergence
stdjacobigmresbicgbcgs
0 100 200 300 400 500 600 70010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Time (sec)E
rror
db time convergence
stdjacobigmresbicgbcgs
3.16. Convergence on AV graph
0 50 100 150 200 250 300 350 400 450 50010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
av time
Time (sec)
erro
r/re
sidu
al
stdbcgs
3.17. PageRank Timing results
Graph PR Jacobi GMRES BiCG BCGSedu 84 84 21† 44∗ 21∗
20 procs 0.09s / 7.56s 0.07s / 5.88 0.6s / 12.6s 0.4s / 17.6s 0.4s / 8.4syahoo-r2 71 65 12 35 1020 procs 1.8s / 127s 1.9s / 123s 16s / 192s 8.6s / 301s 9.9s / 99s
uk 73 71 22∗ 25∗ 11∗
60 procs 0.09s/ 6.57s 0.1s / 7.1s 0.8s / 17.6s 0.80s / 20s 1.0s / 11syahoo-r3 76 7560 procs 1.6s / 122s 1.5s / 112s
db 62 58 29 45 15∗
60 procs 9.0s / 558s 8.7s / 505s 15s / 435s 15s / 675s 15s / 225sav 72 26
226 procs 6.5s / 468s 16.5s / 429sav (host order) 72 26
140 procs 4.6s / 331s 15.0 / 390s
3.18. Dependence on teleportation
0 20 40 60 80 100 120 140 160 180 20010
−8
10−6
10−4
10−2
100
102
Iteration
Err
or/R
esid
ual
Convergence and Conditioning for db
stdgmresc = 0.85c = 0.90c = 0.95c = 0.99
4. Parallel system
4.1. Matrix-Vector multiply
• Iterative processA x → x
• Every process “owns” several rows of the matrix
• Every process “owns” corresponding part of the vector
• Communications required for multiplication
4.2. Distributed matrices
• Computing:
– Load balancing: equal number of non-zeros per processor
– Minimize communications: smallest number “of the processor” ele-ments
• Storage:
– Number of non-zeros per processor
– Number of rows per processor
4.3. Practical data distribution
• Balanced graph partitioning
– Exact - NP hard
– Approximate - multi-resolution, spectral, geometric,
• Practical solution
– Sort graph in lexigraphic order
– Fill processors consecutively by row, adding rows until
wrowsnp + wnnznnzp > (wrowsn + wnnznnz)/p
with wrows : wnnz = 1/1, 2/1, 4/1
4.4. Data distribution schemes
5 10 15 20 25 30
50
100
150
200
250
300
350
400y2 std parellelization and distribution
# of processors
time,
s
smartnrows
4.5. Implementation details
4.6. Implementation: MPI
• Message Passing Interface (MPI) standard
• Library specification for message-passing
• Message passing = data transfer + synchronization• MPI_SEND, MPI_RECV• MPI_Bcast, MPI_Reduce, MPI_Gather, MPI_Scatter
• Implementations: LAM, mpich, Intel, etc.
MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (!done) {if (myid == 0){printf("Enter the number of intervals: (0 quits) ");scanf("%d",&n); }MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
}
4.7. Implementation: PETSc
• Portable Extensible Toolkit for Scientific Computing
• Implements basic linear algebra operations on distributed matrices.
• Advanced linear and nonlinear solversPetscInitialize(&argc,&args,(char *)0,help);
MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,N,N,&A);MatSetValues(A,4,idx,4,idx,Ke,ADD_VALUES);
MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
VecAssemblyBegin(b);VecAssemblyEnd(b);
MatMult(A, b, x);
4.8. Network topology
0 500 1000 1500 2000 250010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Time (sec)
erro
r/re
sidu
al
Network Topology Effects
std−140−fullbcgs−140−fullstd−140−starbcgs−140−star
4.9. Host Ordering on AV graph
0 100 200 300 400 500 600 700 800 90010
−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Time (sec)
erro
r/re
sidu
al
Host Order Improvement
std−140bcgs−140std−140−hostbcgs−140−host
4.10. Parallel performance
90% 100% 110% 120% 130% 140% 150% 160% 170%0%
50%
100%
150%
200%
250%P
erfo
rman
ce In
crea
se (
Per
cent
dec
reas
e in
tim
e) Scaling for computing with full−web
stdbcgs
5. Conclusions
• Eigenvalues everywhere! Linear algebra methods provide provably goodsolutions to many problems. Methods are very general.
• Power-law graphs with high variance in node degrees present challenges tohigh performance parallel computing
• Skewed distribution, chains, central core, singletons makes clustering ofpower-law data a difficult problem
• Embedding in 1D is probably not sufficient for this type of data, higherdimensions needed
5.1. References
• Collaborators:
– Kevin Lang, Pavel Berkhin
– David Gleich and Matt Rasmussen
• Publications:
– “Fast Parallel PageRank: A Linear System Approach”, 2004
– “Spectral Clustering of Large Advertiser Datasets”, 2003
– “Clustering of bipartite advertiser-keyword graph”, 2002
• References:
– Spectral graph partitioning:M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson
(1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J.
Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001)
– PageRank computing:S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala
(2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)