numerical linear algebra for data and link analysis

Numerical Linear Algebrafor Data and Link Analysis

Leonid ZhukovJune 9, 2005

Abstract

Numerical Linear Algebra for Data and Link Analysis

Modern information retrieval and data mining systems must operate on extremely large datasets and require efficient, robust andscalable algorithms. Numerical linear algebra provides a solid foundation for the development of such algorithms and analysisof their behavior.

In this talk I will discuss several linear algebra based methods and their practical applications:

i) Spectral graph partitioning. I will describe a recursive spectral algorithm for bi-partite graph partitioning and its applicationto simultaneous clustering of bidded terms and advertisers in pay-for-performance market data. I will also present a new localrefinement strategy that allows us to improve cluster quality.

ii) Web graph link analysis. I will discuss a linear system formulation of the PageRank algorithm and the use of Krylovsubspace methods for an efficient solution. I will also describe our scalable parallel implementation and present results ofnumerical experiments for the convergence of iterative methods on multiple graphs with various parameter settings.

In conclusion I will outline some difficulties encountered while developing these applications and address possible solutionsand future research directions.

Outline

• Introduction

– Computational science and information retrieval

• Spectral clustering and graph partitioning

– Spectral clustering

– Flow refinement

– Bi-partite spectral and advertiser-term clustering

• Web graph link analysis

– PageRank as linear system

– Krylov subspace methods

– Numerical experiments

• Parallel implementation

– Distributed matrices

– MPI, PETSc, etc

• Conclusion and future work

1. Introduction

1.1. Computational science for information retrieval

• Multiple applications of numerical methods, no specilized algorithms

• Large scale problems

• Practical applications

Scientific Computing Information Retrieval

Problem in continuum, governed by PDE Discrete data is givendiscretization for numerical solution no control over problem size

control over resolution

2D or 3D geometry High dimensional spaces

Uniform distribution of node degrees Power-low degree distribution

1.2. Scientific Computing vs Information Retrieval graphs

FEM mesh for CFD simulations Artist-Artist similarity graph

2. Spectral Graph Partitioning

2.1. Graph partitioning

• Bisecting the graph, edge separator

Good and balanced cut

• Balanced partition

• “Natural” boundaries partition = clustering

2.2. Metrics - good cut

• Partitioning:

cut(V1, V2) =∑

i∈V1,j∈V2

eij; assoc(V1, V ) =∑i∈V1

d(vi)

• Objective functions:

– Minimal cut:MCut(V1, V2) = cut(V1, V2);

– Normalized cut:

NCut(V1, V2) =cut(V1, V2)

assoc(V1, V )+

cut(V1, V2)

assoc(V2, V )

– Quotient Cut:

QCut(V1, V2) =cut(V1, V2)

min(assoc(V1, V ), assoc(V2, V ))

2.3. Graph cuts

• Let G = (V, E) - graph,A(G) - adjacency matrix

• Let V = V + ∪ V − be partitioning of the nodes

• Let v = {+1,−1, +1, ...− 1, +1}T - indicator vector

x-1 x-1 x+1 x+1 x+1

• v(i) = +1, if v(i) ∈ V +; v(i) = −1, if v(i) ∈ V −

• Compute the number of edges, connectingV + andV −

cut(V +, V −) =1

4

∑e(i,j)

(v(i)− v(j))2 =1

4vTLv

• L = D−A

• Minimal cut partitioning - smallest number of edges to remove

• Exact solution is NP-hard!

2.4. Spectral method - motivation (from Physics)

• Linear graph - 5 nodes:

x1 x2 x3 x4 x5• Energy of the system:

E =1

2m

∑i

x(i)2 +1

2k

∑i,j

(x(i)− x(j))2

• Equations of motion:

Md2x

dt2= −kLx

• Laplacian matrix 5x5:

L =

1 −1−1 2 −1

−1 2 −1−1 2 −1

−1 1

2.5. Spectral method - motivation (from Physics)

• Eigenproblem:Lx = λx

• Second lowestλ2 = ω22 mode bisecting the string into two equal sized

components

2.6. Spectral method - relaxation

• Discrete problem→ continuous problem

• Discrete problem:find

min(1

4vTLv)

constraintsv(i) = ±1,∑

i v(i) = 0;

• Relaxation - continuous problem:find

min(1

4xTLx)

constraints:∑

x(i)2 = N ,∑

i x(i) = 0

• Exact constraint satisfies relaxed equation, but not other way around!

• Givenx(i), round them up byv(i) = sign(x(i))

2.7. Spectral method - computations

• Constraint optimization problem:

Q(x) =1

4xTLx− λ(xTx−N)

• Additional constraint:x e = {1, 1, 1, .., 1}• Minimization

minx⊥x1

(1

4

xTLx

xTx)

• Courant Fischer Minimax Theorem

Lx = λx

Looking forλ2 (second smallest) eigenvalue andx2

2.8. Family of spectral methods

• Ratio cut:

RCut(V1, V2) =cut(V1, V2)

|V1|+

cut(V1, V2)

|V2|

(D−A)x = λx

• Normalized cut:

NCut(V1, V2) =cut(V1, V2)

assoc(V1, V )+

cut(V1, V2)

assoc(V2, V )

NCut(V1, V2) = 2− (assoc(V1, V1)

assoc(V1, V )+

assoc(V2, V2)

assoc(V2, V ))

(D−A)x = λDx

2.9. Spectral partitioning algorithm

Algorithm 1Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo

if v2(n) < 0 thenput node n in partition V-

elseput node n in partition V+

end ifend for

2.10. Spectral ordering algorithm

Algorithm 2Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo

sort n according tov2(n)end for

• Permute columns and rows ofA according to “new” ordering

• Since∑

e(i,j)(v(i)− v(j))2 is minimized⇒there are few edges connecting distantv(i) andv(j)

2.11. Spectral Example I (good)

2.12. Linear sweep

• Linear sweep:NCut(V1, V2), QCut(V1, V2)

2.13. Spectral Example II (not so good)

2.14. Flow refinment

Set up and solve minimum S-T cut problem

• Divide node in 3 sets according to embedding ordering

• set up s-t max flow problem with one set of nodes pinned to the source andanother to the sink with inf capacity links

• solve to obtain S-T min cut ( min-cut max-flow theorem, find saturated fron-tier),

• move the partition


cut(A,B)=171 cut(A,B)=70

QCut=0.0108 QCut=0.0053NCut=0.0206 NCut=0.0088

part size=1433 part size=1195


cut(A,B)=11605 cut(A,B)=36688

QCut=0.242 QCut=0.160NCut=0.267 NCut=0.296

part size=266 part size=1103

2.17. Recursive spectral

• tree→ flat clusters

2.18. Example: Recursive Spectral

2.19. Data: Advertiser - bidded term data

Terms

Advertisers

aj

ti

A =

Terms

Advertisers

aj

ti

A =

• Simultaneous clustering of advertisers and bidded terms (co-clustering)

• Bi-partite graph partitioning problem

2.20. Bi-partite graph case

• Adjacency matrix for the bipartite graph

A =

(0 A

AT 0

)• Eigensystem:(

D1 −A−AT D2

) (xy

)= λ

(D1 00 D2

) (xy

)• Normalization:

An = D−1/21 AD

−1/22

Anv = (1− λ)u

ATnu = (1− λ)v

• SVD decomposition:

An = uσvT , σ = 1− λ

2.21. Advertiser - bidded search term matrix

2.22. Advertiser - bidded search term matrix

2.23. Computational consideration

• Large and very sparse matrices

• Only top few eigenvectors needed

• Precision requirements low

• Iterative Krylov subspace methods, Lanczos and Arnoldi algorithms

• Only matrix-vector multiply

3. Web graph link analysis

3.1. PageRank model

• Random walk on the graph

• Markov process: memoryless, homogeneous,

• Stationary distribution: existence, uniqueness, convergence.

• Perron-Frobenius theorem; irreducible, every state is reachable from everyother, and aperiodic - no cycles

3.2. PageRank model

• Construct probability matrix

P = D−1A, D = diag(A)

• Construct transition matrix for Markov process (row-stochastic)

P′ = P + (dvT )

• Correct reducibility (irreducible)

P′′ = cP′ + (1− c)(evT )

• Markov chain stationary distribution exist and unique (Perron-Frobenius)

P′′Tp = λp

3.3. Linear system formulation

• PageRank equation

(cP + c(dvT ) + (1− c)(evT ))x = λx

• Normalization(eTx) = (xTe) = ||x||1, λ1 = 1

• Identity(dTx) = ||x|| − ||PTx||.

• Linear system

(I− cPT )x = v(||x||1 − c||PTx||1)

3.4. Linear System vs Eigensystem

Eigensystem Linear system

P′′Tp = λp (I− cPT )x = k(x) v

P′′ = cP + c(dvT ) + (1− c)(evT ) k(x) = ||x||1 − c||PTx||1

λ = 1 p = x||x||1

• Iteration matrices:P′′, I− cPT - different rate of convergence

• Vectorv - rhs or in the matrix

• More methods available for linear system

• Solution is linear with respect tov

3.5. Flowchart of computational methods

3.6. Stationary iterations

• Power iterations:

p(k+1) = cP′Tp(k) + (1− c)v

• Jacoby Iterations:

p(k+1) = cPTp(k) + kv

• Iteration error:

e(k) = ||x(k) − x(k−1)||1r(k) = ||b−Ax(k)||1

• Convergence in k steps:

k ∼ log(e)/log(c)

3.7. Stationary methods convergence

0 10 20 30 40 50 60 70 8010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Error Metrics for Jacobi Convergence

Iteration

Met

ric V

alue

||x(k) − x*||

||A x(k) − b||

||x(k) − x*||/||x(k)||

3.8. Krylov subspace methods

• Linear systemAx = b, A = I− cPT , b = kv

• Residualr = b−Ax

• Krylov subspace

Km = span{r,Ar,A2r,A3r...Amr}

• xm is build fromx0 + Km, xm = x0 + qm−1(A)r0

• Only matrix-vector products

• Explicit minimization in subspace, extra information for next step

3.9. Krylov subspace methods

• Generalize Minimum Residual (GMRES)pick xn ∈ Kn, such thatmin ||b−Axn||, rn ⊥ AKn

• Biconjugate Gradient (BiCG)pick xn ∈ Kn, such thatrn ⊥ span{w,ATw, ...AT n−1

w)

• Biconjugate Gradient Stabilized (BiCGSTAB)

• Quasi-Minimal Residual (QMR)

• Conjugate Gradient Squared (CGS)

• Chebyshev Iterations.

Preconditioners

• Convergence depends oncond(A) = λmax/λmin

• PreconditionerM, M−1A x = M−1b

• IterateM−1A - better condition number

• Diagonal preconditionerM = D

3.10. Krylov subspace methods: convergence

0 5 10 15 20 25 30 35 40 4510

−8

10−6

10−4

10−2

100

102

Error Metrics for BiCG Convergence

Iteration

Met

ric V

alue

||x(k) − x*||

||A x(k) − b||

||x(k) − x*||/||x(k)||

3.11. Computational Requirements

Method IP SAXPY MV StoragePAGERANK 1 1 M + 3vJACOBI 1 1 M + 3vGMRES i + 1 i + 1 1 M + (i + 5)vBiCG 2 5 2 M + 10vBiCGSTAB 4 6 2 M + 10v

• IP - inner vector products

• SAXPY - scalar times vector plus vector

• MV - matrix vector products

• M - matrix,v - vector

3.12. Graph statistics

Name Nodes Links Storage Sizebs-cc 20k 130k 1.6 MBedu 2M 14M 176 MB

yahoo-r2 14M 266M 3.25 GBuk 18.5M 300M 3.67 GB

yahoo-r3 60M 850M 10.4 GBdb 70M 1B 12.3 GBav 1.4B 6.6B 80 GB

3.13. Graph statistics

100

101

102

103

10−5

10−4

10−3

10−2

10−1

100

bs outdegree, b = 1.526

100

101

102

103

104

10−5

10−4

10−3

10−2

10−1

100

bs indegree, b = 1.747

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

y2 outdegree, b = 1.454

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

y2 indegree, b = 1.848

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

db outdegree, b = 2.010

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

db indegree, b = 1.870

3.14. Convergence I

0 10 20 30 40 50 60 70 8010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

Err

or

uk iteration convergence

stdjacobigmresbicgbcgs

0 5 10 15 20 25 30 35 4010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)E

rror

uk time convergence


3.15. Convergence II

0 10 20 30 40 50 60 7010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

Err

or

db iteration convergence


0 100 200 300 400 500 600 70010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)E

rror

db time convergence


3.16. Convergence on AV graph

0 50 100 150 200 250 300 350 400 450 50010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

av time

Time (sec)

erro

r/re

sidu

al

stdbcgs

3.17. PageRank Timing results

Graph PR Jacobi GMRES BiCG BCGSedu 84 84 21† 44∗ 21∗

20 procs 0.09s / 7.56s 0.07s / 5.88 0.6s / 12.6s 0.4s / 17.6s 0.4s / 8.4syahoo-r2 71 65 12 35 1020 procs 1.8s / 127s 1.9s / 123s 16s / 192s 8.6s / 301s 9.9s / 99s

uk 73 71 22∗ 25∗ 11∗

60 procs 0.09s/ 6.57s 0.1s / 7.1s 0.8s / 17.6s 0.80s / 20s 1.0s / 11syahoo-r3 76 7560 procs 1.6s / 122s 1.5s / 112s

db 62 58 29 45 15∗

60 procs 9.0s / 558s 8.7s / 505s 15s / 435s 15s / 675s 15s / 225sav 72 26

226 procs 6.5s / 468s 16.5s / 429sav (host order) 72 26

140 procs 4.6s / 331s 15.0 / 390s

3.18. Dependence on teleportation

0 20 40 60 80 100 120 140 160 180 20010

−8

10−6

10−4

10−2

100

102

Iteration

Err

or/R

esid

ual

Convergence and Conditioning for db

stdgmresc = 0.85c = 0.90c = 0.95c = 0.99

4. Parallel system

4.1. Matrix-Vector multiply

• Iterative processA x → x

• Every process “owns” several rows of the matrix

• Every process “owns” corresponding part of the vector

• Communications required for multiplication

4.2. Distributed matrices

• Computing:

– Load balancing: equal number of non-zeros per processor

– Minimize communications: smallest number “of the processor” ele-ments

• Storage:

– Number of non-zeros per processor

– Number of rows per processor

4.3. Practical data distribution

• Balanced graph partitioning

– Exact - NP hard

– Approximate - multi-resolution, spectral, geometric,

• Practical solution

– Sort graph in lexigraphic order

– Fill processors consecutively by row, adding rows until

wrowsnp + wnnznnzp > (wrowsn + wnnznnz)/p

with wrows : wnnz = 1/1, 2/1, 4/1

4.4. Data distribution schemes

5 10 15 20 25 30

50

100

150

200

250

300

350

400y2 std parellelization and distribution

# of processors

time,

s

smartnrows

4.5. Implementation details

4.6. Implementation: MPI

• Message Passing Interface (MPI) standard

• Library specification for message-passing

• Message passing = data transfer + synchronization• MPI_SEND, MPI_RECV• MPI_Bcast, MPI_Reduce, MPI_Gather, MPI_Scatter

• Implementations: LAM, mpich, Intel, etc.

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);

while (!done) {if (myid == 0){printf("Enter the number of intervals: (0 quits) ");scanf("%d",&n); }MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

}

4.7. Implementation: PETSc

• Portable Extensible Toolkit for Scientific Computing

• Implements basic linear algebra operations on distributed matrices.

• Advanced linear and nonlinear solversPetscInitialize(&argc,&args,(char *)0,help);

MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,N,N,&A);MatSetValues(A,4,idx,4,idx,Ke,ADD_VALUES);

MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);

VecAssemblyBegin(b);VecAssemblyEnd(b);

MatMult(A, b, x);

4.8. Network topology

0 500 1000 1500 2000 250010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)

erro

r/re

sidu

al

Network Topology Effects

std−140−fullbcgs−140−fullstd−140−starbcgs−140−star

4.9. Host Ordering on AV graph

0 100 200 300 400 500 600 700 800 90010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)

erro

r/re

sidu

al

Host Order Improvement

std−140bcgs−140std−140−hostbcgs−140−host

4.10. Parallel performance

90% 100% 110% 120% 130% 140% 150% 160% 170%0%

50%

100%

150%

200%

250%P

erfo

rman

ce In

crea

se (

Per

cent

dec

reas

e in

tim

e) Scaling for computing with full−web

stdbcgs

5. Conclusions

• Eigenvalues everywhere! Linear algebra methods provide provably goodsolutions to many problems. Methods are very general.

• Power-law graphs with high variance in node degrees present challenges tohigh performance parallel computing

• Skewed distribution, chains, central core, singletons makes clustering ofpower-law data a difficult problem

• Embedding in 1D is probably not sufficient for this type of data, higherdimensions needed

5.1. References

• Collaborators:

– Kevin Lang, Pavel Berkhin

– David Gleich and Matt Rasmussen

• Publications:

– “Fast Parallel PageRank: A Linear System Approach”, 2004

– “Spectral Clustering of Large Advertiser Datasets”, 2003

– “Clustering of bipartite advertiser-keyword graph”, 2002

• References:

– Spectral graph partitioning:M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson

(1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J.

Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001)

– PageRank computing:S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala

(2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)

numerical linear algebra for data and link analysis

Technology

spectral graph partitioning

linear system

process owns

cp

spectral method

bidded terms

eigenvector v2

put node