numerical linear algebra for data and link analysis

61
Numerical Linear Algebra for Data and Link Analysis Leonid Zhukov June 9, 2005

Upload: leonid-zhukov

Post on 01-Nov-2014

886 views

Category:

Technology


1 download

DESCRIPTION

Talk at Google about spectral graph partitioning and distributed pager rank computing using linear systems

TRANSCRIPT

Page 1: Numerical Linear Algebra for Data and Link Analysis

Numerical Linear Algebrafor Data and Link Analysis

Leonid ZhukovJune 9, 2005

Page 2: Numerical Linear Algebra for Data and Link Analysis

Abstract

Numerical Linear Algebra for Data and Link Analysis

Modern information retrieval and data mining systems must operate on extremely large datasets and require efficient, robust andscalable algorithms. Numerical linear algebra provides a solid foundation for the development of such algorithms and analysisof their behavior.

In this talk I will discuss several linear algebra based methods and their practical applications:

i) Spectral graph partitioning. I will describe a recursive spectral algorithm for bi-partite graph partitioning and its applicationto simultaneous clustering of bidded terms and advertisers in pay-for-performance market data. I will also present a new localrefinement strategy that allows us to improve cluster quality.

ii) Web graph link analysis. I will discuss a linear system formulation of the PageRank algorithm and the use of Krylovsubspace methods for an efficient solution. I will also describe our scalable parallel implementation and present results ofnumerical experiments for the convergence of iterative methods on multiple graphs with various parameter settings.

In conclusion I will outline some difficulties encountered while developing these applications and address possible solutionsand future research directions.

Page 3: Numerical Linear Algebra for Data and Link Analysis

Outline

• Introduction

– Computational science and information retrieval

• Spectral clustering and graph partitioning

– Spectral clustering

– Flow refinement

– Bi-partite spectral and advertiser-term clustering

• Web graph link analysis

– PageRank as linear system

– Krylov subspace methods

– Numerical experiments

• Parallel implementation

– Distributed matrices

– MPI, PETSc, etc

• Conclusion and future work

Page 4: Numerical Linear Algebra for Data and Link Analysis

1. Introduction

1.1. Computational science for information retrieval

• Multiple applications of numerical methods, no specilized algorithms

• Large scale problems

• Practical applications

Scientific Computing Information Retrieval

Problem in continuum, governed by PDE Discrete data is givendiscretization for numerical solution no control over problem size

control over resolution

2D or 3D geometry High dimensional spaces

Uniform distribution of node degrees Power-low degree distribution

Page 5: Numerical Linear Algebra for Data and Link Analysis

1.2. Scientific Computing vs Information Retrieval graphs

FEM mesh for CFD simulations Artist-Artist similarity graph

Page 6: Numerical Linear Algebra for Data and Link Analysis

2. Spectral Graph Partitioning

Page 7: Numerical Linear Algebra for Data and Link Analysis

2.1. Graph partitioning

• Bisecting the graph, edge separator

Good and balanced cut

• Balanced partition

• “Natural” boundaries partition = clustering

Page 8: Numerical Linear Algebra for Data and Link Analysis

2.2. Metrics - good cut

• Partitioning:

cut(V1, V2) =∑

i∈V1,j∈V2

eij; assoc(V1, V ) =∑i∈V1

d(vi)

• Objective functions:

– Minimal cut:MCut(V1, V2) = cut(V1, V2);

– Normalized cut:

NCut(V1, V2) =cut(V1, V2)

assoc(V1, V )+

cut(V1, V2)

assoc(V2, V )

– Quotient Cut:

QCut(V1, V2) =cut(V1, V2)

min(assoc(V1, V ), assoc(V2, V ))

Page 9: Numerical Linear Algebra for Data and Link Analysis

2.3. Graph cuts

• Let G = (V, E) - graph,A(G) - adjacency matrix

• Let V = V + ∪ V − be partitioning of the nodes

• Let v = {+1,−1, +1, ...− 1, +1}T - indicator vector

x-1 x-1 x+1 x+1 x+1

• v(i) = +1, if v(i) ∈ V +; v(i) = −1, if v(i) ∈ V −

• Compute the number of edges, connectingV + andV −

cut(V +, V −) =1

4

∑e(i,j)

(v(i)− v(j))2 =1

4vTLv

• L = D−A

• Minimal cut partitioning - smallest number of edges to remove

• Exact solution is NP-hard!

Page 10: Numerical Linear Algebra for Data and Link Analysis

2.4. Spectral method - motivation (from Physics)

• Linear graph - 5 nodes:

x1 x2 x3 x4 x5• Energy of the system:

E =1

2m

∑i

x(i)2 +1

2k

∑i,j

(x(i)− x(j))2

• Equations of motion:

Md2x

dt2= −kLx

• Laplacian matrix 5x5:

L =

1 −1−1 2 −1

−1 2 −1−1 2 −1

−1 1

Page 11: Numerical Linear Algebra for Data and Link Analysis

2.5. Spectral method - motivation (from Physics)

• Eigenproblem:Lx = λx

• Second lowestλ2 = ω22 mode bisecting the string into two equal sized

components

Page 12: Numerical Linear Algebra for Data and Link Analysis

2.6. Spectral method - relaxation

• Discrete problem→ continuous problem

• Discrete problem:find

min(1

4vTLv)

constraintsv(i) = ±1,∑

i v(i) = 0;

• Relaxation - continuous problem:find

min(1

4xTLx)

constraints:∑

x(i)2 = N ,∑

i x(i) = 0

• Exact constraint satisfies relaxed equation, but not other way around!

• Givenx(i), round them up byv(i) = sign(x(i))

Page 13: Numerical Linear Algebra for Data and Link Analysis

2.7. Spectral method - computations

• Constraint optimization problem:

Q(x) =1

4xTLx− λ(xTx−N)

• Additional constraint:x e = {1, 1, 1, .., 1}• Minimization

minx⊥x1

(1

4

xTLx

xTx)

• Courant Fischer Minimax Theorem

Lx = λx

Looking forλ2 (second smallest) eigenvalue andx2

Page 14: Numerical Linear Algebra for Data and Link Analysis

2.8. Family of spectral methods

• Ratio cut:

RCut(V1, V2) =cut(V1, V2)

|V1|+

cut(V1, V2)

|V2|

(D−A)x = λx

• Normalized cut:

NCut(V1, V2) =cut(V1, V2)

assoc(V1, V )+

cut(V1, V2)

assoc(V2, V )

NCut(V1, V2) = 2− (assoc(V1, V1)

assoc(V1, V )+

assoc(V2, V2)

assoc(V2, V ))

(D−A)x = λDx

Page 15: Numerical Linear Algebra for Data and Link Analysis

2.9. Spectral partitioning algorithm

Algorithm 1Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo

if v2(n) < 0 thenput node n in partition V-

elseput node n in partition V+

end ifend for

Page 16: Numerical Linear Algebra for Data and Link Analysis

2.10. Spectral ordering algorithm

Algorithm 2Compute the eigenvectorv2 corresponding toλ2 of L(G)for all node n in Gdo

sort n according tov2(n)end for

• Permute columns and rows ofA according to “new” ordering

• Since∑

e(i,j)(v(i)− v(j))2 is minimized⇒there are few edges connecting distantv(i) andv(j)

Page 17: Numerical Linear Algebra for Data and Link Analysis

2.11. Spectral Example I (good)

Page 18: Numerical Linear Algebra for Data and Link Analysis

2.12. Linear sweep

• Linear sweep:NCut(V1, V2), QCut(V1, V2)

Page 19: Numerical Linear Algebra for Data and Link Analysis

2.13. Spectral Example II (not so good)

Page 20: Numerical Linear Algebra for Data and Link Analysis

2.14. Flow refinment

Set up and solve minimum S-T cut problem

• Divide node in 3 sets according to embedding ordering

• set up s-t max flow problem with one set of nodes pinned to the source andanother to the sink with inf capacity links

• solve to obtain S-T min cut ( min-cut max-flow theorem, find saturated fron-tier),

• move the partition

Page 21: Numerical Linear Algebra for Data and Link Analysis

2.15. Flow refinment

cut(A,B)=171 cut(A,B)=70

QCut=0.0108 QCut=0.0053NCut=0.0206 NCut=0.0088

part size=1433 part size=1195

Page 22: Numerical Linear Algebra for Data and Link Analysis

2.16. Flow refinment

cut(A,B)=11605 cut(A,B)=36688

QCut=0.242 QCut=0.160NCut=0.267 NCut=0.296

part size=266 part size=1103

Page 23: Numerical Linear Algebra for Data and Link Analysis

2.17. Recursive spectral

• tree→ flat clusters

Page 24: Numerical Linear Algebra for Data and Link Analysis

2.18. Example: Recursive Spectral

Page 25: Numerical Linear Algebra for Data and Link Analysis

2.19. Data: Advertiser - bidded term data

Terms

Advertisers

aj

ti

A =

Terms

Advertisers

aj

ti

A =

• Simultaneous clustering of advertisers and bidded terms (co-clustering)

• Bi-partite graph partitioning problem

Page 26: Numerical Linear Algebra for Data and Link Analysis

2.20. Bi-partite graph case

• Adjacency matrix for the bipartite graph

A =

(0 A

AT 0

)• Eigensystem:(

D1 −A−AT D2

) (xy

)= λ

(D1 00 D2

) (xy

)• Normalization:

An = D−1/21 AD

−1/22

Anv = (1− λ)u

ATnu = (1− λ)v

• SVD decomposition:

An = uσvT , σ = 1− λ

Page 27: Numerical Linear Algebra for Data and Link Analysis

2.21. Advertiser - bidded search term matrix

Page 28: Numerical Linear Algebra for Data and Link Analysis

2.22. Advertiser - bidded search term matrix

Page 29: Numerical Linear Algebra for Data and Link Analysis

2.23. Computational consideration

• Large and very sparse matrices

• Only top few eigenvectors needed

• Precision requirements low

• Iterative Krylov subspace methods, Lanczos and Arnoldi algorithms

• Only matrix-vector multiply

Page 30: Numerical Linear Algebra for Data and Link Analysis

3. Web graph link analysis

Page 31: Numerical Linear Algebra for Data and Link Analysis

3.1. PageRank model

• Random walk on the graph

• Markov process: memoryless, homogeneous,

• Stationary distribution: existence, uniqueness, convergence.

• Perron-Frobenius theorem; irreducible, every state is reachable from everyother, and aperiodic - no cycles

Page 32: Numerical Linear Algebra for Data and Link Analysis

3.2. PageRank model

• Construct probability matrix

P = D−1A, D = diag(A)

• Construct transition matrix for Markov process (row-stochastic)

P′ = P + (dvT )

• Correct reducibility (irreducible)

P′′ = cP′ + (1− c)(evT )

• Markov chain stationary distribution exist and unique (Perron-Frobenius)

P′′Tp = λp

Page 33: Numerical Linear Algebra for Data and Link Analysis

3.3. Linear system formulation

• PageRank equation

(cP + c(dvT ) + (1− c)(evT ))x = λx

• Normalization(eTx) = (xTe) = ||x||1, λ1 = 1

• Identity(dTx) = ||x|| − ||PTx||.

• Linear system

(I− cPT )x = v(||x||1 − c||PTx||1)

Page 34: Numerical Linear Algebra for Data and Link Analysis

3.4. Linear System vs Eigensystem

Eigensystem Linear system

P′′Tp = λp (I− cPT )x = k(x) v

P′′ = cP + c(dvT ) + (1− c)(evT ) k(x) = ||x||1 − c||PTx||1

λ = 1 p = x||x||1

• Iteration matrices:P′′, I− cPT - different rate of convergence

• Vectorv - rhs or in the matrix

• More methods available for linear system

• Solution is linear with respect tov

Page 35: Numerical Linear Algebra for Data and Link Analysis

3.5. Flowchart of computational methods

Page 36: Numerical Linear Algebra for Data and Link Analysis

3.6. Stationary iterations

• Power iterations:

p(k+1) = cP′Tp(k) + (1− c)v

• Jacoby Iterations:

p(k+1) = cPTp(k) + kv

• Iteration error:

e(k) = ||x(k) − x(k−1)||1r(k) = ||b−Ax(k)||1

• Convergence in k steps:

k ∼ log(e)/log(c)

Page 37: Numerical Linear Algebra for Data and Link Analysis

3.7. Stationary methods convergence

0 10 20 30 40 50 60 70 8010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Error Metrics for Jacobi Convergence

Iteration

Met

ric V

alue

||x(k) − x*||

||A x(k) − b||

||x(k) − x*||/||x(k)||

Page 38: Numerical Linear Algebra for Data and Link Analysis

3.8. Krylov subspace methods

• Linear systemAx = b, A = I− cPT , b = kv

• Residualr = b−Ax

• Krylov subspace

Km = span{r,Ar,A2r,A3r...Amr}

• xm is build fromx0 + Km, xm = x0 + qm−1(A)r0

• Only matrix-vector products

• Explicit minimization in subspace, extra information for next step

Page 39: Numerical Linear Algebra for Data and Link Analysis

3.9. Krylov subspace methods

• Generalize Minimum Residual (GMRES)pick xn ∈ Kn, such thatmin ||b−Axn||, rn ⊥ AKn

• Biconjugate Gradient (BiCG)pick xn ∈ Kn, such thatrn ⊥ span{w,ATw, ...AT n−1

w)

• Biconjugate Gradient Stabilized (BiCGSTAB)

• Quasi-Minimal Residual (QMR)

• Conjugate Gradient Squared (CGS)

• Chebyshev Iterations.

Preconditioners

• Convergence depends oncond(A) = λmax/λmin

• PreconditionerM, M−1A x = M−1b

• IterateM−1A - better condition number

• Diagonal preconditionerM = D

Page 40: Numerical Linear Algebra for Data and Link Analysis

3.10. Krylov subspace methods: convergence

0 5 10 15 20 25 30 35 40 4510

−8

10−6

10−4

10−2

100

102

Error Metrics for BiCG Convergence

Iteration

Met

ric V

alue

||x(k) − x*||

||A x(k) − b||

||x(k) − x*||/||x(k)||

Page 41: Numerical Linear Algebra for Data and Link Analysis

3.11. Computational Requirements

Method IP SAXPY MV StoragePAGERANK 1 1 M + 3vJACOBI 1 1 M + 3vGMRES i + 1 i + 1 1 M + (i + 5)vBiCG 2 5 2 M + 10vBiCGSTAB 4 6 2 M + 10v

• IP - inner vector products

• SAXPY - scalar times vector plus vector

• MV - matrix vector products

• M - matrix,v - vector

Page 42: Numerical Linear Algebra for Data and Link Analysis

3.12. Graph statistics

Name Nodes Links Storage Sizebs-cc 20k 130k 1.6 MBedu 2M 14M 176 MB

yahoo-r2 14M 266M 3.25 GBuk 18.5M 300M 3.67 GB

yahoo-r3 60M 850M 10.4 GBdb 70M 1B 12.3 GBav 1.4B 6.6B 80 GB

Page 43: Numerical Linear Algebra for Data and Link Analysis

3.13. Graph statistics

100

101

102

103

10−5

10−4

10−3

10−2

10−1

100

bs outdegree, b = 1.526

100

101

102

103

104

10−5

10−4

10−3

10−2

10−1

100

bs indegree, b = 1.747

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

y2 outdegree, b = 1.454

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

y2 indegree, b = 1.848

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

db outdegree, b = 2.010

100

101

102

103

104

105

106

107

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

db indegree, b = 1.870

Page 44: Numerical Linear Algebra for Data and Link Analysis

3.14. Convergence I

0 10 20 30 40 50 60 70 8010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

Err

or

uk iteration convergence

stdjacobigmresbicgbcgs

0 5 10 15 20 25 30 35 4010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)E

rror

uk time convergence

stdjacobigmresbicgbcgs

Page 45: Numerical Linear Algebra for Data and Link Analysis

3.15. Convergence II

0 10 20 30 40 50 60 7010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

Err

or

db iteration convergence

stdjacobigmresbicgbcgs

0 100 200 300 400 500 600 70010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)E

rror

db time convergence

stdjacobigmresbicgbcgs

Page 46: Numerical Linear Algebra for Data and Link Analysis

3.16. Convergence on AV graph

0 50 100 150 200 250 300 350 400 450 50010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

av time

Time (sec)

erro

r/re

sidu

al

stdbcgs

Page 47: Numerical Linear Algebra for Data and Link Analysis

3.17. PageRank Timing results

Graph PR Jacobi GMRES BiCG BCGSedu 84 84 21† 44∗ 21∗

20 procs 0.09s / 7.56s 0.07s / 5.88 0.6s / 12.6s 0.4s / 17.6s 0.4s / 8.4syahoo-r2 71 65 12 35 1020 procs 1.8s / 127s 1.9s / 123s 16s / 192s 8.6s / 301s 9.9s / 99s

uk 73 71 22∗ 25∗ 11∗

60 procs 0.09s/ 6.57s 0.1s / 7.1s 0.8s / 17.6s 0.80s / 20s 1.0s / 11syahoo-r3 76 7560 procs 1.6s / 122s 1.5s / 112s

db 62 58 29 45 15∗

60 procs 9.0s / 558s 8.7s / 505s 15s / 435s 15s / 675s 15s / 225sav 72 26

226 procs 6.5s / 468s 16.5s / 429sav (host order) 72 26

140 procs 4.6s / 331s 15.0 / 390s

Page 48: Numerical Linear Algebra for Data and Link Analysis

3.18. Dependence on teleportation

0 20 40 60 80 100 120 140 160 180 20010

−8

10−6

10−4

10−2

100

102

Iteration

Err

or/R

esid

ual

Convergence and Conditioning for db

stdgmresc = 0.85c = 0.90c = 0.95c = 0.99

Page 49: Numerical Linear Algebra for Data and Link Analysis

4. Parallel system

Page 50: Numerical Linear Algebra for Data and Link Analysis

4.1. Matrix-Vector multiply

• Iterative processA x → x

• Every process “owns” several rows of the matrix

• Every process “owns” corresponding part of the vector

• Communications required for multiplication

Page 51: Numerical Linear Algebra for Data and Link Analysis

4.2. Distributed matrices

• Computing:

– Load balancing: equal number of non-zeros per processor

– Minimize communications: smallest number “of the processor” ele-ments

• Storage:

– Number of non-zeros per processor

– Number of rows per processor

Page 52: Numerical Linear Algebra for Data and Link Analysis

4.3. Practical data distribution

• Balanced graph partitioning

– Exact - NP hard

– Approximate - multi-resolution, spectral, geometric,

• Practical solution

– Sort graph in lexigraphic order

– Fill processors consecutively by row, adding rows until

wrowsnp + wnnznnzp > (wrowsn + wnnznnz)/p

with wrows : wnnz = 1/1, 2/1, 4/1

Page 53: Numerical Linear Algebra for Data and Link Analysis

4.4. Data distribution schemes

5 10 15 20 25 30

50

100

150

200

250

300

350

400y2 std parellelization and distribution

# of processors

time,

s

smartnrows

Page 54: Numerical Linear Algebra for Data and Link Analysis

4.5. Implementation details

Page 55: Numerical Linear Algebra for Data and Link Analysis

4.6. Implementation: MPI

• Message Passing Interface (MPI) standard

• Library specification for message-passing

• Message passing = data transfer + synchronization• MPI_SEND, MPI_RECV• MPI_Bcast, MPI_Reduce, MPI_Gather, MPI_Scatter

• Implementations: LAM, mpich, Intel, etc.

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);

while (!done) {if (myid == 0){printf("Enter the number of intervals: (0 quits) ");scanf("%d",&n); }MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

}

Page 56: Numerical Linear Algebra for Data and Link Analysis

4.7. Implementation: PETSc

• Portable Extensible Toolkit for Scientific Computing

• Implements basic linear algebra operations on distributed matrices.

• Advanced linear and nonlinear solversPetscInitialize(&argc,&args,(char *)0,help);

MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,N,N,&A);MatSetValues(A,4,idx,4,idx,Ke,ADD_VALUES);

MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);

VecAssemblyBegin(b);VecAssemblyEnd(b);

MatMult(A, b, x);

Page 57: Numerical Linear Algebra for Data and Link Analysis

4.8. Network topology

0 500 1000 1500 2000 250010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)

erro

r/re

sidu

al

Network Topology Effects

std−140−fullbcgs−140−fullstd−140−starbcgs−140−star

Page 58: Numerical Linear Algebra for Data and Link Analysis

4.9. Host Ordering on AV graph

0 100 200 300 400 500 600 700 800 90010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time (sec)

erro

r/re

sidu

al

Host Order Improvement

std−140bcgs−140std−140−hostbcgs−140−host

Page 59: Numerical Linear Algebra for Data and Link Analysis

4.10. Parallel performance

90% 100% 110% 120% 130% 140% 150% 160% 170%0%

50%

100%

150%

200%

250%P

erfo

rman

ce In

crea

se (

Per

cent

dec

reas

e in

tim

e) Scaling for computing with full−web

stdbcgs

Page 60: Numerical Linear Algebra for Data and Link Analysis

5. Conclusions

• Eigenvalues everywhere! Linear algebra methods provide provably goodsolutions to many problems. Methods are very general.

• Power-law graphs with high variance in node degrees present challenges tohigh performance parallel computing

• Skewed distribution, chains, central core, singletons makes clustering ofpower-law data a difficult problem

• Embedding in 1D is probably not sufficient for this type of data, higherdimensions needed

Page 61: Numerical Linear Algebra for Data and Link Analysis

5.1. References

• Collaborators:

– Kevin Lang, Pavel Berkhin

– David Gleich and Matt Rasmussen

• Publications:

– “Fast Parallel PageRank: A Linear System Approach”, 2004

– “Spectral Clustering of Large Advertiser Datasets”, 2003

– “Clustering of bipartite advertiser-keyword graph”, 2002

• References:

– Spectral graph partitioning:M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson

(1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J.

Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001)

– PageRank computing:S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala

(2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)