towards scaling fully personalized pagerank

Towards Scaling Fully Towards Scaling Fully Personalized PageRankPersonalized PageRank

Dániel Fogaras, Balázs Rácz

Budapest University of Technology and Economics

Computer and Automation Research Institute of the

Hungarian Academy of Sciences

2/14

Problem formulationProblem formulation

PageRank(Brin,Page,’98)

PV PageRank vector, r uniform distribution vectorOverall quality measure of Web pagesPre-computation: evaluate PV by power iterationQuery: order results by PV

Personalized PageRank(Brin,Page,’98)r preference vector of a user, query dependentPPV(r):=PV personalized quality measure of Web pagesPre-computation: r is not known. What to compute?Query: power-iteration. 5 hours/query!!!

rMPVPV cc)1(

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

3/14

PreliminariesPreliminaries

Linearity: Full personalization

Pre-compute PPV(ri) for all pagesV2 disk, V(V+E) time, where V ≈ 109, E ≈ 1010, ???

Topic-Sensitive PageRank (Haveliwala ’01)Linearity

Pre-compute PPV(ri) for a topical basis r1,…,rk, k≈20Query: user submits a topic by Query engine combines PPV(ri) vectors

Scaling Personalized Web Search (Jeh, Widom, ’03)Decomposition, linearity

Pre-compute PPV(ri) for unit vectors r1,…,rk, corresponding to k≈10.000 pagesQuery: personalization over the 10.000 pages

)()()( 1111 kkkk rPPVrPPVrrPPV


k ,,1

FD

Kéne beszélni az adatbázis méretéről?Esetleg a linearitás láttán triviálisnak tűnhet előre kiszámolni az összes PPV(u)-t.

4/14

Towards full personalizationTowards full personalization

Our algorithmMonte Carlo simulation, not power iteration

Pre-compute approximate PPV(ri) for all unit vectors r1,…,rk, k=number of pages

Scalability: quasi linear pre-computation & sub-linear query

Main points of this presentationOutline of the algorithmPre-computation: external-memory, distributedQuery: used to increase precisionError of approximation tends to zero exponentiallyExact vs. approximated PPV -- space lower bounds


5/14

Outline of the AlgorithmOutline of the AlgorithmTheorem (Jeh, Widom ’03, F ’03)

Random walk starts from page uUniform step with probability 1-c, stops with cPPV(u,v)=Pr the walk stops at page v

Monte Carlo algorithmPre-computation

From u simulate N independent random walks

Database of fingerprints: ending vertices of the walks from all vertices

QueryPPV(u,v) : = # ( walks u→v ) / N


6/14

External memory pre-computationExternal memory pre-computation

Goal: N independent random walks from each vertex

Input: webgraph V ≈ 109, E ≈ 1010

V+E > memory

Accessing the edgesEdge scan --- stream access

Edges sorted by source vertices


7/14

External memory pre-computation (2)External memory pre-computation (2)

Goal: N independent random walks from each vertex

Simulate all walks togetherIteration: 1 blink = 1 edge scan

SortSort path ends

MergeMerge with the sorted graph

Each walk stopsstops with prob. c

EE( #walks ) = (1-c)k∙N∙V

after k iterations


8/14

Distributed indexingDistributed indexing

M machines with fast local network connections

memory < V+E ≤ M∙(memory)Parallelize for N∙V walks

Parts of the graph in RAM

Remote transfers batched

Heuristic partition: one site to one machineMachine1: www.cnn.com/*, Machine2: www.yahoo.com/*

Uniform load balance ← ordinary PR distributed equally


M=3

9/14

Query, increasing precisionQuery, increasing precision

Database of N∙V fingerprints (path endings)Query: PPV(u) : = empirical distribution

from N samplesTheorem (Jeh, Widom, ’03)

O(u) denotes out-neighbors of u

Query: PPV(u) : = empirical distribution from N∙|O(u)| samples

Number of fingerprints for a queryF = N∙(db accesses/query)

uuOv

rcvcu )(

)()1()( PPVPPV


10/14

Error of approximationError of approximation

Exact: PPV(u,v)Approximate by F fingerprints: PPV(u,v) Theorem

If PPV(u,v) > PPV(u, holds, thenPr PPV(u,v) < PPV(u,w) < exp( - 0.3∙N∙δ2 )

Idea of the proof N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) ==sum of F iid. random variables with values -1,0,1Bernstein’s inequality

Error of approximation → 0 exponentially with F = (db size/vertex)∙(db accesses/query) → ∞

)3.0exp(),(),(Pr 2 Fwuvu PPVPPV


),( vuPPV

),(),( wuvu PPVPPV

)(#)(#)),(),(( wuvuwuvuF PPVPPV

11/14

Exact versus approximateExact versus approximate

Model of computationInput: G graph with V vertices

Pre-compute a database of size D

Query: respond by accessing only the db.

ExactQuery: u,v,w

Decide if PPV(u,v) > PPV(u,w) holds

Approximate for fixed ε and δQuery: u,v,w

Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ


12/14

Lower bounds for the db sizeLower bounds for the db size

For the webgraph V ≈ 109

Theorem 1For the Exact problem D = Ω(V2) sized db is

required in worst case

Theorem 2For the Approximate problem D = Ω(V)

Is it possible to improve the 2nd lower bound? Our algorithm uses a D = O(V logV) sized db


13/14

Idea of the lower bound proofsIdea of the lower bound proofs

One-way communication complexity

Bit-vector probing (BVP)

Theorem: B ≥ m for any protocol

Reduction from Exact-PPV to BVP

Alice has a bit vector

Input: x = (x1, x2, …, xm )

Bob has a number

Input: 1 ≤ k ≤ m

Xk = ?

Communication

B bits

Alice has x = (x1, x2, …, xm )

G graph with V vertices, where V2 = m

Pre-compute an Exact PPV database of size D

Communication

Exact PPV db, D bits

Bob has 1 ≤ k ≤ m

u, v, w vertices

PPV(u,v) ? PPV(u,w)

Xk = ?Thus D = B ≥ m= V2


14/14

SummarySummary

Fully personalized PR Monte-Carlo method, not power iteration

Pre-computation External-memory, distributed

Query Increase precision by (db accesses/query)

Error of approximationTends to zero exponentially

Space lower boundsQuadratic for Exact PPRLinear for Approximate PPR

15/14

Thank you!Thank you!


16/14

MiscMisc

N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v))

Claim (by Chernoff’s bound): Pr PPV(u,v) > (1+δ) PPV(u,v) <

exp(-N∙PPV(u,v)∙δ2/4)

If for a protocol Prright answer ≥ (1+γ) / 2 then B ≥ γ ∙m

PV PageRank vector, c constant, M normalized adjacency matrix,

),( vuPPV


),( vuPPV

towards scaling fully personalized pagerank

Documents