towards scaling fully personalized pagerank

16
Towards Scaling Fully Towards Scaling Fully Personalized PageRank Personalized PageRank Dániel Fogaras, Balázs Rácz Budapest University of Technology and Economics Computer and Automation Research Institute of the Hungarian Academy of Sciences

Upload: susane

Post on 03-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Towards Scaling Fully Personalized PageRank. D ániel Fogaras, Balázs Rácz. Computer and Automation Research Institute of the H ungarian Academy of Sciences. Budapest University of Technology and Economics. Problem formulation. PageRank (Brin,Page,’98) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Scaling Fully Personalized PageRank

Towards Scaling Fully Towards Scaling Fully Personalized PageRankPersonalized PageRank

Dániel Fogaras, Balázs Rácz

Budapest University of Technology and Economics

Computer and Automation Research Institute of the

Hungarian Academy of Sciences

Page 2: Towards Scaling Fully Personalized PageRank

2/14

Problem formulationProblem formulation

PageRank(Brin,Page,’98)

PV PageRank vector, r uniform distribution vectorOverall quality measure of Web pagesPre-computation: evaluate PV by power iterationQuery: order results by PV

Personalized PageRank(Brin,Page,’98)r preference vector of a user, query dependentPPV(r):=PV personalized quality measure of Web pagesPre-computation: r is not known. What to compute?Query: power-iteration. 5 hours/query!!!

rMPVPV cc)1(

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 3: Towards Scaling Fully Personalized PageRank

3/14

PreliminariesPreliminaries

Linearity: Full personalization

Pre-compute PPV(ri) for all pagesV2 disk, V(V+E) time, where V ≈ 109, E ≈ 1010, ???

Topic-Sensitive PageRank (Haveliwala ’01)Linearity

Pre-compute PPV(ri) for a topical basis r1,…,rk, k≈20Query: user submits a topic by Query engine combines PPV(ri) vectors

Scaling Personalized Web Search (Jeh, Widom, ’03)Decomposition, linearity

Pre-compute PPV(ri) for unit vectors r1,…,rk, corresponding to k≈10.000 pagesQuery: personalization over the 10.000 pages

)()()( 1111 kkkk rPPVrPPVrrPPV

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

k ,,1

FD
Kéne beszélni az adatbázis méretéről?Esetleg a linearitás láttán triviálisnak tűnhet előre kiszámolni az összes PPV(u)-t.
Page 4: Towards Scaling Fully Personalized PageRank

4/14

Towards full personalizationTowards full personalization

Our algorithmMonte Carlo simulation, not power iteration

Pre-compute approximate PPV(ri) for all unit vectors r1,…,rk, k=number of pages

Scalability: quasi linear pre-computation & sub-linear query

Main points of this presentationOutline of the algorithmPre-computation: external-memory, distributedQuery: used to increase precisionError of approximation tends to zero exponentiallyExact vs. approximated PPV -- space lower bounds

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 5: Towards Scaling Fully Personalized PageRank

5/14

Outline of the AlgorithmOutline of the AlgorithmTheorem (Jeh, Widom ’03, F ’03)

Random walk starts from page uUniform step with probability 1-c, stops with cPPV(u,v)=Pr the walk stops at page v

Monte Carlo algorithmPre-computation

From u simulate N independent random walks

Database of fingerprints: ending vertices of the walks from all vertices

QueryPPV(u,v) : = # ( walks u→v ) / N

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 6: Towards Scaling Fully Personalized PageRank

6/14

External memory pre-computationExternal memory pre-computation

Goal: N independent random walks from each vertex

Input: webgraph V ≈ 109, E ≈ 1010

V+E > memory

Accessing the edgesEdge scan --- stream access

Edges sorted by source vertices

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 7: Towards Scaling Fully Personalized PageRank

7/14

External memory pre-computation (2)External memory pre-computation (2)

Goal: N independent random walks from each vertex

Simulate all walks togetherIteration: 1 blink = 1 edge scan

SortSort path ends

MergeMerge with the sorted graph

Each walk stopsstops with prob. c

EE( #walks ) = (1-c)k∙N∙V

after k iterations

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 8: Towards Scaling Fully Personalized PageRank

8/14

Distributed indexingDistributed indexing

M machines with fast local network connections

memory < V+E ≤ M∙(memory)Parallelize for N∙V walks

Parts of the graph in RAM

Remote transfers batched

Heuristic partition: one site to one machineMachine1: www.cnn.com/*, Machine2: www.yahoo.com/*

Uniform load balance ← ordinary PR distributed equally

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

M=3

Page 9: Towards Scaling Fully Personalized PageRank

9/14

Query, increasing precisionQuery, increasing precision

Database of N∙V fingerprints (path endings)Query: PPV(u) : = empirical distribution

from N samplesTheorem (Jeh, Widom, ’03)

O(u) denotes out-neighbors of u

Query: PPV(u) : = empirical distribution from N∙|O(u)| samples

Number of fingerprints for a queryF = N∙(db accesses/query)

uuOv

rcvcu )(

)()1()( PPVPPV

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 10: Towards Scaling Fully Personalized PageRank

10/14

Error of approximationError of approximation

Exact: PPV(u,v)Approximate by F fingerprints: PPV(u,v) Theorem

If PPV(u,v) > PPV(u, holds, thenPr PPV(u,v) < PPV(u,w) < exp( - 0.3∙N∙δ2 )

Idea of the proof N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) ==sum of F iid. random variables with values -1,0,1Bernstein’s inequality

Error of approximation → 0 exponentially with F = (db size/vertex)∙(db accesses/query) → ∞

)3.0exp(),(),(Pr 2 Fwuvu PPVPPV

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

),( vuPPV

),(),( wuvu PPVPPV

)(#)(#)),(),(( wuvuwuvuF PPVPPV

Page 11: Towards Scaling Fully Personalized PageRank

11/14

Exact versus approximateExact versus approximate

Model of computationInput: G graph with V vertices

Pre-compute a database of size D

Query: respond by accessing only the db.

ExactQuery: u,v,w

Decide if PPV(u,v) > PPV(u,w) holds

Approximate for fixed ε and δQuery: u,v,w

Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 12: Towards Scaling Fully Personalized PageRank

12/14

Lower bounds for the db sizeLower bounds for the db size

For the webgraph V ≈ 109

Theorem 1For the Exact problem D = Ω(V2) sized db is

required in worst case

Theorem 2For the Approximate problem D = Ω(V)

Is it possible to improve the 2nd lower bound? Our algorithm uses a D = O(V logV) sized db

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 13: Towards Scaling Fully Personalized PageRank

13/14

Idea of the lower bound proofsIdea of the lower bound proofs

One-way communication complexity

Bit-vector probing (BVP)

Theorem: B ≥ m for any protocol

Reduction from Exact-PPV to BVP

Alice has a bit vector

Input: x = (x1, x2, …, xm )

Bob has a number

Input: 1 ≤ k ≤ m

Xk = ?

Communication

B bits

Alice has x = (x1, x2, …, xm )

G graph with V vertices, where V2 = m

Pre-compute an Exact PPV database of size D

Communication

Exact PPV db, D bits

Bob has 1 ≤ k ≤ m

u, v, w vertices

PPV(u,v) ? PPV(u,w)

Xk = ?Thus D = B ≥ m= V2

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 14: Towards Scaling Fully Personalized PageRank

14/14

SummarySummary

Fully personalized PR Monte-Carlo method, not power iteration

Pre-computation External-memory, distributed

Query Increase precision by (db accesses/query)

Error of approximationTends to zero exponentially

Space lower boundsQuadratic for Exact PPRLinear for Approximate PPR

Page 15: Towards Scaling Fully Personalized PageRank

15/14

Thank you!Thank you!

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Page 16: Towards Scaling Fully Personalized PageRank

16/14

MiscMisc

N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v))

Claim (by Chernoff’s bound): Pr PPV(u,v) > (1+δ) PPV(u,v) <

exp(-N∙PPV(u,v)∙δ2/4)

If for a protocol Prright answer ≥ (1+γ) / 2 then B ≥ γ ∙m

PV PageRank vector, c constant, M normalized adjacency matrix,

),( vuPPV

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

),( vuPPV