towards scaling fully personalized pagerank
DESCRIPTION
Towards Scaling Fully Personalized PageRank. D ániel Fogaras, Balázs Rácz. Computer and Automation Research Institute of the H ungarian Academy of Sciences. Budapest University of Technology and Economics. Problem formulation. PageRank (Brin,Page,’98) - PowerPoint PPT PresentationTRANSCRIPT
Towards Scaling Fully Towards Scaling Fully Personalized PageRankPersonalized PageRank
Dániel Fogaras, Balázs Rácz
Budapest University of Technology and Economics
Computer and Automation Research Institute of the
Hungarian Academy of Sciences
2/14
Problem formulationProblem formulation
PageRank(Brin,Page,’98)
PV PageRank vector, r uniform distribution vectorOverall quality measure of Web pagesPre-computation: evaluate PV by power iterationQuery: order results by PV
Personalized PageRank(Brin,Page,’98)r preference vector of a user, query dependentPPV(r):=PV personalized quality measure of Web pagesPre-computation: r is not known. What to compute?Query: power-iteration. 5 hours/query!!!
rMPVPV cc)1(
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
3/14
PreliminariesPreliminaries
Linearity: Full personalization
Pre-compute PPV(ri) for all pagesV2 disk, V(V+E) time, where V ≈ 109, E ≈ 1010, ???
Topic-Sensitive PageRank (Haveliwala ’01)Linearity
Pre-compute PPV(ri) for a topical basis r1,…,rk, k≈20Query: user submits a topic by Query engine combines PPV(ri) vectors
Scaling Personalized Web Search (Jeh, Widom, ’03)Decomposition, linearity
Pre-compute PPV(ri) for unit vectors r1,…,rk, corresponding to k≈10.000 pagesQuery: personalization over the 10.000 pages
)()()( 1111 kkkk rPPVrPPVrrPPV
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
k ,,1
4/14
Towards full personalizationTowards full personalization
Our algorithmMonte Carlo simulation, not power iteration
Pre-compute approximate PPV(ri) for all unit vectors r1,…,rk, k=number of pages
Scalability: quasi linear pre-computation & sub-linear query
Main points of this presentationOutline of the algorithmPre-computation: external-memory, distributedQuery: used to increase precisionError of approximation tends to zero exponentiallyExact vs. approximated PPV -- space lower bounds
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
5/14
Outline of the AlgorithmOutline of the AlgorithmTheorem (Jeh, Widom ’03, F ’03)
Random walk starts from page uUniform step with probability 1-c, stops with cPPV(u,v)=Pr the walk stops at page v
Monte Carlo algorithmPre-computation
From u simulate N independent random walks
Database of fingerprints: ending vertices of the walks from all vertices
QueryPPV(u,v) : = # ( walks u→v ) / N
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
6/14
External memory pre-computationExternal memory pre-computation
Goal: N independent random walks from each vertex
Input: webgraph V ≈ 109, E ≈ 1010
V+E > memory
Accessing the edgesEdge scan --- stream access
Edges sorted by source vertices
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
7/14
External memory pre-computation (2)External memory pre-computation (2)
Goal: N independent random walks from each vertex
Simulate all walks togetherIteration: 1 blink = 1 edge scan
SortSort path ends
MergeMerge with the sorted graph
Each walk stopsstops with prob. c
EE( #walks ) = (1-c)k∙N∙V
after k iterations
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
8/14
Distributed indexingDistributed indexing
M machines with fast local network connections
memory < V+E ≤ M∙(memory)Parallelize for N∙V walks
Parts of the graph in RAM
Remote transfers batched
Heuristic partition: one site to one machineMachine1: www.cnn.com/*, Machine2: www.yahoo.com/*
Uniform load balance ← ordinary PR distributed equally
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
M=3
9/14
Query, increasing precisionQuery, increasing precision
Database of N∙V fingerprints (path endings)Query: PPV(u) : = empirical distribution
from N samplesTheorem (Jeh, Widom, ’03)
O(u) denotes out-neighbors of u
Query: PPV(u) : = empirical distribution from N∙|O(u)| samples
Number of fingerprints for a queryF = N∙(db accesses/query)
uuOv
rcvcu )(
)()1()( PPVPPV
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
10/14
Error of approximationError of approximation
Exact: PPV(u,v)Approximate by F fingerprints: PPV(u,v) Theorem
If PPV(u,v) > PPV(u, holds, thenPr PPV(u,v) < PPV(u,w) < exp( - 0.3∙N∙δ2 )
Idea of the proof N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) ==sum of F iid. random variables with values -1,0,1Bernstein’s inequality
Error of approximation → 0 exponentially with F = (db size/vertex)∙(db accesses/query) → ∞
)3.0exp(),(),(Pr 2 Fwuvu PPVPPV
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
),( vuPPV
),(),( wuvu PPVPPV
)(#)(#)),(),(( wuvuwuvuF PPVPPV
11/14
Exact versus approximateExact versus approximate
Model of computationInput: G graph with V vertices
Pre-compute a database of size D
Query: respond by accessing only the db.
ExactQuery: u,v,w
Decide if PPV(u,v) > PPV(u,w) holds
Approximate for fixed ε and δQuery: u,v,w
Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
12/14
Lower bounds for the db sizeLower bounds for the db size
For the webgraph V ≈ 109
Theorem 1For the Exact problem D = Ω(V2) sized db is
required in worst case
Theorem 2For the Approximate problem D = Ω(V)
Is it possible to improve the 2nd lower bound? Our algorithm uses a D = O(V logV) sized db
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
13/14
Idea of the lower bound proofsIdea of the lower bound proofs
One-way communication complexity
Bit-vector probing (BVP)
Theorem: B ≥ m for any protocol
Reduction from Exact-PPV to BVP
Alice has a bit vector
Input: x = (x1, x2, …, xm )
Bob has a number
Input: 1 ≤ k ≤ m
Xk = ?
Communication
B bits
Alice has x = (x1, x2, …, xm )
G graph with V vertices, where V2 = m
Pre-compute an Exact PPV database of size D
Communication
Exact PPV db, D bits
Bob has 1 ≤ k ≤ m
u, v, w vertices
PPV(u,v) ? PPV(u,w)
Xk = ?Thus D = B ≥ m= V2
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
14/14
SummarySummary
Fully personalized PR Monte-Carlo method, not power iteration
Pre-computation External-memory, distributed
Query Increase precision by (db accesses/query)
Error of approximationTends to zero exponentially
Space lower boundsQuadratic for Exact PPRLinear for Approximate PPR
15/14
Thank you!Thank you!
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
16/14
MiscMisc
N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v))
Claim (by Chernoff’s bound): Pr PPV(u,v) > (1+δ) PPV(u,v) <
exp(-N∙PPV(u,v)∙δ2/4)
If for a protocol Prright answer ≥ (1+γ) / 2 then B ≥ γ ∙m
PV PageRank vector, c constant, M normalized adjacency matrix,
),( vuPPV
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
),( vuPPV