[icde 2012] on top-k structural similarity search
DESCRIPTION
In this talk, we talk about the following classic problem: given a node in a graph, how can we efficiently track the top-k similar nodes regarding this node, by simply checking the graph link structure? This talk is accompanying with the ICDE 2012 paper "On Top-k Structural Similarity Search", which can be found at http://www.cs.ubc.ca/~peil/research.htmlTRANSCRIPT
04/14/2023
1
Pei Lee, ICDE 2012
On Top-k Structural Similarity Search
Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada
Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China
2
Outline
Problem definition Structural similarity top k structural similarity search
Existing top k structural similarity search methods SimRank, P-Rank Constraints
TopSim: a family of efficient top k structural similarity search algorithms with accuracy guarantee Truncated TopSim, Prioritized TopSim
Experiments
Problem Statement
3
Graph structures are ubiquitous
Social networks, citation networks, web graphs, etc
Problem Statement
4
What’s structural similarity?
Structural similarity: the pairwise similarity between nodes in a graph
Applications: link prediction, recommendation, etc
Problem Statement
Intuition: two nodes are similar, if their neighbors are similar Derived from PageRank’s intuition
v
a h
b gd
c
e
u
fHow to quantify the similarity between node u and v?
Problem Definition:
Input: ( , ), ,
Output: ( , )
G V E u V v V
S u v
A node is important, if this node is referenced by many other important nodes
5
What’s top-k structural similarity search?
Problem Statement
Problem Definition:
Input: ( , ), ,
Output: Top- similar nodes for
G V E v V k
k v
Given a node v in a huge graph Find top-k similar nodes with v But
Definitely do not want to compare with every node The accuracy of results should be guaranteed.
6
Existing Structural Similarity Measures
Neighbor-based approaches Jaccard Coefficient, Cosine Similarity, Pearson
correlation, Co-citation, etc Cons: no common neighbors, no similarity!
Random walk based approaches SimRank (Jeh & Widom, KDD’02) P-Rank (Zhao et.al, CIKM’09) (by extending SimRank) Cons:
high computational cost Not designed for top-k similarity search
Related Work
7
SimRank & P-Rank
SimRank: two nodes are similar, if they are referenced by similar nodes
Related Work
v
a
cb
u
( , ) 0.5 0S b c
( , ) 0.25 0S u v
( , ) 1S a a 1
( ) ( )
( , ) ( , )| ( ) || ( ) |n n
i I u j I v
CS u v S i j
I u I v
1T
n nS CWS W
Pairwise iterative form:
Matrix form:
In-neighbors
Transition matrix
Correction matrix
P-Rank: two nodes are similar, if they are related with similar nodes
1( ) ( ) ( ) ( )
(1 )( , ) ( , ) ( , )
| ( ) || ( ) | | ( ) || ( ) |n n ni I u j I v i O u j O v
C CS u v S i j S i j
I u I v O u O v
0 < C < 1
0 < λ < 1
SimRank Reversed SimRank
8
Top-k similarity search: challenges
Matrix-based approach: (KDD’02, VLDB’08) Offline: compute a |V|-by-|V| similarity matrix SimRank/P-Rank takes O(|E|2) time, which degenerate to O(|
V|4) in the worst case Space cost: hard to store this huge similarity matrix
Vector-based approach: (SDM’10) Offline: compute a vector with length |V| Takes O(|V|D2n) time in the worst case, where n is the
iteration number, D is the average edge degree All these approaches need to access the whole graph to
find the exact top-k similar nodesChallenges
9
Contributions
Transform the computation of pairwise similarity on graph G to the computation of authority on G×G, based on a propagation & aggregation process;
Propose TopSim, a local top-k structural similarity search algorithm that avoids accessing the whole graph while the accuracy is guaranteed.
Propose Trun-TopSim-SM and Prio-TopSim-SM, which are two approximations allowing us to trade accuracy for speed.
Contributions
10
How TopSim works
Coupling random walk on G
Single random walk on G×G
Propagation & Aggregation
Similarity Path
Similarity Score
11
Product of graphs: G×G
Given G(V, E), G×G is defined as For node u and v in G, uv is a node in G×G For edge (e, u) and (e, v) in G, (ee, uv) is an edge in G×G
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
Each node pair in G will be a node in G×G Each edge pair in G will be an edge in G×G No need to materialize G×G: only conceptually exists to facilitate analysis
GG×G
12
Coupling random walk
Coupling random walk: two random surfers walk simultaneously and follow the same edge direction
Surf1, Surf2 Coupling random walk on G can be equivalently transformed as a single
random walk on G×G SimRank: S(u, v) is the first meeting probability of two random surfers
starting from u and v respectively and following backward links.
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
G G×G
13
Compute similarity based on coupling random walk
We actually transform a similarity ranking problem on G into an authority ranking problem on G×G R(uv) = S(u, v)
Initialization: Source node (if u = v): R(uv) = 1 is fixed Target node (if u ≠ v): R(uv) = 0 and R(uv) will be updated
How is R(uv) updated? Propagation & Aggregation process on G×G
Propagation: nodes propagate their authority to their neighbors following random walk steps
Aggregation: nodes receive and aggregate the authorities that are propagated-in from their neighbors.
14
Compute S(u,v)?
Similarity path: a path from source node to target node without going by source nodes
Probability of a transition step: Similarity:
Sum of similarity paths with end node uv
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
15
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
Compute S(u,v): example
1
11
1
1 1
Path 1: (ee, uv)P(ee, uv) = 0.5
If we only consider 3 steps
Path 2: (aa, bd, ce, uv)P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25
S3(u,v) = P(ee, uv)*C + P(aa, bd, ce, uv)*C3 = 0.28
C = 0.5
16
TopSim
17
Optimization based on SimMap
Observation: many similarity paths are overlapped
v
a h
b gd
c
e
u
1
2
3
f
0
SimMap SM(u) = {(key, value)} key is the node visited by Surf2 on step i when Surf1 visits the node u value = Si(key, u)
SM(v) is exactly the result list TopSim-SM
Example: Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)}
Similarity paths
18
Family of TopSim Algorithms
Algorithms Quality Performance
TopSim Exact Slow if the graph is not sparse
TopSim-SM Exact More efficient than TopSim
Trade accuracy for speed More efficient than TopSim-SM
Trade accuracy for speed More efficient than TopSim-SM
Trun-TopSim-SM
Prio-TopSim-SM
19
TopSim approximations for Scale-free graphs
Scale-free graphs Some nodes have very high degree Web graphs, citation networks, etc
Random surfers will be trapped by high degree nodes The size of SimMaps will be exploded
Revisit the transition probabilities:
a
n
20
TopSim approximations
Basic idea: Only consider similarity paths with higher probability
Truncated TopSim-SM If P(u0u0, …, uivi) < η, stop and ignore this path
Prioritized TopSim-SM Set a buffer size H for each SimMap; Only expand top H nodes in SimMaps:
If | SM(u) | > H, set | SM(u) | = H.
Find accuracy and complexity analysis in paper
21
Experiments
Datasets Arxiv High Energy Physics paper citation network,
including 34,546 nodes and 421,578 edges DBLP co-author graph, with 0.92M nodes, 6.1M edges DBLP citation network, with 1.5M papers and 2.1M
citations Live Journal social network, with 4.84M users and
68.99M friendship ties Factors
C = 0.5, η = 0.001, H = 100
22
Accuracy of similarity scores
Accuracy ratio Accuracy loss
(Running on Arxiv citation network)
3 steps/iterations are good enough for the accuracy of top-20 list
23
Precision@k
(Running on DBLP citation network)
k around 20~30 yields the highest precision
3 steps/iterations yields a high precision
24
Kendall Tau distance (care more about the ranking order …)
a
b
a
b
a
b
b
a
concordant discordant
The higher, the better
25
Kendall Tau distance (care more about the ranking order …)
k around 20~30 yields the highest precision
3 steps/iterations yields a high precision
26
Running time with different node sizes and node degrees
TopSim algorithms are not very sensitive to the graph size
TopSim approximations can handle high degree nodes
27
Running time and accessed nodes
28
Excitements
We transform a similarity problem on graph G into an equivalent authority ranking problem on the product graph G × G to facilitate analysis;
We propose a family of TopSim algorithms that: Produce top-k results with accuracy guarantee; Only access a small portion of the graph.
Handle both SimRank and P-Rank under the same top k framework.
Questions?
SimRank P-Rank
TopSim
29
TopSim-SM
Start from v and find source nodes at each step From level n-1 to 0
Let Surf1 start from source node and walk to node v Let Surf2 start from the same source node and put the visited nodes into
SimMaps When Surf1 visits v, Surf2 will exactly visits the similar nodes of v in
the same step
v
a h
b gd
c
e
u
1
2
3
f
0
Example: Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)}