billion-scale network embedding with iterative random ...4 challenge: billion-scale network data...

23
Billion-scale Network Embedding with Iterative Random Projection Ziwei Zhang Peng Cui Haoyang Li Xiao Wang Wenwu Zhu Tsinghua U Tsinghua U Tsinghua U Tsinghua U Tsinghua U

Upload: others

Post on 23-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

Billion-scale Network Embedding

with Iterative Random Projection

Ziwei Zhang Peng Cui Haoyang Li Xiao Wang Wenwu Zhu

Tsinghua U Tsinghua U Tsinghua U Tsinghua U Tsinghua U

Page 2: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

2

Network Data is Ubiquitous

Social Network Biology Network

Traffic Network

Page 3: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

3

Network Embedding: Vector Representation of Nodes

Generate

Embed

Apply feature-based machine

learning algorithms

Fast computing of nodes similarity

Support parallel computing

Applications: link prediction,

node classification, community

detection, centrality measure,

anomaly detection ...

Page 4: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

4

Challenge: Billion-scale Network Data

Social Networks

WeChat: 1 billion monthly active users (March,

2018)

Facebook: 2 billion active users (2017)

E-commerce Networks

Amazon: 353 million products, 310 million users, 5

billion orders (2017)

Citation Networks

130 million authors, 233 million publications, 754

million citations (Aminer, 2018)

How to conduct network embedding for

such large-scale network data?

Page 5: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

5

Bottleneck of Existing Methods Methods based on random-walks

DeepWalk, B. Perozzi, et al. KDD 2014.

LINE, J. Tang, et al. WWW 2015.

Node2vec, A. Grover, et al. KDD 2016.

Methods based on matrix factorization

M-NMF, X. Wang, et al. AAAI 2017.

AROPE, Z. Zhang, et al. KDD 2018.

Methods based on deep learning

SDNE, D. Wang, et al. KDD 2016.

DVNE, D. Zhu, et al. KDD 2018.

Common bottleneck: based on sophisticated optimization

Computationally expansive

Hard to resort to distributed computing scheme

Optimization is entangled and needs global information

β†’ Communication cost is high

Only handle thousands or millions of nodes and edges

Page 6: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

Network embedding: essentially a dimension reduction problem

Random projection: optimization-free for dimension reduction

Basic idea: randomly project data into a low-dimensional subspace

Extremely efficient and friendly to distributed computing

6

Random Projection

Page 7: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

Key network property: high-order proximity

Can solve the network sparsity problem

Measure indirect relationship between nodes

β†’ How to design a high-order proximity preserved random projection?

7

High-Order Proximity

Target Node

First-order

Second-order

Third-order

…

Page 8: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

8

Problem Formulation Objective function: matrix factorization of preserving high-order proximity

minπ‘ˆ,𝑉

𝑆 βˆ’ π‘ˆπ‘‰π‘‡ 𝑝2

𝑆 = 𝑓 𝐴 = 𝛼1𝐴1 + 𝛼2𝐴

2 +β‹―+ π›Όπ‘žπ΄π‘ž

Slight modification: assuming positive semi-definite and using 2 norm

minπ‘ˆ

𝑆𝑆𝑇 βˆ’ π‘ˆπ‘ˆπ‘‡2

𝑆 = 𝑓 𝐴 = 𝛼1𝐴1 + 𝛼2𝐴

2 +β‹―+ π›Όπ‘žπ΄π‘ž

Random projection:

Denote 𝑅 ∈ ℝ𝑁×𝑑 as a Gaussian random matrix

𝑅𝑖𝑗~𝒩 0,1

𝑑

Surprisingly simple result:

π‘ˆ = 𝑆𝑅

Page 9: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

9

Theoretical Guarantee

Theoretical guarantee

Basically, random projection can effectively minimize the objective function

However, calculating 𝑆 is still very time consuming

Page 10: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

10

Iterative Projection Iterative projection:

π‘ˆ = 𝑆𝑅 = 𝛼1𝐴1 + 𝛼2𝐴

2 +β‹―+ π›Όπ‘žπ΄π‘ž 𝑅

= 𝛼1𝐴1𝑅 + 𝛼2𝐴

2𝑅 +β‹―+ π›Όπ‘žπ΄π‘žπ‘…

Can be calculated iteratively

Why efficient?

𝐴: 𝑁 ×𝑁 sparse adjacency matrix

𝑅: 𝑁 Γ— 𝑑 low-dimensional matrix

Associative law of matrix multiplication

𝐴𝐴…𝐴𝐴𝐴𝑅

Γ— A Γ— A Γ— A

𝐴𝐴…𝐴𝐴 𝐴𝑅

𝐴𝐴…𝐴 𝐴𝐴𝑅

Sparse

Low-dimensional

Sparse

Low-dimensional

Sparse

Low-dimensional

Sparse matrix multiplication!

Page 11: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

11

Iterative Projection

A1 A2 AqΓ— A Γ— A Γ— A…

Γ— R

π‘ˆ1Γ— A

π‘ˆ2 π‘ˆπ‘žΓ— A Γ— A…

Γ— R

Time Consuming!

Efficient!

Page 12: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

12

RandNE: Iterative Random projection Network Embedding

Time Complexity: 𝑂 π‘žπ‘€π‘‘ + 𝑁𝑑2

𝑁/𝑀: number of nodes/edges; 𝑑: dimension; π‘ž: order

Linear w.r.t. network size

Only need to calculate q sparse matrix products

Orders of magnitude faster than existing methods!

Advantages:

Distributed Calculation

Dynamic Updating

Page 13: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

13

Distributed Calculation

Iterative random projection only involves matrix product π‘ˆπ‘– = π΄π‘ˆπ‘–βˆ’1

Each dimension can be calculated separately

Property of sparse matrix product

No communication is needed during calculation!

π‘ˆ ∈ ℝ𝑁×𝑑

Page 14: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

14

Dynamic Updating Networks are dynamic in nature

E.g., in social networks, users add/delete friends, new users join, old users leave

Changes of edges β†’ Calculate incremental parts!

π‘ˆπ‘– + Ξ”π‘ˆπ‘– = 𝐴 + Δ𝐴 β‹… (π‘ˆπ‘–βˆ’1+Ξ”π‘ˆπ‘–βˆ’1)

β†’ Ξ”π‘ˆπ‘– = 𝐴 β‹… Ξ”π‘ˆπ‘–βˆ’1 + Δ𝐴 β‹… π‘ˆπ‘–βˆ’1 + Δ𝐴 β‹… Ξ”π‘ˆπ‘–βˆ’1

Changes of nodes β†’ adjust the dimensionality

TimeT T+1

Page 15: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

15

Dynamic Updating

Linear scalability w.r.t. number of changed nodes/edges

No error accumulation

Identical results as re-running the algorithm

Page 16: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

16

Experimental Setting: Moderate-scale Networks

Datasets: BlogCatalog, Flickr, YouTube

Baselines:

DeepWalk (KDD 2014): DFS random walk + skip-gram

LINE (WWW 2015): BFS random walk + skip-gram

Node2vec (KDD 2016): biased random walk + skip-gram

SDNE (KDD 2016): deep auto-encoder

Page 17: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

17

Experimental Results Running time

At least dozens of times faster

Page 18: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

20

Experimental Results Node Classification

Page 19: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

22

Experimental Results Parameter analysis:

Effectiveness of preserving high-order proximity

Scalability

<4 minutes for network with 1 million nodes, 100 million edges with one PC

Page 20: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

23

Experiments on a Billion-scale Network Experimental results on WeChat

250 millions nodes, 4.8 billion edges

Network Reconstruction

Dynamic link prediction

Better results and no error accumulation!

Page 21: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

24

Experimental Results Running time and acceleration ratio

Practical running time for real billion-scale networks

Number of Computing Nodes 4 8 12 16

Running Time(s) 82157 46029 33965 24757

<7 hours!

Support distributed computing

Page 22: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

25

Conclusion RandNE: a billion-scale network embedding method

Based on iterative random projection to preserve high-order proximities

Much more computationally efficient

Distributed algorithm

Handle dynamic networks

Experimental results on moderate-scale networks

At least one order of magnitude faster

Better or comparable performance

Linear scalability

Experiments on WeChat, a real billion-scale network

Better results in network reconstruction and link prediction

No error accumulation

Linear acceleration ratio

Page 23: Billion-scale Network Embedding with Iterative Random ...4 Challenge: Billion-scale Network Data Social Networks WeChat: 1 billion monthly active users (March, 2018) Facebook: 2 billion

26

Thanks!Ziwei Zhang, Tsinghua University

[email protected]

http://zw-zhang.github.io/

http://nrl.thumedialab.com/