micrun a framework for scale-free graph …• the xeon phi architecture −architecture: many...
TRANSCRIPT
MicRun:A Framework for Scale-free
Graph Algorithms on SIMD Architecture of
the Xeon Phi
Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu,
Qi Zhang, Xiaoling Li and Lei Luo
College of Computer
National University of Defense Technology
10/7/2017
2
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms
– The Xeon Phi Architecture
– Bucket Grouping Module
– Auto-tuning Module
Section 2 The MicRun Framework
Section 3 Experiments & Conclusions
Outline
3
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms
– The Xeon Phi Architecture
– Bucket Grouping Module
– Auto-tuning Module
Section 2 The MicRun Framework
Section 3 Experiments & Conclusions
Outline
4
• Scale-free Graphs are Widely Used− Social Networks Applications− Chemical Molecular Structures− Reference Citations
• Features of Scale-free Graphs − The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution
100
101
102
103
104
100
101
102
103
104
105
Degree
Num
ber
of
V
ert
ices
100
101
102
103
100
102
104
106
Degree
Nu
mb
er
o
f
Ve
rtic
es
(a) Higgs-twitter (b) Soc-pokec
Backgrounds & Motivation
y = x-γ
5
• Graph Algorithms
− Load values of source vertices− Load values of edges− Compute
(e.g. Addition Minimum et.)
− Update destination vertices
Backgrounds & Motivation
Sequential computation steps
• The Xeon Phi Architecture− Architecture: Many Integrated Core (MIC)− 512-bit VPU and four hyper-threads supported− Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set− Connect to CPU with PCIE
Backgrounds & Motivation
6
Backgrounds & Motivation
7
• Challenges of Executing Graph Algorithms on Phi− SIMD access locality influenced by access range− Write conflicts can occur in SIMD Parallelism
• Tiling-and-Grouping Strategy is Commonly Used − Tiling Enhance the data locality− Grouping Remove Parallel conflict− Related Citations
Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016)
Reusing Data Reorganization of graph Applications. (Jiang et al. IPDPS. 2016)
Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)
Backgrounds & Motivation
8
9
• New Challenges Appear
− High Penalty when Using Greedy Grouping− Difficult to Select the Optimal Tile Size
0
50
100
150
200
250
300
350
soc-pokec higgs-twitter
Tim
e
(second)
soc. blocking time
soc. grouping time
higgs. blocking time
higgs. grouping time
orig 128 256 512 1024 2048 4096 8192 163840
500
1000
1500
2000
2500
Tile Size
File
Siz
e
(MB
)
soc-pokec
higgs-twitter
(a) Time Overhead (b) Memory Overhead
Backgrounds & Motivation
10
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms
– The Xeon Phi Architecture
– Bucket Grouping Module
– Auto-tuning Module
Section 2 The MicRun Framework
Section 3 Experiments & Conclusions
Outline
11
• Overview of the Framework and the Modules − Tiling Module− Bucket Grouping Module− Auto-tuning Module− Graph Algorithms
Workflow of the MicRun Framework.
The MicRun Framework
12
• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency
1 2 3
9
4 5
6 7 8
10 11
12 13
14
15
16
Dest. Vertices
Sou
rce
Ve
rtic
es
87654321
9
1 12
6
2
10
4 13 5
14
11
7
8
3
16
15
Bucket number
nnz in buckets
(a) nnz in a tile (b) nnz transformed into groups using buckets
O(n2)Group1 Group2 Group3 Group4 Group5 Group6 SIMD
Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20
Sequential(Chen. 2016)
1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24
The MicRun Framework
O(n2)
13
• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency
1 2 3
9
4 5
6 7 8
10 11
12 13
14
15
16
Dest. Vertices
Sou
rce
Ve
rtic
es
87654321
9
1 12
6
2
10
4 13 5
14
11
7
8
3
16
15
Bucket number
nnz in buckets
(a) nnz in a tile (b) nnz transformed into groups using buckets
O(n2)
The MicRun Framework
O(n*log(b))
14
• Auto-tuning Module− Extract Features Based on the Ideal Graph Application
sizes of the adjust matrix of graphs is related to the sparsity character The nnzs in the graph can influence the whole memory The number of nnzs in each column is related to the nnzs’ distribution The average stride between nnzs can influence the cache miss The feature tuple is constructed as: (s, n, γ, NC , ST)
− Decision Tree Model is Employed The training target OT is obtained by manually probing
The MicRun Framework
int
sum , , ,
1 1 1
p q tfloat float
c r nc g comp nc s total
i j k
T T T T T nnz
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms
– The Xeon Phi Architecture
– Bucket Grouping Module
– Auto-tuning Module
Section 2 The MicRun Framework
Section 3 Experiments & Conclusions
Outline
15
16
• Platform− MIC node on the Tianhe-Ⅱ supercomputer
− The version of the Xeon Phi is 31S1P
− 57 X86 cores, 1.10 GHz, 4 hyper threads per core− The capacity of L2 cache is 28.5MB− Intel ICC 13.0.0, -O3 enabled
• Graph Applications− Bellman-Ford Algorithm− PageRank Algorithm
• Datasets− SNAP Dataset − University of Florida Sparse Matrix Collection
Experiments
• College of Computer of NUDT
• Hometown of Supercomputers: Tianhe - Ⅱ– No. 1 in TOP500 (2013.6 – 2015.11)
– 33.86 PFLOPS, 32,000 CPUs+48,000 MICs
17
Experiments
18
Experiments
• Bucket Grouping vs. Seq. Grouping (Chen. 2016)
(a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies
− Grouping Time Overhead− SIMD Utilization Ratio
Decrease stably Converge to 1 faster
• The Execution of two Graph Algorithms
(a) Comparison of Execution Time
(b) Execution Time of Bellman-Ford
(c) Execution Time of PageRank1.2x on Average
Experiments
19
Datasets
Bellman-Ford PageRank
OPT. vs. SEQ. AUTO. vs. SEQ. OPT. vs. SEQ. AUTO. vs. SEQ.
Val Size Val Size Val Size Val Size
lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256
msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512
rajat24 1.18 2048 1.09 256 1.09 768 1.09 768
Si87H76 1.05 128 1.05 128 1.14 128 1.03 512
higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640
kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024
SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE
• The Performance of the Auto-tuning Module
Optimal 0ver Sequential 1.05x ~ 1.36x
Auto-tuning 0ver Sequential 1.03x ~ 1.29x
Experiments
20
• The MicRun Framework− Grouping Module
Bucket structure is employed Max-heap mechanism is embedded
− Auto-tuning Module Decision Tree Classifier is introduced
• Future work− Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level
Conclusions
21
The Tianhe-2 supercomputer is available online.All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.
Welcome to contact us !Email: [email protected]
22
Thank you! Questions?