micrun a framework for scale-free graph …• the xeon phi architecture −architecture: many...

MicRun：A Framework for Scale-free

Graph Algorithms on SIMD Architecture of

the Xeon Phi

Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu,

Qi Zhang, Xiaoling Li and Lei Luo

College of Computer

National University of Defense Technology

10/7/2017

2

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms

– The Xeon Phi Architecture

– Bucket Grouping Module

– Auto-tuning Module

Section 2 The MicRun Framework

Section 3 Experiments & Conclusions

Outline

3








Outline

4

• Scale-free Graphs are Widely Used− Social Networks Applications− Chemical Molecular Structures− Reference Citations

• Features of Scale-free Graphs − The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution

100

101

102

103

104

100

101

102

103

104

105

Degree

Num

ber

of

V

ert

ices

100

101

102

103

100

102

104

106

Degree

Nu

mb

er

o

f

Ve

rtic

es

(a) Higgs-twitter (b) Soc-pokec

Backgrounds & Motivation

y = x-γ

5

• Graph Algorithms

− Load values of source vertices− Load values of edges− Compute

(e.g. Addition Minimum et.)

− Update destination vertices


Sequential computation steps

• The Xeon Phi Architecture− Architecture: Many Integrated Core (MIC)− 512-bit VPU and four hyper-threads supported− Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set− Connect to CPU with PCIE


6


7

• Challenges of Executing Graph Algorithms on Phi− SIMD access locality influenced by access range− Write conflicts can occur in SIMD Parallelism

• Tiling-and-Grouping Strategy is Commonly Used − Tiling Enhance the data locality− Grouping Remove Parallel conflict− Related Citations

Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016)

Reusing Data Reorganization of graph Applications. (Jiang et al. IPDPS. 2016)

Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)


8

9

• New Challenges Appear

− High Penalty when Using Greedy Grouping− Difficult to Select the Optimal Tile Size

0

50

100

150

200

250

300

350

soc-pokec higgs-twitter

Tim

e

(second)

soc. blocking time

soc. grouping time

higgs. blocking time

higgs. grouping time

orig 128 256 512 1024 2048 4096 8192 163840

500

1000

1500

2000

2500

Tile Size

File

Siz

e

(MB

)

soc-pokec

higgs-twitter

(a) Time Overhead (b) Memory Overhead


10








Outline

11

• Overview of the Framework and the Modules − Tiling Module− Bucket Grouping Module− Auto-tuning Module− Graph Algorithms

Workflow of the MicRun Framework.

The MicRun Framework

12

• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency

1 2 3

9

4 5

6 7 8

10 11

12 13

14

15

16

Dest. Vertices

Sou

rce

Ve

rtic

es

87654321

9

1 12

6

2

10

4 13 5

14

11

7

8

3

16

15

Bucket number

nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)Group1 Group2 Group3 Group4 Group5 Group6 SIMD

Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20

Sequential(Chen. 2016)

1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24


O(n2)

13

• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency

1 2 3

9

4 5

6 7 8

10 11

12 13

14

15

16

Dest. Vertices

Sou

rce

Ve

rtic

es

87654321

9

1 12

6

2

10

4 13 5

14

11

7

8

3

16

15

Bucket number

nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)


O(n*log(b))

14

• Auto-tuning Module− Extract Features Based on the Ideal Graph Application

sizes of the adjust matrix of graphs is related to the sparsity character The nnzs in the graph can influence the whole memory The number of nnzs in each column is related to the nnzs’ distribution The average stride between nnzs can influence the cache miss The feature tuple is constructed as: (s, n, γ, NC , ST)

− Decision Tree Model is Employed The training target OT is obtained by manually probing


int

sum , , ,

1 1 1

p q tfloat float

c r nc g comp nc s total

i j k

T T T T T nnz








Outline

15

16

• Platform− MIC node on the Tianhe-Ⅱ supercomputer

− The version of the Xeon Phi is 31S1P

− 57 X86 cores, 1.10 GHz, 4 hyper threads per core− The capacity of L2 cache is 28.5MB− Intel ICC 13.0.0, -O3 enabled

• Graph Applications− Bellman-Ford Algorithm− PageRank Algorithm

• Datasets− SNAP Dataset − University of Florida Sparse Matrix Collection

Experiments

• College of Computer of NUDT

• Hometown of Supercomputers: Tianhe - Ⅱ– No. 1 in TOP500 (2013.6 – 2015.11)

– 33.86 PFLOPS, 32,000 CPUs+48,000 MICs

17

Experiments

18

Experiments

• Bucket Grouping vs. Seq. Grouping (Chen. 2016)

(a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies

− Grouping Time Overhead− SIMD Utilization Ratio

Decrease stably Converge to 1 faster

• The Execution of two Graph Algorithms

(a) Comparison of Execution Time

(b) Execution Time of Bellman-Ford

(c) Execution Time of PageRank1.2x on Average

Experiments

19

Datasets

Bellman-Ford PageRank

OPT. vs. SEQ. AUTO. vs. SEQ. OPT. vs. SEQ. AUTO. vs. SEQ.

Val Size Val Size Val Size Val Size

lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256

msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512

rajat24 1.18 2048 1.09 256 1.09 768 1.09 768

Si87H76 1.05 128 1.05 128 1.14 128 1.03 512

higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640

kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024

SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE

• The Performance of the Auto-tuning Module

Optimal 0ver Sequential 1.05x ~ 1.36x

Auto-tuning 0ver Sequential 1.03x ~ 1.29x

Experiments

20

• The MicRun Framework− Grouping Module

Bucket structure is employed Max-heap mechanism is embedded

− Auto-tuning Module Decision Tree Classifier is introduced

• Future work− Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level

Conclusions

21

The Tianhe-2 supercomputer is available online.All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.

Welcome to contact us !Email: [email protected]

22

Thank you! Questions？

micrun a framework for scale-free graph …• the xeon phi architecture −architecture: many...

Documents