an approach to the highest efficiency of the hpcg...
Post on 21-Feb-2020
3 Views
Preview:
TRANSCRIPT
0
5
10
15
20
25
30
35
A B C D E F
HPC
G R
esul
ts (G
FLO
P/s)
An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE SupercomputerKazuhiko Komatsu1)4), Ryusuke Egawa1)4), Yoko Isobe1)3), Ryusei Ogata3), Hiroyuki Takizawa2)4), Hiroaki Kobayashi1)
1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST
IntroductionHPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages
Overview of the SX-ACE Supercomputer
Preliminary Evaluation before Optimization
Performance Evaluation of HPCG on SX-ACE
Conclusions
Crossbar
ADB 1MBMSHR
SPU VPUCore 1
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
256GB/s
Core 2
Core 3
Core 0
256GB/s
RCU
4GB
/s4G
B/s
256GB
/s
256GB/sCPU architecture
The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network
Optimizations of HPCG for SX-ACE
<snip>GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329<snip>
To decide the optimization plan, the reference HPCG is executed.Preliminary Evaluation EnvironmentsSupercomputer : SX-ACE# of nodes : 1Compiler : NEC SX C/C++ Compiler Version 1.0HPCG version : Release 2.4Problem size : 104 x 104 x 104 (default)Parallelization : Flat-MPI
HPCG-Benchmark-2.4.yamlThese low performances are caused by quite low vectorization rates andinefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential.
Data rearrangement for continuous memory accesses
matrixValues[0]
matrixValues[1]
matrixValues[i]
mtxIndL[0]
mtxIndL[1]
mtxIndL[i]
matrixValues[0]
matrixValues[1]
matrixValues[i]
mtxIndL[0]mtxIndL[1]
mtxIndL[i] 01234 1
2
34
01234
0
Data packing for vector-friendly matrixmemory allocation of sparse matrices
Eight-coloring
Parallelization approachesto eliminate data dependencies
CSRDefault matrix data storage
format of HPCG2.4
JADSpMV function in a highly opti- mized numeri-
cal library
ELLWell known as suitable for the vector archi-
tecture
Selective data store into the on-chip memory ADB- Only highly-reusable data are selectively stored in ADBBy exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary.
Problem size tuning for efficient use of ADBTo avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method.
To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses- Various sparse matrix memory allocation- the eight-coloring method and the hyperplane method for parallelization- Selectively data stored in ADB, and the problem size tuning.As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking.
* Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied.
Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency
Reference CG Timing & Residual Reduction
Reference SpMV and MG Kernel Timing
Optimized CG Setup
Optimized CG Timing and Analysis
Report Results
Problem Setup
Validation Testing
Hyperplane
0
2
4
6
8
10
12
14
16
0.0
0.1
1.0
10.0
100.0
1 2 4 8 16 32 64 128 256 512
Effic
ienc
y (%
)
HPC
G R
esul
ts (T
FLO
P/s)
Number of nodes
Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23%
A: ReferenceB: JAD + ColoringC: ELL + ColoringD: ELL + HyperplaneF: ELL + Hyperplane + ADBG: ELL + Hyperplane + ADB + Size tuning
HPCG solves a linear system of a sparse matrix. Ax=bA is a large sparse matrix discretized by the finite element method.The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother.
IXS network by two-stage fat-tree
RTR (spine)
RTR (spine)
RTR (edge)
RTR (edge)
plane #0
4GB/s x28GB/s x2
RTR (spine)
RTR (spine)
RTR (edge)
RTR (edge)
plane #1
node #000
RCU
node #015
RCU
node #496
RCU
node #511
RCU
IXS
16 ports
16 ports
32 ports
ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accessesSoftware controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access
22
24
26
28
30
32
34
88x2
16x5
12
96x2
16x4
64
104x
216x
432
112x
216x
400
120x
216x
376
MG
Res
ults
(GFL
OP/
s)
Problem size
Selective caching All data caching
top related