an approach to the highest efficiency of the hpcg...

A B C D E F

An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE SupercomputerKazuhiko Komatsu1)4), Ryusuke Egawa1)4), Yoko Isobe1)3), Ryusei Ogata3), Hiroyuki Takizawa2)4), Hiroaki Kobayashi1)

1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST

IntroductionHPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages

Overview of the SX-ACE Supercomputer

Preliminary Evaluation before Optimization

Performance Evaluation of HPCG on SX-ACE

Conclusions

Crossbar

ADB 1MBMSHR

SPU VPUCore 1

256GB/s

Core 2

Core 3

Core 0

256GB/s

256GB/sCPU architecture

The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network

Optimizations of HPCG for SX-ACE

<snip>GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329<snip>

To decide the optimization plan, the reference HPCG is executed.Preliminary Evaluation EnvironmentsSupercomputer : SX-ACE# of nodes : 1Compiler : NEC SX C/C++ Compiler Version 1.0HPCG version : Release 2.4Problem size : 104 x 104 x 104 (default)Parallelization : Flat-MPI

HPCG-Benchmark-2.4.yamlThese low performances are caused by quite low vectorization rates andinefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential.

Data rearrangement for continuous memory accesses

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]

mtxIndL[1]

mtxIndL[i]

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]mtxIndL[1]

mtxIndL[i] 01234 1

Data packing for vector-friendly matrixmemory allocation of sparse matrices

Eight-coloring

Parallelization approachesto eliminate data dependencies

CSRDefault matrix data storage

format of HPCG2.4

JADSpMV function in a highly optimized numeri-

cal library

ELLWell known as suitable for the vector archi-

tecture

Selective data store into the on-chip memory ADB- Only highly-reusable data are selectively stored in ADBBy exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary.

Problem size tuning for efficient use of ADBTo avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method.

To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses- Various sparse matrix memory allocation- the eight-coloring method and the hyperplane method for parallelization- Selectively data stored in ADB, and the problem size tuning.As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking.

* Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied.

Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency

Reference CG Timing & Residual Reduction

Reference SpMV and MG Kernel Timing

Optimized CG Setup

Optimized CG Timing and Analysis

Report Results

Problem Setup

Validation Testing

Hyperplane

1 2 4 8 16 32 64 128 256 512

Number of nodes

Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23%

A: ReferenceB: JAD + ColoringC: ELL + ColoringD: ELL + HyperplaneF: ELL + Hyperplane + ADBG: ELL + Hyperplane + ADB + Size tuning

HPCG solves a linear system of a sparse matrix. Ax=bA is a large sparse matrix discretized by the finite element method.The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother.

IXS network by two-stage fat-tree

RTR (spine)

RTR (edge)

plane #0

4GB/s x28GB/s x2

RTR (spine)

RTR (edge)

plane #1

node #000

node #015

node #496

node #511

16 ports

32 ports

ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accessesSoftware controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access

Problem size

Selective caching All data caching

an approach to the highest efficiency of the hpcg...

Documents

permonfllop permonmultibody...

large scale artificial neural network training...

hpcg technical specification - sandia national … report...

hpcg @ arm · inspiration •we have used different sources...

pak markthub, akihiro nomura and satoshi matsuoka tokyo...

performance modeling of the hpcg...

mitigation of failures in high performance computing via...

the hpcg benchmark

hpcg benchmark for characterising performance of soc ... ·...

coachfederation.org/icfglobal20121copyright hpcg®, sage...

high performance model based image...

student cluster...

design of a nvram specialized dynamic graph data...

hpcg update: isc’15 update: isc’15 jack dongarra michael...

hpcg pleiades

hpcg 1 toward a new (another) metric for ranking high

i/o performance analysis framework on measurement data from...

urban ecosystem design - hpcg...

active global address space...

large-scale mo calculation with gpu-accelerated fmo...