jst-crest ulp-hpc: ultra low-power, high- …...jst-crest ulp-hpc: ultra low-power, high-performance...

JST CRESTJST-CRESTULP-HPC: Ultra Low-Power, High-g

Performance Computing via Modeling and Optimization of Next Modeling and Optimization of Next

Generation HPC Technologies

Satoshi Matsuoka, PI Tokyo Inst. TechnologyReiji Suda Univ TokyoReiji Suda Univ. TokyoTakayuki Aoki Tokyo Inst. TechnologyHiroki Honda Electro-Technical Univ.Mi hihi K ib hi N i l I I f iMichihiro Koibuchi National Inst. Informatics

DoE Exascale Parametersx1000 power efficiency in 10 years

x

x1000 power efficiency in 10 yearsSystem

ib “2010” “2015” “2018-20”attributes 2010 2015 2018 20

System peak 2 PetaFlops 100-200 PetaFlops

1 ExaFlopPetaFlops

Power Jaguar6 MW

TSUBAME1.3 MW

15 MW 20 MW

S t M 0 3PB 0 1PB 5 PB 32 64PBSystem Memory 0.3PB 0.1PB 5 PB 32-64PBNode Perf 125GF 1.6TF 0.5TF 7TF 1TF 10TFNode Mem BW 25GB/s 0 5TB/s 0 1TB/s 1TB/s 0 4TB/s 4TB/sNode Mem BW 25GB/s 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/s

Node Concurrency 12 O(1000) O(100) O(1000) O(1000) O(10000)#Nodes 18,700 1442 50,000 5,000 1 million 100,000Total Node Interconnect BW

1.5GB/s 8GB/s 20GB/s 200GB/s

MTTI O(days) O(1 day) O(1 day)

JST CREST ULP-HPC Scheme

Ultra MultiUltra Multi--CoreCoreAuto-Tuning for Perf. & Power

モデルと実測の Bayes 的融合

• Bayes モデルと事前分布

)(-Inv~

)/,(~,|

),(~

2220

22

2

v

xN

Ny

iTiii

iiiモデルによる所要時間の推定

!ABCLib$ static select region start!ABCLib$ parameter (in CacheS, in NB, in NPrc) !ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)

対象１（アルゴリズム１）

実行起動前自動チューニング指定、アルゴリズム選択処理の指定

コスト定義関数で使われる入力変数

コスト定義関数

ABCLibScript: アルゴリズム選択Slow & ParallelSlow & Parallel

(& ULP)(& ULP)ULP-HPC

SIMD-Vector(GPGPU, etc.)

• n 回実測後の事後予測分布

),(-Inv~ 00 vi

nTiiimnn

niTinnn

nninininiii

xynyy

ynxnn

tyyyyn

/)()(

/)(,,

)/,(~),,,(|

20

2200

2000

12

21

所要時間の実測データ

imi yn

y 1

!ABCLib$ select sub region end!ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)

対象２（アルゴリズム２）

!ABCLib$ select sub region end!ABCLib$ static select region end

1

対象領域1、2MRAMPRAMFlashetc

ULP-HPCNetworks

0

Low Power High Perf Model Optimization etc.

Power Optimize using Novel Componentsin HPC

High Perf Model p mPoint

Power Perf

x10 Power Efficiencty

Perf. Modell h

0

Power-Aware and Optimizable Applications

Algorithmsx1000

Improvement in 10 years

Machine Power (incl. Linpack Linpack FactorPower Efficiency Compared

cooling) Perf (PF) Mflops/WTsubame1.0( )

1.8MW 0.038 21 2368(2006Q1)ORNL Jaguar(2009Q4)

~9MW 1.76 196 256(2009Q4)Tsubame2.0(2010Q4)

1.8MW 1.2 667 75x31.6(2010Q4)K Computer(2011Q2)

~16MW 10 625 80(2011Q2)BlueGene/Q(2012Q1)

~12MW? 17 1417 35.3(2012Q1)

Tsubame3.0 (2015Q1)

1.8MW 20 11000 4.6~x16(2015Q1)

EXA (2018Q4)? 20MW 1000 50000 1

x16

GPU Many-Core Power Efficiency

GPU Many-Core Power Efficiency

GP GPUGP GPUEfficiencyEfficiency

E.g. CFD ApplicationsE.g. CFD ApplicationsComparing to TSUBAME Opteron CPU 1 core, MAX 200W/GPU

Weather Prediction: x80 （10 GPU = 1000 CPUs）

Lattice Boltzmann: ～ x100Phase Field Model: x170

When we assume the acceleration is typically x100,1 GPU < 200W for our applications

100 GPU： 20 kWTSUBAME ～ 1MW/10000CPU

10000 CPU core of TSUBAME 1.2 : 1MW (1000kW)

X5 10 L P / S k tX5 10 L P / S k tCopyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

X5~10 Low Power / SocketX5~10 Low Power / Socket5

GPU vs. CPU Performance. f m nFLOP/Byte = F/B

Roofline model: Williams, Patterson 2008 Communication s of the ACM

103

N

708 GFlopsPe

F/BCommunication s of the ACM

GeForceGTX 285

10

S

102

FFT (NUK

SGEMM

N-Body (N

lo

MD (N

lo

erform Core i7Extreme

x10-20

GFL

OP

S Riken BMT

KADA)

og N)

- Ddif

og N)

mance

LBM

Extreme

x5-10computeboundG

101

D ffusion

[GFlop

x5-10memorybound

bound

10-2 10-1 100 101100

ps] bound

FLOP/Byte10 2 10 1 100 101

6

TSUBAME 1.2 Experimental Evolution TSUBAME 1.2 Experimental Evolution (Oct. 2008)(Oct. 2008)

NEC SX-8iVoltaire ISR9288 Infiniband x810Gbps x2 ~1310+50 Ports

Storage1.5 Petabyte (Sun x4500 x 60)0 1Petabyte (NEC iStore)

500GB48disks

p~13.5Terabits/s(3Tbits bisection)

10Gbps+External NW Sun x4600 (16 Opteron Cores)

0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)60GB/s aggregate I/O BW

10Gbps External NW ( p )32~128 GBytes/Node10480core/655Nodes

21.4TeraBytes50.4TeraFlops

Unified Infiniband network87.1TF 87.1TF LinpackLinpack (The First GPU (The First GPU

Supercomputer on Top500)Supercomputer on Top500) pOS Linux (SuSE 9, 10)

NAREGI Grid MW

p p p )p p p )[IPDPS10][IPDPS10]

NEW Deploy: NEW Deploy: GCOE TSUBASAGCOE TSUBASAHarpertownHarpertown XeonXeon

ClearSpeed CSX600SIMD accelerator

PCI-e

HarpertownHarpertown--XeonXeon90Node 720CPU90Node 720CPU

8.2TeraFlops8.2TeraFlops

SIMD accelerator648 boards, 52.2TeraFlops SFP/DFP

170 N idi T l 1070 680 T l d

New Production Experiment

NEC Confidential

170 Nvidia Tesla 1070, ~680 Tesla cardsHigh Performance in Many BW-Intensive Apps

10% power increase over TSUBAME 1.0

TSUBAME2.0 Nov. 1, 2011“The Greenest Production Supercomputer in the World”The Greenest Production Supercomputer in the World

TSUBAME 2.0New DevelopmentNew Development

8

Precise GPU Power MeasurementPrecise GPU Power MeasurementAutomatic identification of measurement

Analysis of lines of PCI Express riser card and extraction of power supply lines

Analysis of lines of PCI Express riser card and extraction of power supply lines 0.8

0.9

sequences and high precision powermeasurement with marker kernels

High: CudaMemcopy

Line # Description

1 GND2 GND3 GND4 GND

Line # Description

13 GND14 PRSNT#115 12V16 12V

Line # Description

25 JTAG326 JTAG427 JTAG528 3.3V

0.5

0.6

0.7

5 GND6 GND

7 GND8 GND9 GND

17 12V18 12V

19 12V20 SMCLK21 12V(B3)

29 3.3V30 3.3V

31 3.3V32 JTAG133 PWRGD 0 1

0.2

0.3

0.4

Low: Comp w/ 1 thread

9 GND10 GND11 GND12 GND

21 12V(B3)22 SMDAT23 GND24 JTAG2

33 PWRGD34 3.3Vaux35 WAKE#36

0

0.1

0.976 0.977 0.978 0.979 0.98

4 300Correction formulaeUbiquitous Probeby Kuroda Team

0

1

2

3

4

0

100

200

300

Precise measurements Wireless measurements

Correction formulaedep. probe and PSU

111

923

735

547

359

170

982

794

510

630

1 22 43 64 85 106

127

148

169

190

Y = 38.6 + 1.15 X – 0.000637 (X – 100) (X – 200) ± 0.9

Measuring GPU Power Consumption• Two power sources

– Via PCI Express: < 75 WVia PCI Express: 75 W– Direct inputs from PSU: < 240 W

U t l d th• Uses current clamp sensors around the power inputs

Precise and Accurate Measurement with Current Probes

+ A/D converterNI PCIe-6259+

Measurement PC

Attaches current sensors to two power lines in PCIe

Direct power inputs from PSU

Reads currents at 100 us interval

Statistical Power Modeling of GPUs[IEEE IGCC10][IEEE IGCC10]

nGPU performance counters • Estimates GPU power consumption GPU

statisticallyLi i d l i f

ii

icp1

• Linear regression model using performance counters as explanatory variablesHigh accuracy（Avg Err 4.7％）i 1

• Prevents overtraining by ridge regression

Average power consumption

• Determines optimal parameters by cross fitting

Power meter withhigh resolution Accurate even with DVFSg Accurate even with DVFS

Future: Model‐based power opt.Linear model shows sufficient accuracy

Possibility of optimization of Exascale systems with O(10^8) processors

Optimization Effect Predictors for GPUsOptimization Effect Predictors for GPUsOptimization effects heavily depend on GPU architecture and generation Naïve Program

MatrixMul(AMD)

Estimate effects before making efforts Execute with profiling Exec with access log output

Memory access logPerf. counters of naïve executrion

l

FDTD

MatrixMul(NVIDIA)

CPU

Radeon5870

GTX285

• # instructions• # GM reads• # GM writes

:

Access analyzer

Perf. counters ofoptimized execution

0 1 2 3 4 5 6 7 8 9 10 11

Median

最適化による性能向上

GTX285

GTX480

Naive Exec. Time Opt. Exec. Time

Perf. Model

Div

Estimated SpeedupSpeed‐up by “shared memory” optimization

Estimated Speedup

1 5

2

2.5

3

る性

能向

上

実測 5

6

7

8

9

よる

性能

向上

実測

GTX 480 GPU GTX 285 GPU Future work:Estimating improvement ofed

‐up

d‐up

0

0.5

1

1.5

xMul

-N

DCT

NBod

y

FDTD

lOpt

ion

最適

化に

よる

予測

0

1

2

3

4

ixMul

-N

DCT

NBod

y

FDTD

alOpt

ion

最適

化に

よ予測Estimating improvement of energy consumptionby integrating power model

Spee

Speed

Mat

rixM N F

Bino

mia

lO

Mat

rix

Bino

mia

l

Comparing measured speed up and estimation

Low Power Scheduling in GPU Clusters[IEEE IPDPS HPPAC 09][IEEE IPDPS-HPPAC 09]

Objective：Optimize CPU/GPU Heterogeneous 6000

7000EDP

Heterogeneous

• Optimally schedule mixed sets of jobs executable on either CPU or GPU but 2000

3000

4000

5000

ED

P(1

0^9

sj)

w/different performance & power

• Assume GPU accel. factor (%) known0

1000

Fixed

ECT

Proposed (static)

P p s d ( n lin )

30% Improvement Energy-Delay Product

ECT Proposed (on-line)

SRAD SlowdownTODO: More realistic environment

• Different app. power profile 8

10

12

14

16

時間

増加

率

pp p p

• PCI bus vs. memory conflict– GPU applications slow down by 10 % or 0

2

4

6

8

実行

時pp y

more when co-scheduled with memory-intensive CPU app.

0 2000 4000 6000 8000 10000

バックグラウンドプロセス（MB/s)Background CPU process BW (MB/s)

Intelligent Task Scheduling for GPU clusters‐ considering conflicts on memory bus and PCI‐E bus ‐considering conflicts on memory bus and PCI E bus

CPUC0 C1 C2 C3Status of each process is one of the following:(1) Memory access

4 processes share a node.T. Hamano, et al. IPSJ SIGHPC-124 (domestic)

CPU

L3 Cache

C0 C1 C2 C3

GPU0

(1) Memory access(2) PCI transfer (assume blocking cuMemcpy) (3) No memory accessSpeed‐down ratio of a process = r (percentage

RouterGPU1PCI&HT

GPU2

of each status )×α (conflict coefficient))3()2()1()0(})3,2,1{,0( 321

0 1 2 30 srsrsrsrssss

s s s sα

GPU3Memory

Access time increases due to conflicts We integrated this model into ECT(Earliest Complete Time) scheduling.

実測による結果

1200

1300

1400

c)

ECT

提案手法

Assuming the following information of each job is known•Execution time (at stand alone case)

Enegy reduction of up to 2.3%

ours

800

900

1000

1100

Mak

eS

pan(s

ec•Execution time (at stand alone case)

•Total memory access (from performance counter)•Total PCI‐E transfer (from CUDA profiler)Actual execution time is estimated by our model.

700

800

task_set1 task_set2 task_set3

ULP-HPC Research Example: Next-Gen ULP M S t ([IEEE IPDPS HPPAC 08])Memory System ([IEEE IPDPS-HPPAC 08])

CPUsL1 cache

L2 cache

Decreased

MAIN MEMORY

MRAM DRAM

DecreasedDRAM Requirement

Non-Volatile・Low PowerFast Device

SWAP

FLASH

Non-Volatile・Low PowerMedium-Spee Device for

f SWAP FLASHfast SWAP

HDD

Performance vs Power ConsumptionPerformance vs. Power Consumption45

50

Application : NAS CG (class B)

30

35

40

Tim

e (se

c)

pp ( )

Standard system: DRAM 1GB

Proposed system:

10

15

20

25

Exe

cution T 0MB

64MB

128MB

Proposed system:

(MRAM 128MB + DRAM 192MB)

0

5

10

0 100 200 300 400 500

E

256MB

14

16

18

20

on (J)

Installed Main Memory Capacity (MB)

8

10

12

14

Consu

mptio

Execution overhead : x 1 6

2

4

6Ene

rgy Execution overhead : x 1.6

Memory Energy Consumption : reduced to 28%

0

0 500 1000

Installed Main Memory Capacity (MB)

Ultra-Low-Power HPC Interconnects • Objective: Power efficiency of interconnects is

improved by 100 times within 10 years • Using a large number of small switches

• No more expensive monster-switches (1,000-port switch)

TREE 1 TREE 4TREE 3TREE 2Switch

Tree 1-4: all On ReferenceMichihiro Koibuchi, Tomohiro Otsuka, Tomohiro Kudoh, Hideharu Amano, ``A Switch-tagged Rout

0123 4567 8 910 11

12 13

14

15

Proc

TREE 1 TREE 4TREE 3TREE 2 600

700

800

900

1000

th (

Mbp

s)

On to off

Host 0 finishes

ing Methodology for PC Clusters with VLAN Ethernet'', IEEE Transactions on Parallel and Distributed Systems, Vol.22, No.2, pp.217-230, Feb 2011

TREE 1 TREE 4TREE 3TREE 2

0

100

200

300

400

500

600

0 5 10 15 20

Fra

me

band

wid

th

Generation time (sec)

Off to on

Host 3 starts data transfer to host 2

Topology B Topology A Topology B

Host 0 finishes data transfer to 2

Traffic decreases（Links becomes off) 0 123 4567 8 91011 12131415

Tree 1: On

Tree2-4: Off

Research OutcomePC Switch

SuperNova cluster （225 hosts, 93rd supercomputer on Nov, 2003)

Monster switch （312 port@1）

Small switch （48port@8)

Before After (ULP-HPC)

Power：using small switches（-87%）×cyclic topology(+60%）×On/Off link regulation（-37%）≒-87%(1/8)

@

On/Off link regulation（ 37%）≒ 87%(1/8)

Cost：using small switches（-98%) ×cyclic topology(+60%）×On/Off link regulation(±0%) ≒ -97%(1/30)×On/Off link regulation(±0%) ≒ 97%(1/30)

NII Group (Michihiro Koibuchi ([email protected]), Masato Yoshimi ([email protected]) )

Numerical Weather PredictionCollaboration: Japan Meteorological Agency

Meso-scale Atmosphere Model: Cloud Resolving Non-hydrostatic model

e WRF(W th R h d F t)

Typhoon ~ 1000km

1~ 10kmTornado,

ex. WRF(Weather Research and Forecast) Down burst,Heavy RainWSM5 1 % of lines of code, 25 % of elapsed time

⇒ 20 x boost in microphysics (overall 1.2 - 1.3 x）

145.0 TflopsS

ASUCA : full GPU Implementationdeveloped by Japan Meteorological Agency

nxny

Block Divisionfor Advection

ASUCA T h Si l tiASUCA T h Si l ti

Single precision

76.1 TflopsDoublele precision

TSUBAME M20503990 GPU

nx

nzASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48

nxny

for 1-D Helmholtz eq.

Typhoon

nz

Mesoscale Atmosphere Model ASUCAHorizontal 5km ResolutionMesoscale Atmosphere Model ASUCA

Horizontal 5km Resolution (Present)

Mesoscale Atmosphere Model ASUCAHorizontal 5km ResolutionMesoscale Atmosphere Model ASUCA x1000

Horizontal 500m Resolution

Dendrite SolidificationAll C h E ti b d Ph Fi ld M d l 1 017PFLOPS

zzyyxx

M zyx

222222

Allen-Cahn Equation based on Phase Field Model 1.017PFLOPS~1.1MW

aW

Mt

2114

LTkT

130 22Thermal Conduction tC

Tkt

130Thermal Conduction

: 19 points to solve First Over Peta-FLOPS for Stencil ApplicationFirst Over Peta-FLOPS for Stencil Applicationfor Stencil Applicationfor Stencil Application

SC’11 Gordon Bell Award

21

Finalist& Technical Paper

D d it S lidifi tiD d it S lidifi ti GP GPUGP GPUDendrite Solidification for Pure Metal

Now Scales to flopDendrite Solidification for Pure Metal

Now Scales to flopNow Scales to flopNow Scales to flop

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

Tsubame Precise Power Measurement of Petascale Applicaitonsof Petascale Applicaitons

• Minute‐resolution Measurement of Tsubame2.0 Power during petascale “Grand Challenge” Runsduring petascale Grand Challenge Runs

4/7 20:40 ‐‐ 4/8 8:50

1400

1600

1800

2000

Sys

All

600

800

1000

1200

Power (kW)

104‐105Sys

0

200

400

600104‐105Cool

104‐105All

GPU accelerated AMBER11High speed and low energy Molecular Dynamics simulation

• Comparing 16 GPUs to 96 CPU cores , about 10 times acceleration

High‐speed and low‐energy Molecular Dynamics simulation

PMEMD module from AMBER11p g ,

and 75 % energy reduction

NucleosomeNucleosome25095 atoms

Mathematical Core of AutotuningMathematical Core of Autotuning1

est

0.7

0.8

0.9Experiment only

Model only

Our method 1

ce re

lativ

e to be

~9x efficiencyPrevious method:High overheads fromexhaustive search

0.4

0.5

0.6 Our method 2

Performan

exhaustive search

0

0.1

0.2

0.3Drastic reduction ofoverheads with seq.experimental design.

0

ntiumM1.8

ore2Du

o2.3

Opteron

2.6

Opteron

2.2

Xeon

2.0

Xeon

2.4

ntiumM2.0

obPen4

‐1.8

obPen3

‐1.1

ntium4‐3.4

anium2‐1.6

Xeon

3.0

Xeon

3.8

USparc3‐0.9

Power5‐1.9

Crusoe

0.86

Sparc3i‐1

.0

ower5‐1.65

ore2Du

o2.4

ore2Du

o2.3

anium2‐1.3

p g

Our methods:• Bayesian formulation

Pen

Co

O O

Pen

Mo

Mo

Pen

Ita U P C US Po Co Co Ita

Optimization of unrolling of outer and middle loops in matrix‐matrix multiplyAverage performance for n = 1 to 256 or 512

Bayesian formulation• Suboptimal Exp. Design• Parallel Exp. Design• Online Autotuning

First English bookdedicated to

First English bookdedicated to

• Offline Autotuning• Randomized Exp. Design

autotuning, Springerautotuning, Springer

Organizing international workshop iWAPT (every year from 2006),Reported in SIAM News, Invited to US autotuning workshop

Toward Auto tuned Math LibraryToward Auto‐tuned Math LibraryInvestigating speed and power with data sizes, operation types, and programming

5 classes of operations (cf. matrix comp)1. R1 : operations on register, HT compat.2 R2: operations on register HT uncompat

Seek optimal #threads and clock frequency ‐‐‐‐All but MM: Fastest and economic with max threadsand fastest clock2. R2: operations on register, HT uncompat.

3. L1: operations on L1 cache4. L2: operations on L2 cache5. MM: operations on main memory

and fastest clock‐MM: Exec w/ 1 thread 25% low power on Core2Quad,Exec w/ 1733 MHz 20% low power on Corei7.

120th 1

10Core2Quad Q6700Ws Core i7 975 Extreme EditionWs

60

80

100 th=1

th=2

th=4 6

8 th=1th=2

20

40

60

2

4 th=4th=8

0

20

333 666 999 1333 1666 1999 2332 26660

600

800

1000

1200

1400

1600

1733

1867

2000

2133

2267

2400

2533

2667

2800

2933

3067

3200

3334MHz

MHz

NukadaFFT 1.0 release !? [SC08,SC09][ ]NukadaFFT library is a very fast auto-tuning FFT library

for CUDA GPUs.Tuning Parameters:(1) Factorization of transform size.(2) # of threads launched per SM.(3) Padding insertion patters for shared memory

For more details, you can see GTC 2010 Research Poster, or catch the author.

(3) Padding insertion patters for shared memoryThe library generates many kinds of FFT kernels and selects the fastest one. (exhaustive)

120

150 CUFFT 3.1NukadaFFT 1.0GFLOPS

60

90

You can try online benchmarking as for the size you are interested in.

0

30

512420320256240200192160144128Transform Size

http://matsu-www.is.titech.ac.jp/~nukada/nufft/

Performance of 1-D FFT.

a s o S e

(Double Precision, batch=32,768, GPU=GeForce GTX 480.)

“Auto‐Tuning” Sparse Iterative Solvers on a Multi‐GPU Clusters(A. Cevahir, A. Nukada, S. Matsuoka)

High communication due to irregular sparsity• GPU computing is fast, communication bottleneck is more severeGPU computing is fast, communication bottleneck is more severeWe propose techniques to achieve scalable parallelization

• A new sparse matrix‐vector multiplication (MxV) kernel [ICCS’09]A new sparse matrix vector multiplication (MxV) kernel [ICCS 09]• Auto‐selection of optimal running MxV kernel [ISC’10]• Minimization of communication between GPUs [ISC’10][ ]Two methods: CG and PageRank (on 100CPUs)

Comparisons of different CG implementations64‐bit CG Performance Results

Towards Auto‐Tuning of Stencil Computationon GPU‐accelerated Clusterson GPU accelerated Clusters

• Programming on GPU clusters gets harder because ofharder because of• Hybrid programming with

CUDA/MPI/threads• # of tuning parameters is increasing• # of tuning parameters is increasing• Hiding latency

Auto‐tuning is more important

• 3‐HOP communication has been required, for GPU<‐>CPU<‐>CPU<‐>GPU

1‐HOP method is proposed, and auto‐tune between communication methods

2

2.5

3

cond

)

400

500

600

ops)

Up to 20% speed‐up

0

0.5

1

1.5

Tim

e (s

ec

0

100

200

300

Speed (G

Fl speed up

with 1‐HOP method

0

4.4 4.8 4,16

4,32 8,4 8,8 8,16

8,32

16,4

16,8

16,16

16,32 32

,432

,832

,16Thread Size

We find the optimal parameters, such as

0

0 5 10 15 20 25 30 35The number of GPUs

512x512x512(G) 1024x1024x1024(G)2048x2048x2048(G) 512x512x4096(G)512x512x512(G+C) 1024x1024x1024(G+C)2048x2048x2048(G+C) 512x512x4096(G+C)p p ,

block size, by auto‐tuning x60 faster than 1 CPU core

526GFlops is achieved with 32GPUs on TSUBAME 1.2 x1170 faster than 1 CPU core

Scalability of Parallel Stencil Computationswith Automatic GPU Code Generation

and Auto‐TuningParallel stencil codes 8TF is achieved in Himeno bench

ith 1024GPU TSUBAME2 0get complex because of• Domain decomposition• MPI+CUDA comm.

O l i 000

8,000

9,000

with 1024GPUs on TSUBAME2.0 World record!

• Overlapping comm.

3 000

4,000

5,000

6,000

7,000

GFlop

s

CPU‐LGPU‐LCPU‐XLGPU‐XLPhysis: Code generation

__stencil__ void average(int x, int y, int z, grid3d_real g)

{

User writes computation function of a single point0

1,000

2,000

3,000

0 256 512 768 1,024

framework for multi‐GPUs

{

if (x == 0 || y == 0 || z == 0) return;

float ret = psGridGet(g,0,0,0)

+ psGridGet(g,-1,0,0) + psGridGet(g,1,0,0)

+ psGridGet(g,0,-1,0) + psGridGet(g,0,1,0)

Number of Sockets

4000

5000

Weak scalability of seismic simulation

p (g, , , ) p (g, , , )

+ psGridGet(g,0,0,-1) + psGridGet(g,0,0,1);

psGridEmit(g, ret / 7.0);

}

1000

2000

3000

4000

GFlop

s

Parallel stencil code for GPU clusters is generated 0

0 500 1000Number of GPUs

Energy and Performance Tuning Framework for GPU Accelerator EnvironmentGPU Accelerator Environment

Tuning Directive

OpenMP DirectiveOpenMP Directive

ABCLibScript OMPCUDASource program

Tuning Directive

Program

Automatic tuning

g

Tuning PolicyTuning Program

According to CPU Program

Automatic tuning

Power API

User Policy

GPU Program

Automatic tuning

HardwareHardware

System Components for Energy and Performance Tuning FrameworkTuning Framework

Automatic tuning processor: ABCLibScript (ULPHPC UT)

Tuning Directive

Automatic tuning Tunned Program

Source programProgram

ABCLibScriptTunned Program

OpenMP Processor for GPU: OMPCUDA

Source program

CPU Program

OMPCUDA

OpenMP Directive

Program runs onSource program OMPCUDAGPU Program

Program runs on both CPU & GPU

MatMult Automatic Tuning Result (CPU vsGPU)GPU)

CPU(Nehalem 4 core)N = 600

GPU(Tesla C2050)

GPU F tCPU

Ｎ＝６００

GPU FasterCPU Faster

Matrix Size

Tsubame2.0 Efficient Cooling InfrastructureTsubame2.0 Efficient Cooling Infrastructure

HP’s water‐cooled rackCompletely closed racks with their own

~= Entire Earth Simulator(rack = 50TF)heat exchanger.

1.5 x width of normal rack+rear ext.Cooling for high density deployments

(rack = 50TF)

Cooling for high density deployments35kW of cooling capacity single rack

• Highest Rack Heat Density ever• 3000CFM Intake airflow with 7C chiller water

up to 2000 lbs of IT equipmentUniform air flow across the front of the serversAutomatic door opening mechanismAutomatic door opening mechanism controlling both racksAdjustable temperature set pointR 95% t 97% f h t i id

NEC Confidential

Removes 95% to 97% of heat inside racksPolycarbonate front door reduces ambient noise considerably

TSUBAME2.0 World Rankings(Nov. 2010 Announcement Green500!!!)

The Top 500 (Absolute Performance)The Top 500 (Absolute Performance)#1：~2.5 PetaFlops: China Defense Univ. Dawning Tianhe 1-A#2： 1.76 Petaflops: US ORNL Cray XT5 Jaguar#3： 1 27 PetaFlops: China Shenzen SC Nebulae#3： 1.27 PetaFlops: China Shenzen SC Nebulae#4： 1.19 PetaFlops: Japan Tokyo Tech. HP/NEC TSUBAME2.0#5： 1.054 PetaFlops: US LLBL Cray XE6 Hopper#~33 (#2 Japan)： 0 191 Petaflops：JAEA Fujitsu# 33 (#2 Japan)： 0.191 Petaflops：JAEA Fujitsu

The Green 500 (Performance/Power Efficiency)#1 1684 20 US IBM R h BG/Q P t t (116)#1: 1684.20 : US IBM Research BG/Q Prototype (116)#2: 958.35: Japan Tokyo Tech/HP/NEC Tsubame 2.0 (4)#3: 933.06 : US NCSA Hybrid Cluster Prototype (403)#4 828 67 J Rik “K” S P (170)#4: 828.67: Japan Riken “K” Supercomputer Prototype (170)#5-7: 773.38: Germany Julich etc.IBM QPACE SFB TR (207-209)(#2+ 1448.03: Japan NAO Grape-DR Prototype) (383) (Added in Dec.)

TSUBAME2.0 “Greenest Production Supercomputer in the World”Nov., 2010, June 2011 (two in a row!)

Petaflops? Gigaflops/W?Petaflops? Gigaflops/W?x66 000 x66,000 faster

x3 power efficient

<<Laptop: SONY Vaio type Z (VPCZ1) Supercomputer: TSUBAME 2.0

<<x44,000 Data

Laptop: SONY Vaio type Z (VPCZ1)CPU: Intel Core i7 620M (2.66GHz)MEMORY: DDR3-1066 4GBx2OS: Microsoft Windows 7 Ultimate 64bitHPL I l(R) O i i d LINPACK B h k f

Supercomputer: TSUBAME 2.0CPU: 2714 Intel Westmere 2.93 GhzGPU: 4071 nVidia Fermi M2050MEMORY: DDR3-1333 80TB + GDDR5 12TBOS S SE Li 11 Wi d HPC S R2HPL: Intel(R) Optimized LINPACK Benchmark for

Windows (10.2.6.015)256GB HDD

18 1 GigaFlops Linpack

OS: SuSE Linux 11 + Windows HPC Server R2HPL: Tokyo Tech Heterogeneous HPL11PB Hierarchical Storage

1 192 PetaFlops Linpack18.1 GigaFlops Linpack369 MegaFlops/W

1.192 PetaFlops Linpack1043 MegaFlops/W

Two-Phase Flow SimulationSPH Mesh Scheme (Surface Capturing)

■ Full Navier-Stokes solver：Fractional Step■ Advection term：5th WENO

Particle Methods, ex. SPH Mesh Scheme (Surface Capturing)not splash

■ Poisson：MG-BiCGstab■ Surface tension：CSF model■ Surface capture：CLSVOF(THINC + Level-Set)

Sparse Matrix: BiCGStab + MG-preconditioner

Anti-diffusion THINC for Interface Capture

Continuous Surface Force (CSF) for surface tension

jst-crest ulp-hpc: ultra low-power, high- …...jst-crest ulp-hpc: ultra low-power, high-performance...

Documents