jst-crest ulp-hpc: ultra low-power, high- …...jst-crest ulp-hpc: ultra low-power, high-performance...
TRANSCRIPT
JST CRESTJST-CRESTULP-HPC: Ultra Low-Power, High-g
Performance Computing via Modeling and Optimization of Next Modeling and Optimization of Next
Generation HPC Technologies
Satoshi Matsuoka, PI Tokyo Inst. TechnologyReiji Suda Univ TokyoReiji Suda Univ. TokyoTakayuki Aoki Tokyo Inst. TechnologyHiroki Honda Electro-Technical Univ.Mi hihi K ib hi N i l I I f iMichihiro Koibuchi National Inst. Informatics
DoE Exascale Parametersx1000 power efficiency in 10 years
x
x1000 power efficiency in 10 yearsSystem
ib “2010” “2015” “2018-20”attributes 2010 2015 2018 20
System peak 2 PetaFlops 100-200 PetaFlops
1 ExaFlopPetaFlops
Power Jaguar6 MW
TSUBAME1.3 MW
15 MW 20 MW
S t M 0 3PB 0 1PB 5 PB 32 64PBSystem Memory 0.3PB 0.1PB 5 PB 32-64PBNode Perf 125GF 1.6TF 0.5TF 7TF 1TF 10TFNode Mem BW 25GB/s 0 5TB/s 0 1TB/s 1TB/s 0 4TB/s 4TB/sNode Mem BW 25GB/s 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/s
Node Concurrency 12 O(1000) O(100) O(1000) O(1000) O(10000)#Nodes 18,700 1442 50,000 5,000 1 million 100,000Total Node Interconnect BW
1.5GB/s 8GB/s 20GB/s 200GB/s
MTTI O(days) O(1 day) O(1 day)
JST CREST ULP-HPC Scheme
Ultra MultiUltra Multi--CoreCoreAuto-Tuning for Perf. & Power
モデルと実測の Bayes 的融合
• Bayes モデルと事前分布
)(-Inv~
)/,(~,|
),(~
2220
22
2
v
xN
Ny
iTiii
iiiモデルによる所要時間の推定
!ABCLib$ static select region start!ABCLib$ parameter (in CacheS, in NB, in NPrc) !ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)
対象1(アルゴリズム1)
実行起動前自動チューニング指定、アルゴリズム選択処理の指定
コスト定義関数で使われる入力変数
コスト定義関数
ABCLibScript: アルゴリズム選択Slow & ParallelSlow & Parallel
(& ULP)(& ULP)ULP-HPC
SIMD-Vector(GPGPU, etc.)
• n 回実測後の事後予測分布
),(-Inv~ 00 vi
nTiiimnn
niTinnn
nninininiii
xynyy
ynxnn
tyyyyn
/)()(
/)(,,
)/,(~),,,(|
20
2200
2000
12
21
所要時間の実測データ
imi yn
y 1
!ABCLib$ select sub region end!ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)
対象2(アルゴリズム2)
!ABCLib$ select sub region end!ABCLib$ static select region end
1
対象領域1、2MRAMPRAMFlashetc
ULP-HPCNetworks
0
Low Power High Perf Model Optimization etc.
Power Optimize using Novel Componentsin HPC
High Perf Model p mPoint
Power Perf
x10 Power Efficiencty
Perf. Modell h
0
Power-Aware and Optimizable Applications
Algorithmsx1000
Improvement in 10 years
Machine Power (incl. Linpack Linpack FactorPower Efficiency Compared
cooling) Perf (PF) Mflops/WTsubame1.0( )
1.8MW 0.038 21 2368(2006Q1)ORNL Jaguar(2009Q4)
~9MW 1.76 196 256(2009Q4)Tsubame2.0(2010Q4)
1.8MW 1.2 667 75x31.6(2010Q4)K Computer(2011Q2)
~16MW 10 625 80(2011Q2)BlueGene/Q(2012Q1)
~12MW? 17 1417 35.3(2012Q1)
Tsubame3.0 (2015Q1)
1.8MW 20 11000 4.6~x16(2015Q1)
EXA (2018Q4)? 20MW 1000 50000 1
x16
GPU Many-Core Power Efficiency
GPU Many-Core Power Efficiency
GP GPUGP GPUEfficiencyEfficiency
E.g. CFD ApplicationsE.g. CFD ApplicationsComparing to TSUBAME Opteron CPU 1 core, MAX 200W/GPU
Weather Prediction: x80 (10 GPU = 1000 CPUs)
Lattice Boltzmann: ~ x100Phase Field Model: x170
When we assume the acceleration is typically x100,1 GPU < 200W for our applications
100 GPU: 20 kWTSUBAME ~ 1MW/10000CPU
10000 CPU core of TSUBAME 1.2 : 1MW (1000kW)
X5 10 L P / S k tX5 10 L P / S k tCopyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
X5~10 Low Power / SocketX5~10 Low Power / Socket5
GPU vs. CPU Performance. f m nFLOP/Byte = F/B
Roofline model: Williams, Patterson 2008 Communication s of the ACM
103
N
708 GFlopsPe
F/BCommunication s of the ACM
GeForceGTX 285
10
S
102
FFT (NUK
SGEMM
N-Body (N
lo
MD (N
lo
erform Core i7Extreme
x10-20
GFL
OP
S Riken BMT
KADA)
og N)
- Ddif
og N)
mance
LBM
Extreme
x5-10computeboundG
101
D ffusion
[GFlop
x5-10memorybound
bound
10-2 10-1 100 101100
ps] bound
FLOP/Byte10 2 10 1 100 101
6
TSUBAME 1.2 Experimental Evolution TSUBAME 1.2 Experimental Evolution (Oct. 2008)(Oct. 2008)
NEC SX-8iVoltaire ISR9288 Infiniband x810Gbps x2 ~1310+50 Ports
Storage1.5 Petabyte (Sun x4500 x 60)0 1Petabyte (NEC iStore)
500GB48disks
p~13.5Terabits/s(3Tbits bisection)
10Gbps+External NW Sun x4600 (16 Opteron Cores)
0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)60GB/s aggregate I/O BW
10Gbps External NW ( p )32~128 GBytes/Node10480core/655Nodes
21.4TeraBytes50.4TeraFlops
Unified Infiniband network87.1TF 87.1TF LinpackLinpack (The First GPU (The First GPU
Supercomputer on Top500)Supercomputer on Top500) pOS Linux (SuSE 9, 10)
NAREGI Grid MW
p p p )p p p )[IPDPS10][IPDPS10]
NEW Deploy: NEW Deploy: GCOE TSUBASAGCOE TSUBASAHarpertownHarpertown XeonXeon
ClearSpeed CSX600SIMD accelerator
PCI-e
HarpertownHarpertown--XeonXeon90Node 720CPU90Node 720CPU
8.2TeraFlops8.2TeraFlops
SIMD accelerator648 boards, 52.2TeraFlops SFP/DFP
170 N idi T l 1070 680 T l d
New Production Experiment
NEC Confidential
170 Nvidia Tesla 1070, ~680 Tesla cardsHigh Performance in Many BW-Intensive Apps
10% power increase over TSUBAME 1.0
TSUBAME2.0 Nov. 1, 2011“The Greenest Production Supercomputer in the World”The Greenest Production Supercomputer in the World
TSUBAME 2.0New DevelopmentNew Development
8
Precise GPU Power MeasurementPrecise GPU Power MeasurementAutomatic identification of measurement
Analysis of lines of PCI Express riser card and extraction of power supply lines
Analysis of lines of PCI Express riser card and extraction of power supply lines 0.8
0.9
sequences and high precision powermeasurement with marker kernels
High: CudaMemcopy
Line # Description
1 GND2 GND3 GND4 GND
Line # Description
13 GND14 PRSNT#115 12V16 12V
Line # Description
25 JTAG326 JTAG427 JTAG528 3.3V
0.5
0.6
0.7
5 GND6 GND
7 GND8 GND9 GND
17 12V18 12V
19 12V20 SMCLK21 12V(B3)
29 3.3V30 3.3V
31 3.3V32 JTAG133 PWRGD 0 1
0.2
0.3
0.4
Low: Comp w/ 1 thread
9 GND10 GND11 GND12 GND
21 12V(B3)22 SMDAT23 GND24 JTAG2
33 PWRGD34 3.3Vaux35 WAKE#36
0
0.1
0.976 0.977 0.978 0.979 0.98
4 300Correction formulaeUbiquitous Probeby Kuroda Team
0
1
2
3
4
0
100
200
300
Precise measurements Wireless measurements
Correction formulaedep. probe and PSU
111
923
735
547
359
170
982
794
510
630
1 22 43 64 85 106
127
148
169
190
Y = 38.6 + 1.15 X – 0.000637 (X – 100) (X – 200) ± 0.9
Measuring GPU Power Consumption• Two power sources
– Via PCI Express: < 75 WVia PCI Express: 75 W– Direct inputs from PSU: < 240 W
U t l d th• Uses current clamp sensors around the power inputs
Precise and Accurate Measurement with Current Probes
+ A/D converterNI PCIe-6259+
Measurement PC
Attaches current sensors to two power lines in PCIe
Direct power inputs from PSU
Reads currents at 100 us interval
Statistical Power Modeling of GPUs[IEEE IGCC10][IEEE IGCC10]
nGPU performance counters • Estimates GPU power consumption GPU
statisticallyLi i d l i f
ii
icp1
• Linear regression model using performance counters as explanatory variablesHigh accuracy(Avg Err 4.7%)i 1
• Prevents overtraining by ridge regression
Average power consumption
• Determines optimal parameters by cross fitting
Power meter withhigh resolution Accurate even with DVFSg Accurate even with DVFS
Future: Model‐based power opt.Linear model shows sufficient accuracy
Possibility of optimization of Exascale systems with O(10^8) processors
Optimization Effect Predictors for GPUsOptimization Effect Predictors for GPUsOptimization effects heavily depend on GPU architecture and generation Naïve Program
MatrixMul(AMD)
Estimate effects before making efforts Execute with profiling Exec with access log output
Memory access logPerf. counters of naïve executrion
l
FDTD
MatrixMul(NVIDIA)
CPU
Radeon5870
GTX285
• # instructions• # GM reads• # GM writes
:
Access analyzer
Perf. counters ofoptimized execution
0 1 2 3 4 5 6 7 8 9 10 11
Median
最適化による性能向上
GTX285
GTX480
Naive Exec. Time Opt. Exec. Time
Perf. Model
Div
Estimated SpeedupSpeed‐up by “shared memory” optimization
Estimated Speedup
1 5
2
2.5
3
る性
能向
上
実測 5
6
7
8
9
よる
性能
向上
実測
GTX 480 GPU GTX 285 GPU Future work:Estimating improvement ofed
‐up
d‐up
0
0.5
1
1.5
xMul
-N
DCT
NBod
y
FDTD
lOpt
ion
最適
化に
よる
予測
0
1
2
3
4
ixMul
-N
DCT
NBod
y
FDTD
alOpt
ion
最適
化に
よ 予測Estimating improvement of energy consumptionby integrating power model
Spee
Speed
Mat
rixM N F
Bino
mia
lO
Mat
rix
Bino
mia
l
Comparing measured speed up and estimation
Low Power Scheduling in GPU Clusters[IEEE IPDPS HPPAC 09][IEEE IPDPS-HPPAC 09]
Objective:Optimize CPU/GPU Heterogeneous 6000
7000EDP
Heterogeneous
• Optimally schedule mixed sets of jobs executable on either CPU or GPU but 2000
3000
4000
5000
ED
P(1
0^9
sj)
w/different performance & power
• Assume GPU accel. factor (%) known0
1000
Fixed
ECT
Proposed (static)
P p s d ( n lin )
30% Improvement Energy-Delay Product
ECT Proposed (on-line)
SRAD SlowdownTODO: More realistic environment
• Different app. power profile 8
10
12
14
16
時間
増加
率
pp p p
• PCI bus vs. memory conflict– GPU applications slow down by 10 % or 0
2
4
6
8
実行
時pp y
more when co-scheduled with memory-intensive CPU app.
0 2000 4000 6000 8000 10000
バックグラウンドプロセス(MB/s)Background CPU process BW (MB/s)
Intelligent Task Scheduling for GPU clusters‐ considering conflicts on memory bus and PCI‐E bus ‐considering conflicts on memory bus and PCI E bus
CPUC0 C1 C2 C3Status of each process is one of the following:(1) Memory access
4 processes share a node.T. Hamano, et al. IPSJ SIGHPC-124 (domestic)
CPU
L3 Cache
C0 C1 C2 C3
GPU0
(1) Memory access(2) PCI transfer (assume blocking cuMemcpy) (3) No memory accessSpeed‐down ratio of a process = r (percentage
RouterGPU1PCI&HT
GPU2
of each status )×α (conflict coefficient))3()2()1()0(})3,2,1{,0( 321
0 1 2 30 srsrsrsrssss
s s s sα
GPU3Memory
Access time increases due to conflicts We integrated this model into ECT(Earliest Complete Time) scheduling.
実測による結果
1200
1300
1400
c)
ECT
提案手法
Assuming the following information of each job is known•Execution time (at stand alone case)
Enegy reduction of up to 2.3%
ours
800
900
1000
1100
Mak
eS
pan(s
ec•Execution time (at stand alone case)
•Total memory access (from performance counter)•Total PCI‐E transfer (from CUDA profiler)Actual execution time is estimated by our model.
700
800
task_set1 task_set2 task_set3
ULP-HPC Research Example: Next-Gen ULP M S t ([IEEE IPDPS HPPAC 08])Memory System ([IEEE IPDPS-HPPAC 08])
CPUsL1 cache
L2 cache
Decreased
MAIN MEMORY
MRAM DRAM
DecreasedDRAM Requirement
Non-Volatile・Low PowerFast Device
SWAP
FLASH
Non-Volatile・Low PowerMedium-Spee Device for
f SWAP FLASHfast SWAP
HDD
Performance vs Power ConsumptionPerformance vs. Power Consumption45
50
Application : NAS CG (class B)
30
35
40
Tim
e (se
c)
pp ( )
Standard system: DRAM 1GB
Proposed system:
10
15
20
25
Exe
cution T 0MB
64MB
128MB
Proposed system:
(MRAM 128MB + DRAM 192MB)
0
5
10
0 100 200 300 400 500
E
256MB
14
16
18
20
on (J)
Installed Main Memory Capacity (MB)
8
10
12
14
Consu
mptio
Execution overhead : x 1 6
2
4
6Ene
rgy Execution overhead : x 1.6
Memory Energy Consumption : reduced to 28%
0
0 500 1000
Installed Main Memory Capacity (MB)
Ultra-Low-Power HPC Interconnects • Objective: Power efficiency of interconnects is
improved by 100 times within 10 years • Using a large number of small switches
• No more expensive monster-switches (1,000-port switch)
TREE 1 TREE 4TREE 3TREE 2Switch
Tree 1-4: all On ReferenceMichihiro Koibuchi, Tomohiro Otsuka, Tomohiro Kudoh, Hideharu Amano, ``A Switch-tagged Rout
0123 4567 8 910 11
12 13
14
15
Proc
TREE 1 TREE 4TREE 3TREE 2 600
700
800
900
1000
th (
Mbp
s)
On to off
Host 0 finishes
ing Methodology for PC Clusters with VLAN Ethernet'', IEEE Transactions on Parallel and Distributed Systems, Vol.22, No.2, pp.217-230, Feb 2011
TREE 1 TREE 4TREE 3TREE 2
0
100
200
300
400
500
600
0 5 10 15 20
Fra
me
band
wid
th
Generation time (sec)
Off to on
Host 3 starts data transfer to host 2
Topology B Topology A Topology B
Host 0 finishes data transfer to 2
Traffic decreases(Links becomes off) 0 123 4567 8 91011 12131415
Tree 1: On
Tree2-4: Off
Research OutcomePC Switch
SuperNova cluster (225 hosts, 93rd supercomputer on Nov, 2003)
Monster switch (312 port@1)
Small switch (48port@8)
Before After (ULP-HPC)
Power:using small switches(-87%)×cyclic topology(+60%)×On/Off link regulation(-37%)≒-87%(1/8)
@
On/Off link regulation( 37%)≒ 87%(1/8)
Cost:using small switches(-98%) ×cyclic topology(+60%)×On/Off link regulation(±0%) ≒ -97%(1/30)×On/Off link regulation(±0%) ≒ 97%(1/30)
NII Group (Michihiro Koibuchi ([email protected]), Masato Yoshimi ([email protected]) )
Numerical Weather PredictionCollaboration: Japan Meteorological Agency
Meso-scale Atmosphere Model: Cloud Resolving Non-hydrostatic model
e WRF(W th R h d F t)
Typhoon ~ 1000km
1~ 10kmTornado,
ex. WRF(Weather Research and Forecast) Down burst,Heavy RainWSM5 1 % of lines of code, 25 % of elapsed time
⇒ 20 x boost in microphysics (overall 1.2 - 1.3 x)
145.0 TflopsS
ASUCA : full GPU Implementationdeveloped by Japan Meteorological Agency
nxny
Block Divisionfor Advection
ASUCA T h Si l tiASUCA T h Si l ti
Single precision
76.1 TflopsDoublele precision
TSUBAME M20503990 GPU
nx
nzASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48
nxny
for 1-D Helmholtz eq.
Typhoon
nz
Mesoscale Atmosphere Model ASUCAHorizontal 5km ResolutionMesoscale Atmosphere Model ASUCA
Horizontal 5km Resolution (Present)
Mesoscale Atmosphere Model ASUCAHorizontal 5km ResolutionMesoscale Atmosphere Model ASUCA x1000
Horizontal 500m Resolution
Dendrite SolidificationAll C h E ti b d Ph Fi ld M d l 1 017PFLOPS
zzyyxx
M zyx
222222
Allen-Cahn Equation based on Phase Field Model 1.017PFLOPS~1.1MW
aW
Mt
2114
LTkT
130 22Thermal Conduction tC
Tkt
130Thermal Conduction
: 19 points to solve First Over Peta-FLOPS for Stencil ApplicationFirst Over Peta-FLOPS for Stencil Applicationfor Stencil Applicationfor Stencil Application
SC’11 Gordon Bell Award
21
Finalist& Technical Paper
D d it S lidifi tiD d it S lidifi ti GP GPUGP GPUDendrite Solidification for Pure Metal
Now Scales to flopDendrite Solidification for Pure Metal
Now Scales to flopNow Scales to flopNow Scales to flop
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
Tsubame Precise Power Measurement of Petascale Applicaitonsof Petascale Applicaitons
• Minute‐resolution Measurement of Tsubame2.0 Power during petascale “Grand Challenge” Runsduring petascale Grand Challenge Runs
4/7 20:40 ‐‐ 4/8 8:50
1400
1600
1800
2000
Sys
All
600
800
1000
1200
Power (kW)
104‐105Sys
0
200
400
600104‐105Cool
104‐105All
GPU accelerated AMBER11High speed and low energy Molecular Dynamics simulation
• Comparing 16 GPUs to 96 CPU cores , about 10 times acceleration
High‐speed and low‐energy Molecular Dynamics simulation
PMEMD module from AMBER11p g ,
and 75 % energy reduction
NucleosomeNucleosome25095 atoms
Mathematical Core of AutotuningMathematical Core of Autotuning1
est
0.7
0.8
0.9Experiment only
Model only
Our method 1
ce re
lativ
e to be
~9x efficiencyPrevious method:High overheads fromexhaustive search
0.4
0.5
0.6 Our method 2
Performan
exhaustive search
0
0.1
0.2
0.3Drastic reduction ofoverheads with seq.experimental design.
0
ntiumM1.8
ore2Du
o2.3
Opteron
2.6
Opteron
2.2
Xeon
2.0
Xeon
2.4
ntiumM2.0
obPen4
‐1.8
obPen3
‐1.1
ntium4‐3.4
anium2‐1.6
Xeon
3.0
Xeon
3.8
USparc3‐0.9
Power5‐1.9
Crusoe
0.86
Sparc3i‐1
.0
ower5‐1.65
ore2Du
o2.4
ore2Du
o2.3
anium2‐1.3
p g
Our methods:• Bayesian formulation
Pen
Co
O O
Pen
Mo
Mo
Pen
Ita U P C US Po Co Co Ita
Optimization of unrolling of outer and middle loops in matrix‐matrix multiplyAverage performance for n = 1 to 256 or 512
Bayesian formulation• Suboptimal Exp. Design• Parallel Exp. Design• Online Autotuning
First English bookdedicated to
First English bookdedicated to
• Offline Autotuning• Randomized Exp. Design
autotuning, Springerautotuning, Springer
Organizing international workshop iWAPT (every year from 2006),Reported in SIAM News, Invited to US autotuning workshop
Toward Auto tuned Math LibraryToward Auto‐tuned Math LibraryInvestigating speed and power with data sizes, operation types, and programming
5 classes of operations (cf. matrix comp)1. R1 : operations on register, HT compat.2 R2: operations on register HT uncompat
Seek optimal #threads and clock frequency ‐‐‐‐All but MM: Fastest and economic with max threadsand fastest clock2. R2: operations on register, HT uncompat.
3. L1: operations on L1 cache4. L2: operations on L2 cache5. MM: operations on main memory
and fastest clock‐MM: Exec w/ 1 thread 25% low power on Core2Quad,Exec w/ 1733 MHz 20% low power on Corei7.
120th 1
10Core2Quad Q6700Ws Core i7 975 Extreme EditionWs
60
80
100 th=1
th=2
th=4 6
8 th=1th=2
20
40
60
2
4 th=4th=8
0
20
333 666 999 1333 1666 1999 2332 26660
600
800
1000
1200
1400
1600
1733
1867
2000
2133
2267
2400
2533
2667
2800
2933
3067
3200
3334MHz
MHz
NukadaFFT 1.0 release !? [SC08,SC09][ ]NukadaFFT library is a very fast auto-tuning FFT library
for CUDA GPUs.Tuning Parameters:(1) Factorization of transform size.(2) # of threads launched per SM.(3) Padding insertion patters for shared memory
For more details, you can see GTC 2010 Research Poster, or catch the author.
(3) Padding insertion patters for shared memoryThe library generates many kinds of FFT kernels and selects the fastest one. (exhaustive)
120
150 CUFFT 3.1NukadaFFT 1.0GFLOPS
60
90
You can try online benchmarking as for the size you are interested in.
0
30
512420320256240200192160144128Transform Size
http://matsu-www.is.titech.ac.jp/~nukada/nufft/
Performance of 1-D FFT.
a s o S e
(Double Precision, batch=32,768, GPU=GeForce GTX 480.)
“Auto‐Tuning” Sparse Iterative Solvers on a Multi‐GPU Clusters(A. Cevahir, A. Nukada, S. Matsuoka)
High communication due to irregular sparsity• GPU computing is fast, communication bottleneck is more severeGPU computing is fast, communication bottleneck is more severeWe propose techniques to achieve scalable parallelization
• A new sparse matrix‐vector multiplication (MxV) kernel [ICCS’09]A new sparse matrix vector multiplication (MxV) kernel [ICCS 09]• Auto‐selection of optimal running MxV kernel [ISC’10]• Minimization of communication between GPUs [ISC’10][ ]Two methods: CG and PageRank (on 100CPUs)
Comparisons of different CG implementations64‐bit CG Performance Results
Towards Auto‐Tuning of Stencil Computationon GPU‐accelerated Clusterson GPU accelerated Clusters
• Programming on GPU clusters gets harder because ofharder because of• Hybrid programming with
CUDA/MPI/threads• # of tuning parameters is increasing• # of tuning parameters is increasing• Hiding latency
Auto‐tuning is more important
• 3‐HOP communication has been required, for GPU<‐>CPU<‐>CPU<‐>GPU
1‐HOP method is proposed, and auto‐tune between communication methods
2
2.5
3
cond
)
400
500
600
ops)
Up to 20% speed‐up
0
0.5
1
1.5
Tim
e (s
ec
0
100
200
300
Speed (G
Fl speed up
with 1‐HOP method
0
4.4 4.8 4,16
4,32 8,4 8,8 8,16
8,32
16,4
16,8
16,16
16,32 32
,432
,832
,16Thread Size
We find the optimal parameters, such as
0
0 5 10 15 20 25 30 35The number of GPUs
512x512x512(G) 1024x1024x1024(G)2048x2048x2048(G) 512x512x4096(G)512x512x512(G+C) 1024x1024x1024(G+C)2048x2048x2048(G+C) 512x512x4096(G+C)p p ,
block size, by auto‐tuning x60 faster than 1 CPU core
526GFlops is achieved with 32GPUs on TSUBAME 1.2 x1170 faster than 1 CPU core
Scalability of Parallel Stencil Computationswith Automatic GPU Code Generation
and Auto‐TuningParallel stencil codes 8TF is achieved in Himeno bench
ith 1024GPU TSUBAME2 0get complex because of• Domain decomposition• MPI+CUDA comm.
O l i 000
8,000
9,000
with 1024GPUs on TSUBAME2.0 World record!
• Overlapping comm.
3 000
4,000
5,000
6,000
7,000
GFlop
s
CPU‐LGPU‐LCPU‐XLGPU‐XLPhysis: Code generation
__stencil__ void average(int x, int y, int z, grid3d_real g)
{
User writes computation function of a single point0
1,000
2,000
3,000
0 256 512 768 1,024
framework for multi‐GPUs
{
if (x == 0 || y == 0 || z == 0) return;
float ret = psGridGet(g,0,0,0)
+ psGridGet(g,-1,0,0) + psGridGet(g,1,0,0)
+ psGridGet(g,0,-1,0) + psGridGet(g,0,1,0)
Number of Sockets
4000
5000
Weak scalability of seismic simulation
p (g, , , ) p (g, , , )
+ psGridGet(g,0,0,-1) + psGridGet(g,0,0,1);
psGridEmit(g, ret / 7.0);
}
1000
2000
3000
4000
GFlop
s
Parallel stencil code for GPU clusters is generated 0
0 500 1000Number of GPUs
Energy and Performance Tuning Framework for GPU Accelerator EnvironmentGPU Accelerator Environment
Tuning Directive
OpenMP DirectiveOpenMP Directive
ABCLibScript OMPCUDASource program
Tuning Directive
Program
Automatic tuning
g
Tuning PolicyTuning Program
According to CPU Program
Automatic tuning
Power API
User Policy
GPU Program
Automatic tuning
HardwareHardware
System Components for Energy and Performance Tuning FrameworkTuning Framework
Automatic tuning processor: ABCLibScript (ULPHPC UT)
Tuning Directive
Automatic tuning Tunned Program
Source programProgram
ABCLibScriptTunned Program
OpenMP Processor for GPU: OMPCUDA
Source program
CPU Program
OMPCUDA
OpenMP Directive
Program runs onSource program OMPCUDAGPU Program
Program runs on both CPU & GPU
MatMult Automatic Tuning Result (CPU vsGPU)GPU)
CPU(Nehalem 4 core)N = 600
GPU(Tesla C2050)
GPU F tCPU
N=600
GPU FasterCPU Faster
Matrix Size
Tsubame2.0 Efficient Cooling InfrastructureTsubame2.0 Efficient Cooling Infrastructure
HP’s water‐cooled rackCompletely closed racks with their own
~= Entire Earth Simulator(rack = 50TF)heat exchanger.
1.5 x width of normal rack+rear ext.Cooling for high density deployments
(rack = 50TF)
Cooling for high density deployments35kW of cooling capacity single rack
• Highest Rack Heat Density ever• 3000CFM Intake airflow with 7C chiller water
up to 2000 lbs of IT equipmentUniform air flow across the front of the serversAutomatic door opening mechanismAutomatic door opening mechanism controlling both racksAdjustable temperature set pointR 95% t 97% f h t i id
NEC Confidential
Removes 95% to 97% of heat inside racksPolycarbonate front door reduces ambient noise considerably
TSUBAME2.0 World Rankings(Nov. 2010 Announcement Green500!!!)
The Top 500 (Absolute Performance)The Top 500 (Absolute Performance)#1:~2.5 PetaFlops: China Defense Univ. Dawning Tianhe 1-A#2: 1.76 Petaflops: US ORNL Cray XT5 Jaguar#3: 1 27 PetaFlops: China Shenzen SC Nebulae#3: 1.27 PetaFlops: China Shenzen SC Nebulae#4: 1.19 PetaFlops: Japan Tokyo Tech. HP/NEC TSUBAME2.0#5: 1.054 PetaFlops: US LLBL Cray XE6 Hopper#~33 (#2 Japan): 0 191 Petaflops:JAEA Fujitsu# 33 (#2 Japan): 0.191 Petaflops:JAEA Fujitsu
The Green 500 (Performance/Power Efficiency)#1 1684 20 US IBM R h BG/Q P t t (116)#1: 1684.20 : US IBM Research BG/Q Prototype (116)#2: 958.35: Japan Tokyo Tech/HP/NEC Tsubame 2.0 (4)#3: 933.06 : US NCSA Hybrid Cluster Prototype (403)#4 828 67 J Rik “K” S P (170)#4: 828.67: Japan Riken “K” Supercomputer Prototype (170)#5-7: 773.38: Germany Julich etc.IBM QPACE SFB TR (207-209)(#2+ 1448.03: Japan NAO Grape-DR Prototype) (383) (Added in Dec.)
TSUBAME2.0 “Greenest Production Supercomputer in the World”Nov., 2010, June 2011 (two in a row!)
Petaflops? Gigaflops/W?Petaflops? Gigaflops/W?x66 000 x66,000 faster
x3 power efficient
<<Laptop: SONY Vaio type Z (VPCZ1) Supercomputer: TSUBAME 2.0
<<x44,000 Data
Laptop: SONY Vaio type Z (VPCZ1)CPU: Intel Core i7 620M (2.66GHz)MEMORY: DDR3-1066 4GBx2OS: Microsoft Windows 7 Ultimate 64bitHPL I l(R) O i i d LINPACK B h k f
Supercomputer: TSUBAME 2.0CPU: 2714 Intel Westmere 2.93 GhzGPU: 4071 nVidia Fermi M2050MEMORY: DDR3-1333 80TB + GDDR5 12TBOS S SE Li 11 Wi d HPC S R2HPL: Intel(R) Optimized LINPACK Benchmark for
Windows (10.2.6.015)256GB HDD
18 1 GigaFlops Linpack
OS: SuSE Linux 11 + Windows HPC Server R2HPL: Tokyo Tech Heterogeneous HPL11PB Hierarchical Storage
1 192 PetaFlops Linpack18.1 GigaFlops Linpack369 MegaFlops/W
1.192 PetaFlops Linpack1043 MegaFlops/W
Two-Phase Flow SimulationSPH Mesh Scheme (Surface Capturing)
■ Full Navier-Stokes solver:Fractional Step■ Advection term:5th WENO
Particle Methods, ex. SPH Mesh Scheme (Surface Capturing)not splash
■ Poisson:MG-BiCGstab■ Surface tension:CSF model■ Surface capture:CLSVOF(THINC + Level-Set)
Sparse Matrix: BiCGStab + MG-preconditioner
Anti-diffusion THINC for Interface Capture
Continuous Surface Force (CSF) for surface tension
39