scalable heterogeneous computing: accelerating mpi & i/o on

Scalable Heterogeneous Computing: Accelerating MPI & I/O on CPU+GPU Nodes

Sadaf Alam Swiss National Supercomputing Centre SOS16 Workshop

Collaborators & contributors

•  DK Panda, Devendar Bureddy Ohio State University

•  Antonio J. Peña Technical University of Valencia

•  Oliver Fuhrer MeteoSwiss

2

Characteristics of Internode GPU Communication

3

!"#$ !"#$%&'()*+&&(*'$

,"#$ ,"#$"!%($ "!%($

,"#-,"#$.)/&01()$230')/*4+&$

56-&+7($8&'()*+&&(*'$

5&-&+7($8&'()*+&&(*'$

5&-&+7($8&'()*+&&(*'$

Copy data from GPU to CPU (device to host)!

MPI message transfer between two CPUs (host to host MPI)!

Copy data from CPU to GPU (host to device)!

Motivating Examples •  GPU-accelerated weather modeling application (COSMO)

–  Halo Exchanges between GPUs (on-node and off-node)

–  Different halo sizes

–  Goal: a single source, auto-tuned, code base for platforms with diverse CPU/GPU node configurations

•  GPU virtualization (rCUDA)

–  Solution for small to mid-size heterogeneous clusters

–  Scheduling on remote GPU without performance penalties

–  Goal: define a cost model for scheduling decisions through detailed characterization of GPU-GPU data transfers

4

Bandwidth and Latency Considerations on Homogenous Multi-core Systems

5

!"#-!"#$.)/&01()$230')/*4+&$98':$;+&-#&81+)<$=(<+)>$2**(00$?;#=2@$*+&087()/4+&0$

;A$3B1$

;A$3B1$

;#=2C$

;#=2D$

;#=2E$

;#=2F$

;#=2E$

;#=2F$

;#=2C$

;#=2D$

;+7($F$ ;+7($D$

%&'()*+&&(*'$

Empirical Results (One, Dual-socket Interlagos)

6

F$D$

C$E$

G$H$

I$J$

;#=2$F$

K$L$

DF$DD$

DC$DE$

DG$DH$

;#=2$D$

ED$EF$

CL$CK$

CJ$CI$

CH$CG$

;#=2$E$

CE$CC$

CD$CF$

DL$DK$

DJ$DI$

;#=2$C$

!"#$%&'#

()*#$%&'#

*)+#$%&'#

(),#$%&'#

7

="%$3M9$'9+$;#=2$*N+0(0'$'+$

;A$

="%$3M9$'9+$;#=2$1/)':(0'$1)+<$;A$

O/&7987':$+&$':($!)/>$PQI$0>0'(<$98':$%&'()N/R+0$")+*(00+)0$

8

O/&7987':$+&$':($!)/>$PQI$0>0'(<$98':$%&'()N/R+0$")+*(00+)0$B08&R$N+9$N(S(N$*+<<B&8*/4+&$2"%$?B,;%@$/&7$="%$

O(&(T'$+1$'B&8&R$1+)$;#=2F$

;+$3(&(T'$+1$'B&8&R$1+)$;#=2C$

Latency and Bandwidth: CPU & Accelerators

9

;#=2F$

;#=2D$

;#=2C$

;#=2E$

;A$O#U$

,"#F$

,"#D$

;#=2D$

;#=2F$

;#=2E$

;#=2C$

;A$O#U$

,"#F$

,"#D$

;+7($F$ ;+7($D$

%&'()*+&&(*'$

Implementation Scenarios

10

Latencytotal = LCPU!GPUpageable

+ LGPU!CPUpageable

+ LCPU!CPUMPI

;/VS($8<WN(<(&'/4+&$

.>W8*/N$8<WN(<(&'/4+&$

27S/&*(7$8<WN(<(&'/4+&$

QXW()'$8<WN(<(&'/4+&$

Latencytotal = LCPU!GPUpinned

+ LGPU!CPUpinned

+ LCPU!CPUMPI

Latencytotal =MIN LCPU!GPUpinned

( )+MIN LGPU!CPUpinned

( )+MIN LCPU!CPU"#$

%&'

Latencytotal =MAX MIN LCPU!GPU( ),MIN LGPU!CPU( ),MIN LCPU!CPU"#$

%&'

()*

+,-

11

O/&7987':$3('9((&$'9+$,"#$&+7(0$+&$!)/>$PYI$WN/Z+)<$

!#[2$[(S8*(-:+0'$="%$&+7(-&+7($$!#[2$\+0'-7(S8*($B,;%$,"#-,"#$B,;%$,"#-,"#$

12

O/&7987':$3('9((&$'9+$,"#$&+7(0$+&$!)/>$PYI$?.]78@$WN/Z+)<$.B&8&R$B08&R$786()(&'$W/*^('$08_(0$$

Advances in CUDA and InfiniBand Drivers

•  Enabling of shared access to device memories

•  Uniform virtual addressing

•  Peer to peer support

•  GPU and CPU bindings

•  An optimal implementation exploits all of the above BUT there shouldn’t be a rewrite after the introduction of a new feature

13

Halo Exchange Code Implementation with MVAPICH2-GPU

14

`8&RN($*+7($3/0($1+)$/NN$*+&TRB)/4+&0$

\/N+$

`/<($%M5$\B3$

\/N+$

`/<($;+7($

\/N+$

;+7($F$ ;+7($D$

Implementation using MVAPICH2-GPU

15

-./01## 230.4513# $67#849:#OB6()$8&84/N8_/4+&$/&7$T&/N$*+W>$+&$':($:+0'$

!$ !$$

"$

!+W>$3B6()0$'+$/&7$1)+<$:+0'$1+)$(/*:$4<($0'(W$

!$$

"$ "$

!+<WB'/4+&$+&$':($7(S8*($$

!$$

!$$

!$$

;+&-3N+*^8&R$0(&70$/&7$)(*(8S(0$$

!$ !$ !$

a+9()$<(<+)>b$c$1(9()$N8&(0$+1$*+7($

a(/0'$:+0'$<(<+)>$c$1(9(0'$N8&(0$+1$*+7($

!"

!#!!!!$"

!#!!!%"

!#!!!%$"

!#!!!&"

!#!!!&$"

!#!!!'"

&" (" )" *" %!" %&" %(" %)"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

+,-./"0&(*"12" 34."0&(*"12" 5&5"0&(*"12"

Results (MPI + GPU Halo Exchange)

•  On a dual-socket, dual-GPU, ID QDR system

16

!"

!#!!!!$"

!#!!!%"

!#!!!%$"

!#!!!&"

!#!!!&$"

!#!!!'"

&" (" )" *" %!" %&" %(" %)"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

+,-./"0%!%)"12" 34."0%!%)"12" 5&5"0%!%)"12"

!"

!#!!!$"

!#!!!%"

!#!!!&"

!#!!!'"

!#!!!("

!#!!!)"

%" '" )" *" $!" $%" $'" $)"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

+,-./"0'!**"12" 34."0'!**"12" 5%5"0'!**"12"

!"!#!!!$"!#!!%"

!#!!%$"!#!!&"!#!!&$"!#!!'"!#!!'$"!#!!("

&" (" )" *" %!" %&" %(" %)"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

+,-./"0%)'1)"23" 45."0%)'1)"23" 6&6"0%)'1)"23"

Results (MPI + GPU Halo Exchange)

•  On a 8 GPU, single node system with 2 I/O hubs

17

!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"!#!!!,"!#!!$"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

-./01"2%'+"34" 560"2%'+"34" 7%7"2%'+"34"

!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

,-./0"1$!$)"23" 45/"1$!$)"23" 6%6"1$!$)"23"

!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"!#!!!,"!#!!$"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

-./01"2'!++"34" 560"2'!++"34" 7%7"2'!++"34"

!"!#!!!$"!#!!%"

!#!!%$"!#!!&"!#!!&$"!#!!'"!#!!'$"!#!!("!#!!($"!#!!$"

&" (" )" *"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

+,-./"0%)'1)"23" 45."0%)'1)"23" 6&6"0%)'1)"23"

Results (MPI + GPU Halo Exchange) •  Without MVAPICH2-GPU

18

!"

!#!!$"

!#!!%"

!#!!&"

!#!!'"

!#!!("

!#!!)"

!#!!*"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

,-./0"1%'+"23" 45/"1%'+"23" 6%6"1%'+"23" 789":6;<=6>"

!"

!#!!$"

!#!!%"

!#!!&"

!#!!'"

!#!!("

!#!!)"

!#!!*"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

,-./0"1$!$)"23" 45/"1$!$)"23" 6%6"1$!$)"23" 789":6;<=6>"

!"!#!!$"!#!!%"!#!!&"!#!!'"!#!!("!#!!)"!#!!*"!#!!+"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

,-./0"1'!++"23" 45/"1'!++"23" 6%6"1'!++"23" 789":6;<=6>"

!"!#!!$"!#!!%"!#!!&"!#!!'"!#!!("!#!!)"!#!!*"!#!!+"

%" '" )" +"

!"#$%&'$(

)*&$+,)

-./)'0&1&)2)3.4)5$6"+$&)

,-./0"1$)&*)"23" 45/"1$)&*)"23" 6%6"1$)&*)"23" 789":6;<=6>"

Shortcoming of Current Solution

•  Not a portable solution

–  Only a single MPI library implementation

–  Dependency on a given runtime & driver

–  Network hardware dependency

•  Interoperability constraints

–  Emerging, directive-based programming

–  MPI standards conformance

19

Future Plans & Thoughts

•  Move additional computation to the GPU to minimize transfers –  Towards a self hosted implementation –  Reduction in host memory requirements

•  Encourage middleware developers to provide similar solutions –  OpenCL support –  Multiple accelerators support –  Multiple MPI implementation support (discussions in the forum)

•  Explore interoperability with emerging programming models for accelerators

20

scalable heterogeneous computing: accelerating mpi & i/o on

Documents