gwdg matrix transpose results with hybrid openmp / mpi o. haan gesellschaft für wissenschaftliche...

GWDG

Matrix Transpose Resultswith Hybrid OpenMP / MPI

O. Haan

Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany

( GWDG )

SCICOMP 2000, SDSC, La Jolla

O. Haan, Matrix Transpose Results, SCICOMP 2000 2

GWDG

Overview

• Hybrid Programming Model

• Distributed Matrix Transpose

• Performance Measurements

• Summary of Results


GWDG

Architecture of Scalable Parallel Computers

Two level hierarchy

• cluster of SMP nodesdistributed memoryhigh speed interconnect

• SMP nodes with multiple processorsshared memorybus or switch connected


GWDG

Programming Models

• message passing over all processorsMPI implementation for shared memorymultiple access to switch adaptersSP: 4-way Winterhawk2 +

8-way Nighthawk -

• shared memory over all processorsvirtual global address space SP: -

• hybrid message passing - shared memorymessage passing between nodesshared memory within nodesSP: +


GWDG

Hybrid Programming Model

SPMD programwith MPI tasks

OpenMP threadswithin each task

communicationbetween MPI tasks


GWDG

Example of Hybrid Program program hybrid_example include “mpif.h“ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS()!$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res)!$OMP END PARALLEL do i = 0 , kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end


GWDG

Hybrid Programming vs.Pure Message Passing

+• works on all SP configuration

• coarser internode communication granularity

• faster intranode communication

-• larger programming effort

• additional synchronization steps

• reduced reuse of cached data

the net score depends on the problem


GWDG

Distributed Matrix Transpose


GWDG

3-step Transpose

n1 x n2 matrix A( i1 , i2 ) --> n2 x n1 matrix B( i2 , i1 )

decompose n1, n2 in local and global parts:n1 = n1l * np n2 = n2l * np

write matrices A, B as 4-dim arrays:A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )

step 1 : local reorderA( i1l , i1g , i2l ; i2g ) -> a1( i1l , i2l , i1g ; i2g )

step 2 : global reordera1( i1l , i2l , i1g ; i2g ) -> a2( i1l , i2l , i2g ; i1g )

step 3 : local transposea2( i1l , i2l , i2g ; i1g ) -> B( i2l , i2g , i1l ; i1g )


GWDG

Local Steps: Copy with Reorder

• data in memory:speed limited by performance of bus and memory subsystems

Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s

• data in cache:speed limited by processor performance

Winterhawk2 : one load plus one store per cyclebandwidth : 8 MB / (1/375) s = 3 GB / s


GWDG

Copy: Data in Memory


GWDG

Copy : Prefetch


GWDG

Copy : Data in Cache


GWDG

Global Reorder

a1( *, *, i1g ; i2g ) -> a2( * , * , i2g ; i1g )global reorder on np processors in np steps

p0 p1 p2

step0

step1

step2


GWDG

Performance Modelling

Hardware model: nk nodes with kp procs each

np = nk * kp is total procs count

Switch model: nk concurrent links between nodes

latency tlat , bandwidth c

execution model for Hybrid: reorder on nk nodes:

nk steps with n1*n2 / nk**2 data per node

execution model for MPI: reorder on np processors:

np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs


GWDG

Performance Modelling

Hybrid timing model:

21 /21 nknnctnkt latsmp

nknnctnk lat /211

MPI timing model:

21 /21/ npnnkpctnpt latmpi

nknnctkpnk lat /211


GWDG

Timing of Global Reorder (internode part)


GWDG

Timing of Global Reorder


GWDG

Timing of Transpose


GWDG

Scaling of Transpose


GWDG

Timing of Transpose Steps


GWDG

Summary of Results: Hardware

• Memory access in Winterhawk2 is not adaquate:copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/sa factor of 100 between computational speed and memory speed

• Sharing of switch link by 4 processors degrades communication speed:bandwidth smaller by more than a factor of 4

( factor of 4 expected )latency larger by nearly a factor of 4

( factor of 1 expected )


GWDG

Summary of Results: Hybrid vs. MPI

• hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose :1000 x 1000 matrix on 16 nodes : 2.3 times faster10000 x 10000 matrix on 16 nodes : 1.1 times faster

• Competing influences :MPI programming enhances use of cached dataHybrid programming has lower communication latency and coarser communication granularity


GWDG

Summary of Results: Use of Transpose in FFT

2-dim complex array of size

112 /2/ln5 cnknrnknnt

21 nnn

Execution time on nk nodes :

where r : computational speed per node c : transpose speed per node

effective execution speed per node :

cr

n

rreff

2ln52

1

1


GWDG

Summary of Results: Use of Transpose in FFT- Example SP

r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model

nk = 16 n = 10**6 10**9

hybrid c = 5.6 7.8 Mword/sMPI c = 2.5 7.0 Mword/s

effective execution speed per node

hybrid = 208 338 Mflop/s

MPI = 108 317 Mflop/s

effr

effr

gwdg matrix transpose results with hybrid openmp / mpi o. haan gesellschaft für wissenschaftliche...

Documents

gwdg matrix transpose

memory slide

i1l i1g slide

germany gwdg scicomp

end slide

step transpose n1 x

local transpose a2 i1l

hybrid openmp mpi