gwdg matrix transpose results with hybrid openmp / mpi o. haan gesellschaft für wissenschaftliche...

26
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP 2000, SDSC, La Jolla

Upload: ryan-viel

Post on 16-Dec-2015

229 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

GWDG

Matrix Transpose Resultswith Hybrid OpenMP / MPI

O. Haan

Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany

( GWDG )

SCICOMP 2000, SDSC, La Jolla

Page 2: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 2

GWDG

Overview

• Hybrid Programming Model

• Distributed Matrix Transpose

• Performance Measurements

• Summary of Results

Page 3: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 3

GWDG

Architecture of Scalable Parallel Computers

Two level hierarchy

• cluster of SMP nodesdistributed memoryhigh speed interconnect

• SMP nodes with multiple processorsshared memorybus or switch connected

Page 4: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 4

GWDG

Programming Models

• message passing over all processorsMPI implementation for shared memorymultiple access to switch adaptersSP: 4-way Winterhawk2 +

8-way Nighthawk -

• shared memory over all processorsvirtual global address space SP: -

• hybrid message passing - shared memorymessage passing between nodesshared memory within nodesSP: +

Page 5: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 5

GWDG

Hybrid Programming Model

SPMD programwith MPI tasks

OpenMP threadswithin each task

communicationbetween MPI tasks

Page 6: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 6

GWDG

Example of Hybrid Program program hybrid_example include “mpif.h“ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS()!$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res)!$OMP END PARALLEL do i = 0 , kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end

Page 7: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 7

GWDG

Hybrid Programming vs.Pure Message Passing

+• works on all SP configuration

• coarser internode communication granularity

• faster intranode communication

-• larger programming effort

• additional synchronization steps

• reduced reuse of cached data

the net score depends on the problem

Page 8: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 8

GWDG

Distributed Matrix Transpose

Page 9: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 9

GWDG

3-step Transpose

n1 x n2 matrix A( i1 , i2 ) --> n2 x n1 matrix B( i2 , i1 )

decompose n1, n2 in local and global parts:n1 = n1l * np n2 = n2l * np

write matrices A, B as 4-dim arrays:A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )

step 1 : local reorderA( i1l , i1g , i2l ; i2g ) -> a1( i1l , i2l , i1g ; i2g )

step 2 : global reordera1( i1l , i2l , i1g ; i2g ) -> a2( i1l , i2l , i2g ; i1g )

step 3 : local transposea2( i1l , i2l , i2g ; i1g ) -> B( i2l , i2g , i1l ; i1g )

Page 10: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 10

GWDG

Local Steps: Copy with Reorder

• data in memory:speed limited by performance of bus and memory subsystems

Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s

• data in cache:speed limited by processor performance

Winterhawk2 : one load plus one store per cyclebandwidth : 8 MB / (1/375) s = 3 GB / s

Page 11: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 11

GWDG

Copy: Data in Memory

Page 12: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 12

GWDG

Copy : Prefetch

Page 13: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 13

GWDG

Copy : Data in Cache

Page 14: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 14

GWDG

Global Reorder

a1( *, *, i1g ; i2g ) -> a2( * , * , i2g ; i1g )global reorder on np processors in np steps

p0 p1 p2

step0

step1

step2

Page 15: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 15

GWDG

Performance Modelling

Hardware model: nk nodes with kp procs each

np = nk * kp is total procs count

Switch model: nk concurrent links between nodes

latency tlat , bandwidth c

execution model for Hybrid: reorder on nk nodes:

nk steps with n1*n2 / nk**2 data per node

execution model for MPI: reorder on np processors:

np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs

Page 16: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 16

GWDG

Performance Modelling

Hybrid timing model:

21 /21 nknnctnkt latsmp

nknnctnk lat /211

MPI timing model:

21 /21/ npnnkpctnpt latmpi

nknnctkpnk lat /211

Page 17: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 17

GWDG

Timing of Global Reorder (internode part)

Page 18: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 18

GWDG

Timing of Global Reorder (internode part)

Page 19: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 19

GWDG

Timing of Global Reorder

Page 20: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 20

GWDG

Timing of Transpose

Page 21: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 21

GWDG

Scaling of Transpose

Page 22: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 22

GWDG

Timing of Transpose Steps

Page 23: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 23

GWDG

Summary of Results: Hardware

• Memory access in Winterhawk2 is not adaquate:copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/sa factor of 100 between computational speed and memory speed

• Sharing of switch link by 4 processors degrades communication speed:bandwidth smaller by more than a factor of 4

( factor of 4 expected )latency larger by nearly a factor of 4

( factor of 1 expected )

Page 24: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 24

GWDG

Summary of Results: Hybrid vs. MPI

• hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose :1000 x 1000 matrix on 16 nodes : 2.3 times faster10000 x 10000 matrix on 16 nodes : 1.1 times faster

• Competing influences :MPI programming enhances use of cached dataHybrid programming has lower communication latency and coarser communication granularity

Page 25: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 25

GWDG

Summary of Results: Use of Transpose in FFT

2-dim complex array of size

112 /2/ln5 cnknrnknnt

21 nnn

Execution time on nk nodes :

where r : computational speed per node c : transpose speed per node

effective execution speed per node :

cr

n

rreff

2ln52

1

1

Page 26: GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP

O. Haan, Matrix Transpose Results, SCICOMP 2000 26

GWDG

Summary of Results: Use of Transpose in FFT- Example SP

r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model

nk = 16 n = 10**6 10**9

hybrid c = 5.6 7.8 Mword/sMPI c = 2.5 7.0 Mword/s

effective execution speed per node

hybrid = 208 338 Mflop/s

MPI = 108 317 Mflop/s

effr

effr