parallel and distributed computing on low latency clusters

40
Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009

Upload: vittorio-giovara

Post on 15-Jan-2015

1.777 views

Category:

Technology


3 download

DESCRIPTION

Slides from the thesis defence in Chicago by Vittorio Giovara.

TRANSCRIPT

Page 1: Parallel and Distributed Computing on Low Latency Clusters

Parallel and Distributed Computing on Low

Latecy Clusters

Vittorio GiovaraM. S. Electrical Engineering and Computer Science

University of Illinois at ChicagoMay 2009

Page 2: Parallel and Distributed Computing on Low Latency Clusters

Contents• Motivation

• Strategy

• Technologies

• OpenMP

• MPI

• Infinband

• Application

• Compiler Optimizations

• OpenMP and MPI over Infinband

• Results

• Conclusions

Page 3: Parallel and Distributed Computing on Low Latency Clusters

Motivation

Page 4: Parallel and Distributed Computing on Low Latency Clusters

Motivation

• Scaling trend has to stop for CMOS technology:✓ Direct-tunneling limit in SiO2 ~3 nm

✓ Distance between Si atoms ~0.3 nm

✓ Variabilty

• Foundamental reason: rising fab cost

Page 5: Parallel and Distributed Computing on Low Latency Clusters

Motivation

• Easy to build multiple core processor

• Requires human action to modify and adapt concurrent software

• New classification for computer architectures

Page 6: Parallel and Distributed Computing on Low Latency Clusters

data pool

inst

ruct

ion

pool

CPU

CPU

MISD

data pool

CPU

inst

ruct

ion

pool

SISDdata pool

inst

ruct

ion

pool

CPU CPU

SIMD

data pool

inst

ruct

ion

pool

CPU CPU

CPUCPU

MIMD

Classification

Page 7: Parallel and Distributed Computing on Low Latency Clusters

easier to parallelize

abstraction level

algorithm

loop level

process management

Page 8: Parallel and Distributed Computing on Low Latency Clusters

data dependencybranching overhead

control flowalgorithm

loop level

process management

recursionmemory

managementprofiling

SMP MultiprogrammingMultithreading and Scheduling

Levels

Page 9: Parallel and Distributed Computing on Low Latency Clusters

Backfire

• Difficutly to fully exploit the parallelism offered

• Automatic tools required to adapt software to parallelism

• Compiler support for manual or semi-automatic enhancement

Page 10: Parallel and Distributed Computing on Low Latency Clusters

Applications

• OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software

• Mathematics and Physics

• Computer Science

• Biomedics

Page 11: Parallel and Distributed Computing on Low Latency Clusters

Specific Problem and Background

• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)

• Computationally intensive (even days of CPU); speedup required

• Previous works still not fully encompassing the problem (no Infiniband or OpenMP+MPI solutions)

Page 12: Parallel and Distributed Computing on Low Latency Clusters

Strategy

Page 13: Parallel and Distributed Computing on Low Latency Clusters

Strategy

• Install a Linux Kernel with ad-hoc configuration for scientific computation

• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)

• Add the Infiniband link among clusters with proper drivers in kernel and user space

• Select a MPI implementation library

Page 14: Parallel and Distributed Computing on Low Latency Clusters

Strategy

• Verify Infiniband network through some MPI test examples

• Install the target software

• Proceed to include OpenMP and MPI directives in the code

• Run test cases

Page 15: Parallel and Distributed Computing on Low Latency Clusters

OpenMP

• standard

• supported by most of modern compilers

• requires little knowledge of the software

• very simple construction methods

Page 16: Parallel and Distributed Computing on Low Latency Clusters

OpenMP - example

Page 17: Parallel and Distributed Computing on Low Latency Clusters

Parallel Task 1

Parallel Task 2 Parallel Task 4

Parallel Task 3

OpenMP - example

Page 18: Parallel and Distributed Computing on Low Latency Clusters

Master Thread

Parallel Task 1 Parallel Task 2

Parallel Task 3

Parallel Task 4

Thread B

Thread A

Join

Page 19: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Sceduler

• Which scheduler available for hardware?

- Static

- Dynamic

- Guided

Page 20: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Scheduler

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Static Scheduler Chart

mic

rose

cond

s

number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

Page 21: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Scheduler

0

14625

29250

43875

58500

73125

87750

102375

117000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Dynamic Scheduler Chart

mic

rose

cond

s

number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

Page 22: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Scheduler

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Guided Scheduler Chart

mic

rose

cond

s

number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

Page 23: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Scheduler

Page 24: Parallel and Distributed Computing on Low Latency Clusters

OpenMP Scheduler

static scheduler dynamic scheduler guided scheduler

Page 25: Parallel and Distributed Computing on Low Latency Clusters

MPI

• standard

• widely used in cluster environment

• many transport link supported

• different implementations available

- OpenMPI

- MVAPICH

Page 26: Parallel and Distributed Computing on Low Latency Clusters

Infiniband

• standard

• widely used in cluster environment

• very low latency for small packets

• up to 16 Gb/s transfer speed

Page 27: Parallel and Distributed Computing on Low Latency Clusters

1,0 µs

10,0 µs

100,0 µs

1000,0 µs

10000,0 µs

100000,0 µs

1000000,0 µs

10000000,0 µs

1 kB

2 kB

4 kB

8 kB

16 kB

32 kB

64 kB

128

kB

256

kB

512

kB1

MB

2 M

B4

MB

8 M

B

16 M

B

32 M

B

64 M

B

128

MB

256

MB

512

MB1

GB2

GB4

GB8

GB

16 G

B

OpenMPI Mvapich2

MPI over Infiniband

Page 28: Parallel and Distributed Computing on Low Latency Clusters

MPI over Infiniband

1,00 µs

10,00 µs

100,00 µs

1000,00 µs

10000,00 µs

100000,00 µs

1000000,00 µs

10000000,00 µs

1 kB

2 kB

4 kB

8 kB

16 kB

32 kB

64 kB

128

kB

256

kB

512

kB1

MB

2 M

B4

MB

8 M

B

OpenMPI Mvapich2

Page 29: Parallel and Distributed Computing on Low Latency Clusters

Optimizations

• Active at compile time

• Available only after porting the software to standard FORTRAN

• Consistent documentation available

• Unexpected positive results

Page 30: Parallel and Distributed Computing on Low Latency Clusters

Optimizations

•-march = native

•-O3

•-ffast-math

•-Wl,-O1

Page 31: Parallel and Distributed Computing on Low Latency Clusters

Target Software

Page 32: Parallel and Distributed Computing on Low Latency Clusters

Target Software

• Sally3D

• micromagnetic equation solver

• written in FORTRAN with some C libraries

• program uses linear formulation of mathematical models

Page 33: Parallel and Distributed Computing on Low Latency Clusters

parallel loop

OpenMP Threadsdistributed loop

Host 1 Host 2OpenMP Threads OpenMP Threads

MPI

sequential loop

standard programming

model

Implementation Scheme

Page 34: Parallel and Distributed Computing on Low Latency Clusters

Implementation Scheme

• Data Structure: not embarrassingly parallel

• Three dimensional matrix

• Several temporary arrays – synchronization obiects required

➡ send() and recv() mechanism

➡ critical regions using OpenMP directives

➡ functions merging

➡ matrix conversion

Page 35: Parallel and Distributed Computing on Low Latency Clusters

Results

Page 36: Parallel and Distributed Computing on Low Latency Clusters

ResultsOMP MPI OPT seconds

* * * 133* * - 400* - * 186* - - 487- * * 200- * - 792- - * 246- - - 1062

Total Speed Increase: 87.52%

Page 37: Parallel and Distributed Computing on Low Latency Clusters

Actual ResultsOMP MPI seconds

* * 59* - 129- * 174- - 249

Function Namecalc_intmuduacalc_hdmg_tetcalc_muduacampo_effettivo

Normal OpenMP MPI OpenMP+MPI24.5 s 4.7 s 14.4 s 2.8 s16.9 s 3.0 s 10.8 s 1.7 s12.1 s 1.9 s 7.0 s 1.1 s17.7 s 4.5 s 9.9 s 2.3 s

Page 38: Parallel and Distributed Computing on Low Latency Clusters

Actual Results

Total Raw Speed Increment: 76%

• OpenMP – 6-8x

• MPI – 2x

• OpenMP + MPI – 14 - 16x

Page 39: Parallel and Distributed Computing on Low Latency Clusters

Conclusions

Page 40: Parallel and Distributed Computing on Low Latency Clusters

Conclusions and Future Works

• Computational time has been significantly decreased

• Speedup is consistent with expected results

• Submitted to COMPUMAG ‘09

• Continue inserting OpenMP and MPI directives• Perform algorithm optimizations• Increase cluster size