a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

Post on 21-Dec-2015

222 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and

Communication Overlapping

Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris

National Technical University of AthensDept. of Electrical and Computer Engineering

Computing Systems Laboratory

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overview

Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping

scheme Vertical vs. hyperplane grouping Application on clusters of SMP

nodes

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

TCP/IP over FastEthernet

Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor

sdHub

write send

read receive

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

CPUkernelmode

bufferlength

TCP

IP ETH

Fast

2) CPU copies datafrom user to kernel space

3) CPU adds protocolheaders

5) DMA copies data to NIC

write(sd, buffer, length);

Example: Send

1) system call (CPU)

user 4) CPU programs DMA eng.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI What about Scalable Coherent

Interface? Point-to-point , DSM approach

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI DSM schemeexportedmemorysegment

importedmemorysegment

SCI

write 100

100

read

50

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

process VM area

Physical Memory

Contiguous data in process VMare not contiguous in Physical Memory

SCI Zero Copy Scheme

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

process VM area

Physical Memory

is mapped to

pinned down memory

SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory

SCI Zero Copy Scheme

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Data transfers

Programmed I/O mode CPU handles data transferring “lost” CPU cycles

DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI

SCI DMA approach

No copying by CPU•Data already

contiguous in PM•DMA engine copies

data to network

•No packetizationDone in hardware

•But, init only by kernel

We need VIA

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

{

Loop Body

}

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Dependence Vectors

i2

i1

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Tiling

i2

i1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Tiling

i2

i1

Processor 0

Processor 1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-Overlapping vs. Overlapping Scheme

P0

P1

P2

P3

P0

P1

P2

P3

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

P0

P1

P2

P3

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

CPU1

CPU0

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Vertical vs. Hyperplane grouping

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Example

Tile SpaceGroup Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-overlapping vs. Overlapping scheme

Almost half duration of execution steps Slightly more steps

P0

P1

P2

P3

P0

P1

P2

P3

Non-overlapping scheme

9 computation +8 communication steps

Overlapping scheme

12 steps

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Vertical vs. Hyperplane Grouping

Slower pipeline filling Faster execution because of lack of intratile synchronization

preferable for Tile Spaces, where the mapping direction is comparatively large

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Experimental Platform

Linux SMP (Symmetric Multi-Processors) Cluster

8 nodes 128MB RAM 2 Pentium III 800MHz

SCI ring (SCI Dolphin’s PCI-SCI D330 cards)

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

}

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Experimental results

3

3.5

4

4.5

5

5.5

6

6.5

7

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

Iteration Space 16x16x1024K Iteration Space 48x48x512K

Non-overlapping scheme – vertical

grouping

Overlapping scheme – vertical grouping

Non-overlapping scheme – hyperplane

grouping

Overlapping scheme – hyperplane grouping

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Grouping matrix

n

i

iG

m

m

m

m

H

10000

01

000

11111

0001

0

00001

1

1

1

nii mmmm 111 = number of CPUs within an SMP node

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Example

Tile SpaceGroup Space

SMP node0

SMP node1

30

31,

3

10

111GGG HPH

Scheduling vector Π=(1,1)

top related