a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and

Communication Overlapping

Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris

National Technical University of AthensDept. of Electrical and Computer Engineering

Computing Systems Laboratory

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overview

Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping

scheme Vertical vs. hyperplane grouping Application on clusters of SMP

TCP/IP over FastEthernet

Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor

write send

read receive

CPUkernelmode

bufferlength

IP ETH

2) CPU copies datafrom user to kernel space

3) CPU adds protocolheaders

5) DMA copies data to NIC

write(sd, buffer, length);

Example: Send

1) system call (CPU)

user 4) CPU programs DMA eng.

SCI What about Scalable Coherent

Interface? Point-to-point , DSM approach

SCI DSM schemeexportedmemorysegment

importedmemorysegment

write 100

process VM area

Physical Memory

Contiguous data in process VMare not contiguous in Physical Memory

SCI Zero Copy Scheme

process VM area

Physical Memory

is mapped to

pinned down memory

SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory

SCI Zero Copy Scheme

Data transfers

Programmed I/O mode CPU handles data transferring “lost” CPU cycles

DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks

SCI DMA approach

No copying by CPU•Data already

contiguous in PM•DMA engine copies

data to network

•No packetizationDone in hardware

•But, init only by kernel

We need VIA

Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

Loop Body

Dependence Vectors

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]

Tiling

Processor 0

Processor 1

Non-Overlapping Scheme

Processor 0

Processor 1

Processor 2

Non-Overlapping vs. Overlapping Scheme

Overlapping Scheme

Processor 0

Processor 1

Processor 2

Generalization to SMPs

Vertical vs. Hyperplane grouping

Example

Tile SpaceGroup Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)

Non-overlapping vs. Overlapping scheme

Almost half duration of execution steps Slightly more steps

Non-overlapping scheme

9 computation +8 communication steps

Overlapping scheme

12 steps

Vertical vs. Hyperplane Grouping

Slower pipeline filling Faster execution because of lack of intratile synchronization

preferable for Tile Spaces, where the mapping direction is comparatively large

Experimental Platform

Linux SMP (Symmetric Multi-Processors) Cluster

8 nodes 128MB RAM 2 Pentium III 800MHz

SCI ring (SCI Dolphin’s PCI-SCI D330 cards)

Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

Experimental results

0 5000 10000 15000 20000 25000 30000 35000

Tile Height

0 5000 10000 15000 20000 25000 30000 35000

Tile Height

Iteration Space 16x16x1024K Iteration Space 48x48x512K

Non-overlapping scheme – vertical

grouping

Overlapping scheme – vertical grouping

Non-overlapping scheme – hyperplane

grouping

Overlapping scheme – hyperplane grouping

Grouping matrix

nii mmmm 111 = number of CPUs within an SMP node

Example

Tile SpaceGroup Space

SMP node0

SMP node1

111GGG HPH

Scheduling vector Π=(1,1)

a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

Documents

rank-based ant colony algorithm for a thermal generator ......

from helios to zeus - usenix · 2013-08-12 evt/wote '13...

delivering high performance to parallel applications using...

giannis chantas, nikolaos galatsanos, aristidis likas, and...

ju 20 14 - truth in advertising · civil action 14-cv-1324...

link.springer.com · 2017-08-28 · j glob optim (2014)...

turkexim importers search engine | · wk an eurochemo.com...

document de travail 2012-22 - ofce · 2013-02-08 · these...

dimitris tsoukalas sispad 01 -...

inventory, speculators, and initial coin o eringsinventory,...

event #68: no-limit hold'em main event end of day … ·...

josephine linke yibeltal - dr tsoukalas · history...

21/7/2020 aristidis amaro coelho cleyde nilson … ·...

desubs - bioconductor · desubs aristidis g. vrahatis,...

scheduling of tiled nested loops onto a cluster with a fixed...

epichrom 2017 proceedings - scilifelab · 4 functional...

the global k-means clustering algorithm - robert...

· aristidis psarros a.o. aiantas asyrmatou (gre) (gre)...

1 bayesian restoration using a new nonstationary...

15834247 steven tsoukalas masonic rites and wrongs