an extension of the starss programming model for platforms ... · an extension of the starss...

52
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain) 2 Barcelona Supercomputing Center - Centro Nacional de Supercomputación. Barcelona (Spain) An Extension of the StarSs Programming Model . . . 1 Ayguadé et al.

Upload: others

Post on 09-Jun-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

An Extension of the StarSs ProgrammingModel for Platforms with Multiple GPUs

Eduard Ayguadé2 Rosa M. Badia2 Francisco Igual1

Jesús Labarta2 Rafael Mayo1 Enrique S. Quintana-Ortí1

1Departamento de Ingeniería y Ciencia de los Computadores.University Jaume I. Castellón (Spain)

2Barcelona Supercomputing Center - Centro Nacional de Supercomputación.Barcelona (Spain)

An Extension of the StarSs Programming Model . . . 1 Ayguadé et al.

Page 2: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Motivation

Motivation (I)

The past: emergence of hardware accelerators

Hardware accelerators (especially GPUs) have become a realsolution for HPCHardware: manycore systems on chip (up to 240 cores onmodern Nvidia GPUs)Software: problem solved with high level programming andexecution models (e.g. Nvidia CUDA, Brook+, OpenCL)

An Extension of the StarSs Programming Model . . . 2 Ayguadé et al.

Page 3: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Motivation

Motivation (II)

The present: heterogeneous multi-accelerator systems

One accelerator is not always enough for many applicationsDifferent accelerators adapted to specific applicationsMulti-accelerator systems are the next stepHardware: Nvidia Tesla series, multiple ClearSpeed boards persystem, hybrid architectures, . . .Software: the problem is not solved yet:

Big code modifications from sequential codeManual schedulingThe user has to know the best accelerator for each part of theapplication

An Extension of the StarSs Programming Model . . . 3 Ayguadé et al.

Page 4: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Motivation

Motivation (III)

The future: heterogeneous multi-accelerator systems (on-chip)

Number of cores is increasingThe programmability problem must be addressed as soon aspossibleHardware: Larrabee, AMD Fusion, . . .Software: will determine the success or failure of novelarchitectures

An Extension of the StarSs Programming Model . . . 4 Ayguadé et al.

Page 5: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Motivation

Outline

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 5 Ayguadé et al.

Page 6: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Introduction

Contents

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 6 Ayguadé et al.

Page 7: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Introduction

Introduction. StarSs

The StarSs programming model addresses the programmabilityproblem by exploiting task level parallelismIt consists of:

A few OpenMP-like pragmas identifying tasks in the user codeA source-to-source compilerA runtime system adapted to the underlying architecture

Many instantiations of StarSs have been developed: CellSs,SMPSs, GridSsEach instantiation targets one specific architecture

An Extension of the StarSs Programming Model . . . 7 Ayguadé et al.

Page 8: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Introduction

Introduction. GPUSs

Our proposal: GPUSs

GPUSs: Instantation of the StarSs programming model focusingheterogeneous multi-accelerator platforms

1 Heterogeneity: The target architecture is an heterogeneousmulti-accelerator system

2 Separate memory spaces: The user does not have to deal withseparate memory spaces for each accelerator

3 Simplicity: It adds few pragmas to the sequential user code toport it to the multi-accelerator system

4 Portability: It can be easily ported to other similar architecturesbased on multiple accelerators

An Extension of the StarSs Programming Model . . . 8 Ayguadé et al.

Page 9: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Contents

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 9 Ayguadé et al.

Page 10: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

The StarSs programming model

StarSs programming model

Automatic parallelization of sequential applicationsRuntime system: efficient use available resources (e.g. GPUs)in parallelThe user annotates the application: pieces of code that will beexecuted on a GPU (tasks)Runtime extracts parallelism building a data dependency graph

void task1( float * A ){ ...}

#pragma css taskvoid task1( float * A ){ ...}

User codeAnnotated

codeTDG Tesla system

User Runtime Runtime

An Extension of the StarSs Programming Model . . . 10 Ayguadé et al.

Page 11: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Proposed extensions

Extensions to the StarSs programming model

GPUSs provides OpenMP-like constructs to annotate code:

To identify a unit of work, or task: pragma css taskTo select the execution device: pragma css target device

An Extension of the StarSs Programming Model . . . 11 Ayguadé et al.

Page 12: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Defining tasks: the task clause

Taskifying functions

#pragma css task [ c l a u s e _ l i s t ]{ f unc t i on−header | func t i on−d e f i n i t i o n }

The task clause denotes a function that is always executed as atask.Whenever the program calls a function annotated in this way, theruntime will create an explicit task.

An Extension of the StarSs Programming Model . . . 12 Ayguadé et al.

Page 13: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Defining tasks: the task clause

Identifying the directionality of the arguments

#pragma css task i npu t ( parameter ) |ou tput ( parameter ) |i nou t ( parameter )

{ func t i on−header | func t i on−d e f i n i t i o n }

The input, output and inout clauses denote thedirectionality of each argument.Used by the runtime to track dependencies among tasks andmanage data transfers.

An Extension of the StarSs Programming Model . . . 13 Ayguadé et al.

Page 14: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Specifying target devices: the target clause

Specifying target devices

#pragma css t a r g e t device ( device−name− l i s t ) [ clause− l i s t ]{ f unc t i on−header | func t i on−d e f i n i t i o n }

The target construct specifies that the execution of a task canbe offloaded on a given device.The target device is specified in device-name-list.When a task becomes ready, the runtime can choose among theavailable targets to decide where to execute the task.

An Extension of the StarSs Programming Model . . . 14 Ayguadé et al.

Page 15: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Managing heterogeneity: the implements clause

The implements clause is used to specify alternativeimplementations for a function

Example

#pragma css taskvoid matmul ( f l o a t ∗A, f l o a t ∗B, f l o a t ∗C ) ;

#pragma css t a r g e t device ( cuda ) implements ( matmul )void matmul_cuda ( f l o a t ∗A, f l o a t ∗B, f l o a t ∗C ) {

/ / tuned vers ion f o r a CUDA−compat ib le device}

#pragma css t a r g e t device ( smp ) implements ( matmul )void matmul_smp ( f l o a t ∗A, f l o a t ∗B, f l o a t ∗C ) {

/ / tuned v rs ion f o r a SMP device}

An Extension of the StarSs Programming Model . . . 15 Ayguadé et al.

Page 16: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Example: the matrix-matrix multiplication

Parallelizing the matix-matrix multiplication

#pragma css task i npu t (A [BS ] [ BS] , B [BS ] [ BS ] ) i nou t (C[BS ] [ BS ] )#pragma css t a r g e t device ( cuda )void matmul ( f l o a t ∗A, f l o a t ∗B, f l o a t ∗C ) {

/ / tuned CUDA code f o r the matmul}

f l o a t ∗A [ ] [ ] , ∗B [ ] [ ] , ∗C [ ] [ ] ;

i n t main ( void ) {for ( i n t i =0; i <NB; i ++ )

for ( i n t j =0; j <NB; j ++ )for ( i n t k =0; k<NB; k++ )

matmul ( A [ i ] [ k ] , B [ k ] [ j ] , C[ i ] [ j ] ) ;}

An Extension of the StarSs Programming Model . . . 16 Ayguadé et al.

Page 17: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Example: the Cholesky factorization

The Chokesky factorization of a dense SPD matrix A ∈ℜn×n isdefined as

A = LLT

where L ∈ℜn×n is a lower triangular matrix.

Blocked algorithm:

Chol_spotrf

Chol_trsm

Chol_upd - Chol_gemm - Chol_syrk

An Extension of the StarSs Programming Model . . . 17 Ayguadé et al.

Page 18: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Sequential Cholesky factorization

void Cholesky ( f l o a t ∗A, i n t ts , i n t nt ) {for ( i n t k = 0; k < nt ; k++) {

c h o l _ s p o t r f ( A [ k∗nt+k ] , t s ) ; / / Fac to r i ze d iagonal b lock

for ( i n t i = k +1; i < n t ; i ++) / / T r i angu la r so lveschol_strsm ( A[ k∗nt+k ] , A [ k∗nt+ i ] , t s ) ;

/ / Update t r a i l i n g submatr ixfor ( i n t i = k +1; i < n t ; i ++) {

for ( i n t j = k +1; j < i ; j ++)chol_sgemm ( A[ k∗nt+ i ] , A [ k∗nt+ j ] , A [ j ∗nt+ i ] , t s ) ;

cho l_ssyrk ( A [ k∗nt+ i ] , A [ i ∗nt+ i ] , t s ) ;}

}

i n t main ( void ) {f l o a t ∗A[ n t ] [ n t ] ;. . ./ / Compute the Cholesky f a c t o rCholesky ( A, ts , n t ) ;

}

An Extension of the StarSs Programming Model . . . 18 Ayguadé et al.

Page 19: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Taskifying the Cholesky factorizationEach block function can be converted into a task:

spotrf task

#pragma css task i nou t (A [NT ] [ NT ] )void c h o l _ s p o t r f ( f l o a t ∗A ) {

s p o t r f ( "Lower" , &ts , A,&ts , & i n f o ) ;

}

sgemm task

#pragma css task i npu t (A [NT ] [ NT] ,B [NT ] [ NT ] )

i nou t (C[NT ] [ NT ] )void chol_sgemm ( f l o a t ∗A,

f l o a t ∗B,f l o a t ∗C ) {

sgemm( "N" , "T" , &ts , &ts , &ts ,−1.0 , A, &ts , B, &ts ,

1 .0 , C, &t s ) ;}

ssyrk task

#pragma css task i npu t (A [NT ] [ NT ] )i nou t (C[NT ] [ NT ] )

void cho l_syrk ( f l o a t ∗A,f l o a t ∗C ) {

ssyrk ( "L" , "N" , &ts , &ts ,−1.0 , A, &ts ,

1 .0 , C, &t s ) ;}

strsm task

#pragma css task i npu t (T [NT ] [ NT ] )i nou t (B [NT ] [ NT ] )

void chol_strsm ( f l o a t ∗T ,f l o a t ∗B ) {

dtrsm ( "R" , "L" , "T" , "N" ,&ts , &ts , 1 .0 , T , &ts ,

B, &t s ) ;}

An Extension of the StarSs Programming Model . . . 19 Ayguadé et al.

Page 20: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Taskifying the Cholesky factorizationEach block function can be converted into a task:

spotrf task

#pragma css task i nou t (A [NT ] [ NT ] )void c h o l _ s p o t r f ( f l o a t ∗A ) {

s p o t r f ( "Lower" , &ts , A,&ts , & i n f o ) ;

}

sgemm task

#pragma css task i npu t (A [NT ] [ NT] ,B [NT ] [ NT ] )

i nou t (C[NT ] [ NT ] )void chol_sgemm ( f l o a t ∗A,

f l o a t ∗B,f l o a t ∗C ) {

sgemm( "N" , "T" , &ts , &ts , &ts ,−1.0 , A, &ts , B, &ts ,

1 .0 , C, &t s ) ;}

ssyrk task

#pragma css task i npu t (A [NT ] [ NT ] )i nou t (C[NT ] [ NT ] )

void cho l_syrk ( f l o a t ∗A,f l o a t ∗C ) {

ssyrk ( "L" , "N" , &ts , &ts ,−1.0 , A, &ts ,

1 .0 , C, &t s ) ;}

strsm task

#pragma css task i npu t (T [NT ] [ NT ] )i nou t (B [NT ] [ NT ] )

void chol_strsm ( f l o a t ∗T ,f l o a t ∗B ) {

dtrsm ( "R" , "L" , "T" , "N" ,&ts , &ts , 1 .0 , T , &ts ,

B, &t s ) ;}

An Extension of the StarSs Programming Model . . . 19 Ayguadé et al.

Page 21: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Specifying the target device for each task

By default, each task is executed on the SMP device unless thetarget clause is given.Example: chol_spotrf can be executed on a CUDA-capabledevice:

spotrf task on a CUDA-capable device

#pragma css task i nou t (A [NT ] [ NT ] ) t a r g e t device ( cuda )void c h o l _ s p o t r f ( f l o a t ∗A ) {

/ / CUDA kerne l f o r/ / the Cholesky f a c t o r i z a t i o n

}

An Extension of the StarSs Programming Model . . . 20 Ayguadé et al.

Page 22: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The StarSs programming model. New extensions

Specifying multiple implementations for each task

Multiple implementations for the chol_spotrf can be given:

spotrf task on a CUDA-capable device

#pragma css task i nou t (A [NT ] [ NT ] )void c h o l _ s p o t r f ( f l o a t ∗A ) ;

#pragma css task i nou t (A [NT ] [ NT ] ) t a r g e t device ( cuda )implements ( c h o l _ s p o t r f )

void cho l_spot r f_cuda ( f l o a t ∗A ) {/ / CUDA kerne l f o r/ / the Cholesky f a c t o r i z a t i o n

}

#pragma css task i nou t (A [NT ] [ NT ] ) t a r g e t device ( smp )implements ( c h o l _ s p o t r f )

void chol_spotr f_smp ( f l o a t ∗A ) {/ / SMP r o u t i n e f o r/ / the Cholesky f a c t o r i z a t i o n

}

An Extension of the StarSs Programming Model . . . 21 Ayguadé et al.

Page 23: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

Contents

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 22 Ayguadé et al.

Page 24: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 25: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 26: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 27: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device Device

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 28: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device Device Device...

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 29: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device

Dev. memory

Device

Dev. memory

Device

Dev. memory...

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 30: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device

Dev. memory

Device

Dev. memory

Device

Dev. memory...

PCIExpress Bus

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 31: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device

Dev. memory

Device

Dev. memory

Device

Dev. memory...

PCIExpress Bus

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 32: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The target architecture

A typical multi-accelerator system

Host

Main memory

Device

Dev. memory

Device

Dev. memory

Device

Dev. memory...

PCIExpress Bus

Host with mainmemory

Devices withlocal memory

CommunicationthroughPCIExpress

No directdevice-devicecommunication

Communicationthrough mainmemory

An Extension of the StarSs Programming Model . . . 23 Ayguadé et al.

Page 33: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The GPUSs runtime. Overview

Many features inherited from the CellSs and SMPSs runtimesTwo main modules:

1 Execution of the annotated user code, task generation andscheduling

2 Data movements and task execution

An Extension of the StarSs Programming Model . . . 24 Ayguadé et al.

Page 34: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The GPUSs runtime. Structure

1 A master thread:Executes the user codeIntercepts calls to annotated functionsGenerates tasksInserts them in a Task Dependency Graph

2 A helper thread:Consumes tasks from the TDG as the GPUs become idleMaps tasks to the most suitable deviceIntercepts finalization signals from the worker threads

3 A set of worker threads:Wait for available tasksPerform the necessary data transfers from RAM to GPUInvoke the task call on the GPURetrieve the results (if necessary)

An Extension of the StarSs Programming Model . . . 25 Ayguadé et al.

Page 35: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The GPUSs runtime. Locality exploitation

Host and device memories: two-level memory hierarchyData is transferred to device memory prior to any task executionData is transferred back after execution

Consider the local memory of each GPU as a cache memorystoring recently-used data blocksSoftware cache + Memory coherence policies:

Write-invalidateWrite-back

The runtime keeps a memory map of each accelerator cacheThis information can be used to improve the mapping of tasks toresources

An Extension of the StarSs Programming Model . . . 26 Ayguadé et al.

Page 36: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

The GPUSs framework

The GPUSs runtime. Additional features

Definition of the number of accelerators at runtimeParaver traces to analyze performance

Hybrid CPU/GPU execution of tasksPorted to a system with multiple ClearSpeed boards

An Extension of the StarSs Programming Model . . . 27 Ayguadé et al.

Page 37: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Contents

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 28 Ayguadé et al.

Page 38: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Experimental results

Experimental setup

CPU Dual Xeon QuadCore E5440CPU frequency 2.83 GhzRAM memory 16 GbytesGPU Tesla s1070Graphics processors 4 x GT200GPU frequency 1.3 GhzVideo memory 4 Gbytes per GPUInterconnection PCIExpress Gen2CUDA version 2.0MKL version 10.0.1Driver version 185.18

Performance measured in GFLOPS

An Extension of the StarSs Programming Model . . . 29 Ayguadé et al.

Page 39: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Experimental results. Cholesky factorization

0

100

200

300

400

500

0 4096 8192 12288 16384 20480

GFLOPS

Matrix size

Cholesky factorization - GPUSs runtime

GPUSS - Cached Write-back - 4 GPUsGPUSS - Basic implementation - 4 GPUs

MKL spotrf - Dual Intel Xeon (8 cores)Cholesky GPU (CUBLAS) - 1 GPU

Tasks executed exclusively on GPUs (simple precision)Important improvement with software cache

An Extension of the StarSs Programming Model . . . 30 Ayguadé et al.

Page 40: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Experimental results. Scalability

0

0.5

1

1.5

2

2.5

3

3.5

4

0 4096 8192 12288 16384 20480

Speedup

Matrix size

Cholesky factorization - Speed-up

GPUSS - 4 GPUsGPUSS - 3 GPUsGPUSS - 2 GPUsGPUSS - 1 GPUs

PCIExpress: main bottleneck as number of GPUs increases

An Extension of the StarSs Programming Model . . . 31 Ayguadé et al.

Page 41: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Experimental results. GEMM

0

200

400

600

800

1000

1200

0 4096 8192 12288 16384 20480

GF

LOP

S

Matrix size (m=n=k)

Performance of the Matrix-Matrix Product (C=C+A*B) on s1070

GPUSS - 4 GPUsCUBLAS - 1 GPUs

Performance near 1.1 TFlop (346 GFlops on 1 GPU usingCUBLAS)

An Extension of the StarSs Programming Model . . . 32 Ayguadé et al.

Page 42: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Experimental results

Experimental results on ClearSpeed boards

0

50

100

150

200

250

300

0 4096 8192 12288

GF

LOP

S

Matrix size (m=n=k)

Performance of the Matrix-Matrix Product (C=C+A*B)

GPUSS - Double precision - 4 CSX700GPUSS - Double precision - 4 GPUs

Performance similar to 4 GPUs (double precision)

An Extension of the StarSs Programming Model . . . 33 Ayguadé et al.

Page 43: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Contents

1 Introduction

2 The StarSs programming model. New extensions

3 The GPUSs framework

4 Experimental results

5 Conclusions and future work

An Extension of the StarSs Programming Model . . . 34 Ayguadé et al.

Page 44: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Conclusions

ConclusionsStarSs programming model: versatile and extensible for newarchitecturesProgrammability will determine the success of emergingarchitecturesOur approach relies on a runtime system: little user interventionMany ideas can be applied to other multi-accelerator systems

Future workMore complex scheduling strategiesPorting to other multi-accelerator platformsPorting to heterogeneous multi-accelerator platformsLet the runtime automatically decide where to execute each task

An Extension of the StarSs Programming Model . . . 35 Ayguadé et al.

Page 45: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Conclusions

ConclusionsStarSs programming model: versatile and extensible for newarchitecturesProgrammability will determine the success of emergingarchitecturesOur approach relies on a runtime system: little user interventionMany ideas can be applied to other multi-accelerator systems

Future workMore complex scheduling strategiesPorting to other multi-accelerator platformsPorting to heterogeneous multi-accelerator platformsLet the runtime automatically decide where to execute each task

An Extension of the StarSs Programming Model . . . 35 Ayguadé et al.

Page 46: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Questions?

An Extension of the StarSs Programming Model . . . 36 Ayguadé et al.

Page 47: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Related work

Related workSuperMatrix Extension of the SuperMatrix SMP runtime

Automatic parallelization of linear algebraprogramsHybrid CPU / Multi-GPU systems

Volkov et al. Some highly tuned codes for multi-GPU systemsLinear algebra codesNo runtime or automatic scheduling

Lee et al. Compiler framework for automatic translation andoptimizationOpenMP→ GPU translation

StarPU

An Extension of the StarSs Programming Model . . . 37 Ayguadé et al.

Page 48: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Related work

Related workSuperMatrix Extension of the SuperMatrix SMP runtime

Automatic parallelization of linear algebraprogramsHybrid CPU / Multi-GPU systems

Volkov et al. Some highly tuned codes for multi-GPU systemsLinear algebra codesNo runtime or automatic scheduling

Lee et al. Compiler framework for automatic translation andoptimizationOpenMP→ GPU translation

StarPU

An Extension of the StarSs Programming Model . . . 37 Ayguadé et al.

Page 49: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Related work

Related workSuperMatrix Extension of the SuperMatrix SMP runtime

Automatic parallelization of linear algebraprogramsHybrid CPU / Multi-GPU systems

Volkov et al. Some highly tuned codes for multi-GPU systemsLinear algebra codesNo runtime or automatic scheduling

Lee et al. Compiler framework for automatic translation andoptimizationOpenMP→ GPU translation

StarPU

An Extension of the StarSs Programming Model . . . 37 Ayguadé et al.

Page 50: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Related work

Related workSuperMatrix Extension of the SuperMatrix SMP runtime

Automatic parallelization of linear algebraprogramsHybrid CPU / Multi-GPU systems

Volkov et al. Some highly tuned codes for multi-GPU systemsLinear algebra codesNo runtime or automatic scheduling

Lee et al. Compiler framework for automatic translation andoptimizationOpenMP→ GPU translation

StarPU

An Extension of the StarSs Programming Model . . . 37 Ayguadé et al.

Page 51: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Tesla vs. Cell B.E.

Similarities with the Cell B.E.Heterogeneous architectures:

Cell B.E.: 1 PPE + 8 SPEsTesla: 1 (multicore) CPU + 4 GPUs

Each accelerator has its own local memory poolFast interconnection network

Differences with the Cell B.E.GPUs need more granularity to attain good performancePPE performance is poor compared to that of the SPELarger local memory spaces for each GPU (Gbytes) than foreach SPE (Kbytes)Impact of data transfers (PCIExpress vs EIB)GPUs are passive elements: no system threads can be run onthem

An Extension of the StarSs Programming Model . . . 38 Ayguadé et al.

Page 52: An Extension of the StarSs Programming Model for Platforms ... · An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé2 Rosa M. Badia2 Francisco

Conclusions and future work

Specifying data movements

Some additional clauses can be used with the device pragma:

Data movement clauses

copy_in ( data−reference− l i s t )copy_out ( data−reference− l i s t )

These clauses specify data movement for the shared variables insidea task:

copy_in moves variables from host to device memory once thetask is ready for execution.copy_out moves variables from device to host once the taskfinishes execution.

An Extension of the StarSs Programming Model . . . 39 Ayguadé et al.