asian option pricing on cluster of gpus: first results · 1. build a 16-node cluster to experiment...

Asian Option Pricing on cluster of GPUs: First Results

(ANR project « GCPMF »)

S. Vialle – SUPELECL. Abbas-Turki – ENPC

With the help of P. Mercier (SUPELEC).Previous work of G. Noaje March-June 2008.

1 – Building a GPU cluster for experimentations…

1 – Building a cluster of GPUs

1.1 – Objectives & Strategy1. Build a 16-node cluster to experiment distributed computing on GPU

2. With a multi-core CPU and a GPU on each node

3. Choose hardware to support asynchronous communications and maximal overlapping strategy:

Parallelization and overlapping of: GPU computation, CPU-GPU communications, CPU computations, and CPU-CPU communications (across the interconnection network).

GPUCPU

RAM RAM

Interconnection network

GPUCPU

RAM RAM

GPUCPU

RAM RAM


1.2 - Hardware choice

Product model: EN8800GT/G/HTDP/512M

• Multiprocessors: 14• Stream processors: 112• Core clock: 600 MHz• Memory clock: 900 MHz• Memory amount: 512 MB• Memory interface: 256-bit• Memory bandwidth: 57.6 GB/sec• Texture fill rate: 33.6 billion/sec

GPU on each node: ASUS GeForce 8800 GT

Asynchronouscommunications andCuda 1.1 supported

CPU on each node: 1 processor dual-cores Intel E8200, 2.66 GHz front side bus:1333MHzRAM : 4Go DDR3, cache : 6Mo


1.3 - Software installed

Software Installed & Available

• MPICH-2 yes• OpenMPI yes

• GCC (with OpenMP) yes• ICC (with OpenMP) no, coming soon• CUDA 1.1 yes

• OAR no, coming soon• Linux – Fedora core 8 yes

(64 bit kernel)

• Other software can be installed

To supportvariousexperiments

Contact us


1.4 – Interconnection networks

Networks Installed & available

Gigabit Ethernet yesInfiniband yes, half of the cluster10-Gigabit Ethernet no, next year

GPUs compute very fast:network communication times are not negligibleexperiment and identify the best / less worst network

2 – First benchmarks on a GPU cluster…

2 – First benchmarks

Distributed matrix product

0 1 P-1

0 1 P-1

Principles (C = AxB) :1. Matrixes A and B are

partitioned on P PCs.2. The B partition is static.3. The A partition circulates

on the ring of PCs.4. Algorithm includes P steps.5. At each step, each PC

computes a part of C matrix

6. At the end, the C = AxBmatrix is distributed on theP PCs

Each local computations is run on the GPU…



PE i

PE i+1PE i-1

Computation of C on GPU:

Circulation of Apartition:

One step on PE i:

CPU-GPU datatransfers:



MPI - No Overlap (1 core / node)

0,1

1,0

10,0

100,0

1000,0

1 10

Number of Nodes

Texe

c (s

)

MPI-NoOverlap-LoopTime

MPI-NoOverlap-ComputTime

MPI-NoOverlap-CommTime

MPI - Overlap (1 core / node)

0,1

1,0

10,0

100,0

1000,0

1 10

Number of Nodes

Texe

c (s

)

MPI-Overlap-LoopTime

MPI-Overlap-ComputTime

MPI-Overlap-WaitTime

MPI on cluster of CPUs: 1 core/node + Gigabit Ethernet

• Computation time >> communications timeoverlapping has no impact

• Regular decrease of the execution time (good scalability)



• Computation time ≈ communications time !Gigabit Ethernet is not fast enough for GPU communications !

• Overlapping of CPU comms & GPU computation has an impact:The overlap is incomplete, but seems the right strategy

MPI - No Overlap

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8Number of Nodes

Texe

c (s

)

MPI+CUDA-NoOverlap-LoopTime

MPI+CUDA-NoOverlap-ComputTime

MPI+CUDA-NoOverlap-CommTime

MPI - Overlap

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8Number of Nodes

Texe

c (s

)

MPI+CUDA-Overlap-LoopTime

MPI+CUDA-Overlap-ComputTime

MPI+CUDA-Overlap-WaitTime

MPI+CUDA on cluster of CPU-GPUs: 1 core/node + 1 GPU/node + Gigabit Eth



Matrix product (6080x6080) on GPLEC cluster

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10

Number of nodes

Perfo

rman

ces

(Gig

aFlo

ps)

MPI+Cuda - OverlapMPI-CUDA - No overlapMPI - Overlap (1 core/node)

Nb of Nodes

Nb of CPUs(1 core/node)

Nb of GPUs(1 GPU/node)

1 2.1 GFlops 77 Gflops8 15.5 GFlops 155 Gflops

Finally:

• 2.1 Gflops on 1 CPU155 GFlops on 8 GPUs

• But many time spent in cluster communications

• Network is slow and overlapis incomplete!

• Infiniband interconnect does notimprove performances (difference appears for a larger number of nodes).

Result are encouraging but arefar from peak performances!

3 – Parallelization of an « AsianOption Pricer » on a GPU

cluster

3 – Parallelization of an Asian Option Pricer

Parallelization principle

PE-0 PE-1 PE-P-1

PE-0

PE-0 PE-1 PE-P-1

GP

U

GP

U

GP

U

PE-0 PE-1 PE-P-1

GP

U

GP

U

GP

U

PE-0 PE-1 PE-P-1

GP

U

GP

U

GP

U

PE-0 PE-1 PE-P-1

PE-0t

Read inputdata

Broadcastinput data

Transfer datato the GPU

Run computationon the GPU

Transfer resultson the CPU

Make finalcomputations

Print result


Implementation (1)

int main(int argc, char **argv){... // Variable declarations

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD,&NbPE);MPI_Comm_rank(MPI_COMM_WORLD,&Me);

if (Me == 0) {InitStockParaCPU();

}BroadcastInputData();InitStockParaGPU(); ………

MPI initializations.

Input data file readingfrom PE 0.Broadcast input data to all PEs.

Transfer input data on GPU.


Implementation (2)………for (int jj = 0; jj <= N; jj++){ for (int k = 0; k < NbStocks; k++){ComputeUniformRandom();GaussPRNG(k); OutputInputPRNG();

}ActStock(jj);for (int k = 0; k < NbStocks; k++){AsianSum(k,jj); OutputInputSum(k);

}}ComputeIntegralSum();ComputePriceSum();for (int i = 0; i < Nx; i++) { for (int j = 0; j < Ny; j++) {value = maxi((float)(BasketPriceCPU[i][j]-

BasketSumCPU[i][j]),0);sum = sum+value;sum2 = sum2+value*value;

}}………

Call « kernels »on GPU, andtransfer resultson CPU.

CPU computations.


Implementation (3)

………MPI_Reduce(&sum,&TotalSum,1,MPI_DOUBLE,MPI_SUM,

0,MPI_COMM_WORLD); MPI_Reduce(&sum2,&TotalSum2,1,MPI_DOUBLE,MPI_SUM,

0,MPI_COMM_WORLD);

if (Me == 0) {value = exp(-r)*(TotalSum/(((double)Nx)*Ny*NbPE)); fprintf(stdout,"Computed price: %f\n",

(float)value);}

MPI_Finalize(); return(EXIT_SUCCESS);}

Collect allresults on PE-0.

Computelast resulton PE0.

Close MPImechanisms


Compilation

nvcc -O3 // Serial automatic optimizations-I/opt/openmpi/include/ // MPI include files-I/usr/include/c++/4.1.2/ // Include files required by MPI-DOMPI_SKIP_MPICXX // NVCC does not support « exceptions »-o AsianPricer*.cu // ALL source files are .cu files

OpenMPI + Cuda:

Compilation using OpenMPI+CUDA appears easy when all files have .cu extension

4 – Usage and performances of an « Asian Option Pricer » on a

GPU cluster

4 – Usage and performances

Experimental performances

Asian Pricing on GPU cluster

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nb of CPU+GPU nodes

T(s)

T-TotalExec(s)

T-CalculGPU

T-Transfert

T-DataBcast

T-IO

T-CalculComm

Size up Speedup

106/4 106106 trajectories

From 1 to 4 nodes: size up (to achieve accuracy of 106 trajectories),

Beyond 4 nodes: speedup (to achieve computations faster).

Asian Pricing on GPU cluster

0,01

0,10

1,00

10,00

100,00

1 10 100Nb of CPU+GPU nodes

T(s)

T-TotalExec(s)

T-CalculGPU

T-Transfert

T-DataBcast

T-IO

T-CalculComm


Experimental performances

Good scaling of GPU computations and CPU-GPU transfers …

… but data broadcast could become a problem.


Experimental speedup

Asian pricing on GPU cluster

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

1,00 1,50 2,00 2,50 3,00 3,50 4,00Nb of 4Nodes

GPU

-SU

vs

4Nod

es

SU-ideal(X) = XSU-vs-4Nodes (nb of 4Nodes)

The « speedup » part of the experiment (from 4 nodes to 16 nodes) exhibit correct relative speedup (compared to execution on 4 nodes).


Experimental speedup … to do!

The « speedup » of the GPU cluster compared to an execution on 1 CPU & 1 core is: … ???

• Requires a sequential execution on the CPU of node of the GPU cluster… TO DO !… Previously measured close to 100 on other systems.

The GPU cluster could achieve a speedup close to 360, compared toa sequential execution on one CPU of the same cluster.

The « speedup » of the GPU cluster compared to an execution on a cluster of P multi-core CPUs is … ???

• Requires a parallel MPI+OpenMP execution on the CPUs of the GPU cluster… TO DO !

5 – Conclusion and perspectives

5 – Conclusion and perspectives

• Current results are promising.

• Size up + Speedup seems the realistic way to use a cluster of GPUs.

• Optimize parallel algorithms and source code (many issues to investigate in MPI+CUDA programming).

• Measure performances on a cluster of multi-core CPUs, and compare.

• Measure the energy consumed and compare CPU and GPU energetic performances.

Future work:

Next events:• 2nd JTE-GPGPU, (December 4, 2008, Paris)

• PDCoF’09 (May 2009, Rome, Italy)

Asian Option Pricing on cluster of GPUs: First Results(ANR project « GCPMF »)

Questions ?

asian option pricing on cluster of gpus: first results · 1. build a 16-node cluster to experiment...

Documents