asian option pricing on cluster of gpus: first results · 1. build a 16-node cluster to experiment...
TRANSCRIPT
Asian Option Pricing on cluster of GPUs: First Results
(ANR project « GCPMF »)
S. Vialle – SUPELECL. Abbas-Turki – ENPC
With the help of P. Mercier (SUPELEC).Previous work of G. Noaje March-June 2008.
1 – Building a GPU cluster for experimentations…
1 – Building a cluster of GPUs
1.1 – Objectives & Strategy1. Build a 16-node cluster to experiment distributed computing on GPU
2. With a multi-core CPU and a GPU on each node
3. Choose hardware to support asynchronous communications and maximal overlapping strategy:
Parallelization and overlapping of: GPU computation, CPU-GPU communications, CPU computations, and CPU-CPU communications (across the interconnection network).
GPUCPU
RAM RAM
Interconnection network
GPUCPU
RAM RAM
GPUCPU
RAM RAM
1 – Building a cluster of GPUs
1.2 - Hardware choice
Product model: EN8800GT/G/HTDP/512M
• Multiprocessors: 14• Stream processors: 112• Core clock: 600 MHz• Memory clock: 900 MHz• Memory amount: 512 MB• Memory interface: 256-bit• Memory bandwidth: 57.6 GB/sec• Texture fill rate: 33.6 billion/sec
GPU on each node: ASUS GeForce 8800 GT
Asynchronouscommunications andCuda 1.1 supported
CPU on each node: 1 processor dual-cores Intel E8200, 2.66 GHz front side bus:1333MHzRAM : 4Go DDR3, cache : 6Mo
1 – Building a cluster of GPUs
1.3 - Software installed
Software Installed & Available
• MPICH-2 yes• OpenMPI yes
• GCC (with OpenMP) yes• ICC (with OpenMP) no, coming soon• CUDA 1.1 yes
• OAR no, coming soon• Linux – Fedora core 8 yes
(64 bit kernel)
• Other software can be installed
To supportvariousexperiments
Contact us
1 – Building a cluster of GPUs
1.4 – Interconnection networks
Networks Installed & available
Gigabit Ethernet yesInfiniband yes, half of the cluster10-Gigabit Ethernet no, next year
GPUs compute very fast:network communication times are not negligibleexperiment and identify the best / less worst network
2 – First benchmarks on a GPU cluster…
2 – First benchmarks
Distributed matrix product
0 1 P-1
0 1 P-1
Principles (C = AxB) :1. Matrixes A and B are
partitioned on P PCs.2. The B partition is static.3. The A partition circulates
on the ring of PCs.4. Algorithm includes P steps.5. At each step, each PC
computes a part of C matrix
6. At the end, the C = AxBmatrix is distributed on theP PCs
Each local computations is run on the GPU…
2 – First benchmarks
Distributed matrix product
PE i
PE i+1PE i-1
Computation of C on GPU:
Circulation of Apartition:
One step on PE i:
CPU-GPU datatransfers:
2 – First benchmarks
Distributed matrix product
MPI - No Overlap (1 core / node)
0,1
1,0
10,0
100,0
1000,0
1 10
Number of Nodes
Texe
c (s
)
MPI-NoOverlap-LoopTime
MPI-NoOverlap-ComputTime
MPI-NoOverlap-CommTime
MPI - Overlap (1 core / node)
0,1
1,0
10,0
100,0
1000,0
1 10
Number of Nodes
Texe
c (s
)
MPI-Overlap-LoopTime
MPI-Overlap-ComputTime
MPI-Overlap-WaitTime
MPI on cluster of CPUs: 1 core/node + Gigabit Ethernet
• Computation time >> communications timeoverlapping has no impact
• Regular decrease of the execution time (good scalability)
2 – First benchmarks
Distributed matrix product
• Computation time ≈ communications time !Gigabit Ethernet is not fast enough for GPU communications !
• Overlapping of CPU comms & GPU computation has an impact:The overlap is incomplete, but seems the right strategy
MPI - No Overlap
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8Number of Nodes
Texe
c (s
)
MPI+CUDA-NoOverlap-LoopTime
MPI+CUDA-NoOverlap-ComputTime
MPI+CUDA-NoOverlap-CommTime
MPI - Overlap
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8Number of Nodes
Texe
c (s
)
MPI+CUDA-Overlap-LoopTime
MPI+CUDA-Overlap-ComputTime
MPI+CUDA-Overlap-WaitTime
MPI+CUDA on cluster of CPU-GPUs: 1 core/node + 1 GPU/node + Gigabit Eth
2 – First benchmarks
Distributed matrix product
Matrix product (6080x6080) on GPLEC cluster
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10
Number of nodes
Perfo
rman
ces
(Gig
aFlo
ps)
MPI+Cuda - OverlapMPI-CUDA - No overlapMPI - Overlap (1 core/node)
Nb of Nodes
Nb of CPUs(1 core/node)
Nb of GPUs(1 GPU/node)
1 2.1 GFlops 77 Gflops8 15.5 GFlops 155 Gflops
Finally:
• 2.1 Gflops on 1 CPU155 GFlops on 8 GPUs
• But many time spent in cluster communications
• Network is slow and overlapis incomplete!
• Infiniband interconnect does notimprove performances (difference appears for a larger number of nodes).
Result are encouraging but arefar from peak performances!
3 – Parallelization of an « AsianOption Pricer » on a GPU
cluster
3 – Parallelization of an Asian Option Pricer
Parallelization principle
PE-0 PE-1 PE-P-1
PE-0
PE-0 PE-1 PE-P-1
GP
U
GP
U
GP
U
PE-0 PE-1 PE-P-1
GP
U
GP
U
GP
U
PE-0 PE-1 PE-P-1
GP
U
GP
U
GP
U
PE-0 PE-1 PE-P-1
PE-0t
Read inputdata
Broadcastinput data
Transfer datato the GPU
Run computationon the GPU
Transfer resultson the CPU
Make finalcomputations
Print result
3 – Parallelization of an Asian Option Pricer
Implementation (1)
int main(int argc, char **argv){... // Variable declarations
MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD,&NbPE);MPI_Comm_rank(MPI_COMM_WORLD,&Me);
if (Me == 0) {InitStockParaCPU();
}BroadcastInputData();InitStockParaGPU(); ………
MPI initializations.
Input data file readingfrom PE 0.Broadcast input data to all PEs.
Transfer input data on GPU.
3 – Parallelization of an Asian Option Pricer
Implementation (2)………for (int jj = 0; jj <= N; jj++){ for (int k = 0; k < NbStocks; k++){ComputeUniformRandom();GaussPRNG(k); OutputInputPRNG();
}ActStock(jj);for (int k = 0; k < NbStocks; k++){AsianSum(k,jj); OutputInputSum(k);
}}ComputeIntegralSum();ComputePriceSum();for (int i = 0; i < Nx; i++) { for (int j = 0; j < Ny; j++) {value = maxi((float)(BasketPriceCPU[i][j]-
BasketSumCPU[i][j]),0);sum = sum+value;sum2 = sum2+value*value;
}}………
Call « kernels »on GPU, andtransfer resultson CPU.
CPU computations.
3 – Parallelization of an Asian Option Pricer
Implementation (3)
………MPI_Reduce(&sum,&TotalSum,1,MPI_DOUBLE,MPI_SUM,
0,MPI_COMM_WORLD); MPI_Reduce(&sum2,&TotalSum2,1,MPI_DOUBLE,MPI_SUM,
0,MPI_COMM_WORLD);
if (Me == 0) {value = exp(-r)*(TotalSum/(((double)Nx)*Ny*NbPE)); fprintf(stdout,"Computed price: %f\n",
(float)value);}
MPI_Finalize(); return(EXIT_SUCCESS);}
Collect allresults on PE-0.
Computelast resulton PE0.
Close MPImechanisms
3 – Parallelization of an Asian Option Pricer
Compilation
nvcc -O3 // Serial automatic optimizations-I/opt/openmpi/include/ // MPI include files-I/usr/include/c++/4.1.2/ // Include files required by MPI-DOMPI_SKIP_MPICXX // NVCC does not support « exceptions »-o AsianPricer*.cu // ALL source files are .cu files
OpenMPI + Cuda:
Compilation using OpenMPI+CUDA appears easy when all files have .cu extension
4 – Usage and performances of an « Asian Option Pricer » on a
GPU cluster
4 – Usage and performances
Experimental performances
Asian Pricing on GPU cluster
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nb of CPU+GPU nodes
T(s)
T-TotalExec(s)
T-CalculGPU
T-Transfert
T-DataBcast
T-IO
T-CalculComm
Size up Speedup
106/4 106106 trajectories
From 1 to 4 nodes: size up (to achieve accuracy of 106 trajectories),
Beyond 4 nodes: speedup (to achieve computations faster).
Asian Pricing on GPU cluster
0,01
0,10
1,00
10,00
100,00
1 10 100Nb of CPU+GPU nodes
T(s)
T-TotalExec(s)
T-CalculGPU
T-Transfert
T-DataBcast
T-IO
T-CalculComm
4 – Usage and performances
Experimental performances
Good scaling of GPU computations and CPU-GPU transfers …
… but data broadcast could become a problem.
4 – Usage and performances
Experimental speedup
Asian pricing on GPU cluster
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
1,00 1,50 2,00 2,50 3,00 3,50 4,00Nb of 4Nodes
GPU
-SU
vs
4Nod
es
SU-ideal(X) = XSU-vs-4Nodes (nb of 4Nodes)
The « speedup » part of the experiment (from 4 nodes to 16 nodes) exhibit correct relative speedup (compared to execution on 4 nodes).
4 – Usage and performances
Experimental speedup … to do!
The « speedup » of the GPU cluster compared to an execution on 1 CPU & 1 core is: … ???
• Requires a sequential execution on the CPU of node of the GPU cluster… TO DO !… Previously measured close to 100 on other systems.
The GPU cluster could achieve a speedup close to 360, compared toa sequential execution on one CPU of the same cluster.
The « speedup » of the GPU cluster compared to an execution on a cluster of P multi-core CPUs is … ???
• Requires a parallel MPI+OpenMP execution on the CPUs of the GPU cluster… TO DO !
5 – Conclusion and perspectives
5 – Conclusion and perspectives
• Current results are promising.
• Size up + Speedup seems the realistic way to use a cluster of GPUs.
• Optimize parallel algorithms and source code (many issues to investigate in MPI+CUDA programming).
• Measure performances on a cluster of multi-core CPUs, and compare.
• Measure the energy consumed and compare CPU and GPU energetic performances.
Future work:
Next events:• 2nd JTE-GPGPU, (December 4, 2008, Paris)
• PDCoF’09 (May 2009, Rome, Italy)
Asian Option Pricing on cluster of GPUs: First Results(ANR project « GCPMF »)
Questions ?