outline speeding up matlab computations symmetric multi-processing with matlab accelerating matlab...

26
Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments Using the Parallel Computing Toolbox Using the Matlab Distributed Compute Engine Server Using pMatlab Mixing Matlab and Fortran/C code Compiling MEX code from C/Fortran Turning Matlab routines into C code

Upload: dorian-hanchett

Post on 14-Dec-2015

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Outline

Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments

Using the Parallel Computing Toolbox Using the Matlab Distributed Compute Engine Server Using pMatlab

Mixing Matlab and Fortran/C code Compiling MEX code from C/Fortran Turning Matlab routines into C code

Page 2: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Symmetric Multi-Processing

By default Matlab uses all cores on a given node for operations that can be threadedArrays and matrices• Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN• Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE)• Array operations: PROD, SUM• Array manipulation: BSXFUN, SORTLinear algebra• Matrix Analysis: DET, RCOND• Linear Equations: CHOL, INV, LDL, LINSOLVE, LU, QR• Eigenvalues and singular values: EIG, HESS, SCHUR, SVD, QZElementary math• Trigonometric: ATAN2, COS, CSC, HYPOT, SEC, SIN, TAN, including variants for radians, degrees, hyperbolics, and inverses.• Exponential: EXP, POW2, SQRT• Complex: ABS• Rounding and remainder: CEIL, FIX, FLOOR, MOD, REM, ROUND• LOG, LOG2, LOG10, LOG1P, EXPM1, SIGN, BITAND, BITOR, BITXORSpecial Functions• ERF, ERFC, ERFCINV, ERFCX, ERFINV, GAMMA, GAMMALNData Analysis• CONV2, FILTER, FFT and IFFT of multiple columns or long vectors, FFTN, IFFTN

Page 3: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Symmetric Multi-Processing

To be sure you only use the resources you request, you should either request an entire node and all of the CPU’s...

Or request a single cpu and specify that Matlab should only use a single thread

salloc –t 60 -c 24 module load matlabsrun matlab

salloc –t 60module load matlabsrun matlab -singleCompThread

Page 4: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using GPUs with Matlab

Matlab can use GPUs to do calculations, provided a GPU is available on the node Matlab is running on.

We can query the connected GPUs from within Matlab using

And obtain a list of GPU supported functions using

salloc -t 60 --gres=gpu:1module load matlabmodule load cudamatlab

gpuDeviceCount()gpuDevice()

methods('gpuArray')

Page 5: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using GPUs with Matlab

So there is a 2D FFT – but no Hilbert function...

We could do the log and abs functions on the GPU as well.

H=hilb(1000);H_=gpuArray(H);Z_=fft2(H_);Z=gather(Z_);imagesc(log(abs(Z)));

H=hilb(1000);H_=gpuArray(H);Z_=fft2(H_);imagesc(gather(log(abs(Z_)));

Distribute data to GPUFFT performed on GPUGather data from GPU onto CPU

Page 6: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using GPUs with Matlab

For our example, doing the FFT on the GPU is 7x faster. (4x if you include moving the data to the GPU and back)

>> H=hilb(5000);>> tic; A=gather(gpuArray(H)); tocElapsed time is 0.161166 seconds.>> tic; A=gather(fft2(gpuArray(H))); tocElapsed time is 0.348159 seconds.>> tic; A=fft2(H); tocElapsed time is 1.210464 seconds.

Page 7: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using GPUs with Matlab

Matlab has no built in hilb() function that can run on the GPU – but we can write our own function(kernel) in cuda – and save it to hilbert.cu

And compile it with nvcc to generate a Parallel Thread eXecution file

__global__ void HilbertKernel( double * const out, size_t const numRows, size_t const numCols){ const int rowIdx = blockIdx.x * blockDim.x + threadIdx.x; const int colIdx = blockIdx.y * blockDim.y + threadIdx.y; if ( rowIdx >= numRows ) return; if ( colIdx >= numCols ) return; size_t linearIdx = rowIdx + colIdx*numRows; out[linearIdx] = 1.0 / (double)(1+rowIdx+colIdx) ;}

nvcc -ptx hilbert.cu

Page 8: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using GPUs with Matlab

We have to initialize the kernel and specify the grid size before executing the kernel.

The default for matlab kernel’s is 1 thread per block, but we could create fewer blocks that were each 10 x 10 threads.

H_=gpuArray.zeros(1000);hilbert_kernel=parallel.gpu.CUDAKernel('hilbert.ptx','hilbert.cu');hilbert_kernel.GridSize=size(H_);H_=feval(hilbert_kernel, H_, 1000,1000);Z_=fft2(H_);imagesc(gather(log(abs(Z_))));

hilbert_kernel.ThreadBlockSize=[10,10,1];hilbert_kernel.GridSize=[100,100];

Page 9: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

As an alternative you can also use the Parallel Computing Toolbox. This supports parallelism via MPI

https://info.circ.rochester.edu/BlueHive/Software/Mathematics/matlab.html

You should create a pool that is the same size as the number of processors you requested in your job submission. Matlab also sells licenses for using a Distributed Computing Server which allows for matlabpools that use more than just the local node.

Page 10: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or

‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.

Page 11: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or

‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.matlabpool(4)

parfor n=1:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));end

matlabpool close

Page 12: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or

‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.

matlabpool(4)spmd for n=drange(1:100) H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); endendmatlabpool close

matlabpool(4)spmd for n=labindex:numlabs:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); endendmatlabpool close

Page 13: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or

‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.

pmode start 4

pmode lab2client H 3 H3H3pmode close

n=labindex;H=hilb(n);Z=fft2(H);f=figure('Visible','off');imagesc(log(abs(Z)));print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));

Page 14: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or

‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.

H=hilb(1000);H_=gpuArray(H);Z_=fft2(H_);Z=gather(Z_);imagesc(log(abs(Z)));

matlabpool(8)

H=hilb(1000); H_=distributed(H); Z_=fft(fft(H_,[],1),[],2); Z=gather(Z_); imagesc(log(abs(Z)));

matlabpool close

Example using gpuArray

Example using distributed arrays

Page 15: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Parallel Computing Toolbox

matlabpool(4)spmd codist=codistributor1d(1,[250,250,250,250],[1000,1000]); [i_lo, i_hi]=codist.globalIndices(1); H_local=zeros(250,1000); for i=i_lo:i_hi for j=1:1000 H_local(i-i_lo+1,j)=1/(i+j-1); end end H_ = codistributed.build(H_local, codist);endZ_=fft(fft(H_,[],1),[],2);Z=gather(Z_);imagesc(log(abs(Z)));matlabpool close

What about building hilbert matrix in parallel?

Define partitionGet local indices in x-direction

Allocate space for local part

Initialize local array with Hilbert values.

Assemble codistributed arrayNow it's distributed like before!

Page 16: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using pMatlab

pMatlab is an alternative method to get distributed matlab functionality without relying on Matlab’s Distributed Computing Server. It is built on top of MapMPI (an MPI implementation for matlab – written in matlab - that uses file I/O for communication) It supports various operations on distributed arrays (up to 4D)

Remapping, aggregating, finding non-zero entries, transposing, ghosting

Elementary math functions (trig, exponential, complex, remainder/rounding)

2D Convolutions, FFTs, Discrete Cosine TransformFFT's need to be properly mapped (cannot be distributed along transform dimension).

It does not have as much functionality as the parallel computing toolbox – but it does support ghosting and more flexible partitioning!

Page 17: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Using pMatlab

Since pMatlab works by launching other Matlab instances – we need them to startup with pMatlab functionality. To do so we need to add a few lines to our startup.m file in our matlab path.

addpath('/software/pMatlab/MatlabMPI/src');addpath('/software/pMatlab/src');rehash;pMatlabGlobalsInit;

Page 18: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Running pMatlab in Batch Mode

To submit a job in batch mode we need to create a batch script

And a Matlab script to launch the pMatlab script

#SBATCH -N Matlab#SBATCH -p standard#SBATCH –t 60#SBATCH –N 2#SBATCH --ntasks-per-node=8module load matlabmatlab -nodisplay -r "pmatlab_launcher"

nProcs=getenv('SLURM_NTASKS); [sreturn, machines]=system('nodelist'); machines=regexp(machines, '\n', 'split'); machines=machines(1:size(machines,2)-1); eval(pRUN('pmatlab_script',nProcs,machines));

sample_script.pbs

pmatlab_launcher.m

Page 19: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Running pMatlab in Batch Mode

And finally we have our pmatlab script.

Xmap=map([Np 1],{},0:Np-1);H_=zeros(1000,1000,Xmap);[I1,I2]=global_block_range(H_);H_local=zeros(I1(2)-I1(1)+1,I2(2)-I2(1)+1);for i=I1(1):I1(2) for j=I2(1):I2(2) H_local(i-I1(1)+1,j-I2(1)+1)=1/(i+j-1); endendH_=put_local(H_,H_local);Z_=fft(fft(H_,[],2),[],1);Z=agg(Z_);if (pMATLAB.my_rank == pMATLAB.leader) f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', 'fig1.pdf');end

map for distributing arrayDistributed matrix constructor

Indices for local portion of array

Allocate and populate local portion of array with Hilbert matrix values

Copy local values into distributed arrayDo y-fft and do x-fft. Z_ has different mapCollect resulting matrix onto 'leader'

Plot result from 'leader' matlab process

pmatlab_script.m

X = put_local(X, fft(local(X),[],2));Z = transpose_grid(X);Z = put_local(Z, fft(local(Z),[],1));

Page 20: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Compiling Mex Code

C, C++, or Fortran routines can be called from within Matlab.#include "fintrf.h"subroutine mexfunction(nlhs, plhs, nrhs, prhs) mwPointer :: plhs(*), prhs(*) integer :: nlhs, nrhs mwPointer :: mxGetPr mwPointer :: mxCreateDoubleMatrix real(8) :: mxGetScalar mwPointer :: pr_out integer :: n n = nint(mxGetScalar(prhs(1))) plhs(1) = mxCreateDoubleMatrix(n,n, 0) pr_out = mxGetPr(plhs(1)) call compute(%VAL(pr_out),n)end subroutine mexfunction

subroutine compute(h, n) integer :: n real(8) :: h(n,n) integer :: i,j do i=1,n do j=1,n h(i,j)=1d0/(i+j-1d0) end do end doend subroutine compute

mex hilbert.F90

>> H=hilbert(10)

Page 21: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Turning Matlab code into C

First we create a log_abs_fft_hilb.m function

And then we run

This will produce a mex file that we can test.

We could have specified the type of 'n' in our matlab function

function result = log_abs_fft_hilb(n) result=log(abs(fft2(hilb(n))));

>> codegen log_abs_fft_hilb.m –args {uint32(0)}

>> A=log_abs_fft_hilb_mex(uint32(16));>> B=log_abs_fft_hilb(16);>> max(max(abs(A-B)))ans = 8.8818e-16

function result = log_abs_fft_hilb(n) assert(isa(n,'uint32')); result=log(abs(fft2(hilb(n))));

Page 22: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Turning Matlab code into C

Now we can also export a static library that we can link to:

This will create a subdirectory codegen/lib/log_abs_fft_hilb that will have the source files '.c and .h' as well as a compiled object files '.o' and a library 'log_abs_fft_hilb.a' The source files are portable to any platform with a 'C' compiler (ie BlueStreak). We can rebuild the library on BlueStreak by running

>> codegen log_abs_fft_hilb.m -config coder.config('lib') -args {'uint32(0)'}

mpixlc –c *.car rcs log_abs_fft_hilb.a *.o

Page 23: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Turning Matlab code into C

To use the function, we still need to write a main subroutine that links to it. This requires working with matlab's variable types (which include dynamically resizable arrays)#include "stdio.h"#include "rtwtypes.h"#include "log_abs_fft_hilb_types.h"void main() { uint32_T n=64; emxArray_real_T *result; int32_T i,j; emxInit_real_T(&result, 2); log_abs_fft_hilb(n, result); for(i=0;i<result->size[0];i++) { for(j=0;j<result->size[1];j++) { printf("%f ",result->data[i+result->size[0]*j]); } printf("\n"); } emxFree_real_T(&result);}

Matlab type definitions

Argument to Matlab functionReturn value of Matlab function

Initialize Matlab array to have rank 2

Call matlab function

Free up memory associated with return array

Output result in column major order

Page 24: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Turning Matlab code into C

And here is another example of calling 2D fft's on real datavoid main() { int32_T q0; int32_T i; int32_T n=8; emxArray_creal_T *result; emxArray_real_T *input; emxInit_creal_T(&result, 2); emxInit_real_T(&input, 2); q0 = input->size[0] * input->size[1]; input->size[0]=n; input->size[1]=n; emxEnsureCapacity((emxArray__common *)input, q0, (int32_T)sizeof(real_T)); for(j=0;j<input->size[1];j++ { for(i=0;i<input->size[0];i++) { input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1); } }

my_fft(input, result);

for(i=0;i<result->size[0];i++) { for(j=0;j<result->size[1];j++) { printf("[% 10.4f,% 10.4f] ", result->data[i+result->size[0]*j].re, result->data[i+result->size[0]*j].im); } printf("\n"); }

emxFree_creal_T(&result); emxFree_real_T(&input);}

Page 25: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Turning Matlab code into C

Exported FFT's only work on vectors of length 2N Error checking is disabled in exported C code Mex code will have the same functionality as exported C code, but will also have error checking. It will warn about doing FFT's on arbitrary length vectors, etc... Always test your mex code!

Page 26: Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed

Matlab code is not that different from C code

#include <stdio.h>#include <math.h>#include <complex.h>#include <fftw3.h>void main() { int n=4096; int i,j; double complex temp[n][n], input[n][n]; double result[n][n]; fftw_plan p; p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0], FFTW_FORWARD, FFTW_ESTIMATE); for (i=0;i<n;i++){ for(j=0;j<n;j++) { input[i][j]=(double complex)(1.0/(double)(i+j+1)); } } fftw_execute(p);

for (i=0;i<n;i++){ for(j=0;j<n;j++) { result[i][j]=log(cabs(temp[i][j])); } } for (i=0;i<n;i++){ for(j=0;j<n;j++) { printf("%f ",result[i][j]); } } fftw_destroy_plan(p);}

Or you can write your own 'C' code that uses open source mathematical libraries (ie fftw).