hpc in python

High Performance Computing in Python

A. Tyulpinalekseytyulpin@gmail.com

Measurement Systems and Digital Signal Processing laboratory,Northern (Arctic) Federal University

Arkhangelsk, 2014

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 1 / 22

1 Introduction2 Computing on CPU

1 Multiprocessing module2 MPI and mpi4py

3 GPGPU computing (NVidia CUDA)1 CUDA and PyCUDA2 Theano library

IntroductionWhy Python?

Fast and easy developmentFlexibilityExtensibility (many libraries)Possible to use code written in Fortran, C and C++

Popular scientific librariesNumPySciPyscikit-learnmatplotlibPandas...

Multiprocessing module

FeaturesProcess-based multithreadingBuilt in moduleInterections between processes via thread-safe queuesSynchronization between processesPossible to use shared memory

Multiprocessing moduleProcess-based multithreading.from multiprocessing import Pool , cpu_countimport time , urllib.requestfrom hashlib import sha256

def process(prc_id):u = urllib.request.urlopen(’http ://en.wikipedia.org/wiki/

Special:Random ’)data , hasher = u.read(), sha256 ()hasher.update(data)return (hasher.digest (), data)

if __name__ == ’__main__ ’:pool = Pool(processes = cpu_count ())t1 = time.time()processed = pool.map(process , range (100))pool.close()pool.join()print(’Parallel calculation takes ’,time.time() - t1)t1 = time.time()processed = tuple(map(process , range (100)))print(’Serial calculation takes’,time.time() - t1)

Results and discussion

Hardware3 Mbit/s Internet connectionCPU – Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz. 2 cores withmultithreading.

ResultsParallel calculation takes 24.88 secondsSerial calculation takes 80.53 seconds

Given example could be used in MapReduce-like tasks. For example, if youhave a lot of objects, which can be processed independently, it is good touse this approach for parallel execution of tasks.

Under the hood

1 pool = Pool() launches one slave process per physical processor on thecomputer. On Unix systems, the slaves are forked from the masterprocess. Under Windows, a new process is started that imports thescript.

2 pool.map(process, range(100)) divides the input list into chunks ofroughly equal size and puts the tasks (function + chunk) on a todolist.

3 Each slave process takes a task (function + a chunk of data) from thetodo list, runs map(function, chunk), and puts the result on a resultlist.

4 pool.map on the master process waits until all tasks are handled andreturns the concatenation of the result lists.

Taken from http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessing.pdf

mpi4py

Website – http://mpi4py.scipy.org

FeaturesProvides bindings of the Message Passing Interface (MPI)Point-to-point (sends, receives) communicationsCollective (broadcasts, scatters, gathers) communicationsSupport of import of MPI-C code using SWIGSupport of virtual topologies

mpi4py. Hello world

#!/usr/bin/env python3

from mpi4py import MPI

comm = MPI.COMM_WORLDrank = comm.Get_rank ()

print(rank)

Not need to use call MPI Init()or Finalize()Sripts is run using mpirun

Execution and resultsmpirun -np 4 python3 ex1.pyResults: 0 2 3 1

mpi4py. MPI Broadcast

import numpy as npfrom mpi4py import MPI

comm = MPI.COMM_WORLDcomm.Barrier ()

N = 5if comm.rank == 0:

A = np.arange(N, dtype=np.float64)else:

A = np.empty(N, dtype=np.float64)comm.Bcast( [A, MPI.DOUBLE] )

print("[%02d] %s" % (comm.rank , A))

mpi4py. MPI Reduce

from mpi4py import MPIimport numpy

comm = MPI.COMM_WORLDsize = comm.Get_size ()rank = comm.Get_rank ()

N = 1000000h = 2.0 / N;s = 0.0

for i in range(rank , N+1, size):x = h * i - 1s += numpy.sqrt(1 - x**2)

PI_part = numpy.array(2 * s * h, dtype=’d’)PI = numpy.array (0.0, ’d’)comm.Reduce ([PI_part , MPI.DOUBLE], PI , op=MPI.SUM , root =0)

if rank == 0:print ("Calculated value of PI is: %f16" % PI)

GPGPU computing

DefinitionGeneral-purpose graphics processing units - is the utilization of agraphics processing unit (GPU), which typically handles computation onlyfor computer graphics, to perform computation in applications traditionallyhandled by the central processing unit (CPU).

TechnologiesNVidia CUDAOpenCLOpenACC...

NVidia CUDA paradigm

Features and advantages of CUDAFeatures

Support of SDK for Linux, Mac OS and WindowsSIMT ArchitectureThreads are grouped in warps(32), warps in blocks, blocks in grids.Cores have low-frequency.Key concpet in programming is kernel. Kernel is executed by eachthread.

Advantages

Many cores (good for massive-parallel tasks)Fast downloads and readbacks to and from the GPUEven laptop can provide fast calculationStandard libraries CUFFT, CUBLAS, Thrust, ... are simple to use

CUDA C kernel. Simple example#define N 1000

__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x;if (tid < N)

c[tid] = a[tid] + b[tid];}

int main( void ) {int a[N], b[N], c[N];int *dev_a , *dev_b , *dev_c;

// allocate the memory on the GPU// fill the arrays ’a’ and ’b’ on the CPU// copy the arrays ’a’ and ’b’ to the GPU

add <<<N,1>>>( dev_a , dev_b , dev_c );

// copy the array ’c’ back from the GPU to the CPU// display the results// free the memory allocated on the GPU

return 0;}

PyCUDAimport pycuda.driver as cudaimport pycuda.autoinitfrom pycuda.compiler import SourceModuleimport numpy

a = numpy.round(numpy.random.randn (4,4)*10,0)a = a.astype(numpy.float32)a_gpu = cuda.mem_alloc(a.nbytes)cuda.memcpy_htod(a_gpu , a)mod = SourceModule("""

__global__ void doublify(float *a){

int idx = threadIdx.x + threadIdx.y*4;a[idx] *= 2;

func = mod.get_function("doublify")func(a_gpu , block =(4,4,1))a_doubled = numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled , a_gpu)print(a_doubled)print(a)

TheanoWebsite – http://deeplearning.net/software/theano/Features

Tight integration with NumPyTransparent use of a GPUEfficient symbolic differentiation

MLP benchmark:

Theano vs NumexprAll on CPUSolid blue: TheanoDashed Red: numexpr (without MKL)

Theano. Example

import theano.tensor as Tfrom theano import functionimport numpy as npimport time

a = T.matrix ()b = T.matrix ()out = a ** 2 + b ** 2 + 2*a*bf = function ([a, b], out)

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Theano:’, time.time() - t1)

Theano CPU: 0.8769919872283936 secondsTheano GPU: 0.3814992904663086 seconds

Comparison with Numpy

import numpy as npimport time

def f(a,b):return a**2 + b**2 + 2*a*b

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Numpy:’, time.time() - t1)

Numpy: 0.8708460330963135 seconds

Thanks!alekseytyulpin@gmail.com

hpc in python

Software

improving python- based workloads in hpc - microsoft ·...

jussi enkovaara harri hämäläinen exercises for python in...

hpc in bridges

an introduction to python - hpc-forge › files ›...

python for hpc: best practices for hpc: best practices ......

debugging and profiling hpc applications · arm map: python...

python in hpc · python in hpc nih high performance ......

python for matlab users - csdms · python for matlab users...

hpc in poland

code generation for high-level problem speci cation and...

upcoming biowulfseminars - nih hpc › training › handouts...

python in hpc - exascale computing project · basic...

hpc python workshop: matplotlib · hpc python workshop:...

course work and research thesis masters in … ·...

scientific( python( basics(on(gacrc clusters( 7i ·...

hpc in healthcare

python best practices on blue waters2 python on blue waters...

what’s working in hpc: investigating hpc user...

hpc python programming

flash in hpc