hpc in python

Post on 06-May-2015

572 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Author gave a lecture at the Northern (Arctic) Federal University, Arkhangelsk, Russia when he was studying HPC.

TRANSCRIPT

High Performance Computing in Python

A. Tyulpinalekseytyulpin@gmail.com

Measurement Systems and Digital Signal Processing laboratory,Northern (Arctic) Federal University

Arkhangelsk, 2014

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 1 / 22

Plan

1 Introduction2 Computing on CPU

1 Multiprocessing module2 MPI and mpi4py

3 GPGPU computing (NVidia CUDA)1 CUDA and PyCUDA2 Theano library

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 2 / 22

IntroductionWhy Python?

Fast and easy developmentFlexibilityExtensibility (many libraries)Possible to use code written in Fortran, C and C++

Popular scientific librariesNumPySciPyscikit-learnmatplotlibPandas...

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 3 / 22

Multiprocessing module

FeaturesProcess-based multithreadingBuilt in moduleInterections between processes via thread-safe queuesSynchronization between processesPossible to use shared memory

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 4 / 22

Multiprocessing moduleProcess-based multithreading.from multiprocessing import Pool , cpu_countimport time , urllib.requestfrom hashlib import sha256

def process(prc_id):u = urllib.request.urlopen(’http ://en.wikipedia.org/wiki/

Special:Random ’)data , hasher = u.read(), sha256 ()hasher.update(data)return (hasher.digest (), data)

if __name__ == ’__main__ ’:pool = Pool(processes = cpu_count ())t1 = time.time()processed = pool.map(process , range (100))pool.close()pool.join()print(’Parallel calculation takes ’,time.time() - t1)t1 = time.time()processed = tuple(map(process , range (100)))print(’Serial calculation takes’,time.time() - t1)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 5 / 22

Results and discussion

Hardware3 Mbit/s Internet connectionCPU – Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz. 2 cores withmultithreading.

ResultsParallel calculation takes 24.88 secondsSerial calculation takes 80.53 seconds

Given example could be used in MapReduce-like tasks. For example, if youhave a lot of objects, which can be processed independently, it is good touse this approach for parallel execution of tasks.

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 6 / 22

Under the hood

1 pool = Pool() launches one slave process per physical processor on thecomputer. On Unix systems, the slaves are forked from the masterprocess. Under Windows, a new process is started that imports thescript.

2 pool.map(process, range(100)) divides the input list into chunks ofroughly equal size and puts the tasks (function + chunk) on a todolist.

3 Each slave process takes a task (function + a chunk of data) from thetodo list, runs map(function, chunk), and puts the result on a resultlist.

4 pool.map on the master process waits until all tasks are handled andreturns the concatenation of the result lists.

Taken from http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessing.pdf

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 7 / 22

mpi4py

Website – http://mpi4py.scipy.org

FeaturesProvides bindings of the Message Passing Interface (MPI)Point-to-point (sends, receives) communicationsCollective (broadcasts, scatters, gathers) communicationsSupport of import of MPI-C code using SWIGSupport of virtual topologies

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 8 / 22

mpi4py. Hello world

#!/usr/bin/env python3

from mpi4py import MPI

comm = MPI.COMM_WORLDrank = comm.Get_rank ()

print(rank)

Not need to use call MPI Init()or Finalize()Sripts is run using mpirun

Execution and resultsmpirun -np 4 python3 ex1.pyResults: 0 2 3 1

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 9 / 22

mpi4py. MPI Broadcast

import numpy as npfrom mpi4py import MPI

comm = MPI.COMM_WORLDcomm.Barrier ()

N = 5if comm.rank == 0:

A = np.arange(N, dtype=np.float64)else:

A = np.empty(N, dtype=np.float64)comm.Bcast( [A, MPI.DOUBLE] )

print("[%02d] %s" % (comm.rank , A))

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 10 / 22

mpi4py. MPI Reduce

from mpi4py import MPIimport numpy

comm = MPI.COMM_WORLDsize = comm.Get_size ()rank = comm.Get_rank ()

N = 1000000h = 2.0 / N;s = 0.0

for i in range(rank , N+1, size):x = h * i - 1s += numpy.sqrt(1 - x**2)

PI_part = numpy.array(2 * s * h, dtype=’d’)PI = numpy.array (0.0, ’d’)comm.Reduce ([PI_part , MPI.DOUBLE], PI , op=MPI.SUM , root =0)

if rank == 0:print ("Calculated value of PI is: %f16" % PI)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 11 / 22

GPGPU computing

DefinitionGeneral-purpose graphics processing units - is the utilization of agraphics processing unit (GPU), which typically handles computation onlyfor computer graphics, to perform computation in applications traditionallyhandled by the central processing unit (CPU).

TechnologiesNVidia CUDAOpenCLOpenACC...

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 12 / 22

NVidia CUDA paradigm

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 13 / 22

NVidia CUDA paradigm

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 14 / 22

Features and advantages of CUDAFeatures

Support of SDK for Linux, Mac OS and WindowsSIMT ArchitectureThreads are grouped in warps(32), warps in blocks, blocks in grids.Cores have low-frequency.Key concpet in programming is kernel. Kernel is executed by eachthread.

Advantages

Many cores (good for massive-parallel tasks)Fast downloads and readbacks to and from the GPUEven laptop can provide fast calculationStandard libraries CUFFT, CUBLAS, Thrust, ... are simple to use

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 15 / 22

CUDA C kernel. Simple example#define N 1000

__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x;if (tid < N)

c[tid] = a[tid] + b[tid];}

int main( void ) {int a[N], b[N], c[N];int *dev_a , *dev_b , *dev_c;

// allocate the memory on the GPU// fill the arrays ’a’ and ’b’ on the CPU// copy the arrays ’a’ and ’b’ to the GPU

add <<<N,1>>>( dev_a , dev_b , dev_c );

// copy the array ’c’ back from the GPU to the CPU// display the results// free the memory allocated on the GPU

return 0;}

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 16 / 22

PyCUDAimport pycuda.driver as cudaimport pycuda.autoinitfrom pycuda.compiler import SourceModuleimport numpy

a = numpy.round(numpy.random.randn (4,4)*10,0)a = a.astype(numpy.float32)a_gpu = cuda.mem_alloc(a.nbytes)cuda.memcpy_htod(a_gpu , a)mod = SourceModule("""

__global__ void doublify(float *a){

int idx = threadIdx.x + threadIdx.y*4;a[idx] *= 2;

}""")

func = mod.get_function("doublify")func(a_gpu , block =(4,4,1))a_doubled = numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled , a_gpu)print(a_doubled)print(a)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 17 / 22

TheanoWebsite – http://deeplearning.net/software/theano/Features

Tight integration with NumPyTransparent use of a GPUEfficient symbolic differentiation

MLP benchmark:

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 18 / 22

Theano vs NumexprAll on CPUSolid blue: TheanoDashed Red: numexpr (without MKL)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 19 / 22

Theano. Example

#!/usr/bin/env python3

import theano.tensor as Tfrom theano import functionimport numpy as npimport time

a = T.matrix ()b = T.matrix ()out = a ** 2 + b ** 2 + 2*a*bf = function ([a, b], out)

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Theano:’, time.time() - t1)

Theano CPU: 0.8769919872283936 secondsTheano GPU: 0.3814992904663086 seconds

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 20 / 22

Comparison with Numpy

#!/usr/bin/env python3

import numpy as npimport time

def f(a,b):return a**2 + b**2 + 2*a*b

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Numpy:’, time.time() - t1)

Numpy: 0.8708460330963135 seconds

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 21 / 22

Thanks!alekseytyulpin@gmail.com

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 22 / 22

top related