hpc in python

22
High Performance Computing in Python A. Tyulpin [email protected] Measurement Systems and Digital Signal Processing laboratory, Northern (Arctic) Federal University Arkhangelsk, 2014 A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 1 / 22

Upload: aleksey-tyulpin

Post on 06-May-2015

572 views

Category:

Software


3 download

DESCRIPTION

Author gave a lecture at the Northern (Arctic) Federal University, Arkhangelsk, Russia when he was studying HPC.

TRANSCRIPT

Page 1: HPC in python

High Performance Computing in Python

A. [email protected]

Measurement Systems and Digital Signal Processing laboratory,Northern (Arctic) Federal University

Arkhangelsk, 2014

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 1 / 22

Page 2: HPC in python

Plan

1 Introduction2 Computing on CPU

1 Multiprocessing module2 MPI and mpi4py

3 GPGPU computing (NVidia CUDA)1 CUDA and PyCUDA2 Theano library

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 2 / 22

Page 3: HPC in python

IntroductionWhy Python?

Fast and easy developmentFlexibilityExtensibility (many libraries)Possible to use code written in Fortran, C and C++

Popular scientific librariesNumPySciPyscikit-learnmatplotlibPandas...

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 3 / 22

Page 4: HPC in python

Multiprocessing module

FeaturesProcess-based multithreadingBuilt in moduleInterections between processes via thread-safe queuesSynchronization between processesPossible to use shared memory

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 4 / 22

Page 5: HPC in python

Multiprocessing moduleProcess-based multithreading.from multiprocessing import Pool , cpu_countimport time , urllib.requestfrom hashlib import sha256

def process(prc_id):u = urllib.request.urlopen(’http ://en.wikipedia.org/wiki/

Special:Random ’)data , hasher = u.read(), sha256 ()hasher.update(data)return (hasher.digest (), data)

if __name__ == ’__main__ ’:pool = Pool(processes = cpu_count ())t1 = time.time()processed = pool.map(process , range (100))pool.close()pool.join()print(’Parallel calculation takes ’,time.time() - t1)t1 = time.time()processed = tuple(map(process , range (100)))print(’Serial calculation takes’,time.time() - t1)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 5 / 22

Page 6: HPC in python

Results and discussion

Hardware3 Mbit/s Internet connectionCPU – Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz. 2 cores withmultithreading.

ResultsParallel calculation takes 24.88 secondsSerial calculation takes 80.53 seconds

Given example could be used in MapReduce-like tasks. For example, if youhave a lot of objects, which can be processed independently, it is good touse this approach for parallel execution of tasks.

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 6 / 22

Page 7: HPC in python

Under the hood

1 pool = Pool() launches one slave process per physical processor on thecomputer. On Unix systems, the slaves are forked from the masterprocess. Under Windows, a new process is started that imports thescript.

2 pool.map(process, range(100)) divides the input list into chunks ofroughly equal size and puts the tasks (function + chunk) on a todolist.

3 Each slave process takes a task (function + a chunk of data) from thetodo list, runs map(function, chunk), and puts the result on a resultlist.

4 pool.map on the master process waits until all tasks are handled andreturns the concatenation of the result lists.

Taken from http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessing.pdf

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 7 / 22

Page 8: HPC in python

mpi4py

Website – http://mpi4py.scipy.org

FeaturesProvides bindings of the Message Passing Interface (MPI)Point-to-point (sends, receives) communicationsCollective (broadcasts, scatters, gathers) communicationsSupport of import of MPI-C code using SWIGSupport of virtual topologies

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 8 / 22

Page 9: HPC in python

mpi4py. Hello world

#!/usr/bin/env python3

from mpi4py import MPI

comm = MPI.COMM_WORLDrank = comm.Get_rank ()

print(rank)

Not need to use call MPI Init()or Finalize()Sripts is run using mpirun

Execution and resultsmpirun -np 4 python3 ex1.pyResults: 0 2 3 1

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 9 / 22

Page 10: HPC in python

mpi4py. MPI Broadcast

import numpy as npfrom mpi4py import MPI

comm = MPI.COMM_WORLDcomm.Barrier ()

N = 5if comm.rank == 0:

A = np.arange(N, dtype=np.float64)else:

A = np.empty(N, dtype=np.float64)comm.Bcast( [A, MPI.DOUBLE] )

print("[%02d] %s" % (comm.rank , A))

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 10 / 22

Page 11: HPC in python

mpi4py. MPI Reduce

from mpi4py import MPIimport numpy

comm = MPI.COMM_WORLDsize = comm.Get_size ()rank = comm.Get_rank ()

N = 1000000h = 2.0 / N;s = 0.0

for i in range(rank , N+1, size):x = h * i - 1s += numpy.sqrt(1 - x**2)

PI_part = numpy.array(2 * s * h, dtype=’d’)PI = numpy.array (0.0, ’d’)comm.Reduce ([PI_part , MPI.DOUBLE], PI , op=MPI.SUM , root =0)

if rank == 0:print ("Calculated value of PI is: %f16" % PI)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 11 / 22

Page 12: HPC in python

GPGPU computing

DefinitionGeneral-purpose graphics processing units - is the utilization of agraphics processing unit (GPU), which typically handles computation onlyfor computer graphics, to perform computation in applications traditionallyhandled by the central processing unit (CPU).

TechnologiesNVidia CUDAOpenCLOpenACC...

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 12 / 22

Page 13: HPC in python

NVidia CUDA paradigm

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 13 / 22

Page 14: HPC in python

NVidia CUDA paradigm

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 14 / 22

Page 15: HPC in python

Features and advantages of CUDAFeatures

Support of SDK for Linux, Mac OS and WindowsSIMT ArchitectureThreads are grouped in warps(32), warps in blocks, blocks in grids.Cores have low-frequency.Key concpet in programming is kernel. Kernel is executed by eachthread.

Advantages

Many cores (good for massive-parallel tasks)Fast downloads and readbacks to and from the GPUEven laptop can provide fast calculationStandard libraries CUFFT, CUBLAS, Thrust, ... are simple to use

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 15 / 22

Page 16: HPC in python

CUDA C kernel. Simple example#define N 1000

__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x;if (tid < N)

c[tid] = a[tid] + b[tid];}

int main( void ) {int a[N], b[N], c[N];int *dev_a , *dev_b , *dev_c;

// allocate the memory on the GPU// fill the arrays ’a’ and ’b’ on the CPU// copy the arrays ’a’ and ’b’ to the GPU

add <<<N,1>>>( dev_a , dev_b , dev_c );

// copy the array ’c’ back from the GPU to the CPU// display the results// free the memory allocated on the GPU

return 0;}

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 16 / 22

Page 17: HPC in python

PyCUDAimport pycuda.driver as cudaimport pycuda.autoinitfrom pycuda.compiler import SourceModuleimport numpy

a = numpy.round(numpy.random.randn (4,4)*10,0)a = a.astype(numpy.float32)a_gpu = cuda.mem_alloc(a.nbytes)cuda.memcpy_htod(a_gpu , a)mod = SourceModule("""

__global__ void doublify(float *a){

int idx = threadIdx.x + threadIdx.y*4;a[idx] *= 2;

}""")

func = mod.get_function("doublify")func(a_gpu , block =(4,4,1))a_doubled = numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled , a_gpu)print(a_doubled)print(a)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 17 / 22

Page 18: HPC in python

TheanoWebsite – http://deeplearning.net/software/theano/Features

Tight integration with NumPyTransparent use of a GPUEfficient symbolic differentiation

MLP benchmark:

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 18 / 22

Page 19: HPC in python

Theano vs NumexprAll on CPUSolid blue: TheanoDashed Red: numexpr (without MKL)

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 19 / 22

Page 20: HPC in python

Theano. Example

#!/usr/bin/env python3

import theano.tensor as Tfrom theano import functionimport numpy as npimport time

a = T.matrix ()b = T.matrix ()out = a ** 2 + b ** 2 + 2*a*bf = function ([a, b], out)

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Theano:’, time.time() - t1)

Theano CPU: 0.8769919872283936 secondsTheano GPU: 0.3814992904663086 seconds

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 20 / 22

Page 21: HPC in python

Comparison with Numpy

#!/usr/bin/env python3

import numpy as npimport time

def f(a,b):return a**2 + b**2 + 2*a*b

x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Numpy:’, time.time() - t1)

Numpy: 0.8708460330963135 seconds

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 21 / 22

Page 22: HPC in python

[email protected]

A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 22 / 22