hpc in python
Post on 06-May-2015
572 Views
Preview:
DESCRIPTION
TRANSCRIPT
High Performance Computing in Python
A. Tyulpinalekseytyulpin@gmail.com
Measurement Systems and Digital Signal Processing laboratory,Northern (Arctic) Federal University
Arkhangelsk, 2014
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 1 / 22
Plan
1 Introduction2 Computing on CPU
1 Multiprocessing module2 MPI and mpi4py
3 GPGPU computing (NVidia CUDA)1 CUDA and PyCUDA2 Theano library
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 2 / 22
IntroductionWhy Python?
Fast and easy developmentFlexibilityExtensibility (many libraries)Possible to use code written in Fortran, C and C++
Popular scientific librariesNumPySciPyscikit-learnmatplotlibPandas...
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 3 / 22
Multiprocessing module
FeaturesProcess-based multithreadingBuilt in moduleInterections between processes via thread-safe queuesSynchronization between processesPossible to use shared memory
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 4 / 22
Multiprocessing moduleProcess-based multithreading.from multiprocessing import Pool , cpu_countimport time , urllib.requestfrom hashlib import sha256
def process(prc_id):u = urllib.request.urlopen(’http ://en.wikipedia.org/wiki/
Special:Random ’)data , hasher = u.read(), sha256 ()hasher.update(data)return (hasher.digest (), data)
if __name__ == ’__main__ ’:pool = Pool(processes = cpu_count ())t1 = time.time()processed = pool.map(process , range (100))pool.close()pool.join()print(’Parallel calculation takes ’,time.time() - t1)t1 = time.time()processed = tuple(map(process , range (100)))print(’Serial calculation takes’,time.time() - t1)
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 5 / 22
Results and discussion
Hardware3 Mbit/s Internet connectionCPU – Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz. 2 cores withmultithreading.
ResultsParallel calculation takes 24.88 secondsSerial calculation takes 80.53 seconds
Given example could be used in MapReduce-like tasks. For example, if youhave a lot of objects, which can be processed independently, it is good touse this approach for parallel execution of tasks.
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 6 / 22
Under the hood
1 pool = Pool() launches one slave process per physical processor on thecomputer. On Unix systems, the slaves are forked from the masterprocess. Under Windows, a new process is started that imports thescript.
2 pool.map(process, range(100)) divides the input list into chunks ofroughly equal size and puts the tasks (function + chunk) on a todolist.
3 Each slave process takes a task (function + a chunk of data) from thetodo list, runs map(function, chunk), and puts the result on a resultlist.
4 pool.map on the master process waits until all tasks are handled andreturns the concatenation of the result lists.
Taken from http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessing.pdf
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 7 / 22
mpi4py
Website – http://mpi4py.scipy.org
FeaturesProvides bindings of the Message Passing Interface (MPI)Point-to-point (sends, receives) communicationsCollective (broadcasts, scatters, gathers) communicationsSupport of import of MPI-C code using SWIGSupport of virtual topologies
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 8 / 22
mpi4py. Hello world
#!/usr/bin/env python3
from mpi4py import MPI
comm = MPI.COMM_WORLDrank = comm.Get_rank ()
print(rank)
Not need to use call MPI Init()or Finalize()Sripts is run using mpirun
Execution and resultsmpirun -np 4 python3 ex1.pyResults: 0 2 3 1
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 9 / 22
mpi4py. MPI Broadcast
import numpy as npfrom mpi4py import MPI
comm = MPI.COMM_WORLDcomm.Barrier ()
N = 5if comm.rank == 0:
A = np.arange(N, dtype=np.float64)else:
A = np.empty(N, dtype=np.float64)comm.Bcast( [A, MPI.DOUBLE] )
print("[%02d] %s" % (comm.rank , A))
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 10 / 22
mpi4py. MPI Reduce
from mpi4py import MPIimport numpy
comm = MPI.COMM_WORLDsize = comm.Get_size ()rank = comm.Get_rank ()
N = 1000000h = 2.0 / N;s = 0.0
for i in range(rank , N+1, size):x = h * i - 1s += numpy.sqrt(1 - x**2)
PI_part = numpy.array(2 * s * h, dtype=’d’)PI = numpy.array (0.0, ’d’)comm.Reduce ([PI_part , MPI.DOUBLE], PI , op=MPI.SUM , root =0)
if rank == 0:print ("Calculated value of PI is: %f16" % PI)
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 11 / 22
GPGPU computing
DefinitionGeneral-purpose graphics processing units - is the utilization of agraphics processing unit (GPU), which typically handles computation onlyfor computer graphics, to perform computation in applications traditionallyhandled by the central processing unit (CPU).
TechnologiesNVidia CUDAOpenCLOpenACC...
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 12 / 22
NVidia CUDA paradigm
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 13 / 22
NVidia CUDA paradigm
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 14 / 22
Features and advantages of CUDAFeatures
Support of SDK for Linux, Mac OS and WindowsSIMT ArchitectureThreads are grouped in warps(32), warps in blocks, blocks in grids.Cores have low-frequency.Key concpet in programming is kernel. Kernel is executed by eachthread.
Advantages
Many cores (good for massive-parallel tasks)Fast downloads and readbacks to and from the GPUEven laptop can provide fast calculationStandard libraries CUFFT, CUBLAS, Thrust, ... are simple to use
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 15 / 22
CUDA C kernel. Simple example#define N 1000
__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x;if (tid < N)
c[tid] = a[tid] + b[tid];}
int main( void ) {int a[N], b[N], c[N];int *dev_a , *dev_b , *dev_c;
// allocate the memory on the GPU// fill the arrays ’a’ and ’b’ on the CPU// copy the arrays ’a’ and ’b’ to the GPU
add <<<N,1>>>( dev_a , dev_b , dev_c );
// copy the array ’c’ back from the GPU to the CPU// display the results// free the memory allocated on the GPU
return 0;}
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 16 / 22
PyCUDAimport pycuda.driver as cudaimport pycuda.autoinitfrom pycuda.compiler import SourceModuleimport numpy
a = numpy.round(numpy.random.randn (4,4)*10,0)a = a.astype(numpy.float32)a_gpu = cuda.mem_alloc(a.nbytes)cuda.memcpy_htod(a_gpu , a)mod = SourceModule("""
__global__ void doublify(float *a){
int idx = threadIdx.x + threadIdx.y*4;a[idx] *= 2;
}""")
func = mod.get_function("doublify")func(a_gpu , block =(4,4,1))a_doubled = numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled , a_gpu)print(a_doubled)print(a)
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 17 / 22
TheanoWebsite – http://deeplearning.net/software/theano/Features
Tight integration with NumPyTransparent use of a GPUEfficient symbolic differentiation
MLP benchmark:
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 18 / 22
Theano vs NumexprAll on CPUSolid blue: TheanoDashed Red: numexpr (without MKL)
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 19 / 22
Theano. Example
#!/usr/bin/env python3
import theano.tensor as Tfrom theano import functionimport numpy as npimport time
a = T.matrix ()b = T.matrix ()out = a ** 2 + b ** 2 + 2*a*bf = function ([a, b], out)
x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Theano:’, time.time() - t1)
Theano CPU: 0.8769919872283936 secondsTheano GPU: 0.3814992904663086 seconds
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 20 / 22
Comparison with Numpy
#!/usr/bin/env python3
import numpy as npimport time
def f(a,b):return a**2 + b**2 + 2*a*b
x = np.random.randn (1000000 ,100).astype(np.float32)y = np.random.randn (1000000 ,100).astype(np.float32)t1 = time.time()res = f(x, y)print(’Numpy:’, time.time() - t1)
Numpy: 0.8708460330963135 seconds
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 21 / 22
Thanks!alekseytyulpin@gmail.com
A. Tyulpin (DSPLab) HPC in Python Arkhangelsk, 2014 22 / 22
top related