december 1st, 2016 - mcgill hpc - mcg… · december 1st, 2016 1 ... using cython and numba ......
TRANSCRIPT
Advanced and Parallel Python
December 1st, 2016
1
http://tinyurl.com/cq-advanced-python-20161201By: Bart Oldeman and Pier-Luc St-Onge
Financial Partners
2
Setup for the workshop1. Get a user ID and password paper (provided in class):
##: userNMXXXXXXXXXX **********
2. Access to local computer (replace ## and ___ with appropriate values, “___” is provided in class):a. User name: csuser##b. Password: ___@[S##
3. HTTPS connection to Colosse (replace **********):a. https://jupyter.calculquebec.cab. User name: userNMc. Password: **********d. If requested:
i. click Start Server button, set walltime 83
Select Modules -Change Notebook Kernel
● In the Software tab, select:○ compilers/llvm/3.7.1○ compilers/gcc/4.8.5
● Open notebooks/01-stack.ipynb○ File -> Save and Checkpoint
4
Import Examples and ExercisesIn case the cq-formation-advanced-python folder is not in your home directory, open a Terminal and type:
module load apps/git/1.8.5.3 # If on Colosse
git clone -b ulaval \
https://github.com/calculquebec/cq-formation-advanced-python.git
cd cq-formation-advanced-python
5
Outline
● Revisiting the Scientific Python Stack● Why (and What) is Python?
○ Accelerating Python code: PyPy and Numpy○ Using C code from Python code
● Finding Bottlenecks - Profiling code● Compiling Python Code
○ Using Cython and Numba● Parallelizing Python Programs
○ Parallel Programming Concepts○ The multiprocessing Module○ MPI for Python (mpi4py)
6
7
The Scientific Python stack
Scientific Python stack
In the introductory workshop we looked at:● Python itself● Numpy, for numerical array objects● Scipy, for higher level routines● IPython, an advanced Python shell● Matplotlib, for plottingOn top of that we introduce some new components, for example:● Cython, for speed and interfacing● mpi4py for using MPI in Python
8
9
Speeding up Python programs
Speeding up PythonCentral example: approx_pi.c / approx_pi.py:
// approx_pi.c
double approx_pi(int intervals)
{ double pi = 0.0;
int i;
for (i = 0; i < intervals; i++) {
pi += (4 - ((i % 2) * 8)) /
(double)(2 * i + 1);
}
return pi;
}
10
# approx_pi.py
def approx_pi(intervals):
pi = 0.0
for i in range(intervals):
pi += (4 - 8 * (i % 2)) /
(float)(2 * i + 1)
return pi
Speeding up PythonCompile:$ gcc -O2 pi_collect.c approx_pi.c -o pi_collect
$ ./pi_collect 100000000
.. Time = 0.88 secPython run (example on Guillimin):$ module load iomkl/2015b Python/3.5.0
$ python pi_collect.py approx_pi 100000000
The compiled C code runs almost 100 times faster than the Python code (0.88 vs. 66 seconds with intervals = 100000000).Note that “approx_pi” is the module to import for pi_collect.py.
11
Speeding up Python
How to speed up: two approaches1. Make Python go faster
a. Use the PyPy just-in-time compilerb. Use Numpy with vectorized codec. Use Cython
2. Call C code from Pythona. Manuallyb. Use SWIGc. Use Ctypesd. Use Cythone. ....
12
Speeding up Python using PyPy
How to speed up: use PyPy:$ module add pypy/3-2.4.0
$ pypy3 pi_collect.py approx_pi 100000000
gives 2.2 seconds (30 times faster)
An alternative to PyPy is Numba (not installed on Guillimin).
13
Speeding up with numpy
How to speed up: use vectorized code:from __future__ import division # only needed for Python 2.x
def approx_pi(intervals):
pi1 = 4/numpy.arange(1, intervals*2, 4)
pi2 = -4/numpy.arange(3, intervals*2, 4)
return numpy.sum(pi1) + numpy.sum(pi2)
$ python3 pi_collect.py approx_pi_numpy 100000000
gives 1.4 seconds (47 times faster).Drawback: extra memory use.How to speed up: Cython: see later
14
15
Interfacing with C/C++/Fortran
Interfacing with C and C++
● There are at least 14 different ways to do it:1. By hand using the Python API (*)2. Pyrex3. Cython (**)4. SWIG (*)5. SIP6. Boost.Python7. PyCXX8. CTypes (*)9. Py++
10. f2py (*)11. PyD12. Interrogate13. Robin (*) Quick introduction14. Pybind11 (**) Most popular now, more thorough introduction
16
Using the Python API● Pros: no extra dependencies● Cons: a lot of boilerplate code, which can change between
Python version/* Example of wrapping approx_pi() with the Python-C-API. */
#include <Python.h>
#include "approx_pi.h"
static PyObject* approx_pi_func(PyObject* self, PyObject* args) // wrapped approx_pi()
{ int value; double answer;
if (!PyArg_ParseTuple(args, "i", &value)) // parse input, python float to c double
return NULL;
/* if the above function returns -1, an appropriate Python exception will
* have been set, and the function simply returns NULL */
answer =approx_pi(value);
/* construct the output from approx_pi, from c double to python float */
return Py_BuildValue("f", answer); }
17
Using the Python API/* define functions in module */
static PyMethodDef PiMethods[] =
{
{"approx_pi", approx_pi_func, METH_VARARGS, "approximate Pi"},
{NULL, NULL, 0, NULL} };
static struct PyModuleDef PiModule = {
PyModuleDef_HEAD_INIT, "approx_pi_pyapi", NULL, -1, PiMethods,
NULL, NULL, NULL, NULL };
/* module initialization */
PyMODINIT_FUNC PyInit_approx_pi_pyapi(void)
{ (void) PyModule_Create(&PiModule);}
Compile using $ python3 setup_approx_pi_pyapi.py build_ext --inplacefrom distutils.core import setup, Extension
# define the extension module
module = Extension('approx_pi_pyapi', sources=['approx_pi_pyapi.c', 'approx_pi.c'])
setup(ext_modules=[module]) # run the setup
18
Using CTypes● Pros: the ctypes package is in Python by default, pure
Python solution● Cons: wrapped code in shared lib, interface not fastFirst compile approx_pi_ctypes.so:$ gcc -fPIC -shared -O2 approx_pi.c -o approx_pi_ctypes.so# approx_pi_ctypes.py
""" Example of wrapping approx_pi using ctypes. """
import ctypes
approx_pi_dll = ctypes.cdll.LoadLibrary('./approx_pi_ctypes.so') # find and load the library
approx_pi_dll.approx_pi.argtypes = [ctypes.c_int] # set the argument type
approx_pi_dll.approx_pi.restype = ctypes.c_double # set the return type
def approx_pi(arg):
''' Wrapper for approx_pi '''
return approx_pi_dll.approx_pi(arg)
19
Using SWIG● Mature solution● Wrapper file is autogenerated from interface file./* approx_pi_swig.i */
/* Example of wrapping approx_pi using SWIG. */
%module approx_pi_swig
%{
/* the resulting C file should be built as a python extension */
#define SWIG_FILE_WITH_INIT
/* Includes the header in the wrapper code */
#include "approx_pi.h"
%}
/* Parse the header file to generate wrappers */
%include "approx_pi.h"
20
Using SWIG● Use distutils as before (python3
setup_approx_pi_swig.py build_ext --inplace) but mention the interface file in the setup script.
from distutils.core import setup, Extension
approx_pi_module = Extension("_approx_pi", sources=["approx_pi.c", "approx_pi.i"])setup(ext_modules=[approx_pi_module]])
● This generates three files: approx_pi_swig.py, approx_pi_swig_wrap.c, and _approx_pi_swig*.so
21
Using f2py● Fortran version: approx_pi.f90subroutine approx_pi(intervals, pi)
integer, intent(in) :: intervals
double precision, intent(out) :: pi
integer i
pi = 0
do i = 0, intervals - 1
pi = pi + (4 - (mod(i,2) * 8)) / dble(2 * i + 1)
enddo
end subroutine approx_pi
● Compile usingf2py3 -c -m approx_pi_f2py approx_pi.f90
● Then dopython3 pi_collect.py approx_pi_f2py 100000000
22
23
Cython
Cython● Cython compiles from Python (with extensions) to C.● Based on Pyrex● Goals: faster execution (especially with those
extensions) and easier interoperability with other C code.
● Cython files use the .pyx extension.
24
Cython● Example: approx_pi_cython1.pyx (same as
approx_pi.py)def approx_pi(intervals):
pi = 0.0
for i in range(intervals):
pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
return pi
● Executing python3 setup_cython.py build_ext --inplace from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("*.pyx"))
turns all .pyx files into .c files and .so modules● Run python3 pi_collect.py approx_pi_cython1 100000000
○ 25 seconds: the C code uses only Python objects.25
Cython: declare variables● Need to declare variables using cdef to make it fast● Example: approx_pi_cython2.pyx def approx_pi(int intervals):
cdef double pi
cdef int i
pi = 0.0
for i in range(intervals):
pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
return pi
● Execute python3 setup_cython.py build_ext --inplace ● Run python3 pi_collect.py approx_pi_cython2 100000000
○ 0.89 seconds: almost as fast as native C.
26
Cython: division● Inspecting approx_pi_cython2.c we found it uses
__Pyx_mod_long(__pyx_v_i, 2) instead of a plain __pyx_v_i % 2. This is because in C, -1%10=-1 but in Python, -1%10=9.
● Here we can ignore this and tell Cython to use C behaviour, by adding a line
#cython:cdivision=True● Execute python3 setup_cython.py build_ext --inplace
○ Check that approx_pi_cython3.c uses %.● Run python3 pi_collect.py approx_pi_cython3 100000000
○ 0.88 seconds: the same as native C.● Note: use Cython in IPython/Jupyter using “%load_ext
cythonmagic” and “%%cython” in a cell.27
Cython: wrapping C code● Last but not least: interfacing with C code:# approx_pi_cython4.pyx
cdef extern from "approx_pi.h":
double c_approx_pi "approx_pi" (int intervals)
# C name: approx_pi, Cython name: c_approx_pi
def approx_pi(int intervals):
return c_approx_pi(intervals)
● Plus special setup_cython4.py scriptfrom distutils.core import setup, Extension
from Cython.Distutils import build_ext
setup(cmdclass={'build_ext': build_ext},
ext_modules=[Extension("approx_pi_cython4",
sources=["approx_pi_cython4.pyx", "approx_pi.c"])])
● Execute python3 setup_cython4.py build_ext --inplace ● Run python3 pi_collect.py approx_pi_cython4 100000000
28
Parallel Programming Concepts
29
Vocabulary
● Serial tasks○ Any task that cannot be split in two simultaneous
sequences of actions
○ Examples: starting a process, reading a file, any communication between two processes
● Parallel tasks○ Data parallelism: same action applied on different
data. Could be serial tasks done in parallel.
○ Process parallelism: one action on one set of data. Action split in multiple processes or threads.■ Data partitioning: rectangles or blocks
30
Parallel tasks
● Parallel efficiency (scaling)○ Amdahl’s law: how long does it take to compute a
task with an infinite number of processors?○ Gustafson's law: what size of problem can we
solve in a given time with N processors?
● Shared memory○ Multiple threads share the same memory space in a
single process: full read and write access.
● Distributed memory○ Each process has its own memory space○ Information is sent and received by messages
31
Distributed Memory Model
32
Net
wor
k
Process 1
A(10)
Process 2
A(10)
Different variables!
Serial Code Parallelization● Implicit Parallelization - minimum work for you
○ Threaded libraries (MKL, ACML, GOTO, etc.)○ Compiler directives (OpenMP)○ Good for desktops and shared memory machines
● Explicit Parallelization - work is required !○ You tell what should be done on what CPU○ Solution for distributed clusters (shared nothing!)
● Hybrid Parallelization - work is required !○ Mix of implicit and explicit parallelization
■ Vectorization and parallel CPU instructions○ Good for accelerators (CUDA, OpenCL, etc.)
33
The multiprocessing Module
34
The multiprocessing Module
● Because of the implementation of CPython, only one thread at a time can execute Python code○ This avoids common issues with the shared
memory model: race condition, ...
○ There is a threading module, but it is no longer recommended
● Solution: the multiprocessing module!
35
Pool of WorkersFor embarrassingly parallel tasks, the Pool class allows the creation of worker processes. Each process will compute different data.
Warning: only works in a script!
36
from multiprocessing import Pool
def prod(values):
return values[0] * values[1]
if __name__ == '__main__':
N = 12
values = [(i + 1, N - i)
for i in range(0, N)]
print(values)
workers = Pool(processes=4)
results = workers.map(prod, values)
print(results)
Pool of Workers
● Run: python script.py● What happens with 4 workers:
37
Pool of WorkersAsynchronous map calls can be used in order to do something else in the main process. The map_async() method returns an AsyncResult object which can wait until all workers are done.
38
from multiprocessing import Pool
import time
def prod(values):
time.sleep(1)
return values[0] * values[1]
if __name__ == '__main__':
N = 12
values = [(i + 1, N - i)
for i in range(0, N)]
print(values)
workers = Pool(processes=4)
results = workers.map_async(prod, values)
print('Waiting...')
print(results.get(timeout=10))
Pool of WorkersAsynchronous map calls can use a callback function. Then, the main thread has to wait by first closing the access to workers, and by joining the pool of workers.
39
def printRes(results):
print(results)
if __name__ == '__main__':
N = 12
values = [(i + 1, N - i)
for i in range(0, N)]
print(values)
workers = Pool(processes=4)
results = workers.map_async(prod,
values, callback=printRes)
print('Waiting...')
workers.close()
workers.join()
Pool of Workers
● class Pool([processes[,...]])○ processes: number of worker processes. If None,
processes=multiprocessing.cpu_count()○ Methods:
■ map(func, iterable[, ...]): returns results
■ map_async(func, iterable[, ...]): returns an AsyncResult object
■ close(): closes access to worker processes
■ join(): waiting for all workers to exit. Must call close() before.
40
Pool of Workers
● class AsyncResult○ Methods:
■ get([timeout]): blocking, get results as soon as they are available. In case of error, get
■ wait([timeout]): blocking, waits until the call is done
■ ready(): non-blocking, returns a boolean indicating if the call has completed.
■ successful(): non-blocking, returns a boolean indicating if the call has succeeded.
41
Exercise - Baby Genomic
● Edit baby-genomic.py○ Use a pool of 4 workers○ Use the asynchronous map function
○ Provide a callback function that will print results at the end
○ Tip: use the edProxy() function in order to call the real editDistance() function.
● Run:time -p python baby-genomic.py
42
The Process class● https://docs.python.org/2/library/multiprocessing.html
○ The Process class: manually spawn and control each
processProcess(target=fct, args=(arg1,arg2)).start()
○ Communication channels:■ The Pipe class: to communicate between two
processes, one sends data, one receives data■ The Queue class: a shared pipe managed with locks
and semaphores, one puts data, one gets data○ Synchronization:
■ The Lock class: one acquires lock, one releases lock
43
44
MPI for Python (mpi4py)
MPI for Python● The mpi4py package provides bindings from Python to
MPI (Message Passing Interface).● MPI functions are then available in Python but with
some simplifications:○ MPI_Init() and MPI_Finalize() are done automatically○ The bindings can auto-detect many values that
need to be specified as explicit parameters in the C and Fortran bindings.
○ Example: dest = 1; tag = 54321; MPI_Send( &matrix,
count, MPI_INT, dest, tag, MPI_COMM_WORLD )
becomes MPI.COMM_WORLD.Send(matrix, dest=1, tag=54321)
45
MPI for Python● Import as from mpi4py import MPI● Then often use comm = MPI.COMM_WORLD● Two variations for most functions:
a. all lowercase, e.g. comm.recv()■ works on general Python objects, using pickle (can
be slow)■ received object (value) returned:
● matrix = comm.recv(source=0, tag=MPI.ANY_TAG)
b. capitalized, e.g. comm.Recv()■ works fast on numpy arrays & other buffers■ received object given as parameter:
● comm.Recv(matrix, source=0, tag=MPI.ANY_TAG)
■ Specify [matrix, MPI.INT], or [data, count, MPI.INT] if autodetection fails.
46
Conclusions● Main techniques covered:
○ Speeding up: PyPy, Numba, CTypes, Cython
○ Parallel programming: multiprocessing, mpi4py
● Useful links:○ http://www.scipy-lectures.org/advanced/interfacing_with_c/int
erfacing_with_c.html
○ https://github.com/kwmsmith/scipy-2015-cython-tutorial
○ https://docs.python.org/3/library/multiprocessing.html
○ http://materials.jeremybejarano.com/MPIwithPython
47
Questions?
● Calcul Quebec support team:○ [email protected]
● Specific site support teams:○ [email protected]○ [email protected]○ [email protected]○ [email protected]
48