python as number crunching code glue

Post on 27-Jan-2015

155 Views

Category:

Education

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented to the Boston Python User Group on 9/21/2011

TRANSCRIPT

Python as number crunching glue

Jiahao Chenjiahao@mit.edu@mitpostdoc

theochem.mit.edu

1Thursday, September 22, 2011

This is not a crash course on scientific computing or numerical linear algebraRecommended texts:

2

nr.com

Thursday, September 22, 2011

NumPy and SciPyHow to say:

NumPy: no official pronunciation

SciPy: “sigh pie”

3Thursday, September 22, 2011

NumPy and SciPyHow to say:

NumPy: no official pronunciation

SciPy: “sigh pie”

3

Where to get:

scipy.org, numpy.scipy.org

You might already have it

Otherwise, have fun installing it ;)

Thursday, September 22, 2011

You may already know how to use numpy/scipy!

Similar to Matlab, Octave, Scilab, R.

see:http://mathesaurus.sourceforge.net/

In many cases, Matlab/Octave/Scilab code can be translated easily to use numpy+scipy+matplotlib.

Other interfaces exist: e.g. mlabwrap lets you wrap Python around Matlab.

4Thursday, September 22, 2011

Approximately continuous arithmeticfloating point*

- vs -

Exact discrete arithmeticbooleans, integers, strings, ...

*David Goldberg, “What every computer scientist should know about floating-point arithmetic”

5Thursday, September 22, 2011

Using numpy can make code cleaner

6

a = range(10000000)b = range(10000000)c = []

for i in range(len(a)): c.append(a[i] + b[i])

import numpy as npa = np.arange(10000000)b = np.arange(10000000)c = a + b

What’s different??

Thursday, September 22, 2011

What’s different?

7

a = range(10000000)b = range(10000000)c = [] #a+b is concatenation

for i in range(len(a)): c.append(a[i] + b[i])

import numpy as npa = np.arange(10000000)b = np.arange(10000000)c = a + b #vectorized addition

Using numpy can save lots of time

0.333s7.050s (21x)

a convenient interface to compiled C/Fortran libraries: BLAS, LAPACK, FFTW, UMFPACK,...

creates list ofdynamically typed int

creates ndarray ofstatically typed int

Thursday, September 22, 2011

Numerical sw stack

8

PythonBLAS

NumPy

SciPy

FFTW

...

linearalgebra

Fouriertransforms

External Fortran/C

Your code

LAPACK

...

Thursday, September 22, 2011

“One thing that graduate students eventually learn is that you can hide just about anything in a NxN matrix... (for sufficiently large N)” - anonymous string theorist

9Thursday, September 22, 2011

“One thing that graduate students eventually learn is that you can hide just about anything in a NxN matrix... (for sufficiently large N)” - anonymous string theorist

9

If your data can be put into a matrix/vector, numpy/scipy can help you!

Thursday, September 22, 2011

You may already be working with matrix/vector data...

10

bitmap/video waveform

database table text differential

equation model

graph

Thursday, September 22, 2011

11

# Chapter NumPy SciPy

1 Scientific Computing2 Systems of linear equations X X

3 Linear least squares X

4 Eigenvalue problems X X

5 Nonlinear equations X

6 Optimization X

7 Interpolation X

8 Numerical integration and differntiation X

9 Initial value problems for ODEs X

10 Boundary value problems for ODEs X

11 Partial differential equations X

12 Fast Fourier Transform X

13 Random numbers and stochastic simulation X

Table of contents from Michael Heath’s textbook

Thursday, September 22, 2011

Outline:

* NumPy: explicit data typing with dtypes : array manipulation with ndarrays

* SciPy: high-level numerical routines : use cases

* NumPy/SciPy as code glue: f2py and weave

12Thursday, September 22, 2011

The most fundamental object in NumPy is the ndarray (N-dimensional array)

v[:] vector M[:,:] matrix x[:,:,...,:] higher order tensor

unlike built-in Python data types,ndarrays are designed forhomogeneous, explicitly typed data

13Thursday, September 22, 2011

numpy primitive dtypes

14

Bits Boolean Signedinteger

Unsignedinteger Float Complex

8 bool int8 uint816 int16 uint1632 int32 uint32 float32

64int intp uint float

float64 complex6464int64 uint64

floatfloat64 complex64

128 float128 complex128256 complex256

dtypes bring explicit typing to Python

Thursday, September 22, 2011

>>> mol = np.array(mol, dtype={'atomicnum':('uint8',0), 'coords':('3float64',1)})>>> mol['atomicnum']array([8, 1, 1], dtype=uint8)

Recarray: ndarray of data structure with named fields (record)

15

Structured array: ndarray of data structure

>>> mol = np.zeros(3, dtype=('uint8, 3float64'))>>> mol[0] = 8, (-0.464, 0.177, 0.0)>>> mol[1] = 1, (-0.464, 1.137, 0.0)>>> mol[2] = 1, (0.441, -0.143, 0.0)>>> molarray([(8, [-0.46400000000000002, 0.17699999999999999, 0.0]), (1, [-0.46400000000000002, 1.137, 0.0]), (1, [0.441, -0.14299999999999999, 0.0])], dtype=[('f0', '|u1'), ('f1', '<f8', (3,))])

Thursday, September 22, 2011

The most fundamental object in NumPy is the ndarray (N-dimensional array)In 2D, the matrix class is also useful, especially when porting Matlab/Octave code.* For matrices, a*b is matrix multiply. For ndarrays, a*b is elementwise multiply.

* Matrices have convenient attributes: M.T transpose of M M.H Hermitian conjugate of M M.I matrix inverse of M

* Matrices are always 2D, no matter how you manipulate them. ****** This can lead to some very severe, insidious bugs. ******

using asarray() and asmatrix() views allows the best of both worlds.see: http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#matrix-objects

16Thursday, September 22, 2011

Memory layout of matrices

column major: first dimension is contiguous in memory Fortran, Matlab, R,...

row major: last dimension is contiguous in memory C, Java, numpy,...

Why you should care:• Cache coherence• Transposing a matrix is very expensive

17Thursday, September 22, 2011

• from Python iterable: lists, tuples,...e.g. array([1, 2, 3]) == asarray((1, 2, 3))• from intrinsic functionsempty() allocates memory onlyzeros() initializes to 0ones() initializes to 1arange() creates a uniform rangerand() initializes to uniform randomrandn() initializes to standard normal random...• from binary representation in string/buffer• from file on disk

18

Creating ndarrays

Thursday, September 22, 2011

fromfunction() creates an ndarray whose entries are functions of its indices

e.g. the Hilbert matrix

>>> np.fromfunction(lambda i,j: 1./(i+j+1), (4,4))array([[ 1. , 0.5 , 0.33333333, 0.25 ], [ 0.5 , 0.33333333, 0.25 , 0.2 ], [ 0.33333333, 0.25 , 0.2 , 0.16666667], [ 0.25 , 0.2 , 0.16666667, 0.14285714]])

19

1..n

Generating ndarrays

Thursday, September 22, 2011

arange(): like range() but accepts floats>>> import numpy as np>>> np.arange(2, 2.5, 0.1)array([ 2. , 2.1, 2.2, 2.3, 2.4])

linspace(): creates array with specified number of elements, spaced equally between the specified beginning and ending.>>> np.linspace(2.0, 2.4, 5)array([ 2. , 2.1, 2.2, 2.3, 2.4])

20

Generating ndarrays

Thursday, September 22, 2011

21

ndarray native I/OFormat Reader Writer

pickle pickle.loads() dumps()pickle

np.load()

dumps()

NPY np.load() np.save()NPZ

np.load()np.savez()

Memory map np.memmapnp.memmap

NPY is numpy’s native binary formatNPZ is a zip file of NPYsMemory map: a class useful for handling huge matrices won’t load entire matrix into memory

Thursday, September 22, 2011

22

ndarray text I/OFormat Reader Writer

Stringeval() np.array_repr()

Stringor below with StringIOor below with StringIO

Text filenp.loadtxt()

np.genfromtxt()np.recfromtxt()

savetxt()

CSV np.recfromcsv()Matrix Market scipy.io.mmread() mmwrite()

Thursday, September 22, 2011

23

ndarray binary I/OFormat Reader WriterList np.array() ndarray.tolist()

Stringnp.fromstring() tostring()

Stringor below with StringIOor below with StringIO

Raw binary file

scipy.io.numpyio.fread() ndarray.fromfile()

fwrite().tofile()

MATLAB scipy.io.loadmat() savemat()netCDF scipy.io.netcdf.netcdf_filescipy.io.netcdf.netcdf_file

WAV audio scipy.io.wavfile.read() write()Image

(via PIL)scipy.misc.imread()

scipy.misc.fromimage()imsave()toimage()

Also video (OpenCV), HDF5 (PyTables), FITS (PyFITS)...Thursday, September 22, 2011

Indexing>>> x = np.arange(12).reshape(3,4); xarray([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> x[1,2]6>>> x[2,-1]11>>> x[0][2]2>>> x[(2,2)]10>>> x[:1]array([[0, 1, 2, 3]])>>> x[::2,1:4:2]array([[ 1, 3], [ 9, 11]])

24

#slices return views, not copies

#tuple

row, then column

Thursday, September 22, 2011

Fancy indexing>>> x = np.arange(12).reshape(3,4); xarray([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> x[(2,2)]10>>> x[np.array([2,2])] #same as x[[2,2]]array([[ 8, 9, 10, 11], [ 8, 9, 10, 11]])>>> x[np.array([1,0]), np.array([2,1])]array([6, 1])>>> x[x>8]array([ 9, 10, 11])>>> x>8array([[False, False, False, False], [False, False, False, False], [False, True, True, True]], dtype=bool)

25

array index

Boolean mask

Thursday, September 22, 2011

Fancy indexing II>>> y = np.arange(1*2*3*4).reshape(1,2,3,4); yarray([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],

[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]])

>>> y[0, Ellipsis, 0] # == y[0, ..., 0] == [0,:,:,0]array([[ 0, 4, 8], [12, 16, 20]])>>> y[0, 0, 0, slice(2,4)] # == y[(0, 0, 0, 2:4)]array([2, 3])

26Thursday, September 22, 2011

Broadcasting

>>> x #.shape = (3,4)array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> y #.shape = (1,2,3,4)array([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],

[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]])

27

>>> y * xarray([[[[ 0, 1, 4, 9], [ 16, 25, 36, 49], [ 64, 81, 100, 121]],

[[ 0, 13, 28, 45], [ 64, 85, 108, 133], [160, 189, 220, 253]]]])

What happens when you multiply ndarrays of different dimensions?

Case I: trailing dimensions match

Thursday, September 22, 2011

Broadcasting

>>> a = np.arange(4); aarray([0, 1, 2, 3])>>> b = np.arange(4)[::-1]; barray([3, 2, 1, 0])>>> a + barray([3, 3, 3, 3])

28

What happens when you multiply ndarrays of different dimensions?

Case II: trailing dimension is 1>>> b.shape = 4,1>>> a + barray([[3, 4, 5, 6], [2, 3, 4, 5], [1, 2, 3, 4], [0, 1, 2, 3]])

>>> b.shape = 1,4>>> a + barray([[3, 3, 3, 3]])

Thursday, September 22, 2011

In 2D, the matrix class is often more useful than ndarrays, especially when porting Matlab/Octave code.* For matrices, a*b is matrix multiply. For ndarrays, a*b is elementwise multiply.

* Matrices have convenient attributes: M.T transpose of M M.H Hermitian conjugate of M M.I matrix inverse of M

* Matrices are always 2D, no matter how you manipulate them. ****** This can lead to some very severe, insidious bugs. ******

using asarray() and asmatrix() views allows the best of both worlds.see: http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#matrix-objects

29

Matrix operations

Thursday, September 22, 2011

Matrix functionsYou can apply a function elementwise to a matrix...>>> from numpy import array, exp>>> X = array([[1, 1], [1, 0]])>>> exp(X)array([[ 2.71828183, 2.71828183], [ 2.71828183, 1.]])

...or a matrix version of that function>>> from scipy.linalg import expm>>> expm(X)array([[ 2.71828183, 7.3890561 ], [ 1. , 2.71828183]])

other functions in scipy.linalg.matfuncs30

Thursday, September 22, 2011

SciPy by example

* Data fitting

* Signal matching

* Disease outbreak modeling (epidemiology)

31

http://scipy-central.org/

Thursday, September 22, 2011

Least-squares curve fittingfrom scipy import *from scipy.optimize import leastsqfrom matplotlib.pyplot import plot

#Make up data x(t) with Gaussian noisenum_points = 150t = linspace(5, 8, num_points)x = 11.86*cos(2*pi/0.81*t-1.32) + 0.64*t\ +4*((0.5-rand(num_points))*\ exp(2*rand(num_points)**2))

# Target functionmodel = lambda p, x: \ p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x# Distance to the target functionerror = lambda p, x, y: model(p, x) - y# Initial guess for the parametersp0 = [-15., 0.8, 0., -1.]p1, _ = leastsq(error, p0, args=(t, x))

t2 = linspace(t.min(), t.max(), 100)plot(t, x, "ro", t2, model(p1, t2), "b-")raw_input()

32

fit data to model

Thursday, September 22, 2011

Matching signalsSuppose I have a short audio clip

that I know to be part of a larger file

How can I figure out its offset?

Problem: naïve matching scales as O(N2)

33Thursday, September 22, 2011

An O(N lg N) solutionNaïve matching scales as O(N2)How can we do faster?

phase correlation

Exploit Fourier transforms: they encode relative offsets in complex phase

34

60o

1/6Thursday, September 22, 2011

From math to code

35Thursday, September 22, 2011

From math to code

35

import numpy

#Make up some dataN = 30000idx = 24700size = 300data = numpy.random.rand(N)frag_pad = numpy.zeros(N)frag = data[idx:idx+size]frag_pad[:size] = frag

#Compute phase correlationdata_ft = numpy.fft.rfft(data)frag_ft = numpy.fft.rfft(frag_pad)phase = data_ft * numpy.conj(frag_ft)phase /= abs(phase)cross_correlation = numpy.fft.irfft(phase)offset = numpy.argmax(cross_correlation)

print 'Input offset: %d, computed: %d' % (idx, offset)from matplotlib.pyplot import plotplot(cross_correlation)raw_input() #Pause

Thursday, September 22, 2011

From math to code

35

import numpy

#Make up some dataN = 30000idx = 24700size = 300data = numpy.random.rand(N)frag_pad = numpy.zeros(N)frag = data[idx:idx+size]frag_pad[:size] = frag

#Compute phase correlationdata_ft = numpy.fft.rfft(data)frag_ft = numpy.fft.rfft(frag_pad)phase = data_ft * numpy.conj(frag_ft)phase /= abs(phase)cross_correlation = numpy.fft.irfft(phase)offset = numpy.argmax(cross_correlation)

print 'Input offset: %d, computed: %d' % (idx, offset)from matplotlib.pyplot import plotplot(cross_correlation)raw_input() #Pause

Thursday, September 22, 2011

Modeling a zombie apocalypse

37

http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT

Normal (S) Zombie Dead (R)

Each person can be in one of three states

Thursday, September 22, 2011

Modeling a zombie apocalypse

38

http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT

Normal (S) Zombie Dead (R)

Various processes connect these states

birth (P) normal death

+

resurrection (G)transmission (B)

destruction (A)

Thursday, September 22, 2011

from numpy import linspacefrom scipy.integrate import odeint

P = 0 # birth rated = 0.0001 # natural death rateB = 0.0095 # transmission rateG = 0.0001 # resurrection rateA = 0.0001 # destruction ratedef f(y, t): Si, Zi, Ri = y return [P - B*Si*Zi - d*Si, B*Si*Zi + G*Ri - A*Si*Zi, d*Si + A*Si*Zi - G*Ri]

y0 = [500, 0, 0] # initial conditionst = linspace(0, 5., 1000) # time grid

soln = odeint(f, y0, t) # solve ODES, Z, R = soln[:, :].T

From math to code

39

http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT

S Z R

r d+

GB

A

Thursday, September 22, 2011

Using external code“NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python.” - Travis Oliphant, 2011-06-20

see:http://www.scipy.org/PerformancePythonhttp://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html

40Thursday, September 22, 2011

Python as code glue- numpy.f2py: wraps * C, Fortran 77/90/95 functions * Fortran 90/95 module data * Fortran 77 COMMON blocks

- scipy.weave * .inline: compiles & runs C/C++ code manipulating Python scalars/ndarrays * .blitz: interfaces with Blitz++

Other wrapper libraries and programs: seehttp://scipy.org/Topical_Software

41Thursday, September 22, 2011

numpy.f2py: Fortran/C

$ cat>invsqrt.f real*8 function invsqrt (a) real*8 a invsqrt = 1.0/sqrt(a) end

$ f2py -c -m invsqrt invsqrt.f$ python -c 'import invsqrt; print invsqrt.invsqrt(4)'0.5

see: http://www.scipy.org/F2py

42

$ cat>invsqrt.c#include <math.h>double invsqrt(a) { return 1.0/sqrt(a);}$ cat>invsqrt.mpython module invsqrtinterface real*8 function invsqrt(x) intent(c) :: invsqrt real*8 intent(in) :: x end function invsqrtend interfaceend python module invsqrt$ f2py invsqrt.m invsqrt.c -c$ python -c 'import invsqrt; print invsqrt.invsqrt(4)'0.5

Thursday, September 22, 2011

scipy.weave.inline

>>> from scipy.weave import inline>>> x = 4.0>>> inline('return_val = 1./sqrt(x));',['x'])0.5

see: https://github.com/scipy/scipy/blob/master/scipy/weave/doc/tutorial.txt

43

inline Extension

pythonscipyweave

distutilscore

on-the-flycompiledC/C++program

Thursday, September 22, 2011

scipy.weave.blitzUses the Blitz++ numerical library for C++Converts between ndarrays and Blitz arrays>>> # Computes five-point average using numpy and weave.blitz>>> import numpy import empty>>> from scipy.weave import blitz>>> a = numpy.zeros((4096,4096)); c = numpy.zeros((4096, 4096))>>> b = numpy.random.randn(4096,4096)>>> c[1:-1,1:-1] = (b[1:-1,1:-1] + b[2:,1:-1] + b[:-2,1:-1] + b[1:-1,2:] + b[1:-1,:-2]) / 5.0>>> blitz("a[1:-1,1:-1] = (b[1:-1,1:-1] + b[2:,1:-1] + b[:-2,1:-1] + b[1:-1,2:] + b[1:-1,:-2]) / 5.")>>> (a == c).all()True

see:https://github.com/scipy/scipy/blob/master/scipy/weave/doc/tutorial.txt

44Thursday, September 22, 2011

ParallelizationThe easy way: numpy/scipy’s primitives automatically use vectorization compiled into external BLAS/LAPACK/... libraries

The usual way:- MPI interfaces (mpi4py,...)- Python threads/multiprocessing/...- OpenMP/pthreads... in external C/Fortran

see:http://www.scipy.org/ParallelProgramming

45Thursday, September 22, 2011

How I use NumPy/Scipy

46

Text input

Matrices Test model Visualize

Text output

scipy.optimizeQuasi-Newton optimizers

External binary

Binary outputndarray.

fromfile()

Thursday, September 22, 2011

Beyond NumPy/SciPy

47

Python

NumPy

SciPyExternal Fortran/C

My script

CVXOpt

many more examples at http://www.scipy.org/Topical_Software

PyTables VTK matplotlib

My interactive session

PylabHDF5

file I/Onumerical

optimization

visualization

PyMol

moleculeviz.

plots

Thursday, September 22, 2011

top related