crushing the head of the snake by robert brewer pydata sv 2014

Post on 27-Jan-2015

117 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big Data brings with it particular challenges in any language, mostly in performance. This talk will explain how to get immediate speedups in your Python code by exploiting both timeless programming techniques and fixes specific to Python. We will cover: I. Amongst Our Weaponry 1. How to Time and Profile Python 2. Extracting Loop invariants: constants, lookup tables, even methods! 3. Caching: memoization and heavier things II Gunfight at the O.K. Corral in Morse Code 1. Python functions vs C functions 2. Vector operations: NumPy 3. Reducing calls: loops, generators, recursion III. The Semaphore Version of Wuthering Heights 1. Using select instead of Queue 2. Serialization overhead 3. Parallelizing work

TRANSCRIPT

Crushing the Head of the Snake

Robert BrewerChief Architect

Crunch.io

How to Time

from timeit import Timer

>>> range(5)[0, 1, 2, 3, 4]>>> t = Timer("range(a)", "a = 1000000")>>> t.timeit(1)0.028472900390625>>> t.timeit(100)1.8600409030914307>>> t.timeit(1000)18.056041955947876

Comparing algorithms

>>> Timer("range(1000)").timeit(1 000 000)>>> Timer("range(1000)").timeit()11.392634868621826

>>> Timer("xrange(1000)").timeit()0.20040297508239746

>>> Timer("list(xrange(1000))").timeit()12.207480907440186

Caveat: Overhead

>>> Timer().timeit(1000000)0.029289960861206055

Caveat: Wall time not CPU time

>>> Timer("xrange(1000)").timeit()0.20040297508239746>>> Timer("xrange(1000)").repeat(3)[0.20735883712768555, 0.1968221664428711, 0.18882489204406738] take the minimum

How to Profile

>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")

How to Profile

>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")

(make changes to module)

>>> reload(mod)>>> cProfile.run("mod.b()", sort="cumulative")

How to Profile

>>> cProfile.run("for i in xrange(3000): range(i).sort()", sort="cumulative") 6002 function calls in 0.093 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.019 0.019 0.093 0.093 <string>:1(<module>) 3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

How to Profile

6002 function calls in 0.093 seconds

ncalls tottime percall cumtime percall filename:lineno(func)

3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range}

Example: Standard Deviation

>>> import numpy>>> n = 100>>> a = numpy.array(xrange(n), dtype=float)>>> a.std(ddof=1)29.011491975882016

Example: Standard Deviation

>>> n = 4 000 000 000>>> a = numpy.array(xrange(n), dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>ValueError: setting an array element with a sequence.

Example: Standard Deviation

>>> n = 4 000 000 000>>> arr = numpy.zeros(n, dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>MemoryError

Example: Standard Deviation

Example: Standard Deviation

Given array A broken in n parts a1...an

and local variance V(ai) = Σj(aij - ai)2

V(a) + 2(Σaij)(ai - A) + |ai|(A2 - ai2)

|A| - ddof

n

Σi = 1√

Example: Standard Deviation

def run(): points = 400 000 (0000) segments = 100 part_len = points / segments

partitions = [] for p in range(segments): part = range(part_len * p, part_len * (p + 1)) partitions.append(part)

return stddev(partitions, ddof=1)

Example: Standard Deviation

def stddev(partitions, ddof=0): final = 0.0 for part in partitions: m = total(part) / length(part)

# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength

adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj

return math.sqrt(final / (glength - ddof))

Example: Standard Deviation2052106 function calls in 71.025 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

Example: Standard Deviation

400 000 in 71.025 seconds

Assuming no other effects of scale,it will take 197.3 hours (over 8 days)to calculate our 4 billion-row array.

Example: Standard Deviation

Can we calculateour 4 billion-row array in

1 minute?

That’s 400,000 in 6 ms.

All we need is an 11,837.5x speedup.

Optimization

Example: Standard Deviation2052106 function calls in 71.025 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

Amongst Our Weaponry

Extracting loop invariants

Extracting Loop Invariants

def varsum(arr): vs = 0 for j in range(len(arr)): mean = (total(arr) / length(arr)) vs += (arr[j] - mean) ** 2 return vs

Extracting Loop Invariants

def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs

Extracting Loop Invariants52606 calls in 1.944 seconds (36x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 1.942 1.942 stddev1.py:41(run) 1 0.006 0.006 1.932 1.932 stddev1.py:23(stddev) 10500 1.673 0.000 1.859 0.000 stddev1.py:4(total) 10701 0.196 0.000 0.196 0.000 {range} 100 0.062 0.001 0.081 0.001 stddev1.py:15(varsum) 10300 0.003 0.000 0.003 0.000 stddev1.py:11(length) 20900 0.001 0.000 0.001 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 5.4 hrs

Extracting Loop Invariants

def stddev(partitions, ddof=0): final = 0.0

for part in partitions: m = total(part) / length(part)

# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength

adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj

return math.sqrt(final / (glength - ddof))

Extracting Loop Invariants

def stddev(partitions, ddof=0): final = 0.0

# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength

for part in partitions: m = total(part) / length(part)

adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj

return math.sqrt(final / (glength - ddof))

Extracting Loop Invariants2512 function calls in 0.142 seconds (13x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.140 0.140 stddev1.py:42(run) 1 0.000 0.000 0.136 0.136 stddev1.py:23(stddev) 100 0.063 0.001 0.082 0.001 stddev1.py:15(varsum) 402 0.064 0.000 0.071 0.000 stddev1.py:4(total) 603 0.013 0.000 0.013 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:11(length) 902 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 23 minutes

Amongst Our Weaponry

Use builtin Python functionswhenever possible

Use Python Builtins

def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s

Use Python Builtins

def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s

def total(arr): return sum(arr)

Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 16 minutes

Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

Use Python Builtins

def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs

Use Python Builtins

def varsum(arr):

mean = (total(arr) / length(arr)) return sum((v - mean) ** 2 for v in arr)

Use Python Builtins

402110 function calls in 0.122 seconds1.27x slower

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.120 0.120 stddev.py:36(run) 1 0.000 0.000 0.115 0.115 stddev.py:17(stddev) 502 0.044 0.000 0.114 0.000 {sum} 100 0.000 0.000 0.106 0.001 stddev.py:12(varsum)400100 0.070 0.000 0.070 0.000 stddev.py:14(genexpr) 402 0.000 0.000 0.011 0.000 stddev.py:4(total)

Amongst Our Weaponry

Reduce function calls

Reduce Function Calls>>> Timer("sum(a)", "a = range(10)").repeat(3)[0.15801000595092773, 0.1406857967376709, 0.14577603340148926]

>>> Timer("total(a)", "a = range(10); total = lambda x: sum(x)" ).repeat(3)[0.2066800594329834, 0.1998300552368164, 0.21536493301391602]

0.000 000 059 seconds per call

Reduce Function Calls

def variances_squared(arr): mean = (total(arr) / length(arr)) for v in arr: yield (v - mean) ** 2

Reduce Function Calls

def varsum(arr): mean = (total(arr) / length(arr)) return sum( (v - mean) ** 2 for v in arr )

def varsum(arr): mean = (total(arr) / length(arr)) return sum([(v - mean) ** 2 for v in arr])

Reduce Function Calls2010 function calls in 0.082 seconds (1.17x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.080 0.080 stddev.py:36(run) 1 0.000 0.000 0.071 0.071 stddev.py:17(stddev) 100 0.050 0.001 0.056 0.001 stddev.py:12(varsum) 502 0.020 0.000 0.020 0.000 {sum} 402 0.000 0.000 0.016 0.000 stddev.py:4(total) 101 0.009 0.000 0.009 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev.py:8(length) 400 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 13+ minutes

Amongst Our Weaponry

Vector operationswith NumPy

Vector Operations

part = numpy.array( xrange(...), dtype=float)

def total(arr): return arr.sum()

def varsum(arr): return ( (arr - arr.mean()) ** 2).sum()

Vector Operations3408 function calls in 0.057 seconds (1.43x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}

still 9.5 minutes

Vector Operations3408 function calls in 0.057 seconds (1.43x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}

still 9.5 minutes

Vector Operations3408 function calls in 0.006 seconds (13.6x)

ncalls tottime percall cumtime percall filename:lineno(func)

1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}

should be exactly 1 minute

Vector Operations

Let’s try 4 billion!

Bump up that N...

Vector Operations

MemoryError

Oh, yeah...

Amongst Our Weaponry

Parallelizationwith

multiprocessing

Parallelization

from multiprocessing import Pool

def run(): results = Pool().map( run_one, range(segments)) result = stddev(results) return result

Parallelization

def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))

T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V

Parallelization

def stddev(TLVs, ddof=0): final = 0.0

totals = [T for T, L, V in TLVs] lengths = [L for T, L, V in TLVs] glength = sum(lengths) g = sum(totals) / glength

for T, L, V in TLVs: m = T / L adj = ((2 * T * (m - g)) + ((g ** 2 - m ** 2) * L)) final += V + adj

return math.sqrt(final / (glength - ddof))

Parallelization3734 function calls in 0.024 seconds

6x slower

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.024 0.024 stddev.py:47(run) 4 0.000 0.000 0.011 0.003 threading.py:234(wait) 22 0.011 0.000 0.011 0.000 {thread.lock.acquire} 1 0.000 0.000 0.011 0.011 pool.py:222(map) 1 0.000 0.000 0.008 0.008 pool.py:113(__init__) 4 0.001 0.000 0.005 0.001 process.py:116(start) 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 4 0.000 0.000 0.004 0.001 forking.py:115(init) 4 0.003 0.001 0.003 0.001 {posix.fork}

...

Parallelization

Could that waiting be insignificantwhen we scale up to 4 billion?

Let’s try it!

Parallelization3766 function calls in 67.811 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 67.811 67.811 stddev.py:47(run) 4 0.000 0.000 67.747 16.930 threading.py:234(wait) 22 67.747 3.079 67.747 3.079 {thread.lock.acquire} 1 0.000 0.000 67.747 67.747 pool.py:222(map) 1 0.000 0.000 0.062 0.060 pool.py:113(__init__) 4 0.000 0.000 0.058 0.014 process.py:116(start) 4 0.057 0.014 0.057 0.014 {posix.fork} 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 2 0.002 0.001 0.002 0.001 {sum}

SO CLOSE! 1.13 minutes

Parallelization

def run_one(i): if i == 50: cProfile.runctx(..., "prf.50")

>>> import pstats>>> s = pstats.Stats("prf.50")>>> s.sort_stats("cumulative")<pstats.Stats instance at 0x2bddcb0>>>> _.print_stats()

Parallelization

57 function calls in 2.804 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.431 0.431 2.791 2.791 stddev.py:43(run_one) 2 0.000 0.000 2.360 1.180 numpy.ndarray.sum 2 2.360 1.180 2.360 1.180 numpy.ufunc.reduce 1 0.000 0.000 0.000 0.000 memmap.py:195(__new__)

Parallelization

def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))

T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V

200 seconds / 4 cores = 50

Parallelization? Serialization!

67.8 seconds for 4 billion rows, but-50 of those are loading data! 17.8 seconds to do the actual math.

Serialization

import bloscpack as bpbargs = bp.args.DEFAULT_BLOSC_ARGSbargs['clevel'] = 6

bp.pack_ndarray_file( part, fname, blosc_args=bargs)

part = bp.unpack_ndarray_file(fname)

Serialization

Let’s try it!

I Crush Your

Head!

I Crush Your Head!1153 function calls in 26.166 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 26.166 26.166 stddev_bp.py:56(run) 4 0.000 0.000 26.134 6.53 threading.py:234(wait) 22 26.134 1.188 26.134 1.188 thread.lock.acquire 1 0.000 0.000 26.133 26.133 pool.py:222(map) 1 0.000 0.000 26.133 26.133 pool.py:521(get) 1 0.000 0.000 26.133 26.133 pool.py:513(wait) 1 0.003 0.003 0.030 0.030 __init__.py:227(Pool) 1 0.000 0.000 0.021 0.021 pool.py:113(__init__)

I Crush Your Head!

With some time-tested generalprogramming techniques:

Extract loop invariants

Use language builtins

Reduce function calls

I Crush Your Head!

And some Python librariesfor architectural improvements:

Use NumPy for vector ops

Use multiprocessing for parallelization

Use bloscpack for compression

I Crush Your Head!

We sped up our calculationso that it runs in:

0.003% of the time

or 27317 times faster

4.4 orders of magnitude

Crushing the Head of the Snake

Any questions?

@aminusfubob@crunch.io

top related