code optimization in blond d. quartullo 30/10/2014headtail working group meeting1 acknowledgements:...

HEADTAIL working group meeting 1

Code optimization in BLonD

D. Quartullo

30/10/2014

Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen,J. E. Muller, D. Piparo, E. Shaposhnikova, H. Timko

HEADTAIL working group meeting 230/10/2014

Outlook Why optimize BLonD? Language used and tools for optimization

• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags

BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time

Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker

Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations

Summary and next steps Experimental: parallelization with OpenMP







Summary and next steps Experimental: Parallelization with OpenMP


Why optimize BLonD?

Because, in general, the simulation of the accelleration ramp is such that the code has to take into account millions of turns (e.g. LHC 9 millions) This implies several days of computing time or even weeks notwithstanding a choosen amount of macroparticles quite low in respect to the real one. In addition the CERN LXPLUS batch service doesn’t advice to submit jobs into 1 or 2 weeks queues!!!

Because the multibunch case is quite expensive since one has to consider a lot of macroparticles and slices.

So the present runtimes are inacceptable in view of multibunch extension• E.g. LHC acceleration ramp takes 1 week on LXPLUS without intensity effects,

for 50000 particles only! Effective parallelization requires optimization first, because a not optimized

parallelised code could be as performant as an optimized serial code, and the serial code is more stable!

Because, as we will see, it’s possible to optimise the code quite significantly!


Python

The language used is Python mainly because it’s:

• Fast to program with it• Open source• Very spread among the scientific computing community and for that reason

there exists a lot of support• Full of packages and routines that allow the user not to envy persons using

Matlab or Mathematica applications.

We are using the Anaconda distribution that includes Python v2.7 64bit and all the packages that we need. Anaconda can be installed very easily in Linux, Windows and Mac.Link: https://store.continuum.io/cshop/anaconda/

But since the Python code, as its similar cousins, is interpreted, typically we can’t reach the same performances as with compiled code as for example C/C++ or Fortran.

We need something to boost Python...


Faster with C: Ctypes

There exist plenty of plugins that allow to ‘connect’ C/C++ and Python codes. The main ones are Python-C-API, Ctypes, SWIG, Cython.

We started using Ctypes because:

• Allows to code the routines that one wants to optimise in pure C/C++ code with just an unavoidable overhead due to the type casting from Python to C.

• However this overhead for our applications in general is small since the quantity of floating point operations that Blondie has to carry out is large.

• The programmer has full control of the process of embedding C in Python (flags, parallelization in C, autovectorisation).

• The programmer has not to learn a new language, as it is for Cython, supposing that he knows C!!!

• If a certain routine is optimised in C and then linked with Python, the programmer is sure that all the best has been done.


Spyder is an IDE for Python. In it there is a profiler program able to measure and display with a nice interface the various times of the several routines used in a Python run.

Spyder can be foundfor example in the Anaconda distribution.

Spyder profiler usesPython’s modulecProfile.

A good web page where to find useful hints on profiletechniques andperformance analysis:http://www.huyng.com/posts/python-performance-analysis/

30/10/2014

Spyder profiler


Gcc compiler and flagsWe need a C++ compiler to compile the files that have to be linked to Python via the Ctypes module.

GCC is open source, complete and easy to use.

We are using GCC v4.8.1 64bit.Windows: http://tdm-gcc.tdragon.net/Linux(CERN): source from /afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6/setup.sh

One can compile a .cpp file with certain options called ‘flags’ used mostly to optimize the code.For example:

• ‘-Oflag’ Autovectorisation, fast mathematical operations, ...• ‘-mfma4’ Fused-multiply-add operarions for even faster

calculations (not all machines support this feature but LXPLUS yes!)

• ‘-std=c++11’ Last standard for the C++ language• ‘-fopenmp’ Task parallelization with OpenMP

Think of it as a parallelization with just one core!!

Speed up operations of the type a + b x c !!




BLonD structure and optimization strategies• Packages, modules and setup files• What to optimize: RAM memory and computation time





Packages, modules and setup files

Code documentationTest - cases

Create a new beam and slice it C++ optimised routinesCython routines (just for test)Impedances in time and frequency domain

Define beam and rf parametersFeedbacks, RF noise

StatisticsPlots

TrackersREADME file

Setup file for the C++ optimised routinesSetup file for Cython optimised routines


What to optimize: RAM memory and computation time

In BLonD we store momentum program, beam energy, beam coordinates and similars in arrays. In that way we can save on average computation time since we calculate once and for all at the beginning all the things that with high probability could be called at least two times in the rest of the code.

This technique increases obviously significantly the RAM memory used. On the other hand the RAM consuption is currently not a problem even if we use

the double precision type that requires 8 byte for every number saved.• E.g. Our office PCs as well as LXPLUS machines have at least 4 GB of RAM and

the simulation of the LHC ramp takes approximately 1.5 GB of memory. Obviously there could be some problems if we want to launch two or more

simulations on the same machine (e.g. In local), but on the LXPLUS batch service for example one can request at least 4 GB of RAM for each submitted jobs.

Then all the effort has been put on the computation time saving.


High computation timeNegligible RAM usage

30/10/2014

What to optimize: RAM memory and computation time

Negligible computation timeHigh RAM usage

Zero computation timeNegligible RAM usage


Definitions and parameters of interestWe’ll say that f is on the order of g, in symbols f(n) g(n), for n -> +∞ if there exists a posive constant C such that

= C

We’ll say that f is little - o of g, in symbol f(n) = o(g(n)), for n -> +∞ if

= 0

Example: f(n) = g(n) = , h(n) = f(n) g(n) , f(n) = o(h(n))

This type of notation (asymptotic study) is used for comparing two or more different algorithms that give the same result.

Example: insertion sort and quick sort sorting algorithms. In the average case we have:C[insertion_sort](n) , C[quick_sort](n) n (

where C stands for computation time (or cost) and obviously depends on the type of algorithm and the number of elements n that have to be ordered.

f and g grows at the same pace

f grows slower than g


Definitions and parameters of interestThis implies that C[quick_sort] = o(C[insertion_sort]), that is quick sort is more performant than insertion sort, at least for large n.

Main parameters of interest for the optimization study of BLonD:

• Number of macroparticles M• Number of slices S

We will suppose that these two variables are independent from each other, in other wordswe can have M -> ∞ and/or S -> ∞.


Histogram constant space

Code not optimized: numpy.histogram is too expensive if the slicing is done with constant space. Why? Let’s have a look with Spyder that easily can show all the subroutines of a given routine.

From this picture still we can’t say a lot since we don’t know how the various subroutines are used. Let’s go deeper...

C [quicksort](n) n n

Binary researchalgorithm used

C [bin_res](n) n

16 M +

Histogram constant spaceNumpy.histogram: (a = array of coordinates, bins = array of edges, M = len(a), S = len(bins) - 1)

block = 65536for i in arange(0, len(a), block): sa = sort(a[i:i+block]) n += np.r_[sa.searchsorted(bins[:-1], 'left'), \ sa.searchsorted(bins[-1], 'right')]

n = np.diff(n)

So if M > 65536 C[numpy.histogram] 16 M + S, that is linear in both M and S. If the length of our array is M 65536 C[numpy.histogram] , that is linearithmic in M and linear in S. One problem is that we have a lot of function calls (searchsorted, diff, and quicksort which is a recursive method) so, even if linear in M for large M, the method has angular coefficient high. The second problem is that we have a linear dipendence on S.

30/10/2014 HEADTAIL working group meeting 18

x

𝑆16 M + S


Histogram constant space

Code optimized: (input = array of coordinates, output = array of edges, M = len(input), S = len(output) - 1)

inv_bin_width = n_slices / (cut_right - cut_left); for (i = 0; i < n_macroparticles; i++) { a = input[i]; if ((a < cut_left)||(a > cut_right)) continue; fbin = (a - cut_left) * inv_bin_width; ffbin = (uint)(fbin); output[ffbin] = output[ffbin] + 1.0; }

doesn’t dipend on S because the access time into the various cells of an array stored in the RAM doesn’t dipend on the length of the array (address technique)

So C[optimized histogram] M, we have zero function calls and no dipendence on S! It’s really difficult to do better! This code is not autovectorizable since the access of the output array is not sequential.

+

𝑐2𝑐3𝑐4𝑐5𝑐6𝑐7

x M

𝑐1+𝐶M


Tracker

The tracker consists of three parts: Kick, kick_acceleration, drift.Obviously it doesn’t depend on S.

KICK EQUATION

beam_dE = beam_dE + V numpy.sin( h )

The problem is that we can’t use directly in Python the C math library since the routines in it can take just scalars. On the other hand making a for loop on a big array in Python is really a suicide!

Then we need to code in C++...

It can be even 17x more expensive than math.sin in Python!!!


Tracker

The sin in math.h is not autovectorizable for two reasons:

• It is not inline so when there is a call to this function in an otherwise vectorizable for loop, the compiler doesn’t vectorize

• It doesn’t use polynomials but large Look Up Tables and so it can’t be vectorized.

Solution: Taylor series? Good since we would deal with polynomials, easily vectorizable.But even better... Pade rational functions that allow autovectorisation (they are ratios of polynomials) and need less terms than a Taylor expansion to get a fixed accuracy..

We use the fast_sin routine from the VDT CERN library (D. Piparo and others)

KICK_ACCELERATION EQUATION

beam_dE = beam_dE + acceleration_kick

In C++ we have a for loop that is immediately vectorised


Tracker SIMPLE_DRIFT EQUATION

beam_theta = beam_theta +

In C++ we have a for loop that is immediately vectorised


LHC ramp with feedback and no impedances

Parameters: M = 50000, S = 100, num_turns = 1000. Machine: my office PC

CODE NOT OPTIMIZED


LHC ramp with feedback and no impedances

CODE OPTIMIZED

RESULTS: histogram: from 3.477 to 0.188 tracker: from 1.877 to 0.747


SPS with full impedance model at injection

Parameters: M = 5000000, S = 500, num_turns = 10. Machine: my office PC

CODE NOT OPTIMIZED


CODE OPTIMIZED

RESULTS: histogram: from 3.480 to 0.178 tracker: from 4.949 to 1.010

30/10/2014

SPS with full impedance model at injection


HISTOGRAM:

LHC M = 50000, S = 100, num_turns = 1000, histogram: from 3.477 to 0.188SPS M = 5000000, S = 500, num_turns = 10, histogram: from 3.480 to 0.178

• Let’s remember that the optimised histogram is linear in M and independent on S, so if M x num_turns is preserved, as it’s the case, the computation times should not change significantly.

• Let’s recall the two formulas for the not optimised histogram: If M 65536 then C[numpy.histogram] If M > 65536 C[numpy.histogram] 16 M + S

But in both the LHC and SPS cases we have S << M and S << 65536, and since we have: C[numpy.histogram_LHC] M C[numpy.histogram_SPS] 16 M + S M Since M x num_turns is preserved, the total times have to be largely the same!

30/10/2014

Observations


Observations

LHC numpy.histogram SPS numpy.histogram

ceil(5000000 / 65536) x 10 = ceil(79.26) x 10


Summary and next stepsBLonD has to be optimised if we want to carry out expensive simulations.The tools used for the various optmizations have been shown.An asynthotic study of the histogram routine has been done.The histogram and the tracker have been optimized with very good results;

in addition it has been shown that it’s difficult to optimize these routines further,

at least with serial codes.

IN THE NEXT FUTURE… Optimization of other time consuming routines, for example FFT, convolution,

interpolation, hamiltonian..., with and without parallelization. Multibunch and parallelization have to be done at the same time

• task parallelization define dependences among classes• ‘trivial’ parallelization over cores e.g. for tracker• physics’ parallelization of intensity effects

• time domain: where to truncate the convolution


Experimental:Parallelization with OpenMP

OpenMP is an interface that allows to parallelize our tasks when we have multiple cores that share a fixed memory, as it happens on our office PCs or on LXPLUS.

The user can choose easily the number of cores to use. We found that on LXPLUS it’s often better to use 7 cores than 8, maybe because one core is generally used more than the others since it’s the main one (see next slide).

In our case a lot of tasks can be parallelized, for example the tracker, the histogram and the interpolation routines since the particles are indipendent from each other.

However it’s not trivial to parallelize with OpenMP and sometimes, if the parallelization is not done efficiently or the size of the problem is too small so that the cost of the communication between processors overcomes the gain derived from the splitting, a serial code could even be faster (see the last benchmarking).

Cprofiler, and so Spyder profiler as well, has not been tested by their developers for profiling multithreaded codes. On the other hand it’s difficult to find reliable time profilers for multicore routines.


NUMPY

1 core

2 cores

3 cores

4 cores

5 cores

6 cores

7 cores

8 cores

max

Test 1 7.489 1.648 0.842 0.650 0.457 0.422 0.313 0.305 0.346 4.084

Test 2 7.514 1.636 0.821 0.566 0.472 0.350 0.302 0.291 0.336 0.462

Test 3 7.496 1.600 0.874 0.576 0.446 0.377 0.337 0.299 0.348 0.331

Test 4 7.519 1.609 0.827 0.645 0.508 0.363 0.305 0.299 0.326 0.312

Test 5 7.486 1.624 0.831 0.576 0.453 0.396 0.356 0.278 0.353 0.274

Experimental:Parallelization with OpenMP

Comparison of the Numpy histogram against a parallelized version of the optimised histogram discussed earlier. Parameters: 500000 particles, 100 turns. Machine: LXPLUS

NUMPY PARALLEL 2 CORES

VECTORIZABLE DOUBLE

OPTIMISED DOUBLE

VECTORIZABLE FLOAT

OPTIMISED FLOAT

Test 1 3.547 0.488 0.445 0.204 0.283 0.183

Test 2 3.566 0.520 0.439 0.207 0.285 0.185

Test 3 3.537 0.536 0.455 0.206 0.295 0.185

NUMPY PARALLEL 7 CORES (best)

VECTORIZABLE DOUBLE

OPTIMISED DOUBLE

VECTORIZABLE FLOAT

OPTIMISED FLOAT

Test 1 7.924 0.267 0.565 0.263 0.390 0.203

Test 2 7.582 0.257 0.558 0.252 0.386 0.204

Test 3 7.807 0.291 0.584 0.265 0.376 0.221

LOCAL

LXPLUS

1 CORE

1 CORE

+ 0.2 for casting

+ 0.2 for casting

30/10/2014 HEADTAIL working group meeting 35

Benchmarking of various histogram methods: 500000 particles, 100 turns OPTIMIZED is the optimised method discussed before in these slides VECTORIZABLE derives from OPTIMIZED, it’s autovectorizable but it has two for loops inside

instead of one The PARALLEL method is the same as the one in the previous slide

Experimental: Parallelization with OpenMP


“ We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

Donald Knuth, computer scientist Professor Emeritus at Stanford University, called the "father of the analysis of algorithms".

THANK YOU!

code optimization in blond d. quartullo 30/10/2014headtail working group meeting1 acknowledgements:...

Documents

optimization python

code optimization

group meeting6 python

python v2

python codes

openmp slide

flags blond structure

group meeting7