code optimization in blond d. quartullo 30/10/2014headtail working group meeting1 acknowledgements:...
TRANSCRIPT
HEADTAIL working group meeting 1
Code optimization in BLonD
D. Quartullo
30/10/2014
Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen,J. E. Muller, D. Piparo, E. Shaposhnikova, H. Timko
HEADTAIL working group meeting 230/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: parallelization with OpenMP
HEADTAIL working group meeting 330/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 430/10/2014
Why optimize BLonD?
Because, in general, the simulation of the accelleration ramp is such that the code has to take into account millions of turns (e.g. LHC 9 millions) This implies several days of computing time or even weeks notwithstanding a choosen amount of macroparticles quite low in respect to the real one. In addition the CERN LXPLUS batch service doesn’t advice to submit jobs into 1 or 2 weeks queues!!!
Because the multibunch case is quite expensive since one has to consider a lot of macroparticles and slices.
So the present runtimes are inacceptable in view of multibunch extension• E.g. LHC acceleration ramp takes 1 week on LXPLUS without intensity effects,
for 50000 particles only! Effective parallelization requires optimization first, because a not optimized
parallelised code could be as performant as an optimized serial code, and the serial code is more stable!
Because, as we will see, it’s possible to optimise the code quite significantly!
HEADTAIL working group meeting 530/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 630/10/2014
Python
The language used is Python mainly because it’s:
• Fast to program with it• Open source• Very spread among the scientific computing community and for that reason
there exists a lot of support• Full of packages and routines that allow the user not to envy persons using
Matlab or Mathematica applications.
We are using the Anaconda distribution that includes Python v2.7 64bit and all the packages that we need. Anaconda can be installed very easily in Linux, Windows and Mac.Link: https://store.continuum.io/cshop/anaconda/
But since the Python code, as its similar cousins, is interpreted, typically we can’t reach the same performances as with compiled code as for example C/C++ or Fortran.
We need something to boost Python...
HEADTAIL working group meeting 730/10/2014
Faster with C: Ctypes
There exist plenty of plugins that allow to ‘connect’ C/C++ and Python codes. The main ones are Python-C-API, Ctypes, SWIG, Cython.
We started using Ctypes because:
• Allows to code the routines that one wants to optimise in pure C/C++ code with just an unavoidable overhead due to the type casting from Python to C.
• However this overhead for our applications in general is small since the quantity of floating point operations that Blondie has to carry out is large.
• The programmer has full control of the process of embedding C in Python (flags, parallelization in C, autovectorisation).
• The programmer has not to learn a new language, as it is for Cython, supposing that he knows C!!!
• If a certain routine is optimised in C and then linked with Python, the programmer is sure that all the best has been done.
HEADTAIL working group meeting 8
Spyder is an IDE for Python. In it there is a profiler program able to measure and display with a nice interface the various times of the several routines used in a Python run.
Spyder can be foundfor example in the Anaconda distribution.
Spyder profiler usesPython’s modulecProfile.
A good web page where to find useful hints on profiletechniques andperformance analysis:http://www.huyng.com/posts/python-performance-analysis/
30/10/2014
Spyder profiler
HEADTAIL working group meeting 930/10/2014
Gcc compiler and flagsWe need a C++ compiler to compile the files that have to be linked to Python via the Ctypes module.
GCC is open source, complete and easy to use.
We are using GCC v4.8.1 64bit.Windows: http://tdm-gcc.tdragon.net/Linux(CERN): source from /afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6/setup.sh
One can compile a .cpp file with certain options called ‘flags’ used mostly to optimize the code.For example:
• ‘-Oflag’ Autovectorisation, fast mathematical operations, ...• ‘-mfma4’ Fused-multiply-add operarions for even faster
calculations (not all machines support this feature but LXPLUS yes!)
• ‘-std=c++11’ Last standard for the C++ language• ‘-fopenmp’ Task parallelization with OpenMP
Think of it as a parallelization with just one core!!
Speed up operations of the type a + b x c !!
HEADTAIL working group meeting 1030/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 1130/10/2014
Packages, modules and setup files
Code documentationTest - cases
Create a new beam and slice it C++ optimised routinesCython routines (just for test)Impedances in time and frequency domain
Define beam and rf parametersFeedbacks, RF noise
StatisticsPlots
TrackersREADME file
Setup file for the C++ optimised routinesSetup file for Cython optimised routines
HEADTAIL working group meeting 1230/10/2014
What to optimize: RAM memory and computation time
In BLonD we store momentum program, beam energy, beam coordinates and similars in arrays. In that way we can save on average computation time since we calculate once and for all at the beginning all the things that with high probability could be called at least two times in the rest of the code.
This technique increases obviously significantly the RAM memory used. On the other hand the RAM consuption is currently not a problem even if we use
the double precision type that requires 8 byte for every number saved.• E.g. Our office PCs as well as LXPLUS machines have at least 4 GB of RAM and
the simulation of the LHC ramp takes approximately 1.5 GB of memory. Obviously there could be some problems if we want to launch two or more
simulations on the same machine (e.g. In local), but on the LXPLUS batch service for example one can request at least 4 GB of RAM for each submitted jobs.
Then all the effort has been put on the computation time saving.
HEADTAIL working group meeting 13
High computation timeNegligible RAM usage
30/10/2014
What to optimize: RAM memory and computation time
Negligible computation timeHigh RAM usage
Zero computation timeNegligible RAM usage
HEADTAIL working group meeting 1430/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 1530/10/2014
Definitions and parameters of interestWe’ll say that f is on the order of g, in symbols f(n) g(n), for n -> +∞ if there exists a posive constant C such that
= C
We’ll say that f is little - o of g, in symbol f(n) = o(g(n)), for n -> +∞ if
= 0
Example: f(n) = g(n) = , h(n) = f(n) g(n) , f(n) = o(h(n))
This type of notation (asymptotic study) is used for comparing two or more different algorithms that give the same result.
Example: insertion sort and quick sort sorting algorithms. In the average case we have:C[insertion_sort](n) , C[quick_sort](n) n (
where C stands for computation time (or cost) and obviously depends on the type of algorithm and the number of elements n that have to be ordered.
f and g grows at the same pace
f grows slower than g
HEADTAIL working group meeting 1630/10/2014
Definitions and parameters of interestThis implies that C[quick_sort] = o(C[insertion_sort]), that is quick sort is more performant than insertion sort, at least for large n.
Main parameters of interest for the optimization study of BLonD:
• Number of macroparticles M• Number of slices S
We will suppose that these two variables are independent from each other, in other wordswe can have M -> ∞ and/or S -> ∞.
HEADTAIL working group meeting 1730/10/2014
Histogram constant space
Code not optimized: numpy.histogram is too expensive if the slicing is done with constant space. Why? Let’s have a look with Spyder that easily can show all the subroutines of a given routine.
From this picture still we can’t say a lot since we don’t know how the various subroutines are used. Let’s go deeper...
C [quicksort](n) n n
Binary researchalgorithm used
C [bin_res](n) n
16 M +
Histogram constant spaceNumpy.histogram: (a = array of coordinates, bins = array of edges, M = len(a), S = len(bins) - 1)
block = 65536for i in arange(0, len(a), block): sa = sort(a[i:i+block]) n += np.r_[sa.searchsorted(bins[:-1], 'left'), \ sa.searchsorted(bins[-1], 'right')]
n = np.diff(n)
So if M > 65536 C[numpy.histogram] 16 M + S, that is linear in both M and S. If the length of our array is M 65536 C[numpy.histogram] , that is linearithmic in M and linear in S. One problem is that we have a lot of function calls (searchsorted, diff, and quicksort which is a recursive method) so, even if linear in M for large M, the method has angular coefficient high. The second problem is that we have a linear dipendence on S.
30/10/2014 HEADTAIL working group meeting 18
x
𝑆16 M + S
HEADTAIL working group meeting 1930/10/2014
Histogram constant space
Code optimized: (input = array of coordinates, output = array of edges, M = len(input), S = len(output) - 1)
inv_bin_width = n_slices / (cut_right - cut_left); for (i = 0; i < n_macroparticles; i++) { a = input[i]; if ((a < cut_left)||(a > cut_right)) continue; fbin = (a - cut_left) * inv_bin_width; ffbin = (uint)(fbin); output[ffbin] = output[ffbin] + 1.0; }
doesn’t dipend on S because the access time into the various cells of an array stored in the RAM doesn’t dipend on the length of the array (address technique)
So C[optimized histogram] M, we have zero function calls and no dipendence on S! It’s really difficult to do better! This code is not autovectorizable since the access of the output array is not sequential.
+
𝑐2𝑐3𝑐4𝑐5𝑐6𝑐7
x M
𝑐1+𝐶M
HEADTAIL working group meeting 2030/10/2014
Tracker
The tracker consists of three parts: Kick, kick_acceleration, drift.Obviously it doesn’t depend on S.
KICK EQUATION
beam_dE = beam_dE + V numpy.sin( h )
The problem is that we can’t use directly in Python the C math library since the routines in it can take just scalars. On the other hand making a for loop on a big array in Python is really a suicide!
Then we need to code in C++...
It can be even 17x more expensive than math.sin in Python!!!
HEADTAIL working group meeting 2130/10/2014
Tracker
The sin in math.h is not autovectorizable for two reasons:
• It is not inline so when there is a call to this function in an otherwise vectorizable for loop, the compiler doesn’t vectorize
• It doesn’t use polynomials but large Look Up Tables and so it can’t be vectorized.
Solution: Taylor series? Good since we would deal with polynomials, easily vectorizable.But even better... Pade rational functions that allow autovectorisation (they are ratios of polynomials) and need less terms than a Taylor expansion to get a fixed accuracy..
We use the fast_sin routine from the VDT CERN library (D. Piparo and others)
KICK_ACCELERATION EQUATION
beam_dE = beam_dE + acceleration_kick
In C++ we have a for loop that is immediately vectorised
HEADTAIL working group meeting 2230/10/2014
Tracker SIMPLE_DRIFT EQUATION
beam_theta = beam_theta +
In C++ we have a for loop that is immediately vectorised
HEADTAIL working group meeting 2330/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 2430/10/2014
LHC ramp with feedback and no impedances
Parameters: M = 50000, S = 100, num_turns = 1000. Machine: my office PC
CODE NOT OPTIMIZED
HEADTAIL working group meeting 2530/10/2014
LHC ramp with feedback and no impedances
CODE OPTIMIZED
RESULTS: histogram: from 3.477 to 0.188 tracker: from 1.877 to 0.747
HEADTAIL working group meeting 2630/10/2014
SPS with full impedance model at injection
Parameters: M = 5000000, S = 500, num_turns = 10. Machine: my office PC
CODE NOT OPTIMIZED
HEADTAIL working group meeting 27
CODE OPTIMIZED
RESULTS: histogram: from 3.480 to 0.178 tracker: from 4.949 to 1.010
30/10/2014
SPS with full impedance model at injection
HEADTAIL working group meeting 28
HISTOGRAM:
LHC M = 50000, S = 100, num_turns = 1000, histogram: from 3.477 to 0.188SPS M = 5000000, S = 500, num_turns = 10, histogram: from 3.480 to 0.178
• Let’s remember that the optimised histogram is linear in M and independent on S, so if M x num_turns is preserved, as it’s the case, the computation times should not change significantly.
• Let’s recall the two formulas for the not optimised histogram: If M 65536 then C[numpy.histogram] If M > 65536 C[numpy.histogram] 16 M + S
But in both the LHC and SPS cases we have S << M and S << 65536, and since we have: C[numpy.histogram_LHC] M C[numpy.histogram_SPS] 16 M + S M Since M x num_turns is preserved, the total times have to be largely the same!
30/10/2014
Observations
HEADTAIL working group meeting 2930/10/2014
Observations
LHC numpy.histogram SPS numpy.histogram
ceil(5000000 / 65536) x 10 = ceil(79.26) x 10
HEADTAIL working group meeting 3030/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: parallelization with OpenMP
HEADTAIL working group meeting 3130/10/2014
Summary and next stepsBLonD has to be optimised if we want to carry out expensive simulations.The tools used for the various optmizations have been shown.An asynthotic study of the histogram routine has been done.The histogram and the tracker have been optimized with very good results;
in addition it has been shown that it’s difficult to optimize these routines further,
at least with serial codes.
IN THE NEXT FUTURE… Optimization of other time consuming routines, for example FFT, convolution,
interpolation, hamiltonian..., with and without parallelization. Multibunch and parallelization have to be done at the same time
• task parallelization define dependences among classes• ‘trivial’ parallelization over cores e.g. for tracker• physics’ parallelization of intensity effects
• time domain: where to truncate the convolution
HEADTAIL working group meeting 3230/10/2014
Outlook Why optimize BLonD? Language used and tools for optimization
• Python• Faster with C: Ctypes • Spider profiler• Gcc compiler and flags
BLonD structure and optimization strategies• Main file, packages, modules and setup files• What to optimize: RAM memory and computation time
Asymptotic study and routines optimization• Definitions and parameters of interest• Histogram constant space• Tracker
Two realistic cases:• LHC ramp with feedback and no impedances• SPS with full impedance model at injection• Observations
Summary and next steps Experimental: parallelization with OpenMP
HEADTAIL working group meeting 3330/10/2014
Experimental:Parallelization with OpenMP
OpenMP is an interface that allows to parallelize our tasks when we have multiple cores that share a fixed memory, as it happens on our office PCs or on LXPLUS.
The user can choose easily the number of cores to use. We found that on LXPLUS it’s often better to use 7 cores than 8, maybe because one core is generally used more than the others since it’s the main one (see next slide).
In our case a lot of tasks can be parallelized, for example the tracker, the histogram and the interpolation routines since the particles are indipendent from each other.
However it’s not trivial to parallelize with OpenMP and sometimes, if the parallelization is not done efficiently or the size of the problem is too small so that the cost of the communication between processors overcomes the gain derived from the splitting, a serial code could even be faster (see the last benchmarking).
Cprofiler, and so Spyder profiler as well, has not been tested by their developers for profiling multithreaded codes. On the other hand it’s difficult to find reliable time profilers for multicore routines.
HEADTAIL working group meeting 3430/10/2014
NUMPY
1 core
2 cores
3 cores
4 cores
5 cores
6 cores
7 cores
8 cores
max
Test 1 7.489 1.648 0.842 0.650 0.457 0.422 0.313 0.305 0.346 4.084
Test 2 7.514 1.636 0.821 0.566 0.472 0.350 0.302 0.291 0.336 0.462
Test 3 7.496 1.600 0.874 0.576 0.446 0.377 0.337 0.299 0.348 0.331
Test 4 7.519 1.609 0.827 0.645 0.508 0.363 0.305 0.299 0.326 0.312
Test 5 7.486 1.624 0.831 0.576 0.453 0.396 0.356 0.278 0.353 0.274
Experimental:Parallelization with OpenMP
Comparison of the Numpy histogram against a parallelized version of the optimised histogram discussed earlier. Parameters: 500000 particles, 100 turns. Machine: LXPLUS
NUMPY PARALLEL 2 CORES
VECTORIZABLE DOUBLE
OPTIMISED DOUBLE
VECTORIZABLE FLOAT
OPTIMISED FLOAT
Test 1 3.547 0.488 0.445 0.204 0.283 0.183
Test 2 3.566 0.520 0.439 0.207 0.285 0.185
Test 3 3.537 0.536 0.455 0.206 0.295 0.185
NUMPY PARALLEL 7 CORES (best)
VECTORIZABLE DOUBLE
OPTIMISED DOUBLE
VECTORIZABLE FLOAT
OPTIMISED FLOAT
Test 1 7.924 0.267 0.565 0.263 0.390 0.203
Test 2 7.582 0.257 0.558 0.252 0.386 0.204
Test 3 7.807 0.291 0.584 0.265 0.376 0.221
LOCAL
LXPLUS
1 CORE
1 CORE
+ 0.2 for casting
+ 0.2 for casting
30/10/2014 HEADTAIL working group meeting 35
Benchmarking of various histogram methods: 500000 particles, 100 turns OPTIMIZED is the optimised method discussed before in these slides VECTORIZABLE derives from OPTIMIZED, it’s autovectorizable but it has two for loops inside
instead of one The PARALLEL method is the same as the one in the previous slide
Experimental: Parallelization with OpenMP
HEADTAIL working group meeting 3630/10/2014
“ We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
Donald Knuth, computer scientist Professor Emeritus at Stanford University, called the "father of the analysis of algorithms".
THANK YOU!