optimized calculation of timing for parallel beam ...accelconf.web.cern.ch › accelconf ›...

1
presented at ICALEPCS 2017 Optimized Calculation of Timing for Parallel Beam Operation at the FAIR Accelerator Complex A. Schaller, J. Fitzek - GSI, Darmstadt, Germany Prof. Dr. F. Wolf, Dr. D. Lorenz - TU Darmstadt, Germany Abstract For the new FAIR accelerator complex at GSI the settings management system LSA is used. It is developed in collaboration with CERN and until now it is executed strictly serial. Nowadays the performance gain of single core processors have nearly stagnated and multicore processors dominate the market. This evolution forces software projects to make use of the parallel hardware to increase their performance. In this thesis LSA is analyzed and parallelized using different parallelization patterns like task and loop parallelization. The most common case of user interaction is to change specific settings so that the accelerator performs at its best. For each changed setting, LSA needs to calculate all child settings of the parameter hierarchy. To maximize the speedup of the calculations, they are also optimized sequentially. The used data structures and algorithms are reviewed to ensure minimal resource usage and maximal compatibility with parallel execution. The overall goal of this thesis is to speed up the calculations so that the results can be shown in an user interface with nearly no noticeable latency. Motivation To allow the commissioning and operation of FAIR, the software used today has to be optimized. The Cryring (YR), with its local injector, acts as a test facility for the new control system and in special for the control systems central component, the settings management system LSA. For the last YR commissioning beam time, about 3 700 manual trims were calculated per week with 80 working hours, which is about one trim every 77 seconds. Since the YR is a very small accelerator ring, with a circumference of approximate 54 m, everything worked fine. The waiting time summarizes to about 19 minutes, however, the human reaction time is not much less. But when it comes to calculate the Heavy Ion Synchrotron 18 (SIS18) or SIS100, the calculations get very slow. To calculate 3 700 trims for the SIS18, with its approximate 216 m, an operator would have to wait for over 13 hours. The SIS100, with approximate 1 100 m, calculation would even take over 91 hours. Speedup The speedup represents a factor, that shows how two different algorithms perform on the same task. In the context of parallelization, the speedup is a factor that indicates how much the parallel algorithm is faster than the sequential one. It is given by S (P ) = T (1) T (P ) (1) where T (n ) is the total execution time on a system with n processing units. T (1) is also representable as T (1) = T setup + T compute + T finalize (2) Since the only part that can benefit from parallel optimization is T compute , T (n ) can be written as T (n ) = T setup + T compute (1) n + T finalize (3) The efficiency can be expressed as E (P ) = S (P ) P (4) where E = Efficiency P = Number of Processing Units S = Speedup T = Time Work Depth Model The Work Depth Model, described by Blelloch, allows to compare the execution time of parallel algorithms. Especially when using trees for parallelization, like the parameter hierarchy in LSA, other comparison mechanism don’t fit to the problem. In the context of parallelizing LSA, the work W is expressed by the amount of settings to be calculated and the depth D is expressed by the depth of the parameter hierarchy. Using Equation 5 of Blelloch, a range for time T can be calculated for a given number of processing units P where time T depends on the hardware. W P T < W P + D (5) With respect to Equation 1 we can say that the speedup in the Work Depth Model is WP W + PD S (P ) < P (6) Test Scenarios The following figures visualize the two patterns used for testing Pattern 1 • Including one Chain • Changed one high-level parameter in SIS18 (P1-1) Pattern 2 • Including three Chains • Changed one high-level parameter in SIS100 (P2-1) SIS18 (P2-2) SIS18 and SIS100 (P2-3) Overview of the test scenarios Nr. of Nr. of changed calculated average Nr. of high level dependent original Scenario Settings Settings Settings time P1-1 14 538 1 8 707 12.7 s P2-1 38 943 1 11 184 132.9 s P2-2 38 943 1 8 466 18.1 s P2-3 38 943 2 19 650 155.2 s Optimizations Sequential • use caching where possible • use suitable data structures for the main use case • reduce array copies when inserting (or deleting) multiple points to a function • change algorithms with complexity O (n 2 ) to those with O (n log n ) where possible • don’t calculate a setting for all its parents but only once all its parents have been calculated Parallel • run static data preparation in parallel • run calculation loops in parallel where possible • use parameter hierarchy as a task graph Optimization Results Average execution times on the target platform (10 cores with HT, 64 GB RAM) where each scenario was run twice for warmup and five times for measurements. The parallel execution was measured with the default threadpool size of 19 plus the main thread. no serial serial + parallel 0.25 0.5 1 2 4 8 16 32 64 128 256 optimization seconds P1-1 P2-1 P2-2 P2-3 Work Depth Model: Equation 6 for W = 3728 (changed settings in SIS100) and D = 20 (depth of parameter hierarchy for SIS100). 3.95 9.97 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 10 Number of processing units Speedup Work Depth range of Speedup Speedups 4 cores with HT 10 cores with HT Work Depth Model for • 4 cores: speedup is between 3.92 and 4.00 measured speedup is 3.95 • 10 cores: speedup is between 9.49 and 10.00 measured speedup is 9.97 The parallel speedup on the target platform with 10 cores has an efficiency E of 0.997, on the test platform with 4 cores the efficiency is 0.987 (see Equation 4), which nearly is a so called perfect linear speedup where E = 1. The following image shows the memory consumption on the target platform for scenario P2-3. The scenario was run twice for warmup and five times for the measurements. no optimizations vs. serial + parallel optimizations 0 GB 2 GB 4 GB 6 GB 738.6 s vs. 8.7 s 305 garbage collections vs. 5 garbage collections 1 353.4 GB total memory allocation vs. 18.9 GB total memory allocation Overview of the needed resources for P2-3 sequential optimization none sequential & parallel trim time 95.6 s 5.4 s 0.7 s CPU usage avg 6.2 % 9.1 % 62.8 % CPU usage max 19.9 % 11.1 % 62.8 % Heap avg 2.5 GB 1.8 GB 2.0 GB Heap max 5.4 GB 3.4 GB 3.8 GB GC pause time avg 11.4 ms 29.7 ms 33.8 ms GC pause time max 70.5 ms 29.7 ms 33.8 ms Main thread usage 100 % 100 % 8.5 % Thread pool usage total 0% 0% 91.5 % Thread pool usage avg 0% 0% 4.8 % TLAB size total 147.5 GB 6.4 GB 4.0 GB TLAB size avg 64.1 MB 56.8 MB 3.6 MB TLAB size max 105.7 MB 75.0 MB 72.8 MB Alloc. rate for TLAB 1 495.0 MB/s 731.3 MB/s 672.0 MB/s Object size total 114.3 kB 10.4 kB 5.0 kB Object size avg 1.0 kB 10.4 kB 0.8 kB Object size max 32.0 kB 10.4 kB 2.1 kB Alloc. rate for Objects 1.1 kB/s 1.1 kB/s 0.8 kB/s For this chart, each scenario was executed 2 times for warmup and 5 times for measurements on a test platform (4 cores, 12 GB RAM) wth and without Hyper Threading Technology (HT). The default threadpool size is n - 1, so for the setup without HT, the default threadpool size is 3 and with HT the default threadpool size is 7. 3 4 7 8 12 16 20 16 18 20 22 24 26 28 30 threadpool size seconds without HT with HT Summary By reducing the memory consumption and complexity of the most used algorithms of LSA from O (n 2 ) to O (n log n ) the sequential calculation time could be sped up by 22.00. Parallelizing the DAG containing the parameter hierarchy and some loops in the trim calculations increased the speedup by a factor of 9.97 on the target platform with 10 cores with hyper threading. This leads to an average speedup of 219.23 which now allows the user to seamlessly change the overall accelerator scheduling.

Upload: others

Post on 05-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimized Calculation of Timing for Parallel Beam ...accelconf.web.cern.ch › AccelConf › icalepcs2017 › posters › tupha019_poster.pdfto ensure minimal resource usage and maximal

presented at ICALEPCS2017

Optimized Calculation of Timing for Parallel Beam Operationat the FAIR Accelerator Complex

A. Schaller, J. Fitzek - GSI, Darmstadt, Germany

Prof. Dr. F. Wolf, Dr. D. Lorenz - TU Darmstadt, Germany

Abstract

For the new FAIR accelerator complex at GSI the settings managementsystem LSA is used. It is developed in collaboration with CERN anduntil now it is executed strictly serial. Nowadays the performancegain of single core processors have nearly stagnated and multicoreprocessors dominate the market. This evolution forces softwareprojects to make use of the parallel hardware to increase theirperformance. In this thesis LSA is analyzed and parallelized usingdifferent parallelization patterns like task and loop parallelization. Themost common case of user interaction is to change specific settingsso that the accelerator performs at its best. For each changed setting,LSA needs to calculate all child settings of the parameter hierarchy.To maximize the speedup of the calculations, they are also optimizedsequentially. The used data structures and algorithms are reviewedto ensure minimal resource usage and maximal compatibility withparallel execution. The overall goal of this thesis is to speed up thecalculations so that the results can be shown in an user interface withnearly no noticeable latency.

Motivation

To allow the commissioning and operation of FAIR, the softwareused today has to be optimized. The Cryring (YR), with its localinjector, acts as a test facility for the new control system and inspecial for the control systems central component, the settingsmanagement system LSA. For the last YR commissioning beamtime, about 3 700 manual trims were calculated per week with80 working hours, which is about one trim every 77 seconds.Since the YR is a very small accelerator ring, with a circumferenceof approximate 54 m, everything worked fine. The waiting timesummarizes to about 19 minutes, however, the human reactiontime is not much less. But when it comes to calculate the HeavyIon Synchrotron 18 (SIS18) or SIS100, the calculations get veryslow. To calculate 3 700 trims for the SIS18, with its approximate216 m, an operator would have to wait for over 13 hours. TheSIS100, with approximate 1 100 m, calculation would even takeover 91 hours.

Speedup

The speedup represents a factor, that shows how two differentalgorithms perform on the same task. In the context ofparallelization, the speedup is a factor that indicates how muchthe parallel algorithm is faster than the sequential one. It is givenby

S(P ) = T (1)

T (P )(1)

where T (n) is the total execution time on a system with nprocessing units.T (1) is also representable as

T (1) = Tsetup +Tcompute +T f i nal i ze (2)

Since the only part that can benefit from parallel optimization isTcompute, T (n) can be written as

T (n) = Tsetup +Tcompute(1)

n+T f i nal i ze (3)

The efficiency can be expressed as

E(P ) = S(P )

P(4)

whereE = EfficiencyP = Number of Processing UnitsS = SpeedupT = Time

Work Depth Model

The Work Depth Model, described by Blelloch, allows to comparethe execution time of parallel algorithms. Especially when usingtrees for parallelization, like the parameter hierarchy in LSA,other comparison mechanism don’t fit to the problem. In thecontext of parallelizing LSA, the work W is expressed by theamount of settings to be calculated and the depth D is expressedby the depth of the parameter hierarchy. Using Equation 5 ofBlelloch, a range for time T can be calculated for a given numberof processing units P where time T depends on the hardware.

W

P≤ T < W

P+D (5)

With respect to Equation 1 we can say that the speedup in theWork Depth Model is

W P

W +PD≤ S(P ) < P (6)

Test Scenarios

The following figures visualize the two patterns used for testing

Pattern 1• Including one Chain• Changed one high-level

parameter in– SIS18 (P1-1)

Pattern 2• Including three Chains• Changed one high-level

parameter in– SIS100 (P2-1)– SIS18 (P2-2)– SIS18 and SIS100

(P2-3)

Overview of the test scenarios

Nr. of Nr. ofchanged calculated average

Nr. of high level dependent originalScenario Settings Settings Settings time

P1-1 14 538 1 8 707 12.7 sP2-1 38 943 1 11 184 132.9 sP2-2 38 943 1 8 466 18.1 sP2-3 38 943 2 19 650 155.2 s

Optimizations

Sequential

• use caching where possible• use suitable data structures for the main use case• reduce array copies when inserting (or deleting) multiple

points to a function• change algorithms with complexity O(n2) to those with

O(n logn) where possible• don’t calculate a setting for all its parents but only once all

its parents have been calculated

Parallel

• run static data preparation in parallel• run calculation loops in parallel where possible• use parameter hierarchy as a task graph

Optimization Results

Average execution times on the target platform (10 cores withHT, 64 GB RAM) where each scenario was run twice for warmupand five times for measurements. The parallel execution wasmeasured with the default threadpool size of 19 plus the mainthread.

no serial serial + parallel0.25

0.51248

163264

128256

optimization

seco

nd

s

P1-1P2-1P2-2P2-3

Work Depth Model: Equation 6 for W = 3728 (changed settingsin SIS100) and D = 20 (depth of parameter hierarchy for SIS100).

3.95

9.97

2 4 8 16 32 64 128 256 5121

2

4

8

16

32

64

128

256

512

10Number of processing units

Spee

du

p

Work Depthrange of Speedup

Speedups4 cores with HT10 cores with HT

Work Depth Model for

• 4 cores: speedup is between 3.92 and 4.00measured speedup is 3.95

• 10 cores: speedup is between 9.49 and 10.00measured speedup is 9.97

The parallel speedup on the target platform with 10 cores hasan efficiency E of 0.997, on the test platform with 4 cores theefficiency is 0.987 (see Equation 4), which nearly is a so calledperfect linear speedup where E = 1.

The following image shows the memory consumption on thetarget platform for scenario P2-3. The scenario was run twicefor warmup and five times for the measurements.

no optimizations vs.serial + paralleloptimizations

0 GB

2 GB

4 GB

6 GB

738.6 s vs. 8.7 s305 garbage collections vs. 5 garbage collections

1 353.4 GB totalmemory allocation

vs.18.9 GB total

memory allocation

Overview of the needed resources for P2-3sequential

optimization none sequential & paralleltrim time 95.6 s 5.4 s 0.7 s

CPU usage avg 6.2 % 9.1 % 62.8 %CPU usage max 19.9 % 11.1 % 62.8 %

Heap avg 2.5 GB 1.8 GB 2.0 GBHeap max 5.4 GB 3.4 GB 3.8 GB

GC pause time avg 11.4 ms 29.7 ms 33.8 msGC pause time max 70.5 ms 29.7 ms 33.8 msMain thread usage 100 % 100 % 8.5 %

Thread pool usage total 0 % 0 % 91.5 %Thread pool usage avg 0 % 0 % 4.8 %

TLAB size total 147.5 GB 6.4 GB 4.0 GBTLAB size avg 64.1 MB 56.8 MB 3.6 MB

TLAB size max 105.7 MB 75.0 MB 72.8 MBAlloc. rate for TLAB 1 495.0 MB/s 731.3 MB/s 672.0 MB/s

Object size total 114.3 kB 10.4 kB 5.0 kBObject size avg 1.0 kB 10.4 kB 0.8 kB

Object size max 32.0 kB 10.4 kB 2.1 kBAlloc. rate for Objects 1.1 kB/s 1.1 kB/s 0.8 kB/s

For this chart, each scenario was executed 2 times for warmupand 5 times for measurements on a test platform (4 cores, 12 GBRAM) wth and without Hyper Threading Technology (HT). Thedefault threadpool size is n −1, so for the setup without HT, thedefault threadpool size is 3 and with HT the default threadpoolsize is 7.

3 4 7 8 12 16 20

16

18

20

22

24

26

28

30

threadpool size

seco

nd

s

without HTwith HT

Summary

By reducing the memory consumption and complexity ofthe most used algorithms of LSA from O(n2) to O(n logn)the sequential calculation time could be sped up by 22.00.Parallelizing the DAG containing the parameter hierarchy andsome loops in the trim calculations increased the speedup bya factor of 9.97 on the target platform with 10 cores with hyperthreading. This leads to an average speedup of 219.23 whichnow allows the user to seamlessly change the overall acceleratorscheduling.