the gap project: - gpu for realtime applications in high energy physics and medical...

34
THE GAP PROJECT: GPU for Realtime Applications in High Energy Physics and Medical Imaging M. Bauce 1 , A. Messina, M. Rescigno, S. Giagu, S. Capuani, M. Palombo, G. Lamanna, M. Fiorini, G. Di Domenico 1 Sapienza Universit di Roma IEEE - Realtime, Nara, May 26-30, 2014 M. Bauce (Sapienza UniversitÆ di Roma) The GPU Application Project May 26-30, 2014 1 / 22

Upload: others

Post on 20-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

THE GAP PROJECT:

GPU for Realtime Applications in High Energy Physics and Medical Imaging

M. Bauce1, A. Messina, M. Rescigno, S. Giagu,S. Capuani, M. Palombo, G. Lamanna, M. Fiorini, G. Di Domenico

1Sapienza Università di Roma

IEEE - Realtime, Nara, May 26-30, 2014

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 1 / 22

Page 2: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Basic GPU architecture

Less CONTROL units, many more ARITHMETICLOGIC units.

Optimized to execute simultaneously the sameoperation on many different data(Single-Instruction on Multiple Data)

SMX executes kernels(/functions) usinghundreds of threads concurrently

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 2 / 22

Page 3: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

CPU vs GPU

While a CPU executes programs serially a GPU is tailored for highly paralleloperation

GPUs have many parallel execution units (cores) and higher transistor counts,while CPUs have few execution units and higher clock speeds

GPUs have significantly faster and more advanced memory interfaces as theyneed to move around a lot more data tha CPUs

A GPU is for the most part deterministic in its operations

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 3 / 22

Page 4: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

The GAP Project

GPU Application Project for Realtime in HEPand medical imaging is a 3 years projectfundend by the Italian Ministry of research,started in April ’13

Collaboration between: Sapienza università diRoma, INFN Pisa, Università di Ferrara

Aim to demonstrate that is possible to use off-the-shelf computercommodities to accelerate computation in scientific application:The main challenge will be to extend it to real timeapplication for the first time.

APE group - INFN Roma

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 4 / 22

Page 5: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Different scales of Realtime

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 5 / 22

see P. Vicini’s talk on 29/05

Page 6: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

What is a GPU suitable for?

Different color→ Different tasks

Every algorithm that can be divided in tasksoperating on independent data can gain fromparallelization.

Weak-point is the memory usage: smallamount available, overhead for datacross-reading algorithm.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 6 / 22

Amdahl’s law

Page 7: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

What is a GPU suitable for?

Different color→ Different tasks

Every algorithm that can be divided in tasksoperating on independent data can gain fromparallelization.

Weak-point is the memory usage: smallamount available, overhead for datacross-reading algorithm.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 6 / 22

Amdahl’s law

Page 8: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

A physics scenario: NA62I Kaon decays in flight

High intensity unseparated hadron beam (6% kaons).

Need a good kinematics reconstruction.

Efficient veto and PID system for backgroundcontribution

I Trigger/DAQ:

Three levels

L0 max latency: 1ms

10 MHz events on maindetectors

I RICH:17 m long, 3 m in diameter, filled with Ne at 1 atm

10 MHz events: about 20 hits per particle

Distinguish between pions and muon.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 7 / 22

Page 9: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Ring reconstuction: single and multiple

I Several approaches to the ring fitting problem in HEP.

Incomplete rings (non symmetric hits distribution)

Trackless, difficult to merge detector info at L0

Fast reconstruction, not iterative

Accurate results, offline quality

Low latency for synchronous trigger

State-of-Art

Geometric fits

Algebraic fits

Statistical fits

Conformal fits

Single ring Multi-body decays

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 8 / 22

Page 10: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Multi ring fit - a new algorithm

ALMAGEST: a new algorithm based on Ptolemy’s theorem:

If a quadrilateral is inscribable in a circle then the product of the measuresof its diagonals is equal to the sum of the products of the measures of thepairs of opposite sides.

AD × BC + AB × DC = AC × BD

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 9 / 22

Page 11: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

Page 12: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

Page 13: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit.

Page 14: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

Page 15: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again,

and again

Page 16: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

5) Perform a singlering fit

Page 17: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

5) Perform a singlering fit

6) Repeat excludingthe points alreadyassigned to a ring

Page 18: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

5) Perform a singlering fit

6) Repeat excludingthe points alreadyassigned to a ring

Add complications

Page 19: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

5) Perform a singlering fit

6) Repeat excludingthe points alreadyassigned to a ring

Add complications

Page 20: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest parallelization

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 10 / 22

1) Select a triplet (3starting points)

2) Loop on theremaining points: ifthe next point doesNOT satisfy thePtolemy’s theoremreject it.

3) If the pointsatisfy thePtolemy’s theoremconsider it for thefit. 4) repeat again

5) Perform a singlering fit

6) Repeat excludingthe points alreadyassigned to a ring

Add complications

Page 21: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Almagest on GPU

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 11 / 22

Two levels of parallelism on data:

Several triplets run in parallel

Several events at the same time

Very high parallelism on GPU:

Huge number of computing cores, > 2000

Huge memory bandwidth

Page 22: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

A simple example

The real parameters of the generatedrings:

1 (6.65, 6.15) R=11.02 (8.42, 4.59) R=12.6

The fitted parameters of the two ring:1 (7.29, 6.57) R=11.62 (8.44, 4.32) R=12.3

Fitting time on Tesla C1060:1.5 µs/event

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 12 / 22

Page 23: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Algorithm performances

Total computing time of the orderof few µs per event (on singleGPU - Tesla K20).

Reasonable efficiency (using 8triplets):room for improvement (furthertests ongoing)

Single and multiple ringalgorithms have similar latencies(packing together 256 events).

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 13 / 22

Page 24: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

The ATLAS experiment

I Energy: 7-8 TeV (14 TeV)I Inst. luminosity: 1034 cm−2s−1

I Bunch crossing rate: 40 MHzI Event size: 1.5 MBI Input Data rate: 30 TB/s

Data processing real challenging

Vectorization of the data-analysis code

Task parallelization during processing

Include cutting-edge technologies: GPU, MIC, etc. under investigation

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 14 / 22

Page 25: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Atlas Trigger

New challenges in theupgraded luminosityregime

Atlas is considering newtecnologies to iclude inthe upgrade

Foreseen merge of L2and EF

Possibility to exploit GPU to run high level trigger algorithms,aiming to offline-like quality resolution.

Two issues to take into account:

1 Integration of GPU in the complexframework

2 Exploit high-parallelizable algorithms

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 15 / 22

Page 26: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Atlas Trigger

New challenges in theupgraded luminosityregime

Atlas is considering newtecnologies to iclude inthe upgrade

Foreseen merge of L2and EF

Possibility to exploit GPU to run high level trigger algorithms,aiming to offline-like quality resolution.

Two issues to take into account:

1 Integration of GPU in the complexframework

2 Exploit high-parallelizable algorithms

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 15 / 22

Page 27: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Framework Architecture Implementation

Atlas Trigger has a complex framework: how to integrate a GPU in it?

APE project

Tecnical implementation of GPU in Atlas SWthrough a client-server structure.

Flexible solution useful for several paralleltasks:

can be expanded to a GPU farm.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 16 / 22

Page 28: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Muon Isolation algorithm

0 950880

930

1080

1160

time [ms]0 0.5 1 1.5 2 2.5 3 3.5 4

entr

ies

0

200

400

600

800

1000

1200

1400

1600

1800

1 Retrieve calorimeter cell information

2 Pack data for transfer

3 Send them to APE server

4 Perform algorithm

5 Send back the output to the maintrigger framework

6 Total time: ∼1.2 ms

Overhead well within time budget!

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 17 / 22

I Info retrievementI data packingI data transferI AlgorithmI Total

µs

Page 29: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Muon Isolation algorithm

0 950880

930

1080

1160

time [ms]0 0.5 1 1.5 2 2.5 3 3.5 4

entr

ies

0

200

400

600

800

1000

1200

1400

1600

1800

1 Retrieve calorimeter cell information

2 Pack data for transfer

3 Send them to APE server

4 Perform algorithm

5 Send back the output to the maintrigger framework

6 Total time: ∼1.2 ms

Overhead well within time budget!

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 17 / 22

I Info retrievementI data packingI data transferI AlgorithmI Total

µs

∆t ∼ 250 µs

Page 30: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Atlas Standalone test: Track fitting

I Track fitting on GPU

Kalman Filter and track fit on tracks from detector

Parallel fit of an increase number of track

tracks/event0 500 1000 1500 2000 2500 3000

s]

µtim

e/tra

ck [

0

10

20

30

40

50

60

70

80

90

Intel Xeon E5­2620

Nvidia GTX Titan

tracks/event10

210

310

410

s]

µtim

e/tra

ck [

1

10

Intel Xeon E5­2620

Nvidia GTX Titan

tracks/event0 500 1000 1500 2000 2500 3000

tim

e/e

vent [m

us]

0

5

10

15

20

25

30

35

40

45 Intel Xeon E5­2620

Nvidia GTX Titan

GPU implementation less dependent on the track multiplicity:I promising for upcoming data taking conditions, with increased luminosity.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 18 / 22

Av. Time/TrackAv. Time/Event

Page 31: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Nuclear Magnetic Resonance

Conventional DTI:7.86×106 linear systems to solve

∼ 15 s on a CPU

Kurtosis Tensor imaging:7.86×106 non-linear fit to perform

∼ 7200 s on a CPUM. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 19 / 22

Typical NMR brain scan:

128×128 pixel images

10-40 slices

≥15 diffusion gradient directions

On average: 128×128×32×15= 7.86×106 calculations

Computationally intensive, massively parallel:

A non-linear fit: ∼7 ms each non-linear fit

106-107 independent fit on each pixel

only 10− 80 ns to read data from memory

8 threads on an Intel Xeon E5-2609 CPU at 2.4 GHz

Page 32: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Nuclear Magnetic Resonance

Estimated improvements:

High-res. images 512×512

20-200 speed-up factor after algorithmoptimization

Image simulations:matrix size: 32×32 up to 2048×2048

about 2×103 axons

30 points ob b-value, from 0 to 5000 s/mm2

Significant improvement observedusing GPU I

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 20 / 22

GT650m

GTX Titan

Page 33: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Computed Tomography

Feldkamp-Davis-Kress algorithm tosolve the differential equation below

Three steps intrinsically parallel

Big improvement on GPU wrt. parallelOpenMP

Expected big benefits from nextgeneration GPU (K40)

f̂ (s1, s2, s3) =12

∫ 2π

0dβ

1U2

∫ um

−um

du · D√D2 + u2 + v2

· gβ(u, v ) · h(u − u′)

Step GTS 450 GTX680 Titan OpenMP OpenMP/Titanweighting 0.32 1.5 0.39 0.22 0.55filtering 49.17 15.01 13.57 1.26 0.09backlproj 17.45 4.37 3.08 80.85 26.3Total 66.94 20.88 17.04 82.33 4.83

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 21 / 22

Page 34: The GAP project: - GPU for Realtime Applications in High Energy Physics and Medical ...bauce/TRIG-1-bauce.pdf · 2018-11-27 · THE GAP PROJECT: GPU for Realtime Applications in High

Conclusions

GPU: quickly developing technology now available for realtime applications.

trigger algorithm optimized for GPU parallel execution are a viable solutionin typical HEP experiments:

I Tested interface for integration in complex TDAQ frameworks

allow implementation of complex and good-resolution algorithm for imagereconstruction in medical diagnostic

· · · room for improvements and applications in other scenarios.

M. Bauce (Sapienza Universitá di Roma) The GPU Application Project May 26-30, 2014 22 / 22