“update on gpu trigger”
DESCRIPTION
“Update on GPU trigger”. NA62 collaboration meeting 9.12.2009. Gianluca Lamanna Scuola Normale Superiore and INFN Pisa. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
“Update on GPU trigger”
Gianluca LamannaScuola Normale Superiore and INFN Pisa
NA62 collaboration meeting 9.12.2009
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Introduction
2
We are investigating the possible use of video cards for trigger purpose in our experiment First attempt presented in Capri meeting based on my notebook’s video card and preliminary ring finding on the RICH algorithm The first result was 128 ms to find a ring ( too slow!!!) Two possible applications identified: L1 and/or L0
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
GPU parallelization structure
3
The GPU (video card processor) is natively build for parallelization The GT200 (in NVIDIA TESLA C1060) has 240 computing cores subdivided in 30 multiprocessors, 8 cores each. Each core runs a thread, a group of threads is called thread block Each thread block runs on a multiprocessors 32 threads (warp) are scheduled concurrently in a multiprocessors with a SIMD structure
Instruction pool
Data pool
PU
PU
PU
PU
All the threads in a kernel use the same instruction pool on different data sets The divergence between threads, due to different code path, gives a loose in parallelization
Parallelization in multi-cores structure
Fast parallel code (many threads work on the sam problem)
Multi-events (many events are processed in the same time)
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Memory management
4
The correct use of the memory is crucial for every application on GPU with high performance required.
Memory In/out chip
Cached
Access
Scope Lifetime
Register In no R/W 1 thread
Thread
Local Out no R/W 1 thread
Thread
Shared In no R/W 1 thread block
Block
Global Out no R/W All threads
Kernel
Constant Out yes R All threads
Kernel
Texture Out yes R All threads
Kernel The shared memory is subdivided in 16 banks. Each bank can be access simultaneously by different thread in the block. Read/write conflicts are serialized.
12345678910111213141516
Thr 1
Thr 3
Thr 6Thr 7
Thr 12
Thr 15
Global memory access is cohaleshed in 64 (or 128) bytessegments. Allignment in warp guarantees the best performance.
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
GPU in NA62 trigger system
5
RICH
MUVL0TS
STRAWS
GPU
GPU
L0 triggerprimitives
Data to L1Trigger Reduced Data
L0 trigger
Trigger primitives
Data to L1
RICH
MUVL0TS
STRAWS
GPU
GPU
PC
PC
PCGPU
In the L1 & L2 the GPUs can support the decision in the PC farm No issues with latency and computing time (but reasonables) Realibility and stability Benefits: smaller L1 PC farm. More sophisticated algorithms can allow higher trigger purity
In the L0 the GPU can be exploited for effective trigger rate reduction Latency and computing time limitations: 10 MHz input at L0 and 1 ms latency in TELL1 readout-> The trigger decision has to be ready in 1 ms and the computing time is 100 ns for events (example: 100 ns to find a ring in the RICH) Very challenging!!! Benefits: very selective L0 triggers both for signal and secondary processes collection
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Hough transform
6
Each hit is the center of a test circle with a given radius. The ring center is the best common point of the test circles PMs position -> constant memory hits -> global memory Testing circle prototypes -> constant memory
3D space for histograms in shared memory (2D grid VS test circle radius)
X
Y
Radius
Limitations due to the total shared memory amount (16K) One thread for each center (hits) -> 32 threads (in one thread block) for event
Pros: natural parallelization, small number of thread for event Cons: unpredictable memory access (read&write conflicts), heavy use of on-chip fast memory
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Optimized for problem multi histos (OPMH)
7
Each PM (1000) is considered as the center of a circle. For each center an histogram is constructed with the distances btw center and hits (<32). The whole processor is used for a single event (huge number of center) The single thread computes few distances Several histograms computed in different shared memory spaces Not natural for processor Isn’t possible to process more than one event at the same time (the parallelism is fully exploited to speed up the computation)
distance
Pros: very fast and simple kernel operation Cons: not natural resources assignement; the whole processor is used for a single event
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Optimized for device multi histo (ODMH)
8
Exactly the same algorithm wrt the OPMH, but with different resources assignment The system is exploited in a more natural way: each block is dedicated to a single event, using the shared memory for one histogram Several events processed in parallel in the same time Easier to avoid conflicts in shared and global memory
Shared
Shared
Shared
1 event -> M threads (each thread for N PMs)
1 event -> M threads (each thread for N PMs)
1 event -> M threads (each thread for N PMs)
G L O B A L M E
MO R Y
Pros: natural memory optimization and resources assignment Cons: waste of resources
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Optimization procedure & results
9
First v
ersio
n
Multipl
e eve
nts
Mem in
it opti
m
Mem ac
cess
optim Ker
nel o
ptim.
us algo 1algo 2algo 3
Several steps of optimization (still room for improvement) Algo1 and Algo3 are suitable to process many events in concurrent procedure at the same time (in Algo2 the single event saturates the chip resources)
Algorithm TESLA c1060
Algo 1 85 us
Algo 2 139 us
Algo 3 12.4 us
Improvment wrt the results shown in Capri (128 ms per ring):
Different Video Card (factor of 10) Better understanding of resources assignment Better understanding of memory conflicts and memory management Multi events approach Other algorithms
GPU 12 hitsBest ring
Hits generated NA62 - G4 MC
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Further optimization on ODMH
10
Two crucial parameters are: N: total number of events for kernel PM: number of PMs processed in the same thread
N events
us
8 PMs x Thread
PM x Thread
us
1000 events x kernel
The time is linear with the N events per thread The minimum of the time as a function of the number of PMs processed in a single kernel is for PMs=8
Working Point (N=1000 PMs=8)
-> 10.8 us
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Present result
11
Further optimizations on working point:
V11: moving arrays from registers (in local) to shared memory -> 6us V12: occupancy optimization -> 5.9 us V13: temporary histograms optimization in shared memory -> 5 us
This means that a group of 1000 events is processed in 5 ms, the latency (due to the algorithm) is 5 ms and the maximum rate is 200 kHz.
Data transfer time (for packet): Data From host to GPU ram -> 70 us Results From GPU ram to host -> 7 us
Concurrent copy and kernel execution in streams
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Prospects
12
3 us isn’t a too optimistics hypotesis for the next month -> several optimization to be done 1 us isn’t to far from our possibility, new algorithm (“triplets”) will be implemented very soon.
1 us means that by using 10 video cards at the same time in parellel we can reach the 10 MHz with a latency of 5 ms -> NA62-TELL1 can easily manage this latency -> only issue with GTK readout The next generation of video cards (already available on the market) will offer at least a factor of 2 in performance (probably much higher due to different architecture)
Next optimization steps: copy hits in shared memory computation of PM center position (instead of using constant memory) Eliminate the slow operations (division by integer) Eliminate syncthread in perfectly allineated warp structure Understand differences in time in reading and writing shared memory
GPUs NVIDIA FERMI ATI HD 5870
Single PrecisionPerformance 2 Teraflops 2.72 Teraflops
Memory 6 GB DDR5 6 GB DDR5
Memory speed ? 1.2 GHz
Bandwidth 230 GB/s 153 GB/s
Stream processors 512 1600
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Going through a working system
13
INTEL PRO/1000 QUAD GBE
PCI-E x16
4 GB/s(20 MHz)*
8 GB/s(40 MHz)*
PCI-E gen2 x16
100GB/s(500 MHz)*
TESLA
GPU
VRAM
RAM
CPU
CPU
30 GB/s(150 MHz)*
The maximum bandwidth in each part of the system is very high
No intrinsic system’s bottleneck
(* the maximum “frequency equivalent” is compute assuming 200 B/event)
Three important parts are still missing to test a real system: Intelligent GBE receiver, in which the data coming from the TELL1 (ethernet packets) are prepared to be directly transfer to the GPU memory A real time linux system in a double processors PC (one processor for the system, one processor to move data) A re-syncronization card to send the trigger decision syncronously to the TTC
TD
AQ
WG
– G
ianl
uca
La
man
na –
9.1
2.2
009
Conclusions
14
FEDigitization + buffer
+ (trigger primitives)
PCs+GPU
PCs+GPUL0
L1
“quasi-triggerless” with GPUs
The use of GPUs in the trigger will be usefull both at L1/L2 and L0 An effective reduction of the trigger rate at L0 will be very important both for main trigger and secondary triggers, with great benefits for physics The L0-GPU is very challenging to the requirements in latency and processing time A very carefull use of the parallelization structure and memory allows to reach 5 us of processing time and 5 ms of latency (for 1000 events packets) The 3 us level will be reached with relatively small effort, optimizing the data transfer and dispatching The 1 us isn’t completely impossible By using 10 video cards in parallel and increasing the latency in the readout, the system seems to be feasible More job is needed for the PC & HW side to design a complete new trigger paradigm with high-quality primitives computing in the first stage