“update on gpu trigger”

“Update on GPU trigger”

Gianluca LamannaScuola Normale Superiore and INFN Pisa

NA62 collaboration meeting 9.12.2009

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Introduction

2

We are investigating the possible use of video cards for trigger purpose in our experiment First attempt presented in Capri meeting based on my notebook’s video card and preliminary ring finding on the RICH algorithm The first result was 128 ms to find a ring ( too slow!!!) Two possible applications identified: L1 and/or L0

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

GPU parallelization structure

3

The GPU (video card processor) is natively build for parallelization The GT200 (in NVIDIA TESLA C1060) has 240 computing cores subdivided in 30 multiprocessors, 8 cores each. Each core runs a thread, a group of threads is called thread block Each thread block runs on a multiprocessors 32 threads (warp) are scheduled concurrently in a multiprocessors with a SIMD structure

Instruction pool

Data pool

PU

PU

PU

PU

All the threads in a kernel use the same instruction pool on different data sets The divergence between threads, due to different code path, gives a loose in parallelization

Parallelization in multi-cores structure

Fast parallel code (many threads work on the sam problem)

Multi-events (many events are processed in the same time)

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Memory management

4

The correct use of the memory is crucial for every application on GPU with high performance required.

Memory In/out chip

Cached

Access

Scope Lifetime

Register In no R/W 1 thread

Thread

Local Out no R/W 1 thread

Thread

Shared In no R/W 1 thread block

Block

Global Out no R/W All threads

Kernel

Constant Out yes R All threads

Kernel

Texture Out yes R All threads

Kernel The shared memory is subdivided in 16 banks. Each bank can be access simultaneously by different thread in the block. Read/write conflicts are serialized.

12345678910111213141516

Thr 1

Thr 3

Thr 6Thr 7

Thr 12

Thr 15

Global memory access is cohaleshed in 64 (or 128) bytessegments. Allignment in warp guarantees the best performance.

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

GPU in NA62 trigger system

5

RICH

MUVL0TS

STRAWS

GPU

GPU

L0 triggerprimitives

Data to L1Trigger Reduced Data

L0 trigger

Trigger primitives

Data to L1

RICH

MUVL0TS

STRAWS

GPU

GPU

PC

PC

PCGPU

In the L1 & L2 the GPUs can support the decision in the PC farm No issues with latency and computing time (but reasonables) Realibility and stability Benefits: smaller L1 PC farm. More sophisticated algorithms can allow higher trigger purity

In the L0 the GPU can be exploited for effective trigger rate reduction Latency and computing time limitations: 10 MHz input at L0 and 1 ms latency in TELL1 readout-> The trigger decision has to be ready in 1 ms and the computing time is 100 ns for events (example: 100 ns to find a ring in the RICH) Very challenging!!! Benefits: very selective L0 triggers both for signal and secondary processes collection

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Hough transform

6

Each hit is the center of a test circle with a given radius. The ring center is the best common point of the test circles PMs position -> constant memory hits -> global memory Testing circle prototypes -> constant memory

3D space for histograms in shared memory (2D grid VS test circle radius)

X

Y

Radius

Limitations due to the total shared memory amount (16K) One thread for each center (hits) -> 32 threads (in one thread block) for event

Pros: natural parallelization, small number of thread for event Cons: unpredictable memory access (read&write conflicts), heavy use of on-chip fast memory

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Optimized for problem multi histos (OPMH)

7

Each PM (1000) is considered as the center of a circle. For each center an histogram is constructed with the distances btw center and hits (<32). The whole processor is used for a single event (huge number of center) The single thread computes few distances Several histograms computed in different shared memory spaces Not natural for processor Isn’t possible to process more than one event at the same time (the parallelism is fully exploited to speed up the computation)

distance

Pros: very fast and simple kernel operation Cons: not natural resources assignement; the whole processor is used for a single event

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Optimized for device multi histo (ODMH)

8

Exactly the same algorithm wrt the OPMH, but with different resources assignment The system is exploited in a more natural way: each block is dedicated to a single event, using the shared memory for one histogram Several events processed in parallel in the same time Easier to avoid conflicts in shared and global memory

Shared

Shared

Shared

1 event -> M threads (each thread for N PMs)



G L O B A L M E

MO R Y

Pros: natural memory optimization and resources assignment Cons: waste of resources

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Optimization procedure & results

9

First v

ersio

n

Multipl

e eve

nts

Mem in

it opti

m

Mem ac

cess

optim Ker

nel o

ptim.

us algo 1algo 2algo 3

Several steps of optimization (still room for improvement) Algo1 and Algo3 are suitable to process many events in concurrent procedure at the same time (in Algo2 the single event saturates the chip resources)

Algorithm TESLA c1060

Algo 1 85 us

Algo 2 139 us

Algo 3 12.4 us

Improvment wrt the results shown in Capri (128 ms per ring):

Different Video Card (factor of 10) Better understanding of resources assignment Better understanding of memory conflicts and memory management Multi events approach Other algorithms

GPU 12 hitsBest ring

Hits generated NA62 - G4 MC

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Further optimization on ODMH

10

Two crucial parameters are: N: total number of events for kernel PM: number of PMs processed in the same thread

N events

us

8 PMs x Thread

PM x Thread

us

1000 events x kernel

The time is linear with the N events per thread The minimum of the time as a function of the number of PMs processed in a single kernel is for PMs=8

Working Point (N=1000 PMs=8)

-> 10.8 us

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Present result

11

Further optimizations on working point:

V11: moving arrays from registers (in local) to shared memory -> 6us V12: occupancy optimization -> 5.9 us V13: temporary histograms optimization in shared memory -> 5 us

This means that a group of 1000 events is processed in 5 ms, the latency (due to the algorithm) is 5 ms and the maximum rate is 200 kHz.

Data transfer time (for packet): Data From host to GPU ram -> 70 us Results From GPU ram to host -> 7 us

Concurrent copy and kernel execution in streams

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Prospects

12

3 us isn’t a too optimistics hypotesis for the next month -> several optimization to be done 1 us isn’t to far from our possibility, new algorithm (“triplets”) will be implemented very soon.

1 us means that by using 10 video cards at the same time in parellel we can reach the 10 MHz with a latency of 5 ms -> NA62-TELL1 can easily manage this latency -> only issue with GTK readout The next generation of video cards (already available on the market) will offer at least a factor of 2 in performance (probably much higher due to different architecture)

Next optimization steps: copy hits in shared memory computation of PM center position (instead of using constant memory) Eliminate the slow operations (division by integer) Eliminate syncthread in perfectly allineated warp structure Understand differences in time in reading and writing shared memory

GPUs NVIDIA FERMI ATI HD 5870

Single PrecisionPerformance 2 Teraflops 2.72 Teraflops

Memory 6 GB DDR5 6 GB DDR5

Memory speed ? 1.2 GHz

Bandwidth 230 GB/s 153 GB/s

Stream processors 512 1600

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Going through a working system

13

INTEL PRO/1000 QUAD GBE

PCI-E x16

4 GB/s(20 MHz)*

8 GB/s(40 MHz)*

PCI-E gen2 x16

100GB/s(500 MHz)*

TESLA

GPU

VRAM

RAM

CPU

CPU

30 GB/s(150 MHz)*

The maximum bandwidth in each part of the system is very high

No intrinsic system’s bottleneck

(* the maximum “frequency equivalent” is compute assuming 200 B/event)

Three important parts are still missing to test a real system: Intelligent GBE receiver, in which the data coming from the TELL1 (ethernet packets) are prepared to be directly transfer to the GPU memory A real time linux system in a double processors PC (one processor for the system, one processor to move data) A re-syncronization card to send the trigger decision syncronously to the TTC

TD

AQ

WG

– G

ianl

uca

La

man

na –

9.1

2.2

009

Conclusions

14

FEDigitization + buffer

+ (trigger primitives)

PCs+GPU

PCs+GPUL0

L1

“quasi-triggerless” with GPUs

The use of GPUs in the trigger will be usefull both at L1/L2 and L0 An effective reduction of the trigger rate at L0 will be very important both for main trigger and secondary triggers, with great benefits for physics The L0-GPU is very challenging to the requirements in latency and processing time A very carefull use of the parallelization structure and memory allows to reach 5 us of processing time and 5 ms of latency (for 1000 events packets) The 3 us level will be reached with relatively small effort, optimizing the data transfer and dispatching The 1 us isn’t completely impossible By using 10 video cards in parallel and increasing the latency in the readout, the system seems to be feasible More job is needed for the PC & HW side to design a complete new trigger paradigm with high-quality primitives computing in the first stage

“update on gpu trigger”

Documents