status of the l1 sts tracking i. kisel gsi / kip cbm collaboration meeting gsi, march 12, 2009

Status of the L1 STS TrackingStatus of the L1 STS Tracking

I. KiselI. KiselGSI / KIPGSI / KIP

CBM Collaboration MeetingCBM Collaboration MeetingGSI, March 12, 2009GSI, March 12, 2009

12 March 2009, GSI12 March 2009, GSI Ivan Kisel, GSIIvan Kisel, GSI 22/14/14

L1 CA Track Finder EfficiencyL1 CA Track Finder Efficiency

Track categoryTrack category Efficiency, Efficiency, %%

Reference set (>1 Reference set (>1 GeV/c)GeV/c)

95.295.2

All set (≥4 hits,>100 All set (≥4 hits,>100 MeV/c)MeV/c)

89.889.8

Extra set (<1 GeV/c)Extra set (<1 GeV/c) 78.678.6

CloneClone 2.82.8

GhostGhost 6.66.6

MC tracks/ev foundMC tracks/ev found 672672

Speed, s/evSpeed, s/ev 0.80.8

I. Rostovtseva

• Fluctuated magnetic field?Fluctuated magnetic field?• Too large STS acceptance?Too large STS acceptance?• Too large distance between STS stations?Too large distance between STS stations?


Many-core HPCMany-core HPC

• High performance computing (HPC)High performance computing (HPC)• Highest clock rate is reachedHighest clock rate is reached• Performance/power optimizationPerformance/power optimization• Heterogeneous systems of many (>8) coresHeterogeneous systems of many (>8) cores• Similar programming languages (OpenCL, Ct and CUDA)Similar programming languages (OpenCL, Ct and CUDA)• We need a uniform approach to all CPU/GPU familiesWe need a uniform approach to all CPU/GPU families

• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• SIMDization of the algorithm (from scalars to vectors)SIMDization of the algorithm (from scalars to vectors)• MIMDization (multi-threads, many-cores) MIMDization (multi-threads, many-cores)

????

GamingGaming STI: STI: CellCell


GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee?? GP GPUGP GPU

Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla??

CPUCPU Intel: Intel: XX-coresXX-cores

CPUCPU Intel: Intel: XX-coresXX-cores

FPGAFPGA XilinxXilinx


??CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion??


Current and Expected Eras of Intel Processor ArchitecturesCurrent and Expected Eras of Intel Processor Architectures

From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005.

• Future programming is 3-dimentionalFuture programming is 3-dimentional• The amount of data is doubling every 18-24 monthThe amount of data is doubling every 18-24 month• Massive data streamsMassive data streams• The RMS (Recognition, Mining, Synthesis) workload in real timeThe RMS (Recognition, Mining, Synthesis) workload in real time• Supercomputer-level performance in ordinary servers and PCsSupercomputer-level performance in ordinary servers and PCs• Applications, like real-time decision-making analysisApplications, like real-time decision-making analysis

CoresCores

ThreadsThreads

SIMD widthSIMD width


Cores and ThreadsCores and Threads

CPU architecture in CPU architecture in 20092009

CPU of your laptop inCPU of your laptop in 20152015

CPU architecture in CPU architecture in 19XX19XX

1 Process per CPU1 Process per CPU

CPU architecture inCPU architecture in 20002000

2 Threads per Process per CPU2 Threads per Process per CPU

ProcessProcessThread1 Thread2Thread1 Thread2

exeexe r/wr/wr/wr/w exeexeexeexe r/wr/w... ...... ...


SIMD WidthSIMD Width

D1D1 D2D2

S4S4S3S3S2S2S1S1

DD

S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1

S16S16S15S15S14S14S13S13S12S12S11S11S10S10S9S9S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1

Scalar double precision (Scalar double precision (6464 bits) bits)

Vector (SIMD) double precision (Vector (SIMD) double precision (128128 bits) bits)

Vector (SIMD) single precision (Vector (SIMD) single precision (128128 bits) bits)

Intel AVX (2010) vector single precision (Intel AVX (2010) vector single precision (256256 bits) bits)

Intel LRB (2010) vector single precision (Intel LRB (2010) vector single precision (512512 bits) bits)

22 or or 1/21/2

44 or or 1/41/4

88 or or 1/81/8

1616 or or 1/161/16

FasterFaster or or SlowerSlower ? ?

SIMD = Single Instruction Multiple DataSIMD = Single Instruction Multiple DataSIMD uses vector registers SIMD uses vector registers SIMD exploits data-level parallelismSIMD exploits data-level parallelism

CPUCPU

ScalarScalar VectorVectorDD SS SS SS SS


SIMD KF Track Fit on Intel Multicore Systems: SIMD KF Track Fit on Intel Multicore Systems: ScalabilityScalability

H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering

Real-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threadsReal-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threads

Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1

Real-time performance on different Intel CPU platformsReal-time performance on different Intel CPU platforms


Intel Larrabee: Intel Larrabee: 32 Cores32 Cores

L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.

LRB vs. GPU:LRB vs. GPU:Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways:Radeon 4000 series in three major ways:• use the x86 instruction set with Larrabee-specific extensions;use the x86 instruction set with Larrabee-specific extensions;• feature cache coherency across all its cores;feature cache coherency across all its cores;• include very little specialized graphics hardware.include very little specialized graphics hardware.

LRB vs. CPU:LRB vs. CPU:The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's LRB's 3232 x86 x86 corescores will be based on the much simpler Pentium design; will be based on the much simpler Pentium design;• each core supports each core supports 4-way 4-way simultaneous simultaneous multithreadingmultithreading, with 4 copies of each processor register; , with 4 copies of each processor register; • each core contains a each core contains a 512-bit vector512-bit vector processing unit, able to process processing unit, able to process 16 single precision floating point16 single precision floating point numbers at a time; numbers at a time;• LRB includes explicit cache control instructions;LRB includes explicit cache control instructions;• LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;• LRB includes one fixed-function graphics hardware unit.LRB includes one fixed-function graphics hardware unit.


General Purpose Graphics Processing Units (GPGPU)General Purpose Graphics Processing Units (GPGPU)

• Substantial evolution of graphics hardware over the past yearsSubstantial evolution of graphics hardware over the past years• Remarkable programmability and flexibilityRemarkable programmability and flexibility• Reasonably cheapReasonably cheap• New branch of research – GPGPUNew branch of research – GPGPU


NVIDIA HardwareNVIDIA Hardware

S. Kalcher, M. Bach

• Streaming multiprocessorsStreaming multiprocessors• No overhead thread switchingNo overhead thread switching• FPUs instead of cache/controlFPUs instead of cache/control• Complex memory hierarchyComplex memory hierarchy• SIMT – Single Instruction Multiple ThreadsSIMT – Single Instruction Multiple Threads

GT200GT200• 30 multiprocessors30 multiprocessors• 30 DP units30 DP units• 8 SP FPUs per MP 8 SP FPUs per MP • 240 SP units240 SP units• 16 000 registers per MP16 000 registers per MP• 16 kB shared memory per MP16 kB shared memory per MP• >= 1 GB main memory>= 1 GB main memory• 1.4 GHz clock1.4 GHz clock• 933 GFlops SP933 GFlops SP


SIMD/SIMT Kalman Filter on the CSC-Scout ClusterSIMD/SIMT Kalman Filter on the CSC-Scout Cluster

CPU1600

GPU9100

M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth

18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)++

27xTesla S1070(4x(GT200, 4 GB))27xTesla S1070(4x(GT200, 4 GB))


CPU/GPU Programming FrameworksCPU/GPU Programming Frameworks

• Cg, OpenGL Shading Language, Direct XCg, OpenGL Shading Language, Direct X• Designed to write shaders• Require problem to be expressed graphically

• AMD BrookAMD Brook• Pure stream computingPure stream computing• No hardware specificNo hardware specific

• AMD CAL (Compute Abstraction Layer)AMD CAL (Compute Abstraction Layer)• Generic usage of hardware on assembler levelGeneric usage of hardware on assembler level

• NVIDIA CUDA (Compute Unified Device Architecture)NVIDIA CUDA (Compute Unified Device Architecture)• Defines hardware platformDefines hardware platform• Generic programmingGeneric programming• Extension to the C languageExtension to the C language• Explicit memory managementExplicit memory management• Programming on thread levelProgramming on thread level

• Intel Ct (C for throughput)Intel Ct (C for throughput)• Extension to the C languageExtension to the C language• Intel CPU/GPU specificIntel CPU/GPU specific• SIMD exploitation for automatic parallelismSIMD exploitation for automatic parallelism

• OpenCL (Open Computing Language)OpenCL (Open Computing Language)• Open standard for generic programmingOpen standard for generic programming• Extension to the C languageExtension to the C language• Supposed to work on any hardwareSupposed to work on any hardware• Usage of specific hardware capabilities by extensionsUsage of specific hardware capabilities by extensions


On-line = Off-line Reconstruction ?On-line = Off-line Reconstruction ?

Off-line and on-line reconstructions will and should be parallelized Off-line and on-line reconstructions will and should be parallelized Both versions will be run on similar many-core systems or even on the same PC farmBoth versions will be run on similar many-core systems or even on the same PC farm Both versions will use (probably) the same parallel language(s), such as OpenCLBoth versions will use (probably) the same parallel language(s), such as OpenCL Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? If the final code is fast, can we think about a global on-line event reconstruction and selection?If the final code is fast, can we think about a global on-line event reconstruction and selection?

Intel SIMDIntel SIMD Intel MIMDIntel MIMD Intel CtIntel Ct NVIDIA NVIDIA CUDACUDA

OpenCLOpenCL

STSSTS ++ ++ ++ ++ ––

MuChMuCh

RICHRICH

TRDTRD

Your RecoYour Reco

Open Charm Open Charm AnalysisAnalysis

Your AnalysisYour Analysis


SummarySummary• Think parallel !Think parallel !• Parallel programming is the key to the full potential of the Tera-scale platformsParallel programming is the key to the full potential of the Tera-scale platforms• Data parallelism vs. parallelism of the algorithmData parallelism vs. parallelism of the algorithm• Stream processing – no branchesStream processing – no branches• Avoid direct accessing main memory, no maps, no look-up-tablesAvoid direct accessing main memory, no maps, no look-up-tables• Use SIMD unit in the nearest future (many-cores, TF/s, …)Use SIMD unit in the nearest future (many-cores, TF/s, …)• Use single-precision floating point where possibleUse single-precision floating point where possible• In critical parts use double precision if necessaryIn critical parts use double precision if necessary• Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)• New parallel languages appear: OpenCL, Ct, CUDANew parallel languages appear: OpenCL, Ct, CUDA• GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!• Should we start buying them for testing?Should we start buying them for testing?

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

OpenCL?OpenCL?OpenCL?OpenCL?





GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

CPUCPU Intel: Intel: XXX-coresXXX-cores

CPUCPU Intel: Intel: XXX-coresXXX-cores



status of the l1 sts tracking i. kisel gsi / kip cbm collaboration meeting gsi, march 12, 2009

Documents