alice week 17.11.99 technical board tpc intelligent readout architecture volker lindenstruth...

ALICE Week 17.11.99

Technical Board

TPC Intelligent Readout Architecture

Volker Lindenstruth

Universität Heidelberg

Volker Lindenstruth, November Volker Lindenstruth, November 19991999

What‘s new??What‘s new??

TPC occuppancy is much higher than originally assumed New Trigger Detector TRD First time TPC selective readout becomes relevant New Readout/L3 Architecture

No intermediate buses and buffer memories - use PCI and local memory instead

New dead-time or throtteling architecture


TRD/TPC Overall TimelineTRD/TPC Overall Timeline

Time in s0 1 2 3 4 5

even

t

TR

D p

retr

igge

r

end

of

TE

C d

rift

TR

D t

rigg

er a

t L

1

TEC drift

Track segmentprocessing

track matching

Data shippingoff detector

data sampling,linear fit

Tri

gger

at

TP

C(G

ate

open

s)


TPC L3 trigger and processingTPC L3 trigger and processingTPC L3 trigger and processingTPC L3 trigger and processing

TRDTrigger

Tracking ofe+/e- Candidates

inside TPC

Ship Zero suppressed TPC Data Sector parallel

Trigger andreadout TPC~2 kHz~2 kHz

GlobalTrigger

Other Trigger Detectors,Other Trigger Detectors,TRD L0preTRD L0pre

L1L1

Select Regionsof Interest

L2L2(144 Links, ~60 MB/evt)(144 Links, ~60 MB/evt)

Ship TRDe+/e- Tracks

L2L2

Verify e+/e-Hypothesis

L0L0

Reject event e+/e- Tracksplus RoIs

seed

sse

eds

Con

ical

ze

ro

Con

ical

ze

ro

supp

ress

ed r

eado

utsu

ppre

ssed

rea

dout

Front-EndFront-End//TriggerTrigger

TPC intelligentTPC intelligentReadoutReadout

DAQDAQ

On-line data reduction

(tracking, reconstruction,partial readout,

data compression)

Track segmentsand space points


Architecture from TPArchitecture from TP

FEDCFEDCFEDCFEDCFEDCFEDC

FEDCFEDCFEDCFEDCFEDCFEDC

FEDCFEDCFEDC

LDC LDC LDC LDC LDC

FEDCFEDCFEDC

FEE

LDC

DDL

Switch

STL

FEEFEEFEEFEEFEE

PDS

TPCTPC ITSITS PHOSPHOSPIDPID TRIGTRIG

Switch

GDCGDC GDCGDC GDCGDC GDCGDC

101044 Hz Hz PbPb--PbPb101055 Hz p-p Hz p-pEvent Event RateRate

L1 L1 TriggerTrigger

L2 L2 TriggerTrigger2500 MB/s 2500 MB/s PbPb++PbPb20 MB/s20 MB/s p+p p+p

101033 Hz Hz PbPb--PbPb101044 Hz p-p Hz p-p1.5-2 µs1.5-2 µs

BUSYBUSY

TriggerTriggerDataData

1250 MB/s 1250 MB/s PbPb++PbPb20 MB/s20 MB/s p+p p+p

EDMEDM

50 Hz zentral + 50 Hz zentral + 1 kHz 1 kHz dimuon Pbdimuon Pb--PbPb550 Hz550 Hz p-p p-p10-10010-100 µs µs

PDSPDSPDS

L0 L0 TriggerTrigger


Some Technology Trends

DRAM

Jahr Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

……. ……. …….

Kapazität Geschwindigkeit (Latenz)

Logic: 2x in 3 years 2x in 3 Jahren

DRAM: 4x in 3 years 2x in 15 Jahren

Disk: 4x in 3 years 2x in 10 Jahren

1000:1! 2:1!


Prozessor-DRAM Memory Gap

µProc60%/yr.(2X/1.5yr)

DRAM6%/yr.(2X/15 yrs)1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

TimeDave Patterson, UC BerkeleyDave Patterson, UC Berkeley

“Moore’s Law”


Testing the uniformity of memoryTesting the uniformity of memory

// Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) {

// Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) {

// Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds();

for (i = iterations; i; i--)

// The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1;

sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1);

stri

dest

ride

itera

tion

itera

tion

stri

dest

ride

stri

dest

ride

stri

dest

ride

size

size Ad

dre

ssA

ddr

ess


360 MHz Pentium MMX360 MHz Pentium MMX

2.7 ns2.7 ns

95 ns95 ns

190 ns190 ns

32 b

ytes

32 b

ytes

409

4 b

ytes

409

4 b

ytes

L1 Instruct. Cache: 16 kBL1 Data Cache: 16 kB (4-way associative, 16Byte line)L2 Cache: 512 kB (unified)MMU: 32 I / 64D TLB (4-way assoc)


360 MHz Pentium MMX360 MHz Pentium MMX

L2 Cache offL2 Cache off All Caches offAll Caches off


Vergleich zweier SupercomputerVergleich zweier Supercomputer

HP V-Class (PA-8x00)HP V-Class (PA-8x00) SUN E10k (UltraSparc II)SUN E10k (UltraSparc II)

L1 Instruct. Cache: 16 kBL1 Data Cache: 16 kB (write-through,

non allocate,direct mapped,32Byte line)

L2 Cache: 512 kB (unified)MMU: 2x64 fully assoc. TLB

L1 Instruct. Cache: 512 kBL1 Data Cache: 1024 kB (4-way associative,

16Byte line)MMU: 160 fully assoc. TLB


LogPLogP

PP MM PP MM PP MMoo (overhead)(overhead)

gg (gap)(gap)

oo (overhead)(overhead)

LL (Latenz)(Latenz)

Verbindungs-NetzwerkVerbindungs-Netzwerk

PP (Prozessoren)(Prozessoren)

LL: Time, a packet travels in the network from sender to receiver: Time, a packet travels in the network from sender to receiveroo: CPU overhead to send or receive a message: CPU overhead to send or receive a messagegg: shortest time between sent or received message: shortest time between sent or received messagePP: Number of processors: Number of processors

gg (gap)(gap)

NICNICNICNIC NICNIC

Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993

Volume limited by L/gVolume limited by L/g(aggregate Throughput)(aggregate Throughput)

NIC: Network Interface CardNIC: Network Interface Card


2-Node Ethernet Cluster2-Node Ethernet Cluster

Quelle: IntelQuelle: Intel

Gigabit EthernetGigabit Ethernet

Fast Ethernet (100 Mb/s)Fast Ethernet (100 Mb/s)

Gigabit Ethernet Gigabit Ethernet with Carrier Extensionwith Carrier Extension

• SUN Gigabit Ethernet PCI Karte IP 2.0SUN Gigabit Ethernet PCI Karte IP 2.0• 2 SUN 450 Ultra Server 1 CPU each2 SUN 450 Ultra Server 1 CPU each• Sender produces TCP datastream with largeSender produces TCP datastream with large Data buffers Data buffers; receiver simply throws data away; receiver simply throws data away• Prozessor Utilization:Prozessor Utilization: Sender 40%; Receiver 60% !• Throughput ca. 160 Mbits !Throughput ca. 160 Mbits !• Netto Throughput increases if receiver is Netto Throughput increases if receiver is implemented as twin processor implemented as twin processor

Why is the TCP/IP Gigabit Ethernet performance so much worse than the Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible??theoretically possible??Note: CMS implemented their own propriate network API for Gethernet andNote: CMS implemented their own propriate network API for Gethernet andMyrinetMyrinet

Test:Test:


First Conclusions - OutlookFirst Conclusions - Outlook

Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth.

Number of TPC Data links dropped from (528 ) to 180 Aggregate data rate per link ~34 MB/sec @ 100 Hz TPC has highest processing requirements -

majority of TPC computation can be done on per sector basis. Keep the number of CPUs that process one sector in parallel to a minimum

Today this number is 5 due to TPC granularity Try to get Sector data directly into one processor Selective Readout of TPC sectors can reduce data rate requirement by factors

of at least 2-5 Overall complexity of L3 Processor can be reduced by using PCI based

receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links.

DATE already uses a GSM paradigm as memory pool - no software changes


PCI Receiver Card ArchitecturePCI Receiver Card Architecture

OpticalReceiver

Multi EventBuffer

DataFiFo

Pus

h re

adou

t P

oint

ers

FPGAPCI 66/64

PCIHost

memory

PCIHostbridge

PCI


PCI Readout of one TPC sectorPCI Readout of one TPC sector

cave

counting house

x2x18

RcvBd

NICPCI

RcvBd

NICPCI

RcvBd

NICPCI

RcvBd

NICPCI

L3 Network

ReceiverProcessor

• Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-end.• The optical PCI receiver modules mount directly in a commercial off the shelf (COTS) receiver computer in the counting house• The COTS receiver processor performs any necessary hit level functionality on the data in case of L3 processing• The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality.• The receiver processor is much less expensive than any crate based solution


Overall TPC Intelligent Readout Architecture

Overall TPC Intelligent Readout Architecture

PCI

MEMCPU

RORC

LDC/L3CPU

NIC

L2 Trigger

PDS

36 TPC Sectors

InnerTracking

SystemPhoton

Spectrometer

FEE

ParticleIdentification

DDL

L1 Trigger

Switch

TriggerData

Trigger DecisionsDetector busy

FEEFEE

PDS PDS PDS

MuonTrackingChambers

L0 Trigger

FEE

FEE

FEE

Trigger Detectors: Micro Channelplate- Zero-Degree Cal.- Muon Trigger Chambers

- Transition Radiation Detector

RORCRORC

PCI

MEMCPU

RORC

LDC/FEDC

NIC

RORCRORC

PCI

MEMCPU

RORC

LDC/FEDC

NIC

RORCRORC

PCI

MEMCPU

RORC

LDC/FEDC

NIC

PCI

MEMCPU

RORC

LDC/L3CPU

NIC

FEE

PCI

MEMCPU

RORC

LDC/L3CPU

NICPCI

MEMCPU

RORC

LDC/L3CPU

NICPCI

MEMCPU

RORC

LDC/L3CPU

NIC

FEE

PCI

MEMCPU

RORC

LDC/L3CPU

NICPCI

MEMCPU

RORC

LDC/L3CPU

NICPCI

MEMCPU

RORC

LDC/L3CPU

NIC

RORCRORC

PCI

MEMCPU

RORC

LDC/FEDC

NIC

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

L3 MatrixEDM

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

PCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NICPCIMEMCPU

GDC/L3CPU

NIC

Computer center

• Each TPC sector forms an independent sector cluster• The sector clusters merge through a cluster interconnect/network to a global processing cluster.• The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout• All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time• All processing elements can be replaced and upgraded at any point in time• The network is commercial• The resulting multiprocessor cluster is generic and can be used as off-line farm


Dead Time / Flow ControlDead Time / Flow Control

TPC FEE Buffer(8 black Events)

RcvBd

NICPCI

TPC reveiver Buffer> 100 Events

Event ReceiptDaisy Chain

Scenario I TPC Dead Time is determined centrally For every TPC trigger a counter is incremented For every completely received event the last receiver

module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event

The event receipt pulse decrements the counter The counter reaching count 7 asserts TPC dead time

(there could be an other event already in the queue

Scenario II TPC Dead Time is determined centrally based on rates

assuming worst case event sizes Overflow protection for FEE buffers:

Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit)

Overflow protection for receiver buffers:~100 Events in 1 second - ORhigh- water mark in any receiver buffer (preferred way)

High water mark - send XOFF

low water mark - send XOFF

No need for reverse flow control on optical linkNo need for dead time signalling at TPC frontend


SummarySummary

Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail

Do not move data if not required - moving data costs money (except for some granularity effects)

Overall complexity can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME

General purpose COTS processors are less expensive than any crate solution FPGA based PCI receiver card prototype is built, NT driver completed, Linux

driver almost completed DDL already planned as PCI version No reverse flow control required for DDL DDL URD to be revised by collaboration ASAP No dead time or throtteling required to be implemented at front-end Two scenarios as to how to implement it for the TPC at back-end without

additional cost

alice week 17.11.99 technical board tpc intelligent readout architecture volker lindenstruth...

Documents

pbpb101044 hz pp hz

pbpb550 hz550 hz pp

tpc l3 trigger

tpc occuppancy

uniformity of memory

partial readout

khz dimuon pbdimuon

linear fittrigger