alice week 17.11.99 technical board tpc intelligent readout architecture volker lindenstruth...
TRANSCRIPT
ALICE Week 17.11.99
Technical Board
TPC Intelligent Readout Architecture
Volker Lindenstruth
Universität Heidelberg
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
What‘s new??What‘s new??
TPC occuppancy is much higher than originally assumed New Trigger Detector TRD First time TPC selective readout becomes relevant New Readout/L3 Architecture
No intermediate buses and buffer memories - use PCI and local memory instead
New dead-time or throtteling architecture
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
TRD/TPC Overall TimelineTRD/TPC Overall Timeline
Time in s0 1 2 3 4 5
even
t
TR
D p
retr
igge
r
end
of
TE
C d
rift
TR
D t
rigg
er a
t L
1
TEC drift
Track segmentprocessing
track matching
Data shippingoff detector
data sampling,linear fit
Tri
gger
at
TP
C(G
ate
open
s)
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
TPC L3 trigger and processingTPC L3 trigger and processingTPC L3 trigger and processingTPC L3 trigger and processing
TRDTrigger
Tracking ofe+/e- Candidates
inside TPC
Ship Zero suppressed TPC Data Sector parallel
Trigger andreadout TPC~2 kHz~2 kHz
GlobalTrigger
Other Trigger Detectors,Other Trigger Detectors,TRD L0preTRD L0pre
L1L1
Select Regionsof Interest
L2L2(144 Links, ~60 MB/evt)(144 Links, ~60 MB/evt)
Ship TRDe+/e- Tracks
L2L2
Verify e+/e-Hypothesis
L0L0
Reject event e+/e- Tracksplus RoIs
seed
sse
eds
Con
ical
ze
ro
Con
ical
ze
ro
supp
ress
ed r
eado
utsu
ppre
ssed
rea
dout
Front-EndFront-End//TriggerTrigger
TPC intelligentTPC intelligentReadoutReadout
DAQDAQ
On-line data reduction
(tracking, reconstruction,partial readout,
data compression)
Track segmentsand space points
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Architecture from TPArchitecture from TP
FEDCFEDCFEDCFEDCFEDCFEDC
FEDCFEDCFEDCFEDCFEDCFEDC
FEDCFEDCFEDC
LDC LDC LDC LDC LDC
FEDCFEDCFEDC
FEE
LDC
DDL
Switch
STL
FEEFEEFEEFEEFEE
PDS
TPCTPC ITSITS PHOSPHOSPIDPID TRIGTRIG
Switch
GDCGDC GDCGDC GDCGDC GDCGDC
101044 Hz Hz PbPb--PbPb101055 Hz p-p Hz p-pEvent Event RateRate
L1 L1 TriggerTrigger
L2 L2 TriggerTrigger2500 MB/s 2500 MB/s PbPb++PbPb20 MB/s20 MB/s p+p p+p
101033 Hz Hz PbPb--PbPb101044 Hz p-p Hz p-p1.5-2 µs1.5-2 µs
BUSYBUSY
TriggerTriggerDataData
1250 MB/s 1250 MB/s PbPb++PbPb20 MB/s20 MB/s p+p p+p
EDMEDM
50 Hz zentral + 50 Hz zentral + 1 kHz 1 kHz dimuon Pbdimuon Pb--PbPb550 Hz550 Hz p-p p-p10-10010-100 µs µs
PDSPDSPDS
L0 L0 TriggerTrigger
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Some Technology Trends
DRAM
Jahr Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
……. ……. …….
Kapazität Geschwindigkeit (Latenz)
Logic: 2x in 3 years 2x in 3 Jahren
DRAM: 4x in 3 years 2x in 15 Jahren
Disk: 4x in 3 years 2x in 10 Jahren
1000:1! 2:1!
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Prozessor-DRAM Memory Gap
µProc60%/yr.(2X/1.5yr)
DRAM6%/yr.(2X/15 yrs)1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
TimeDave Patterson, UC BerkeleyDave Patterson, UC Berkeley
“Moore’s Law”
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Testing the uniformity of memoryTesting the uniformity of memory
// Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) {
// Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) {
// Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds();
for (i = iterations; i; i--)
// The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1;
sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1);
stri
dest
ride
itera
tion
itera
tion
stri
dest
ride
stri
dest
ride
stri
dest
ride
size
size Ad
dre
ssA
ddr
ess
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
360 MHz Pentium MMX360 MHz Pentium MMX
2.7 ns2.7 ns
95 ns95 ns
190 ns190 ns
32 b
ytes
32 b
ytes
409
4 b
ytes
409
4 b
ytes
L1 Instruct. Cache: 16 kBL1 Data Cache: 16 kB (4-way associative, 16Byte line)L2 Cache: 512 kB (unified)MMU: 32 I / 64D TLB (4-way assoc)
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
360 MHz Pentium MMX360 MHz Pentium MMX
L2 Cache offL2 Cache off All Caches offAll Caches off
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Vergleich zweier SupercomputerVergleich zweier Supercomputer
HP V-Class (PA-8x00)HP V-Class (PA-8x00) SUN E10k (UltraSparc II)SUN E10k (UltraSparc II)
L1 Instruct. Cache: 16 kBL1 Data Cache: 16 kB (write-through,
non allocate,direct mapped,32Byte line)
L2 Cache: 512 kB (unified)MMU: 2x64 fully assoc. TLB
L1 Instruct. Cache: 512 kBL1 Data Cache: 1024 kB (4-way associative,
16Byte line)MMU: 160 fully assoc. TLB
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
LogPLogP
PP MM PP MM PP MMoo (overhead)(overhead)
gg (gap)(gap)
oo (overhead)(overhead)
LL (Latenz)(Latenz)
Verbindungs-NetzwerkVerbindungs-Netzwerk
PP (Prozessoren)(Prozessoren)
LL: Time, a packet travels in the network from sender to receiver: Time, a packet travels in the network from sender to receiveroo: CPU overhead to send or receive a message: CPU overhead to send or receive a messagegg: shortest time between sent or received message: shortest time between sent or received messagePP: Number of processors: Number of processors
gg (gap)(gap)
NICNICNICNIC NICNIC
Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993
Volume limited by L/gVolume limited by L/g(aggregate Throughput)(aggregate Throughput)
NIC: Network Interface CardNIC: Network Interface Card
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
2-Node Ethernet Cluster2-Node Ethernet Cluster
Quelle: IntelQuelle: Intel
Gigabit EthernetGigabit Ethernet
Fast Ethernet (100 Mb/s)Fast Ethernet (100 Mb/s)
Gigabit Ethernet Gigabit Ethernet with Carrier Extensionwith Carrier Extension
• SUN Gigabit Ethernet PCI Karte IP 2.0SUN Gigabit Ethernet PCI Karte IP 2.0• 2 SUN 450 Ultra Server 1 CPU each2 SUN 450 Ultra Server 1 CPU each• Sender produces TCP datastream with largeSender produces TCP datastream with large Data buffers Data buffers; receiver simply throws data away; receiver simply throws data away• Prozessor Utilization:Prozessor Utilization: Sender 40%; Receiver 60% !• Throughput ca. 160 Mbits !Throughput ca. 160 Mbits !• Netto Throughput increases if receiver is Netto Throughput increases if receiver is implemented as twin processor implemented as twin processor
Why is the TCP/IP Gigabit Ethernet performance so much worse than the Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible??theoretically possible??Note: CMS implemented their own propriate network API for Gethernet andNote: CMS implemented their own propriate network API for Gethernet andMyrinetMyrinet
Test:Test:
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
First Conclusions - OutlookFirst Conclusions - Outlook
Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth.
Number of TPC Data links dropped from (528 ) to 180 Aggregate data rate per link ~34 MB/sec @ 100 Hz TPC has highest processing requirements -
majority of TPC computation can be done on per sector basis. Keep the number of CPUs that process one sector in parallel to a minimum
Today this number is 5 due to TPC granularity Try to get Sector data directly into one processor Selective Readout of TPC sectors can reduce data rate requirement by factors
of at least 2-5 Overall complexity of L3 Processor can be reduced by using PCI based
receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links.
DATE already uses a GSM paradigm as memory pool - no software changes
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
PCI Receiver Card ArchitecturePCI Receiver Card Architecture
OpticalReceiver
Multi EventBuffer
DataFiFo
Pus
h re
adou
t P
oint
ers
FPGAPCI 66/64
PCIHost
memory
PCIHostbridge
PCI
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
PCI Readout of one TPC sectorPCI Readout of one TPC sector
cave
counting house
x2x18
RcvBd
NICPCI
RcvBd
NICPCI
RcvBd
NICPCI
RcvBd
NICPCI
L3 Network
ReceiverProcessor
• Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-end.• The optical PCI receiver modules mount directly in a commercial off the shelf (COTS) receiver computer in the counting house• The COTS receiver processor performs any necessary hit level functionality on the data in case of L3 processing• The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality.• The receiver processor is much less expensive than any crate based solution
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Overall TPC Intelligent Readout Architecture
Overall TPC Intelligent Readout Architecture
PCI
MEMCPU
RORC
LDC/L3CPU
NIC
L2 Trigger
PDS
36 TPC Sectors
InnerTracking
SystemPhoton
Spectrometer
FEE
ParticleIdentification
DDL
L1 Trigger
Switch
TriggerData
Trigger DecisionsDetector busy
FEEFEE
PDS PDS PDS
MuonTrackingChambers
L0 Trigger
FEE
FEE
FEE
Trigger Detectors: Micro Channelplate- Zero-Degree Cal.- Muon Trigger Chambers
- Transition Radiation Detector
RORCRORC
PCI
MEMCPU
RORC
LDC/FEDC
NIC
RORCRORC
PCI
MEMCPU
RORC
LDC/FEDC
NIC
RORCRORC
PCI
MEMCPU
RORC
LDC/FEDC
NIC
PCI
MEMCPU
RORC
LDC/L3CPU
NIC
FEE
PCI
MEMCPU
RORC
LDC/L3CPU
NICPCI
MEMCPU
RORC
LDC/L3CPU
NICPCI
MEMCPU
RORC
LDC/L3CPU
NIC
FEE
PCI
MEMCPU
RORC
LDC/L3CPU
NICPCI
MEMCPU
RORC
LDC/L3CPU
NICPCI
MEMCPU
RORC
LDC/L3CPU
NIC
RORCRORC
PCI
MEMCPU
RORC
LDC/FEDC
NIC
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
L3 MatrixEDM
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
PCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NICPCIMEMCPU
GDC/L3CPU
NIC
Computer center
• Each TPC sector forms an independent sector cluster• The sector clusters merge through a cluster interconnect/network to a global processing cluster.• The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout• All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time• All processing elements can be replaced and upgraded at any point in time• The network is commercial• The resulting multiprocessor cluster is generic and can be used as off-line farm
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
Dead Time / Flow ControlDead Time / Flow Control
TPC FEE Buffer(8 black Events)
RcvBd
NICPCI
TPC reveiver Buffer> 100 Events
Event ReceiptDaisy Chain
Scenario I TPC Dead Time is determined centrally For every TPC trigger a counter is incremented For every completely received event the last receiver
module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event
The event receipt pulse decrements the counter The counter reaching count 7 asserts TPC dead time
(there could be an other event already in the queue
Scenario II TPC Dead Time is determined centrally based on rates
assuming worst case event sizes Overflow protection for FEE buffers:
Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit)
Overflow protection for receiver buffers:~100 Events in 1 second - ORhigh- water mark in any receiver buffer (preferred way)
High water mark - send XOFF
low water mark - send XOFF
No need for reverse flow control on optical linkNo need for dead time signalling at TPC frontend
Volker Lindenstruth, November Volker Lindenstruth, November 19991999
SummarySummary
Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail
Do not move data if not required - moving data costs money (except for some granularity effects)
Overall complexity can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME
General purpose COTS processors are less expensive than any crate solution FPGA based PCI receiver card prototype is built, NT driver completed, Linux
driver almost completed DDL already planned as PCI version No reverse flow control required for DDL DDL URD to be revised by collaboration ASAP No dead time or throtteling required to be implemented at front-end Two scenarios as to how to implement it for the TPC at back-end without
additional cost