experience with multi-threaded c++ applications in the atlas dataflow szymon gadomski university of...
DESCRIPTION
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow"3 ATLAS Data Flow software (2) State of the project: –development done mostly in , –measurements for Technical Design Report – performance, –preparation for beam test support – stability, robustness and deployment. 7 kinds of applications (+3 kinds of controllers) Always several threads (independent processes within one application without their own resources). Roles, challenges and use of threads very different. In this short talk only a few examples –use of threads, problems, solutions.TRANSCRIPT
Experience with multi-threaded C++ applications in the ATLAS
DataFlowSzymon GadomskiUniversity of Bern, Switzerlandand INP Cracow, Polandon behalf of the ATLAS Trigger/DAQ DataFlow, CHEP 2003 conference
Performance problems found and solved:• STL containers• thread scheduling• other
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 2
ATLAS DataFlow software• Flow of data in the ATLAS DAQ system
– Data to LVL2 (part of event), to EF (whole event), to mass storage.
– See talks by Giovanna Lehman (overview of DataFlow) and by Stefan Stancu (networking).
• PCs, standard Linux, applications written in C++ (so far using only gcc to compile), standard network technology (Gb ethernet).
• “Soft” real time system, no guaranteed response time. The average response time is what matters.
• Common tasks (exchanging messages, state machine, access configuration db, reporting errors, …) using a framework (well, actually two…).
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 3
ATLAS Data Flow software (2)• State of the project:
– development done mostly in 2001-2002,– measurements for Technical Design Report –
performance,– preparation for beam test support – stability,
robustness and deployment.• 7 kinds of applications (+3 kinds of controllers)• Always several threads (independent processes
within one application without their own resources). • Roles, challenges and use of threads very different. • In this short talk only a few examples
– use of threads, problems, solutions.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 4
Testbed at CERN
1U PCs >= 2 1U PCs >= 2 GHzGHz
4U PCs >= 2 4U PCs >= 2 GHzGHz
FPGA FPGA Traffic Traffic generatorsgenerators
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 5
LVL2 processing unit (L2PU) - role
Multiplicties are
indicative only
L2PUL2SV
DataFlow application
Interface with control software
1600x
10xUp to 500x
MassStorage
pROS1x
ROBROBROBROBROBROBROBROB
140x
Detectordata!
ROS
L1 + RoI data
data request(RoI only)
data
Open choice.
detailedLVL2 result
LVL2 decision
• gets LVL1 decision• asks for data • gets it• makes LVL2 decision• sends it• sends detailed result
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 6
L2PU design
Worker Thread
Worker Thread
Worker Thread
Input Thread
RoI Data Requests
RoI Data
L2SV
ROS‘s
LVL2 Decision
L2PULVL1 Result
Worker Thread
pROSLVL2
Result
Assemble RoI Data
Add to Event Queue
Get next Event from Queue
If Accept send Result
Run LVL2 Selection code
Send Decision
Request data + wait
RoI Data
RoI Data
RoI D
ata
Continue Selection code
If complete restart Worker
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 7
Sub-farm Interface (SFI) - role
Multiplicties are indicative only
SFI
DataFlow application
EF
Interface with control
50x
140x
1x
MassStorage
LVL2 accepts
and rejectscomplete event
DFM
ROSdata
requestclear
assign
EoE
• gets event id (L2 accept)• asks for all event data • gets it• builds complete event• buffers it• sends it to Event Filter
request
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 8
Assembly Thread
SFI Design
Input Thread Request Thread Event Handler
Data Requests
Event Data
Event Assign
s
Different threads for requesting and receiving data Threads for assembly and for sending to Event Handler
DFMEB Rate/SFI
~50 HzEnd of
Event
SFI
Reask Fragment ID
sAssigns
ROSFragments Events
EFFull
Event
ROS
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 9
Lesson with L2PU and SFI – STL containers
# th
read
s
time blocked!
• With no apparent dependence between threads in code, it was observed that threads were not running independently. No effect from more threads.
• VisualThreads, using instrumented pthread library:– STL containers use a memory pool, by default one per
executable. There is a lock, threads may block each other.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 10
Lesson with L2PU and SFI – STL containers (2)
• The solution is to use pthread allocator. Independent memory pools for each thread, no lock, no blocking. • Use for all containers used at event rate.• Careful with creating objects in one thread and deleting in another.
blocked less often
# th
read
s
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 11
SFI HistoryDate Change EB EB +
Output to EF30 Oct `02 First integration on testbed 0.5 MB/s -13 Nov Sending data requests at a regular pace 8.0 MB/s -14 Nov Reduce the number of threads 15 MB/s -20 Nov Switch off hyper-threading 17 MB/s -21 Nov Introduce credit based traffic shaping 28 MB/s -13 Dec First try on throughput - 14 MB/s17 Jan Chose pthread allocator for STL object 53 MB/s 18 MB/s29 Jan DC Buffer recycling when sending 56 MB/s 19 MB/s05 Feb IOVec storage type in the EFormat library 58 MB/s 46 MB/s21 Feb Buffer pool per thread 64 MB/s 48 MB/s21 Feb Grouping interthread communication 73 MB/s 51 MB/s26 Feb Avoiding one system call per message 80 MB/s 55 MB/s
threads
threadsthreads
threads
threadsthreads
Most improvements (and most problems) are related to threads.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 12
Lessons from SFI• Traffic shaping (limiting the number of outstanding
requests for data) eliminates packet loss.• Grouping interthread communication – decrease
frequency of thread activation.• Some improvements in more predictable areas:
• avoiding copies and system calls, • avoiding creations by recycling buffers,• avoiding contention, each thread has its own buffers. Optimizations driven by measurements with full functionality.
• Effective development: developer works on a
good testbed, tests and optimizes, short cycle.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 13
0
10
20
30
40
50
60
0 2 4 6 8 10#ROLs/ROS
EB
rat
e H
z
Performance of the SFI
• Reaching I/O limit at 95 MB/s otherwise CPU limited• 35% performance gain with at least 8 ROLs/ROS• Will approach I/O limit for 1 ROL/ROS with faster CPU
95 MB/s – IO limited
#ROLs/ROS
EB only
ThroughputCPU limited (2.4 GHz CPU)
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 14
Readout System (ROS) - role
ROBinROBinROBin
I/O Manager
~12 bufersfor data
ROS controllerLVL2 or EB
Data request
request data
data
ROI collection andpartial event building.
Not exactly like SFI:ROS SFI
RequestRate
24 kHz L23 kHz EB
50 Hz
Dataper req.
2 kB LVL28 kB EB
1.5 MB
Data rate
72 MB/s 75 MB/s
All numbers approximate.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 15
IOManager in ROS
= Thread
= Process= Linux Scheduler
Requests(L2, EB, Delete)
Request Queue
RobInsRequest Handlers
Control, error
Trigger
The number of request handlers is configurable
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 16
request rate vs. # request handlers
0
20
40
60
80
100
0 2 4 6 8 10 12
# request handlers
requ
est r
ate
(kHz)
patch
no patch
•System without interrupt. Poll and yield. •Standard linux scheduler puts the thread away until next time slice. Up to 10 ms.
Solution is to change scheduling in kernel•For 2.4.9 kernels there exists an unofficial patch (tested on CERN RH7.2) •For CERN RH7.3 there is a CERN-certified patch linux_2.4.18_18_sched.yield.patch
20 s latency for getting data
This is and evolving field, need to continue evaluating thread-related changes of Linux kernels.
Thread scheduling problem
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 17
Conclusions• The DataFlow of ATLAS DAQ has a set of
applications managing the flow of data.• All prototypes exist, have been
optimized, are used for performance measurements and are prepared for Beam Test.
• Standard technology (Gb ethernet, PCs, standard Linux, C++ with gcc, multi-threaded) meets ATLAS requirements.
• A few lessons were learned.
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 18
Backup slides
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 19
Data Flow Manager (DFM) - role
Multiplicties are indicative only
L2SV SFI
DataFlow application
EF
I/F with OnlineSW
100x
200x
16x
1x
MassStorage
SFO
30x
Disk files
LVL2 accepts
and rejects
data data
data
DFM
ROSdata
requestclear
assign
EoE
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 20
DFM Design
Bulk of work done in I/O threadCleanup thread identifies timed out eventsFully embedded in the DC framework
Threads allow for independent and
parallel processing within an application
DFM
I/O Thread Cleanup ThreadLoad BalancingBookkeeping
L2 DesicionsEndOfEvent
SFI AssignsTimeouts
L2SV
L2 Decisions
EventAssigns
EndOfEvent
Clears
ROSI/O Rate ~4 kHz
SFI
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 21
STL containers (3)
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 22
SFI performanceInput up to 95 Mb/s (~3/4 of the 1 Gb line)Input and output at 55 Mb/s (~1/2 line speed)
With all the logic of EventBuilding and all the objects involved, the performance is already close to the network limit (on a 2.4 GHz PC).
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 23
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10
# of SFIs
Tota
l dat
a ra
te o
f the
EB
[MB/
s]
Performance of Event Building
max EB rate with 8 SFIs ~ 350Hz (17% of ATLAS EB rate)
• N SFIs• 1 DFM• hardware emulators of ROS
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 24
After the patch Xeon/2GHz - Linux 2.4.18+CERN scheduling patch
0
50
100
150
200
0 10 20 30 40
# request handlers
L2 re
ques
t rat
e (k
Hz)
latency = 2 usecslatency = 5 usecslatency = 10 usecslatency = 20 usecslatency = 50 usecslatency = 100 usecslatency = 1000 usecs
100% L2Requests1 ROL per L2 request
release grouping = 100
Simulated I/O latency
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 25
Flow of messages
5a: DFM_Decision
SFIROS/ROBDFM
5b: SFI_EoE
SFI_FlowControl
Note6a: SFI_DataRequest associated with
5a: DFM_Decision used for error recovery.
1..n
L2PUL2SV
2a: L2PU_Data Request
p ROS
3a: L2PU_LVL2Result
4a: L2SV_LVL2 Decision
2b: ROS/ROB_Fragment
1..i
DFM_FlowControl
Build event
3b: pROS_Ack
wait EoE
reassign
1..n
1a: L2SV_LVL1ResultRoIB
EF
wait LVL2 decision
or time out 1..i
4b: DFM_Ack
1b: L2PU_LVL2Decision
1..n1..n
6b: ROS/ROB_EventFragment
5a': DFM_SFIAssign
6a: SFI_DataRequest
7: DFM_Clear
receive or timeout
sequential processing or time out
time-out event
or time out
CHEP, March 03S.Gadomski, "Experience with multi-threaded C++ in ATLAS DataFlow" 26
LVL2 Processors
DFMs
Local EF Farms
SFIs
To Remote EF Farm
LVL2 Supervisors
RoIBSV
SwitchDFM
Switch
SubFarmSwitchSubFarm
Switch EF Switch EF Switch
RO{B,S}
EB Switch
LVL2 Switch
RO{B,S}RO{B,S}RO{B,S}RO{B/S}
RODsRODsRODsRODsRODsRODsRODs RODsRODsRODs
Deployment view