timm morten steinbeck, computer science/computer engineering group kirchhoff institute f. physics,...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Timm Morten Steinbeck, Computer Science/Computer Engineering Group
Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg
A Framework for Building Distributed Data Flow
Chains in Clusters
Requirements
Alice: A relativistic heavy ion physics experiment
Very large multiplicity: 15.000 particles/event
Full event size > 70 MB
Last trigger stage (High Level Trigger, HLT) is first stage with complete event data
Data rate into HLT up to 25 GB/s
High Level Trigger
HLT has to process large volume of data to go from this...
... to this:
HLT has to perform reconstruction of particle tracks from raw ADC data.
1, 2, 123, 255, 100, 30, 5, 1, 4,3, 2, 3, 4, 5, 3, 4, 60, 130, 30, 5,....................
High Level Trigger
HLT consists of Linux-PC farm (500-1000 nodes) with fast network
Nodes are arranged hierarchically First stage reads data from detector
and performs first level processing Each stage sends produced data to
next stage Each further stage performs
processing or merging on input data
High Level Trigger
The HLT needs a software framework to transport data through the PC farm.Requirements for the framework: Efficiency:
The framework should not use too much CPU cycles (which are needed for data analysis)
The framework should transport the data as fast as possible
Flexibility: The framework should consist of components which can be
plugged together in different configurations The framework should allow reconfiguration at runtime
The Data Flow Framework
Components are single processes
Communication based on publisher-subscriber principle:
One publisher can serve multiple subscribers
Publisher
SubscriberSubscribe
SubscriberSubscribe
SubscriberSubscribe
Publisher
SubscriberNew Data
SubscriberNew Data
SubscriberNew Data
Framework Architecture
Framework consists of mutually dependant packages
Utility Classes
Communication Classes
Publisher & SubscriberBase Classes
Data Source, Sink, &Processing Components
Data Source, Sink, &Processing Templates
Data Flow Components
Utility Classes
General helper classes (e.g. timer, thread) Rewritten classes:
Thread-safe string class Faster vector class
Communication Classes
Two abstract base classes: For small message transfers For large data blocks
Derived classes for each network technology
Currently existing: Message and block classes for TCP and SCI (Shared Memory Interconnect)
Message Base Class
TCP Message Class SCI Message Class
Block Base Class
TCP Block Class SCI Block Class
Communication Classes
Implementations foreseen for Atoll network (University of Mannheim) and Scheduled Transfer Protocol
API partly modelled after socket API (Bind, Connect) Both implicit and explicit connection possible
Implicit ConnectionUser Calls System CallsConnect connectSend transfer dataSend transfer dataSend transfer dataDisconnect disconnect
Explicit ConnectionUser Calls System CallsSend connect, transfer data, disconnectSend connect, transfer data, disconnectSend connect, transfer data, disconnect
Publisher Subscriber Classes
Implement publisher-subscriber interface
Abstract interface for communication mechanism between processes/modules
Currently named pipes or shared memory available
Multi-threaded implementation
Publisher
SubscriberSubscribe
SubscriberSubscribe
SubscriberSubscribe
Publisher
SubscriberNew Data
SubscriberNew Data
SubscriberNew Data
Publisher Subscriber Classes
Efficiency: Event data not sent to subscriber components
Publisher process places data into shared memory (ideally during production)
Descriptors are sent to subscribers holding location of data in shared memory
Requires buffer management in publisher
Publisher
Event M
Data Block n
Data Block 0
New Event
Event MDescriptor
Block n
Block 0
Subscriber
Shared Memory
Data Flow Components
To merge data streams belonging to one event
To split and rejoin a data stream (e.g. for load balancing)
EventMerger
Event N, Part 1
Event N, Part 2
Event N, Part 3
Event N
Several components to shape the data flow in a chain
EventScatterer EventGatherer
Event 1Event 2Event 3
Event 1
Event 3
Processing
Processing
Processing
Event 1Event 2Event 3
Data Flow Components
To transparently transport data over the network to other computers (Bridge)
SubscriberBridgeHead has subscriber class for incoming data, PublisherBridgeHead uses publisher class to announce data
Several components to shape the data flow in a chain
SubscriberBridgeHead
Node 1Network
PublisherBridgeHead
Node 2
Component Template
Templates for user components are providedTemplates for user components are providedTemplates for user components are provided To read out data from source
and insert it into a chain (Data Source Template)
To accept data from the chain, process, and reinsert it (Analysis Template)
To accept data from the chain and process it, e.g. store it (Data Sink Template)
Data SourceAddressing &Handling
Buffer Management
Data Announcing
Data Analysis & Processing
Output Buffer Mgt.
Output Data Announcing
Accepting Input Data
Input Data Addressing
Data SinkAddressing &Writing
Accepting Input Data
Input Data Addressing
Benchmarks
Performance tests of framework Dual Pentium 3, w. 733 MHz-800MHz Tyan Thunder 2500 or Thunder HEsl motherboard,
Serverworks HE/HEsl chipset 512MB RAM, >20 GB disk space, system, swap,
tmp Fast Ethernet, switch w. 21 Gb/s backplane SuSE Linux 7.2 w. 2.4.16 kernel
Benchmarks
Benchmark of publisher-subscriber interface Publisher process 16 MB output buffer, event size 128 B Publisher does buffer management, copies data
into buffer, subscriber just replies to each event
Maximum performance more than 12.5 kHz
Benchmarks
Benchmark of TCP message class Client sending messages to server on other PC TCP over Fast Ethernet Message size 32 B
Maximum message rate more than 45 kHz
Benchmarks
Publisher-Subscriber Network Benchmark Publisher on node A, subscriber on node B Connected via Bridge, TCP over Fast Ethernet 31 MB buffer in publisher and receiving bridge Message size from 128 B to 1 MB
SubscriberBridgeHead
Node ANetwork
PublisherBridgeHead
Node B
SubscriberPublisher
Benchmarks
Publisher-Subscriber Network BenchmarkNotes:Dual CPU nodes100% 2 CPUs
Theor. Rate:100 Mb/s / Evt. Size
Benchmarks
CPU load/event rate decrease with larger blocks Receiver more loaded than sender
Minimum CPU load 20% sender, 30% receiver
Maximum CPU load @ 128 B events 90% receiver, 80% sender
Management of > 250.000 events in system!!
Benchmarks
32 kB event size: Network bandwidth limits 32 kB event size: More than 10 MB/s At maximum event size: 12.3*106 B of 12.5*106 B
"Real-World" Test
13 node test setup Simulation of read-out and processing of 1/36 (slice)
of Alice Time Projection Chamber (TPC) Simulated piled-up (overlapped) proton-proton
events Target processing rate: 200 Hz (Maximum read out
rate of TPC)
"Real-World" Test
EG
ES
CF
AU
FP
CF
AU
FP
CF
AU
FP
CF
AU
FP
CF
AU
FP
CF
AU
FP
EG
ES
T T
EG
ES
EG
ES
T T
EG
ES
T T T T
EG
ES
FP
CF
AU
T T
EG
ES
T T
ADC Unpacker
Event Scatterer
Tracker (2x)
EM
PM
EM
SM
EM
PM
EM
SM
Patch Merger
Slice Merger
Patches1 65432
T T
Event Merger
Event Merger
Event Gatherer
Cluster Finder
File Publisher
"Real-World" Test
CF
AU
FP
1, 2, 123, 255, 100, 30, 5, 4*0, 1, 4, 3*0, 4, 1, 2, 60, 130, 30, 5,....................
1, 2, 123, 255, 100, 30,5, 0, 0, 0, 0, 1, 4, 0, 0, 0, 4, 1, 2, 60, 130, 30, 5,....................
EG
ES
T T
EM
PM
EM
SM
"Real-Word" Test
Third (line of) node(s) connects track segments to form complete tracks (Track Merging)
Second line of nodes finds curved particle track segments going through charge space-points
First line of nodes unpacks zero-suppressed, run-length-encoded ADC values and calculates 3D space-point coordinates of charge depositions
CF
AU
FP
EG
ES
T T
EM
PM
EM
SM
Test Results
CF
AU
FP
EG
ES
T T
EM
PM
EM
SM Rate: 270 Hz CPU load: 2 * 100% Network: 6 * 130 kB/s event data
CPU load: 2 * 75% - 100%
Network: 6 * 1.8 MB/s event data
CPU load: 2 * 60% - 70%
Conclusion & Outlook
Framework allows flexible creation of data flow chains, while still maintaining efficiency
No dependencies created during compilation Applicable to wide range of tasks Performance already good enough for many
applications using TCP on Fast Ethernet
Future work: Use dynamic configuration ability for fault tolerance purposes, improvements of performance
More information: http://www.ti.uni-hd.de/HLT