extensible message layers for multimedia cluster computers dr. craig ulmer center for experimental...
TRANSCRIPT
Extensible Message Layers forMultimedia Cluster Computers
Dr. Craig Ulmer
Center for Experimental Research in Computer Systems
Outline
Background Evolution of cluster computers Multimedia of “Resource-rich” cluster computers
Design of extensible message layers GRIM: General-purpose Reliable In-order Messages
Extensions Integrating peripheral devices Streaming computations
Host-to-host performance
Concluding remarks
Background
An Evolution of Cluster Computers
Cluster Computers
Cost-effective alternative to supercomputers Number of commodity workstations Specialized network hardware and software
Result: Large pool of host processors
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
MemoryI/
O B
us
System Area Network
Improving Cluster Computers
Adding more host CPUs Adding intelligent peripheral devices
PeripheralDevices
Host CPUs
Peripheral Device Trends
Increasingly independent, intelligent peripheral devices
Feature on-card processing and memory facilities
Migration of computing power and bandwidth requirements to peripherals
Ethernet
Host
Storage
CPU
SAN NI
Media Capture
Resource-Rich Cluster Computers
Inclusion of diverse peripheral devices Ethernet server cards, multimedia capture devices,
embedded storage, computational accelerators
Processing takes place in host CPUs and peripherals
SAN NI
Ethernet
HostHost
Host
System AreaNetwork
Cluster
SAN NIVideo Capture
FPGA
Host
Host Host
Storage
HostHost
CPU CPU
Benefits of Resource-Rich Clusters
Employ cluster computing in new applications Real-time constraints I/O intensive Network
Example: Digital libraries Enormous amounts of data Large number of network users
Example: Multimedia Capture and process large streams of multimedia data CAVE or Visualization clusters
Extensible Message Layers
Supporting Resource-Rich Cluster Computers
Problem: Utilizing distributed cluster resources
How is efficient intra-cluster communication provided? How can applications make use of resources?
CPU
CPUCPU CPU CPU CPU CPU
CPU
CPU
VideoCapture
FPGA
RAID
FPGA
FPGA
EthernetEthernet
RAID
RAID
? ? ? ? ? ?
Answer: Flexible “Message Layer” Communication Software
Message layers are enabling technology for clusters Enable cluster to function as single image multiprocessor system
Current message layers Optimized for transmissions between host CPUs Peripheral devices only available in context of the local host
What is needed Support efficient communication with host CPUs and peripherals Ability to harness peripheral devices as pool of resources
GRIM: An Implementation
A message layer for
resource-rich clusters
GRIM
Core
General-purpose Reliable In-order Message Layer (GRIM)
Message layer for resource-rich clusters Myrinet SAN backbone Both host CPUs and peripheral devices are endpoints Communication core implemented in NI
CPU
FPGA Card
Storage Card
NetworkInterface
Card
SystemArea
Network
Per-hop Flow Control
End-to-end flow control necessary for reliable delivery Prevents buffer overflows in communication path
Endpoint-managed schemes Impractical for peripheral devices
Per-hop flow control scheme Transfer data as soon as next stage can accept Optimistic approach
ReceivingEndpoint
SendingEndpoint SAN
Network Interface Network Interface
PCI PCI
Reply
ReceivingEndpoint
SendingEndpoint
Send
SANNetwork Interface Network Interface
PCI PCIReceivingEndpoint
SendingEndpoint
DATA
ACK
DATA
ACK
PCISAN
Network Interface Network Interface
DATA
ACK
PCI
Logical Channels
Multiple endpoints in a host share the NI Employ multiple logical channels in the NI
Each endpoint owns one or more logical channels Logical channel provides virtual interface to network
Endpoint 1
Endpoint n
Logical Channel
Logical Channel
Network Interface
Scheduler
Network
Programming Interfaces: Active Messages
Message specifies function to be executed at receiver Similar to remote procedure calls, but lightweight Invoke operations at remote resources
Useful for constructing device-specific APIs Example: Interactions with remote storage controller
CPU
StorageControllerNINI SAN
AM_fetch_file()
AM_return_file()
Programming Interfaces: Remote Memory
Transfer blocks of data from one host to another Receiving NI executes transfer directly
Read and Write operations NI interacts with kernel driver to translate virtual addresses Optional notification mechanisms
CPU
NINI SAN
MemoryCPU
Memory
Integrating Peripheral Devices
Hardware Extensibility
Peripheral Device Overview
NI
CPU
CPU
Peripheral Device
In GRIM peripherals are endpoints
Intelligent peripherals Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages
Legacy peripherals Managed by host application or Remote memory operations
Legacy Peripheral Device
Peripheral Devices Examples
Video display card Manipulate frame buffer Remote memory writes
Video Display
D/AAGPFrameBuffer
Server adaptor card Networked host on PCI card AM handlers for LAN-SAN bridge
Server Adaptor
Ethernet
PCI i960
SCSI
PCIDMA
A/D FrameBuffer
HostMemoryVideo Capture
Video capture card Specialized DMA engine AM handlers capture data
Celoxica RC-1000 FPGA Card
FPGAs provide acceleration Load with application-specific circuits
Celoxica RC-1000 FPGA card Xilinx Virtex-1000 FPGA 8 MB SRAM
Hardware implementation Endpoint as state machines AM handlers are circuits
SRAM
0SRAM
1SRAM
2SRAM
3
PCIFPGA
Control&Switching
FPGA Endpoint Organization
Frame
InputQueues
OutputQueues
Communication Library API
ApplicationData
Memory API
FPGA Card Memory
FPGACircuit Canvas
User Circuit API
UserCircuitn
UserCircuit1
Example FPGA Configuration
Cryptography configuration DES, RC6, MD5, and ALU
Occupies 70% of FPGA Newer FPGAs 8x in size
Operates with 20 MHz clock Newer FPGAs 6x faster 4KB Payload => 55 s (73MB/s)
Expansion: Sharing the FPGA
FPGA has limited space for hardware circuits Host reconfigures FPGA on demand FPGA Function Fault
HostCPU
FPGA
Circuit X
Circuit Y
Configuration A
Circuit X
Circuit Y
Configuration A
Configuration B
Circuit E
Circuit F
Configuration C
Circuit G
StateStorage
SRAM0Message:Use Circuit F
FunctionFault
Circuit E
Circuit F
Configuration C
Circuit G
(150 ms)
Extension: Streaming Computations
Software extensibility
Streaming Computation Overview
Programming method for distributed resources Establish pipeline for streaming operations Example: Multimedia processing
Celoxica RC-1000 FPGA endpoint
CPU
NI
VideoCapture
CPU
NI
MediaProcessor
CPU
NI
MediaProcessor
CPU
NI
MediaProcessor
System Area Network
Streaming Fundamentals
Computation: How is a computation performed? Active message approach
Forwarding: Where are results transmitted? Programmable forwarding directory
Destination: FPGAForward Entry: XAM: Perform FFT
In MessageFPGA
Computational Circuits
Circuit 1: FFT
Circuit N: Encrypt
Forwarding DirectoryDestination: Host Forward Entry: XAM: Receive FFT
Out Message
Host-to-Host Performance
Transferring data betweentwo host-level endpoints
Host-to-Host Communication Performance
Host-to-Host transfers standard benchmark Three phases of data transfer
Injection most challenging
Overall communication path
NI SAN
CPU
NI
CPU
Memory Memory
Active Messages
Remote Memory Operations
11
22
33
Source Destination
Host-NI: Data Injections
Host-NI transfers challenging Host lacks DMA engine
Multiple transfer methods Programmed I/O DMA
Automatically select methodResult: Tunable PCI Injection Library (TPIL)
CPU
MainMemory
PC
I B
us
PCIDMA
Peripheral
DeviceMemory
MemoryController
Cache
TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host
Ban
dwid
th (
MB
ytes
/s)
Injection Size (Bytes)
Overall Communication Pipeline
Three phases of transmission Optimization: Use fragmentation to increase utilization Optimization: Allow cut-through transmissions
time
SendingHost-NI
NI-NI
ReceivingNI-Host
Message 1
Message 1
Message 1 Message 2
Message 2
Message 2
Overall Transmission Time
Message 1
Message 1
Message 1
Message 3Message 2
Message 3Message 2
Message 3Message 2
Overall Transmission TimeOverall Transmission Time
Overall Host-to-Host Performance
Host NI Latency (μs) Bandwidth (MB/s)
P4-1.7GHzLANai 9 8 146
LANai 4 14.5 108
P3-550MHzLANai 9 9.5 116
LANai 4 14 96
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Comparison to Existing Message Layers
Latency (μs)
μs
Bandwidth (MB/s)
MB/s
Concluding Remarks
Key Contributions
Framework for communication in resource-rich clusters Reliable delivery mechanisms, virtualized network interface, and
flexible programming interfaces Comparable performance to state-of-the-art message layers
Extensible for peripheral devices Suitable for intelligent and legacy peripherals Methods for managing card resources
Extensible for higher-level programming abstractions Endpoint-level: Streaming computations and sockets emulation NI-level: Multicast support
Future Directions
Continued work with GRIM Video card vendors opening cards to developers Myrinet connected embedded devices
Adaptation to other network substrates Gigabit Ethernet appealing because of cost Modification to transmission protocols InfiniBand technology promising
Active system area networks FPGA chips beginning to feature gigabit transceivers Use FPGA chips as networked processing device
Additional Research Projects
Wireless Sensor Networks
NASA JPL Research In-situ WSNs Exploration of Mars
Communication Self organization Routing
SensorSim Java simulator Evaluate protocols
PeZ: Pole-Zero Editor for MATLAB
Related Publications
A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.
Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.
A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.
An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000.
Papers and Software Available at
http://www.CraigUlmer.com/research
Backup Slides
Performance: FPGA Computations
Acquire SRAM
Detect New Message
Fetch Header
Computation
Store Results
Store Header
Lookup Forwarding
Update Queues
Release SRAM
8
4
7
1024
1024
16
5
3
1
Fetch Payload 1024
Clocks
Clock Speed: 20MHzOperation Latency: 55 s (4KB 73MB/s)
SRAM0(Incoming Queues)
SRAM1(User Page 0)
SRAM3(Outgoing Queues)
PortA PortC
Built-in ALU Ops
SRAM2(User Page 1)
MessageGenerator
ResultsCache
PortB
ScratchpadController
ScratchpadController
Fetch/Decode
Control/ StatusPort
Page Fault
Expansion: Sharing On-Card Memory
Limited on-card memory for storing application data Construct virtual memory system for on-card memory Swap space is host memory
HostCPU
FPGA
User-definedCircuits
PageFrame 1
SRAM1
PageFrame 2
SRAM2
PageFrame 1
PageFrame 1
PageFrame 1
UserPage X
RC-1000 Challenges
Hardware implementation Queue state machines
Memory locking SRAM single ported Arbitrate for use
CPU / NI contention NI manages FPGA lock
FPGA
UserCircuits
SRAM
CPU
MemoryLock
NI
Example: Autonomous Spaceborne Clusters
NASA Remote Exploration and Experimentation Spaceborne vehicle processes data locally Clusters in the sky
Number of peripheral devices Data sensors FPGA & DSPs
Adaptive hardware Modify functionality after deployment
Acquire FPGA SRAM CPU-NI: 20 s NI: 8 s
Inject 4 KB message to FPGA CPU: 58 s (70 MB/s) NI: 32 s (128 MB/s)
Release FPGA SRAM CPU-NI: 8 s NI: 5 s
Performance: Card Interactions
FPGA
UserCircuits
SRAMMemoryLock
NI
CPU
Example: Digital Libraries
Enormous amount of data and users Intelligent LAN and storage cards to manage requests
CPU
Intelligent LANAdaptor
StorageAdaptor
SANNI Files A-H
CPU
Intelligent LANAdaptor
StorageAdaptor
SANNI Files S-Z
Client Client Client ClientClient Client
CPU
Intelligent LANAdaptor
StorageAdaptor
SANNI Files I-R
SAN Backbone
Cyclone Systems I2O Server Adaptor Card
Networked host on a PCI card Integration with GRIM
Interact directly with the NI Ported host-level endpoint software
Utilized as a LAN-SAN bridge
HostSystem
i960 RxProcessor
DMAEngines
PrimaryPCI
Interface
DRAM
10/100 Ethernet
10/100 Ethernet
SCSI
SCSI
ROM
DMAEngine
SecondaryPCI
Interface
Daughter Card
Local Bus
GRIM Multicast Extensions
Distribute the same message to multiple receivers Tree based distributions Replicate message at NI Messages are recycled back into network
Extensions to NI’s core communication operations Recycled messages in separate logical channel Utilize per-hop flow control for reliable delivery
A
B C
D E
NIEndpoint A
NI Endpoint B
NI Endpoint D
NI Endpoint C
NI Endpoint E
A
B
C
D
E
Multicast Performance
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000 1,000,000
Multicast RTT
Unicast RTT
Multicast Injection Overhead
Unicast Injection Overhead
LANai 4, P4-1.7 GHz Hosts
Tim
e (μ
s)
8 Hosts
Multicast Message Size (Bytes)
Multicast Observations
Beneficial: reduces sending overhead
Performance loss for large messages Dependent on NI memory copy bandwidth
On-card memory copy benchmark: LANai 4: 19 MB/s LANai 9: 66 MB/s
Extension: Sockets Emulation
Berkeley sockets is a communication standard Utilized in numerous distributed applications
GRIM provides sockets API emulation Functions for intercepting socket calls AM handler functions for buffering connection data
write()
Intercept
Generate AM
AM:AppendSocket X
SocketData
Socket X
AM HandlerAppend Socket
Intercept
Extract Data
read()
Sender Receiver
Sockets Emulation Performance
0
20
40
60
80
100
120
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
GRIM Sockets LANai 4
100 Mb/s Ethernet
P4-1.7 GHz Hosts
Ban
dwid
th (
MB
ytes
/s)
Transfer Size (Bytes)
Overall Performance: Store-and-Forward
Approach: Single message, no overlap Three transmission stages Expect roughly 1/3 of bandwidth of individual stage
P3-550 MHz Hosts
Message 1
Message 1
Message 1
time
PCI: 132 MB/s
PCI: 132 MB/s
Myrinet: 160 MB/s
Overall Transmission Time
SendingHost-NI
NI-NI
ReceivingNI-Host
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Enhancement: Message Pipelining
Allow overlap with multiple in-flight messages GRIM uses AM and RM fragmentation/reassembly Performance depends on fragment size
LANai 9, P3-550 MHz Hosts
SendingHost-NI
NI-NI
ReceivingNI-Host
Message 1
time
Message 3Message 2
Overall Transmission Time
Message 1 Message 3Message 2
Message 1 Message 3Message 2
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Enhancement: Cut-through Transfers
Forward data as soon as it begins to arrive Cut-through at sending and receiving NIs
time
Message 1
Message 1
Message 1 Message 2
Message 2
Message 2
SendingHost-NI
NI-NI
ReceivingNI-Host
Overall Transmission TimeLANai 9, P3-550 MHz HostsMessage Size (Bytes)
Ban
dwid
th (
MB
ytes
/s)