extensible message layers for multimedia cluster computers dr. craig ulmer center for experimental...

Extensible Message Layers forMultimedia Cluster Computers

Dr. Craig Ulmer

Center for Experimental Research in Computer Systems

Outline

Background Evolution of cluster computers Multimedia of “Resource-rich” cluster computers

Design of extensible message layers GRIM: General-purpose Reliable In-order Messages

Extensions Integrating peripheral devices Streaming computations

Host-to-host performance

Concluding remarks

Background

An Evolution of Cluster Computers

Cluster Computers

Cost-effective alternative to supercomputers Number of commodity workstations Specialized network hardware and software

Result: Large pool of host processors

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

MemoryI/

O B

us

System Area Network

Improving Cluster Computers

Adding more host CPUs Adding intelligent peripheral devices

PeripheralDevices

Host CPUs

Peripheral Device Trends

Increasingly independent, intelligent peripheral devices

Feature on-card processing and memory facilities

Migration of computing power and bandwidth requirements to peripherals

Ethernet

Host

Storage

CPU

SAN NI

Media Capture

Resource-Rich Cluster Computers

Inclusion of diverse peripheral devices Ethernet server cards, multimedia capture devices,

embedded storage, computational accelerators

Processing takes place in host CPUs and peripherals

SAN NI

Ethernet

HostHost

Host

System AreaNetwork

Cluster

SAN NIVideo Capture

FPGA

Host

Host Host

Storage

HostHost

CPU CPU

Benefits of Resource-Rich Clusters

Employ cluster computing in new applications Real-time constraints I/O intensive Network

Example: Digital libraries Enormous amounts of data Large number of network users

Example: Multimedia Capture and process large streams of multimedia data CAVE or Visualization clusters

Extensible Message Layers

Supporting Resource-Rich Cluster Computers

Problem: Utilizing distributed cluster resources

How is efficient intra-cluster communication provided? How can applications make use of resources?

CPU

CPUCPU CPU CPU CPU CPU

CPU

CPU

VideoCapture

FPGA

RAID

FPGA

FPGA

EthernetEthernet

RAID

RAID

? ? ? ? ? ?

Answer: Flexible “Message Layer” Communication Software

Message layers are enabling technology for clusters Enable cluster to function as single image multiprocessor system

Current message layers Optimized for transmissions between host CPUs Peripheral devices only available in context of the local host

What is needed Support efficient communication with host CPUs and peripherals Ability to harness peripheral devices as pool of resources

GRIM: An Implementation

A message layer for

resource-rich clusters

GRIM

Core

General-purpose Reliable In-order Message Layer (GRIM)

Message layer for resource-rich clusters Myrinet SAN backbone Both host CPUs and peripheral devices are endpoints Communication core implemented in NI

CPU

FPGA Card

Storage Card

NetworkInterface

Card

SystemArea

Network

Per-hop Flow Control

End-to-end flow control necessary for reliable delivery Prevents buffer overflows in communication path

Endpoint-managed schemes Impractical for peripheral devices

Per-hop flow control scheme Transfer data as soon as next stage can accept Optimistic approach

ReceivingEndpoint

SendingEndpoint SAN

Network Interface Network Interface

PCI PCI

Reply

ReceivingEndpoint

SendingEndpoint

Send

SANNetwork Interface Network Interface

PCI PCIReceivingEndpoint

SendingEndpoint

DATA

ACK

DATA

ACK

PCISAN

Network Interface Network Interface

DATA

ACK

PCI

Logical Channels

Multiple endpoints in a host share the NI Employ multiple logical channels in the NI

Each endpoint owns one or more logical channels Logical channel provides virtual interface to network

Endpoint 1

Endpoint n

Logical Channel

Logical Channel

Network Interface

Scheduler

Network

Programming Interfaces: Active Messages

Message specifies function to be executed at receiver Similar to remote procedure calls, but lightweight Invoke operations at remote resources

Useful for constructing device-specific APIs Example: Interactions with remote storage controller

CPU

StorageControllerNINI SAN

AM_fetch_file()

AM_return_file()

Programming Interfaces: Remote Memory

Transfer blocks of data from one host to another Receiving NI executes transfer directly

Read and Write operations NI interacts with kernel driver to translate virtual addresses Optional notification mechanisms

CPU

NINI SAN

MemoryCPU

Memory

Integrating Peripheral Devices

Hardware Extensibility

Peripheral Device Overview

NI

CPU

CPU

Peripheral Device

In GRIM peripherals are endpoints

Intelligent peripherals Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages

Legacy peripherals Managed by host application or Remote memory operations

Legacy Peripheral Device

Peripheral Devices Examples

Video display card Manipulate frame buffer Remote memory writes

Video Display

D/AAGPFrameBuffer

Server adaptor card Networked host on PCI card AM handlers for LAN-SAN bridge

Server Adaptor

Ethernet

PCI i960

SCSI

PCIDMA

A/D FrameBuffer

HostMemoryVideo Capture

Video capture card Specialized DMA engine AM handlers capture data

Celoxica RC-1000 FPGA Card

FPGAs provide acceleration Load with application-specific circuits

Celoxica RC-1000 FPGA card Xilinx Virtex-1000 FPGA 8 MB SRAM

Hardware implementation Endpoint as state machines AM handlers are circuits

SRAM

0SRAM

1SRAM

2SRAM

3

PCIFPGA

Control&Switching

FPGA Endpoint Organization

Frame

InputQueues

OutputQueues

Communication Library API

ApplicationData

Memory API

FPGA Card Memory

FPGACircuit Canvas

User Circuit API

UserCircuitn

UserCircuit1

Example FPGA Configuration

Cryptography configuration DES, RC6, MD5, and ALU

Occupies 70% of FPGA Newer FPGAs 8x in size

Operates with 20 MHz clock Newer FPGAs 6x faster 4KB Payload => 55 s (73MB/s)

Expansion: Sharing the FPGA

FPGA has limited space for hardware circuits Host reconfigures FPGA on demand FPGA Function Fault

HostCPU

FPGA

Circuit X

Circuit Y

Configuration A

Circuit X

Circuit Y

Configuration A

Configuration B

Circuit E

Circuit F

Configuration C

Circuit G

StateStorage

SRAM0Message:Use Circuit F

FunctionFault

Circuit E

Circuit F

Configuration C

Circuit G

(150 ms)

Extension: Streaming Computations

Software extensibility

Streaming Computation Overview

Programming method for distributed resources Establish pipeline for streaming operations Example: Multimedia processing

Celoxica RC-1000 FPGA endpoint

CPU

NI

VideoCapture

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

System Area Network

Streaming Fundamentals

Computation: How is a computation performed? Active message approach

Forwarding: Where are results transmitted? Programmable forwarding directory

Destination: FPGAForward Entry: XAM: Perform FFT

In MessageFPGA

Computational Circuits

Circuit 1: FFT

Circuit N: Encrypt

Forwarding DirectoryDestination: Host Forward Entry: XAM: Receive FFT

Out Message

Host-to-Host Performance

Transferring data betweentwo host-level endpoints

Host-to-Host Communication Performance

Host-to-Host transfers standard benchmark Three phases of data transfer

Injection most challenging

Overall communication path

NI SAN

CPU

NI

CPU

Memory Memory

Active Messages

Remote Memory Operations

11

22

33

Source Destination

Host-NI: Data Injections

Host-NI transfers challenging Host lacks DMA engine

Multiple transfer methods Programmed I/O DMA

Automatically select methodResult: Tunable PCI Injection Library (TPIL)

CPU

MainMemory

PC

I B

us

PCIDMA

Peripheral

DeviceMemory

MemoryController

Cache

TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host

Ban

dwid

th (

MB

ytes

/s)

Injection Size (Bytes)

Overall Communication Pipeline

Three phases of transmission Optimization: Use fragmentation to increase utilization Optimization: Allow cut-through transmissions

time

SendingHost-NI

NI-NI

ReceivingNI-Host

Message 1

Message 1

Message 1 Message 2

Message 2

Message 2

Overall Transmission Time

Message 1

Message 1

Message 1

Message 3Message 2

Message 3Message 2

Message 3Message 2

Overall Transmission TimeOverall Transmission Time

Overall Host-to-Host Performance

Host NI Latency (μs) Bandwidth (MB/s)

P4-1.7GHzLANai 9 8 146

LANai 4 14.5 108

P3-550MHzLANai 9 9.5 116

LANai 4 14 96

Ban

dwid

th (

MB

ytes

/s)

Message Size (Bytes)

Comparison to Existing Message Layers

Latency (μs)

μs

Bandwidth (MB/s)

MB/s

Concluding Remarks

Key Contributions

Framework for communication in resource-rich clusters Reliable delivery mechanisms, virtualized network interface, and

flexible programming interfaces Comparable performance to state-of-the-art message layers

Extensible for peripheral devices Suitable for intelligent and legacy peripherals Methods for managing card resources

Extensible for higher-level programming abstractions Endpoint-level: Streaming computations and sockets emulation NI-level: Multicast support

Future Directions

Continued work with GRIM Video card vendors opening cards to developers Myrinet connected embedded devices

Adaptation to other network substrates Gigabit Ethernet appealing because of cost Modification to transmission protocols InfiniBand technology promising

Active system area networks FPGA chips beginning to feature gigabit transceivers Use FPGA chips as networked processing device

Additional Research Projects

Wireless Sensor Networks

NASA JPL Research In-situ WSNs Exploration of Mars

Communication Self organization Routing

SensorSim Java simulator Evaluate protocols

PeZ: Pole-Zero Editor for MATLAB

Related Publications

A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.

Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.

A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.

An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000.

Papers and Software Available at

http://www.CraigUlmer.com/research

Backup Slides

Performance: FPGA Computations

Acquire SRAM

Detect New Message

Fetch Header

Computation

Store Results

Store Header

Lookup Forwarding

Update Queues

Release SRAM

8

4

7

1024

1024

16

5

3

1

Fetch Payload 1024

Clocks

Clock Speed: 20MHzOperation Latency: 55 s (4KB 73MB/s)

SRAM0(Incoming Queues)

SRAM1(User Page 0)

SRAM3(Outgoing Queues)

PortA PortC

Built-in ALU Ops

SRAM2(User Page 1)

MessageGenerator

ResultsCache

PortB

ScratchpadController

ScratchpadController

Fetch/Decode

Control/ StatusPort

Page Fault

Expansion: Sharing On-Card Memory

Limited on-card memory for storing application data Construct virtual memory system for on-card memory Swap space is host memory

HostCPU

FPGA

User-definedCircuits

PageFrame 1

SRAM1

PageFrame 2

SRAM2

PageFrame 1

PageFrame 1

PageFrame 1

UserPage X

RC-1000 Challenges

Hardware implementation Queue state machines

Memory locking SRAM single ported Arbitrate for use

CPU / NI contention NI manages FPGA lock

FPGA

UserCircuits

SRAM

CPU

MemoryLock

NI

Example: Autonomous Spaceborne Clusters

NASA Remote Exploration and Experimentation Spaceborne vehicle processes data locally Clusters in the sky

Number of peripheral devices Data sensors FPGA & DSPs

Adaptive hardware Modify functionality after deployment

Acquire FPGA SRAM CPU-NI: 20 s NI: 8 s

Inject 4 KB message to FPGA CPU: 58 s (70 MB/s) NI: 32 s (128 MB/s)

Release FPGA SRAM CPU-NI: 8 s NI: 5 s

Performance: Card Interactions

FPGA

UserCircuits

SRAMMemoryLock

NI

CPU

Example: Digital Libraries

Enormous amount of data and users Intelligent LAN and storage cards to manage requests

CPU

Intelligent LANAdaptor

StorageAdaptor

SANNI Files A-H

CPU


StorageAdaptor

SANNI Files S-Z

Client Client Client ClientClient Client

CPU


StorageAdaptor

SANNI Files I-R

SAN Backbone

Cyclone Systems I2O Server Adaptor Card

Networked host on a PCI card Integration with GRIM

Interact directly with the NI Ported host-level endpoint software

Utilized as a LAN-SAN bridge

HostSystem

i960 RxProcessor

DMAEngines

PrimaryPCI

Interface

DRAM

10/100 Ethernet

10/100 Ethernet

SCSI

SCSI

ROM

DMAEngine

SecondaryPCI

Interface

Daughter Card

Local Bus

GRIM Multicast Extensions

Distribute the same message to multiple receivers Tree based distributions Replicate message at NI Messages are recycled back into network

Extensions to NI’s core communication operations Recycled messages in separate logical channel Utilize per-hop flow control for reliable delivery

A

B C

D E

NIEndpoint A

NI Endpoint B

NI Endpoint D

NI Endpoint C

NI Endpoint E

A

B

C

D

E

Multicast Performance

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000 1,000,000

Multicast RTT

Unicast RTT

Multicast Injection Overhead

Unicast Injection Overhead

LANai 4, P4-1.7 GHz Hosts

Tim

e (μ

s)

8 Hosts

Multicast Message Size (Bytes)

Multicast Observations

Beneficial: reduces sending overhead

Performance loss for large messages Dependent on NI memory copy bandwidth

On-card memory copy benchmark: LANai 4: 19 MB/s LANai 9: 66 MB/s

Extension: Sockets Emulation

Berkeley sockets is a communication standard Utilized in numerous distributed applications

GRIM provides sockets API emulation Functions for intercepting socket calls AM handler functions for buffering connection data

write()

Intercept

Generate AM

AM:AppendSocket X

SocketData

Socket X

AM HandlerAppend Socket

Intercept

Extract Data

read()

Sender Receiver

Sockets Emulation Performance

0

20

40

60

80

100

120

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

GRIM Sockets LANai 4

100 Mb/s Ethernet

P4-1.7 GHz Hosts

Ban

dwid

th (

MB

ytes

/s)

Transfer Size (Bytes)

Overall Performance: Store-and-Forward

Approach: Single message, no overlap Three transmission stages Expect roughly 1/3 of bandwidth of individual stage

P3-550 MHz Hosts

Message 1

Message 1

Message 1

time

PCI: 132 MB/s

PCI: 132 MB/s

Myrinet: 160 MB/s


SendingHost-NI

NI-NI

ReceivingNI-Host

Ban

dwid

th (

MB

ytes

/s)


Enhancement: Message Pipelining

Allow overlap with multiple in-flight messages GRIM uses AM and RM fragmentation/reassembly Performance depends on fragment size

LANai 9, P3-550 MHz Hosts

SendingHost-NI

NI-NI

ReceivingNI-Host

Message 1

time

Message 3Message 2


Message 1 Message 3Message 2

Message 1 Message 3Message 2

Ban

dwid

th (

MB

ytes

/s)


Enhancement: Cut-through Transfers

Forward data as soon as it begins to arrive Cut-through at sending and receiving NIs

time

Message 1

Message 1

Message 1 Message 2

Message 2

Message 2

SendingHost-NI

NI-NI

ReceivingNI-Host

Overall Transmission TimeLANai 9, P3-550 MHz HostsMessage Size (Bytes)

Ban

dwid

th (

MB

ytes

/s)

extensible message layers for multimedia cluster computers dr. craig ulmer center for experimental...

Documents