modularity and costs greg busby computer science 614 march 26, 2002

Modularity and CostsGreg Busby

Computer Science 614

March 26, 2002

Problem 1 – Complexity

Protocols are necessary to do network communications Both ends must agree on format to

exchange messages

Communication protocols are complexUsing several protocols together is even more complex

Solution 1 – Layers

Implement each protocol independently Allows cleaner implementation

Layer protocols Maintains modularity Reduces complexity – no need to

understand interactions between protocols

Problem 2 – Delays

Messages get larger as additional headers are added at each layerProcessing overhead for switch between layersNeed to wait for one protocol to finish before starting the nextI/O overhead with multiple writes to memory as buffers are stored between layers

Solution 2 – Improve Performance

Will discuss several approaches, including pros and cons of each: x-Kernel: Puts entire communication system

directly in the kernel with specific objects and support routines

Integrated Layer Processing (ILP): Integrates protocol layers to reduce task switching and memory writes

Protocol Accelerator (PA): Reduces total data to send and shortens critical path of code between messages

x-Kernel

Defines a uniform set of abstractions for protocolsStructures protocols for efficient interaction in the common caseSupports primitive routines for common protocol tasks

x-Kernel Architecture

Provides objects for protocols, sessions, and messages

Creates a kernel for a specific set of protocols (static)

Instantiates sessions for each protocol as needed (dynamic)

Messages are active objects that move through protocol/sessions

Provides specific support routines

TCP

ETH

IP

UDP

x-Kernel Objects

Protocols Create sessions Demux messages received

Sessions Represent connections Created and destroyed when connections

made/terminated

Messages Contain the data itself Passed from level to level

x-Kernel Primitives

Buffer managers Allocate, concatenate, split, and

truncate Operate in local process heap

Map managers Add, remove, and map bindings for

protocols

Event managers Provide timers to allow timeouts

x-Kernel Performance

2-3 x faster than Unix overallUnix cost is primarily due to socketsProtocol performance is comparableConclusion: architecture is the difference

x-Kernel Conclusions

Pros: Architecture simplifies

the implementation of protocols

Uniformity of interface between protocols makes protocol performance predictable and reduces overhead between protocols

Possible to write efficient protocols by tuning the underlying architecture

Don’t need to know exact protocol stack

Cons: Requires new

compilation of the kernel for each new set of protocols

Doesn’t reduce message size (headers) or sequentiallity of processes

Primarily useful as a research tool for protocol implementation, not to improve performance per se.

Integrated Layer Processing (ILP)

Reduces protocol layers by integrating processingTunes performance to increase caching and avoid memory I/OEliminates redundant copies (similar to U-Net’s shared memory)

ILP Architecture

Combine protocol-specific manipulations in a single loop where possibleProcess small pieces to make use of processor on-board cachingPut as much processing as possible in-line (macros) versus function calls

ILP Loop

Combine marshalling (encoding), encryption, and checksumming Work in memory, reduce copyingReduces steps from 5 to 2 (increased processing at step 1)

ApplicationData

ApplicationData

Kernel Buffer

TCP Buffer

ApplicationData

TCP Buffer

Kernel Buffer

1. Marshalling (r/w)

2. Encryption (r/w)

3. Copying (r/w)

4. Checksum (r)

5. System copy (r/w) 2. System copy (r/w)

1. Marshalling

ILPSend

Non-ILPSend

encryption, andchecksumming (r/w)

ILP Processing (send)

Divide message into small partsBegin marshalling and encryption on part B, then C…Process part A once length is knownFinish protocol-specific processingDoesn’t work if A must be processed first (ordering-constrained)

RPC Header Data

Length align.

TCP Header

Part A Part CPart B

marshalling, encryption

checksum

ILP Performance

Processing reduction of 20-25%Throughput improvement of 10-15%Actually reduces cache usage, although designed to optimize itPerformance gains can easily be masked by using strong encryption which drastically increases processingConclusion: performance results were such that use is “debatable in existing communication systems…”

ILP Conclusions

Pros Decreased

memory access up to 30%

Slightly improved performance

Cons Only applicable

with non-ordering constrained functions

Requires macros to increase speed, reducing flexibility

Protocol stack must be known before-hand

The Protocol Accelerator (PA)

Reduces header overhead by sending non-changing protocol headers only onceFurther reduces total bytes by packing other header information across protocolsReduces layered protocol processing overhead by splitting processing of header and data (canonical processing)

PA Header Reduction

Four classes of Header Information Connection Identification – don’t change during

session Protocol-specific Information – depends only on

protocol state, not on message Message-specific Information – depends on

contents of message but not protocol state Gossip – included because overhead is small, but

optional

Connection Cookies 8-byte field that replaces the Connection

Identification information

PA Message Format

Connection Cookie suffices for Connection ID on 2nd & later messagesPacking information explained belowGossip is optional but useful

Connection cookie (62 bit number)

Connection Identification (first message)

Protocol-specific Information

Message-specific Information

Gossip (optional)

Packing Information (if packed)

Application Data

Connection Id Present bit

Byte order bit (big- or little-endian)

PA Processing Reduction

Canonical Protocol Processing – Breaks processing in a protocol layer into 2 parts Pre-processing Phase – build or check

message header without changing protocol state

Post-processing Phase – update protocol state; attempt to do this after message is sent or delivered

Pre-processing at every layer done before post-processing at any layer

PA Processing Reduction (cont.)

Header Prediction Use post-processing phase to predict formation

of next header

Packet Filters A pre-pre-processor that checks or ensures

header correctness without invoking protocol where possible; invokes protocol if necessary

Message packing Pack backlogged messages together if

application gets ahead – reduces space and processing since checksums etc. calculated only once

PA Processing (send)

Check backlog; queue and exit if anyCreate packing and predicted header, add to message dataRun packet filter to create message-specific data (and gossip, if any)Push to protocol if necessaryPush connection cookie onto front of message and sendPass to protocol stack for post-processing to update protocol state

Application

Network

ProtocolStack

Packer Unpacker

PA

PreSend

PreDeliver

PA Performance

Can gain an order of magnitude improvement over pure layered protocolsMaximal throughput achieved by reducing garbage collection and doing post-processing while messages are “on the wire”Conclusion: Useful in improving performance as long as PA is used on both ends of

PA Conclusions

Pros Eliminates much

of the overhead of layered protocols

Significant speed improvement

Canonical processing applicable in any case

Cons Can’t communicate

with non-PA peer Specific PA needed

for set of protocols No fragmentation

of messages, so only works on small messages

Summary

Protocols are layered to improve modularity and reduce complexity This reduces performance

Improving performance reduces modularity Requires foreknowledge of protocol stack

Approaches Increase use of kernel (x-Kernel) Integrate processing of all layers together (ILP) Reduce message size and speed critical path (PA)

All improve performance, but only PA results in significant improvement.

modularity and costs greg busby computer science 614 march 26, 2002

Documents