optimizing rpc

Download Optimizing RPC

Post on 24-Feb-2016




0 download

Embed Size (px)


“ Lightweight Remote Procedure Call ” (1990) Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy (University of Washington) “U-Net: A User-Level Network Interface for Parallel and Distributed Computing” (1995) - PowerPoint PPT Presentation


  • Optimizing RPCLightweight Remote Procedure Call (1990)Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy (University of Washington) U-Net: A User-Level Network Interface for Parallel and Distributed Computing (1995)Thorsten von Eicken, Anindya Basu, Vineet Buch, Werner Vogels (Cornell University)

    Dan Sandler COMP 520 September 9, 2004

  • Review: ScalabilityScalable systems distribute work along an axis that can scale without bounde.g., Number of CPUs, machines, networks, Distributed work requires coordinationCoordination requires communicationCommunication is slow

  • Review: RPCRemote procedure call extends the classic procedure call modelExecution happens elsewhere Goal: API transparencyCommunication details are hidden Remember, RPC is just a part of a distributed systemSolves only one problem: communication

  • Performance: A war on two fronts Conventional RPCProcedure calls between hostsNetwork communication (protocols, etc.) hidden from the programmerPerformance obstacle: the networkLocal RPCProcesses cannot communicate directlySecurity, stabilityThe RPC abstraction is useful here tooPerformance obstacle: protection domains

  • OverviewTwo papers, addressing these RPC usage modelsWhat is the common case?Where is performance lost?How can we optimize RPC?Build the systemEvaluate improvements

  • The Remote CaseU-Net. Von Eicken, et al., 1995.

    Historically, the network is the bottleneck

    Networks getting faster all the time Is RPC seeing this benefit?

  • Message latencyEnd-to-end latency(network latency) + (processing overhead)Network latencyTransmission delayIncreases with message sizeFaster networks address this directlyProcessing overheadAt endpoints, in hardware & softwareFaster networks don't help here

  • Latency ObservationsNetwork latencyImpact per message is O(message size)Dominant factor for large messages

    Processing overheadImpact is O(1)Dominant for small messages

  • Impact on RPCInsight: Applications tend to use small RPC messages.ExamplesOOP (messages between distributed objects)Database queries (requests vs. results)Caching (consistency/synch)Network filesystems

  • Poor network utilization

    Per-message overhead at each host+ Most RPC clients use small messages= Lots of messages= Lots of host-based overhead = Latency & poor bandwidth utilization

  • Review: the microkernel OSBenefits:Protected memory provides security, stabilityModular design enables flexible developmentKernel programming is hard, so keep the kernel smallApplicationSmall kernelOS services

  • Review: the microkernel OSDrawback:Most OS services are now implemented in other processesWhat was a simple kernel trap is now a full IPC situationResult: overhead

    ApplicationSmall kernelOS services

  • Overhead huntingLifecycle of a message sendUser-space application makes a kernel callContext switch to kernelCopy arguments to kernel memoryKernel dispatches to I/O serviceContext switch to processCopy arguments to I/O process spaceI/O service calls network interfaceCopy arguments to NI hardwareReturn path is similarThis all happens on the remote host too

  • U-Net design goalsEliminate data copies & context switches wherever possiblePreserve microkernel architecture for ease of protocol implementationNo special-purpose hardware

  • U-Net architecture AppMicrokernelNetworkinterfaceIO serviceAppAppKNetworkinterfaceAppAppTraditional RPCU-Net RPCCONNECTION SETUPCOMMUNICATION

  • U-Net architecture summaryImplement RPC as a library in user spaceConnect library to network interface (NI) via shared memory regions instead of kernel callsApp & NI poll & write memory to communicate fewer copiesNI responsible for routing of messages to/from applicationsKernel involved only for connection setup fewer context switches

  • U-Net implementationsSimple ATM hardware: Fore SBA-100Message routing must still be done in kernel (simulated U-Net)Proof-of concept & experimentation Programmable ATM: Fore SBA-200Message multiplexing performed on the board itselfKernel uninvolved in most operationsMaximum benefit to U-Net design

  • U-Net as protocol platformTCP, UDP implemented on U-NetModular: No kernel changes necessaryFast: Huge latency win over vendor's TCP/UDP implementationExtra fast: Bandwidth also improved over Fore TCP/UDP utilization

  • U-Net: TCP, UDP resultsRound trip latency (sec) vs. packet size (bytes) on ATM

    U-Net roughly 1/5 of Fore impl. latency

  • U-Net: TCP, UDP resultsBandwidth (Mbits/sec) vs. packet size (bytes) on ATM

    Fore maxes at 10 Mbyte/secU-Net achieves nearly 15

  • Active Messages on U-NetActive Messages: Standard network protocol and API designed for parallel computationSplit-C: parallel programming language built on AMBy implementing AM on U-Net, we can compare performance with parallel computers running the same Split-C programs.

  • Active Messages on U-NetContendersU-Net cluster, 60 MHz SuperSparcMeiko CS-2, 40 MHz SuperSparcCM-5, 33 MHz Sparc-2Results: U-Net cluster roughly competitive with supercomputers on a variety of Split-C benchmarksConclusion: U-Net a viable platform for parallel computing using general-purpose hardware

  • U-Net design goals: recapEliminate context switches & copiesKernel removed from fast pathsMost communications can go straight from app to network interfacePreserve modular system architecture Plug-in protocols do not involve kernel code(Almost) no special-purpose hardwareNeed programmable controllers with fancy DMA features to get the most out of U-NetAt least you don't need custom chips & boards (cf. parallel computers)

  • Local RPCModel: inter-process communication as simple as a function callKernelOS ServiceOS ServiceOS ServiceUser processUser process

  • A closer lookReality: the RPC mechanism is heavyweightStub code oblivious to the local caseUnnecessary context switchingArgument/return data copyingKernel bottlenecksOS ServiceOS ServiceOS ServiceUser processUser process

  • Slow RPC discourages IPCSystem designers will find ways to avoid using slow RPC even if it conflicts with the overall design...Larger, more complex OS serviceUser processUser processOS service folded into kernel

  • Slow RPC discourages IPC...or defeats it entirely.User processUser process

  • Local RPC trouble spotsSuboptimal parts of the code path:Copying argument dataContext switches & reschedulingCopying return dataConcurrency bottlenecks in kernelFor even the smallest remote calls, network speed dominates these factorsFor local calls ... we can do better

  • LRPC: Lightweight RPCBershad, et al., 1990.Implemented within the Taos OSTarget: multiprocessor systemsWide array of low-level optimizations applied to all aspects of procedure calling

  • Guiding optimization principleOptimize the common caseMost procedure calls do not cross machine boundaries (20:1)Most procedure calls involve small parameters, small return values (32 bits each)

  • LRPC: Key optimizationsThreads transfer between processes during a call to avoid full context switches and reschedulingCompare: client thread blocks while server thread switches in and performs taskSimplified data transferShared argument stack; optimizations for small arguments which can be byte-copiedSimpler call stubs for simple arguments thanks to (b)Many decisions made at compile timeKernel bottlenecks reduced Fewer shared data structures

  • LRPC: Even more optimizationsShared argument memory allocated pairwise at bind timeSaves some security checks at call-time, tooArguments copied only onceFrom optimized stub into shared stackComplex RPC parameters can be tagged as pass through and optimized as simple onese.g., a pointer eventually handed off to another user processDomains are cached on idle CPUsA thread migrating to that domain can jump to such a CPU (where the domain is already available) to avoid a full context switch

  • LRPC Performance vs. Taos RPCDispatch time: 1/3Null()LRPC: 157 microsecTaos: 464 microsecAdd(byte[4],byte[4]) -> byte[4]LRPC: 164; Taos: 480BigIn(byte[200])LRPC: 192; Taos: 539BigInOut(byte[200]) -> byte[200]LRPC: 227; Taos: 636

  • LRPC Performance vs. Taos RPCMultiprocessor performance: substantial improvement

    251551000 calls/sec(as measured)# of CPUs1 2 3 4Taos RPCLRPC

  • Common ThemesDistributed systems need RPC to coordinate distributed workSmall messages dominate RPCSources of latency for small messagesCross-machine RPC: overhead in network interface communicationCross-domain RPC: overhead in context switching, argument copyingSolution: Remove the kernel from the fast path