sdn

2
Automatic Failure Recovery for Software-Defined Networks Maciej Ku´ zniar †* Peter Perešíni †* Nedeljko Vasi´ c Marco Canini Dejan Kosti´ c EPFL TU Berlin / T-Labs Institute IMDEA Networks <name.surname>@epfl.ch [email protected] [email protected] * These authors contributed equally to this work ABSTRACT Tolerating and recovering from link and switch failures are fundamental requirements of most networks, includ- ing Software-Defined Networks (SDNs). However, instead of traditional behaviors such as network-wide routing re- convergence, failure recovery in an SDN is determined by the specific software logic running at the controller. While this admits more freedom to respond to a failure event, it ultimately means that each controller application must in- clude its own recovery logic, which makes the code more difficult to write and potentially more error-prone. In this paper, we propose a runtime system that auto- mates failure recovery and enables network developers to write simpler, failure-agnostic code. To this end, upon de- tecting a failure, our approach first spawns a new controller instance that runs in an emulated environment consisting of the network topology excluding the failed elements. Then, it quickly replays inputs observed by the controller before the failure occurred, leading the emulated network into the for- warding state that accounts for the failed elements. Finally, it recovers the network by installing the difference ruleset between emulated and current forwarding states. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Network operating systems; C.4 [Performance of Systems]: Reliability, availability, and serviceability Keywords Software Defined Network, Fault tolerance 1. INTRODUCTION To ensure uninterrupted service, Software-Defined Net- works (SDNs) must continue to forward traffic in the face of link and switch failures. Therefore, as with traditional net- works, SDNs necessitate failure recovery —adapting to fail- ures by directing traffic over functioning alternate paths. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third- party components of this work must be honored. For all other uses, contact the owner/author(s). HotSDN’13, August 16, 2013, Hong Kong, China. ACM 978-1-4503-2178-5/13/08. Conceptually, this requires computing the new forwarding state in response to a failure event, similarly to how a link- state routing protocol re-converges after detecting a failure. In an SDN however, this computation takes place at the SDN controller, based on its current view of the network. To achieve higher availability, this computation could be done in advance to determine backup paths for a range of failure scenarios. With the appropriate hardware support 1 , the backup paths can be preinstalled to enable switches to quickly adapt to failures based just on local state. Regardless of whether the recovered forwarding state is computed in response to a failure or precomputed in ad- vance, failure recovery requires the presence of extra soft- ware logic at the controller. In particular, in today’s modu- lar controller platforms such as POX, Floodlight, etc., each individual module (e.g., access control, routing) potentially needs to include its own failure recovery logic [6]. This leads to controller modules that are more difficult to develop, and potentially increases the chance of bugs [1]. Avoiding this problem by relying on simple recovery logic like timeouts or deleting rules forwarding to an inactive port is inefficient [6] and may introduce forwarding loops in the network [4]. In this paper, we propose AFRO (Automatic Failure Re- covery for OpenFlow), a system that provides automatic fail- ure recovery on behalf of simpler, failure-agnostic controller modules. Inspired by the successful implementation of such a model in other domains (e.g., in the large-scale comput- ing framework MapReduce [2]), we argue that application- specific functionality should be separated from the failure recovery mechanisms. Instead, a runtime system should be responsible for transparently recovering the network from failures. Doing so is critical for dramatically reducing the chance of introducing bugs and insidious corner cases, and increasing the overall chance of SDN’s success. AFRO extends the functionality of a basic controller pro- gram that is only capable of correctly operating over a net- work without failures and allows it to react to changes in the network topology. The intuition behind our approach is based on a straightforward recovery mechanism, or rather a solution that sidesteps the recovery problem altogether: After any topology change, we could wipe clean the entire network forwarding state and restart the controller. Then, the forwarding state gets recovered as the controller starts to install forwarding rules according to its control logic, ini- tial configuration and the external events that influence its execution. 1 In OpenFlow, the FastFailover group type is available since version 1.1 of the specification. 159

Upload: maheshwar-bhat

Post on 16-Aug-2015

217 views

Category:

Documents


2 download

DESCRIPTION

SDN

TRANSCRIPT

Automatic Failure Recovery for Software-Dened NetworksMaciej Ku zniarPeter PereniNedeljko Vasi cMarco CaniniDejan Kosti cEPFLTU Berlin / T-LabsInstitute IMDEA [email protected]@[email protected] authors contributed equally to this workABSTRACTTolerating and recovering fromlink and switch failuresare fundamental requirements of most networks, includ-ingSoftware-DenedNetworks (SDNs). However, insteadof traditional behaviors suchas network-wide routing re-convergence, failurerecoveryinanSDNis determinedbythespecicsoftwarelogicrunningatthecontroller. Whilethisadmitsmorefreedomtorespondtoafailureevent, itultimatelymeansthateachcontrollerapplicationmustin-clude its ownrecoverylogic, whichmakes thecode morediculttowriteandpotentiallymoreerror-prone.Inthis paper, we propose aruntimesystemthat auto-mates failure recoveryandenables networkdevelopers towritesimpler, failure-agnosticcode. Tothisend, uponde-tectingafailure,ourapproachrstspawnsanewcontrollerinstance that runs in an emulated environment consisting ofthe network topology excluding the failed elements. Then, itquicklyreplaysinputsobservedbythecontrollerbeforethefailureoccurred,leadingtheemulatednetworkintothefor-wardingstatethat accountsforthe failedelements. Finally,it recoversthenetworkbyinstallingthedierencerulesetbetweenemulatedandcurrentforwardingstates.Categories and Subject DescriptorsC.2.4 [DistributedSystems]: Network operating systems;C.4[Performance of Systems]: Reliability, availability,andserviceabilityKeywordsSoftwareDenedNetwork,Faulttolerance1. INTRODUCTIONTo ensure uninterruptedservice, Software-DenedNet-works (SDNs) must continue to forwardtrac in the face oflinkandswitchfailures. Therefore, aswithtraditionalnet-works, SDNsnecessitatefailurerecoveryadaptingtofail-ures bydirecting trac over functioning alternate paths.Permissiontomakedigital orhardcopiesofpart orall ofthisworkforpersonal orclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmade ordistributedforprot orcommercialadvantageand thatcopiesbearthisnoticeand the full citationon the rst page. Copyrightsfor third-party components of this work must be honored. For all other uses, contactthe owner/author(s).HotSDN13, August 16, 2013, Hong Kong, China.ACM 978-1-4503-2178-5/13/08.Conceptually, thisrequirescomputingthenewforwardingstateinresponsetoafailureevent,similarlytohowalink-stateroutingprotocolre-convergesafterdetectingafailure.InanSDNhowever, this computationtakes place at theSDNcontroller,basedonitscurrentviewofthenetwork.Toachievehigheravailability, thiscomputationcouldbedoneinadvancetodeterminebackuppathsforarangeoffailurescenarios. Withtheappropriatehardwaresupport1,thebackuppathscanbepreinstalledtoenableswitchestoquicklyadapttofailuresbasedjustonlocalstate.Regardless of whether therecoveredforwardingstateiscomputedinresponse toafailure or precomputedinad-vance, failurerecoveryrequiresthepresenceof extrasoft-warelogicatthecontroller. Inparticular,intodaysmodu-larcontrollerplatformssuchasPOX,Floodlight, etc.,eachindividualmodule(e.g.,accesscontrol,routing)potentiallyneeds to include its own failure recovery logic [6]. This leadsto controller modules that are more dicult to develop, andpotentiallyincreasesthechanceof bugs[1]. Avoidingthisproblemby relyingonsimplerecoverylogicliketimeouts ordeleting rules forwardingto aninactive port is inecient [6]andmayintroduceforwardingloopsinthenetwork[4].Inthispaper,weproposeAFRO(AutomaticFailureRe-covery for OpenFlow), a system that provides automatic fail-urerecoveryonbehalfofsimpler,failure-agnosticcontrollermodules. Inspiredbythesuccessfulimplementationofsuchamodel inotherdomains(e.g., inthelarge-scalecomput-ingframeworkMapReduce[2]), wearguethatapplication-specicfunctionalityshouldbeseparatedfromthefailurerecoverymechanisms. Instead,aruntimesystemshouldberesponsible for transparentlyrecoveringthenetworkfromfailures. Doingsoiscritical for dramaticallyreducingthechanceof introducingbugsandinsidiouscornercases, andincreasingtheoverallchanceofSDNssuccess.AFRO extends thefunctionalityofabasiccontrollerpro-gramthatisonlycapableofcorrectlyoperatingoveranet-workwithout failures andallows it toreact tochangesinthe networktopology. Theintuitionbehindourapproachisbasedonastraightforwardrecoverymechanism, or ratherasolutionthat sidesteps therecoveryproblemaltogether:Afteranytopologychange, wecouldwipecleantheentirenetworkforwardingstateandrestartthecontroller. Then,theforwardingstategetsrecoveredasthecontrollerstartstoinstallforwardingrulesaccordingtoitscontrollogic,ini-tial congurationandtheexternaleventsthatinuenceitsexecution.1In OpenFlow, the FastFailover group type is available sinceversion1.1ofthespecication.159However, becauseall rulesinthenetworkneedtobere-installed,such a simplemethod isinecient andhas signi-cant disadvantages. First, a large number of packets may bedroppedorredirectedinsidethePacketInmessagestothecontroller,whichcouldbeoverwhelmed. Second,therecov-eryspeedwill beadverselyaectedbyseveral factors, in-cluding the timeto resetthe networkandcontroller,aswellas theoverheadof computing, transmittingandinstallingthenewrules.Toovercometheseissues,wetakeadvantageofthe soft-ware partinSDN.Sinceallthecontroldecisionsaremadein software running at the controller, AFRO repeats (or pre-dicts)thesedecisionsbysimplyrunningthesamesoftwarelogicinanemulatedenvironmentof theunderlyingphysi-cal network[3, 7]. Duringnormal execution, AFROsimplylogsatraceoftheeventsobservedatthecontrollerthatin-uenceitsexecution(forexample, PacketInmessages). Incase of a network failure, AFRO replays at accelerated speedtheeventspriortothefailurewithinanisolatedinstanceofthe controller with its environmentexcept it pretends thatthenetworktopologyneverchangedandallfailedelementsneverexistedsincethebeginning. Finally, AFROrecoversthe actual network forwarding state by transitioning it to therecoveredforwardingstateoftheemulatedenvironment.2. AUTOMATING FAILURE RECOVERYAFRO worksintwooperationmodes: recordmode,whenthe networkisnot experiencing failures,andrecoverymode,whichactivatesafterafailureisdetected. Thenetworkandcontroller start inacombinedcleanforwardingstatethatlaterisgraduallymodiedbythecontrollerinresponsetoPacketInevents. Inrecordmode, AFROrecordsall Pack-etInarrivingatthecontrollerandkeepstrackofcurrentlyinstalledrules bylogging FlowModandFlowRemmessages.When a network failure is detected, AFRO enters the recov-erymode. Recoveryinvolvestwomainphases: replayandreconguration.Duringreplay,AFROcreatesacleancopyoftheoriginalcontroller, whichwecall ShadowController. ShadowCon-troller hasthesamefunctionalityas theoriginal one, butit starts withacleanstateandfromthebeginningworksonaShadow Networkanemulatedcopyoftheactualnet-workthatismodiedbyremovingfailedelementsfromitstopology. Then, AFROfeedstheShadowController withrecordedPacketInmessagesprocessedbytheoriginal con-troller beforethefailureoccurred. Whenreplayends, theShadowControllerandShadowNetworkcontainanewfor-wardingstate.Atthispoint, recongurationtransitionstheactual net-workfromthe current statetothe newone. Thistransitionrequiresmakingmodicationstoboththecontrollerinter-nal stateaswell asforwardingrulesintheswitches. First,AFRO computes a minimal set of rule changes. Then, it usesan ecient, yet consistent method to update all switches. Fi-nally, it replaces the controller state with the state of ShadowController and ends the recovery procedure. In case anotherfailure is detected during recovery, the system needs to inter-rupttheaboveprocedureandrestartwithanotherShadowNetworkrepresentingthecurrenttopology.2.1 ReplayReplay needstobequickandecient. WhileAFROisreacting to the failure, the network is vulnerable and its per-formanceissuboptimal evenif thereisfastfailoverimple-mented. Further,consecutivefailuresmaycauseissuesthatevenfailover cannot handle. Toimprovethe replay per-formance, weusetwoorthogonal optimizations. First, welterPacketInmessagesand chooseonly those necessary toreachtherecoveredforwardingstate. Then, weparallelizethereplaybasedonpacketindependence.2.2 Network RecongurationAfter thereplayends, AFROneeds toecientlyapplythechangeswithoutputtingthenetworkinaninconsistentstate. The recongurationrequires twosteps: (i) modifyingrulesintheswitches, and(ii)migratingthecontrollertoanewstate.For rules in the switches we need a consistent and propor-tional updatemechanism. Ifthesizeofanupdateismuchbigger thanthesizeof arequiredcongurationchange, itosetstheadvantagesof AFRO. Thereforeweproposeus-ing a two-phase commit mechanism of per-packet consistentupdatesthat is verysimilar totheonepresentedas opti-mizationin[5]. Westartmodifyingrulesthatwill notbeusedanduseabarriertowaituntil switchesapplyanewconguration. At this point we installrulesthat direct traf-ctothepreviouslyinstalled,newrules.The transitionbetweenthe oldandnewcontroller canbedoneeitherbyexposinganinterfacetopasstheentirestatebetweenthetwocontrollers, or byreplacingtheoldonealtogether. Inthesecondcase, theShadowControllersimplytakes over the connections to realswitches andtakesresponsibilityoftheoriginalcontroller.2.3 PrototypeWeimplementedaprototypeof AFROworkingwithaPOXcontrollerplatform. POXallowsprogrammerstoex-tendthe controller functionalitybyaddingnewmodules.Therefore, it is possible tointroduce AFROrequiringnomodicationstocontrollerlogic.3. REFERENCES[1] M.Canini,D.Venzano,P.Peresni,D.Kostic,andJ.Rexford.ANICEWaytoTestOpenFlowApplications.InNSDI,2012.[2] J.DeanandS.Ghemawat.MapReduce: SimpliedDataProcessingonLargeClusters. CommunicationsoftheACM,51(1),2008.[3] N.Handigol,B.Heller,V.Jeyakumar,B.Lantz,andN.McKeown.ReproducibleNetworkExperimentsUsingContainer-BasedEmulation.InCoNEXT,2012.[4] P.Peresni,M.Kuzniar,N.Vasic,M.Canini,andD.Kostic.OF.CPP:ConsistentPacketProcessingforOpenFlow.InHotSDN,2013.[5] M.Reitblatt,N.Foster,J.Rexford,C.Schlesinger,andD.Walker.AbstractionsforNetworkUpdate.InSIGCOMM,2012.[6] S.Sharma,D.Staessens,D.Colle,M.Pickavet,andP.Demeester.EnablingFastFailureRecoveryinOpenFlowNetworks.InDRCN,2011.[7] A.Wundsam,D.Levin,S.Seetharaman,andA.Feldmann.OFRewind: EnablingRecordandReplayTroubleshootingforNetworks.InUSENIXATC,2011.160