sdn

Automatic Failure Recovery for Software-Dened NetworksMaciej Ku zniarPeter PereniNedeljko Vasi cMarco CaniniDejan Kosti cEPFLTU Berlin / T-LabsInstitute IMDEA [email protected]@[email protected] authors contributed equally to this workABSTRACTTolerating and recovering fromlink and switch failuresare fundamental requirements of most networks, includ-ingSoftware-DenedNetworks (SDNs). However, insteadof traditional behaviors suchas network-wide routing re-convergence, failurerecoveryinanSDNis determinedbythespecicsoftwarelogicrunningatthecontroller. Whilethisadmitsmorefreedomtorespondtoafailureevent, itultimatelymeansthateachcontrollerapplicationmustin-clude its ownrecoverylogic, whichmakes thecode morediculttowriteandpotentiallymoreerror-prone.Inthis paper, we propose aruntimesystemthat auto-mates failure recoveryandenables networkdevelopers towritesimpler, failure-agnosticcode. Tothisend, uponde-tectingafailure,ourapproachrstspawnsanewcontrollerinstance that runs in an emulated environment consisting ofthe network topology excluding the failed elements. Then, itquicklyreplaysinputsobservedbythecontrollerbeforethefailureoccurred,leadingtheemulatednetworkintothefor-wardingstatethat accountsforthe failedelements. Finally,it recoversthenetworkbyinstallingthedierencerulesetbetweenemulatedandcurrentforwardingstates.Categories and Subject DescriptorsC.2.4 [DistributedSystems]: Network operating systems;C.4[Performance of Systems]: Reliability, availability,andserviceabilityKeywordsSoftwareDenedNetwork,Faulttolerance1. INTRODUCTIONTo ensure uninterruptedservice, Software-DenedNet-works (SDNs) must continue to forwardtrac in the face oflinkandswitchfailures. Therefore, aswithtraditionalnet-works, SDNsnecessitatefailurerecoveryadaptingtofail-ures bydirecting trac over functioning alternate paths.Permissiontomakedigital orhardcopiesofpart orall ofthisworkforpersonal orclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmade ordistributedforprot orcommercialadvantageand thatcopiesbearthisnoticeand the full citationon the rst page. Copyrightsfor third-party components of this work must be honored. For all other uses, contactthe owner/author(s).HotSDN13, August 16, 2013, Hong Kong, China.ACM 978-1-4503-2178-5/13/08.Conceptually, thisrequirescomputingthenewforwardingstateinresponsetoafailureevent,similarlytohowalink-stateroutingprotocolre-convergesafterdetectingafailure.InanSDNhowever, this computationtakes place at theSDNcontroller,basedonitscurrentviewofthenetwork.Toachievehigheravailability, thiscomputationcouldbedoneinadvancetodeterminebackuppathsforarangeoffailurescenarios. Withtheappropriatehardwaresupport1,thebackuppathscanbepreinstalledtoenableswitchestoquicklyadapttofailuresbasedjustonlocalstate.Regardless of whether therecoveredforwardingstateiscomputedinresponse toafailure or precomputedinad-vance, failurerecoveryrequiresthepresenceof extrasoft-warelogicatthecontroller. Inparticular,intodaysmodu-larcontrollerplatformssuchasPOX,Floodlight, etc.,eachindividualmodule(e.g.,accesscontrol,routing)potentiallyneeds to include its own failure recovery logic [6]. This leadsto controller modules that are more dicult to develop, andpotentiallyincreasesthechanceof bugs[1]. Avoidingthisproblemby relyingonsimplerecoverylogicliketimeouts ordeleting rules forwardingto aninactive port is inecient [6]andmayintroduceforwardingloopsinthenetwork[4].Inthispaper,weproposeAFRO(AutomaticFailureRe-covery for OpenFlow), a system that provides automatic fail-urerecoveryonbehalfofsimpler,failure-agnosticcontrollermodules. Inspiredbythesuccessfulimplementationofsuchamodel inotherdomains(e.g., inthelarge-scalecomput-ingframeworkMapReduce[2]), wearguethatapplication-specicfunctionalityshouldbeseparatedfromthefailurerecoverymechanisms. Instead,aruntimesystemshouldberesponsible for transparentlyrecoveringthenetworkfromfailures. Doingsoiscritical for dramaticallyreducingthechanceof introducingbugsandinsidiouscornercases, andincreasingtheoverallchanceofSDNssuccess.AFRO extends thefunctionalityofabasiccontrollerpro-gramthatisonlycapableofcorrectlyoperatingoveranet-workwithout failures andallows it toreact tochangesinthe networktopology. Theintuitionbehindourapproachisbasedonastraightforwardrecoverymechanism, or ratherasolutionthat sidesteps therecoveryproblemaltogether:Afteranytopologychange, wecouldwipecleantheentirenetworkforwardingstateandrestartthecontroller. Then,theforwardingstategetsrecoveredasthecontrollerstartstoinstallforwardingrulesaccordingtoitscontrollogic,ini-tial congurationandtheexternaleventsthatinuenceitsexecution.1In OpenFlow, the FastFailover group type is available sinceversion1.1ofthespecication.159However, becauseall rulesinthenetworkneedtobere-installed,such a simplemethod isinecient andhas signi-cant disadvantages. First, a large number of packets may bedroppedorredirectedinsidethePacketInmessagestothecontroller,whichcouldbeoverwhelmed. Second,therecov-eryspeedwill beadverselyaectedbyseveral factors, in-cluding the timeto resetthe networkandcontroller,aswellas theoverheadof computing, transmittingandinstallingthenewrules.Toovercometheseissues,wetakeadvantageofthe soft-ware partinSDN.Sinceallthecontroldecisionsaremadein software running at the controller, AFRO repeats (or pre-dicts)thesedecisionsbysimplyrunningthesamesoftwarelogicinanemulatedenvironmentof theunderlyingphysi-cal network[3, 7]. Duringnormal execution, AFROsimplylogsatraceoftheeventsobservedatthecontrollerthatin-uenceitsexecution(forexample, PacketInmessages). Incase of a network failure, AFRO replays at accelerated speedtheeventspriortothefailurewithinanisolatedinstanceofthe controller with its environmentexcept it pretends thatthenetworktopologyneverchangedandallfailedelementsneverexistedsincethebeginning. Finally, AFROrecoversthe actual network forwarding state by transitioning it to therecoveredforwardingstateoftheemulatedenvironment.2. AUTOMATING FAILURE RECOVERYAFRO worksintwooperationmodes: recordmode,whenthe networkisnot experiencing failures,andrecoverymode,whichactivatesafterafailureisdetected. Thenetworkandcontroller start inacombinedcleanforwardingstatethatlaterisgraduallymodiedbythecontrollerinresponsetoPacketInevents. Inrecordmode, AFROrecordsall Pack-etInarrivingatthecontrollerandkeepstrackofcurrentlyinstalledrules bylogging FlowModandFlowRemmessages.When a network failure is detected, AFRO enters the recov-erymode. Recoveryinvolvestwomainphases: replayandreconguration.Duringreplay,AFROcreatesacleancopyoftheoriginalcontroller, whichwecall ShadowController. ShadowCon-troller hasthesamefunctionalityas theoriginal one, butit starts withacleanstateandfromthebeginningworksonaShadow Networkanemulatedcopyoftheactualnet-workthatismodiedbyremovingfailedelementsfromitstopology. Then, AFROfeedstheShadowController withrecordedPacketInmessagesprocessedbytheoriginal con-troller beforethefailureoccurred. Whenreplayends, theShadowControllerandShadowNetworkcontainanewfor-wardingstate.Atthispoint, recongurationtransitionstheactual net-workfromthe current statetothe newone. Thistransitionrequiresmakingmodicationstoboththecontrollerinter-nal stateaswell asforwardingrulesintheswitches. First,AFRO computes a minimal set of rule changes. Then, it usesan ecient, yet consistent method to update all switches. Fi-nally, it replaces the controller state with the state of ShadowController and ends the recovery procedure. In case anotherfailure is detected during recovery, the system needs to inter-rupttheaboveprocedureandrestartwithanotherShadowNetworkrepresentingthecurrenttopology.2.1 ReplayReplay needstobequickandecient. WhileAFROisreacting to the failure, the network is vulnerable and its per-formanceissuboptimal evenif thereisfastfailoverimple-mented. Further,consecutivefailuresmaycauseissuesthatevenfailover cannot handle. Toimprovethe replay per-formance, weusetwoorthogonal optimizations. First, welterPacketInmessagesand chooseonly those necessary toreachtherecoveredforwardingstate. Then, weparallelizethereplaybasedonpacketindependence.2.2 Network RecongurationAfter thereplayends, AFROneeds toecientlyapplythechangeswithoutputtingthenetworkinaninconsistentstate. The recongurationrequires twosteps: (i) modifyingrulesintheswitches, and(ii)migratingthecontrollertoanewstate.For rules in the switches we need a consistent and propor-tional updatemechanism. Ifthesizeofanupdateismuchbigger thanthesizeof arequiredcongurationchange, itosetstheadvantagesof AFRO. Thereforeweproposeus-ing a two-phase commit mechanism of per-packet consistentupdatesthat is verysimilar totheonepresentedas opti-mizationin[5]. Westartmodifyingrulesthatwill notbeusedanduseabarriertowaituntil switchesapplyanewconguration. At this point we installrulesthat direct traf-ctothepreviouslyinstalled,newrules.The transitionbetweenthe oldandnewcontroller canbedoneeitherbyexposinganinterfacetopasstheentirestatebetweenthetwocontrollers, or byreplacingtheoldonealtogether. Inthesecondcase, theShadowControllersimplytakes over the connections to realswitches andtakesresponsibilityoftheoriginalcontroller.2.3 PrototypeWeimplementedaprototypeof AFROworkingwithaPOXcontrollerplatform. POXallowsprogrammerstoex-tendthe controller functionalitybyaddingnewmodules.Therefore, it is possible tointroduce AFROrequiringnomodicationstocontrollerlogic.3. REFERENCES[1] M.Canini,D.Venzano,P.Peresni,D.Kostic,andJ.Rexford.ANICEWaytoTestOpenFlowApplications.InNSDI,2012.[2] J.DeanandS.Ghemawat.MapReduce: SimpliedDataProcessingonLargeClusters. CommunicationsoftheACM,51(1),2008.[3] N.Handigol,B.Heller,V.Jeyakumar,B.Lantz,andN.McKeown.ReproducibleNetworkExperimentsUsingContainer-BasedEmulation.InCoNEXT,2012.[4] P.Peresni,M.Kuzniar,N.Vasic,M.Canini,andD.Kostic.OF.CPP:ConsistentPacketProcessingforOpenFlow.InHotSDN,2013.[5] M.Reitblatt,N.Foster,J.Rexford,C.Schlesinger,andD.Walker.AbstractionsforNetworkUpdate.InSIGCOMM,2012.[6] S.Sharma,D.Staessens,D.Colle,M.Pickavet,andP.Demeester.EnablingFastFailureRecoveryinOpenFlowNetworks.InDRCN,2011.[7] A.Wundsam,D.Levin,S.Seetharaman,andA.Feldmann.OFRewind: EnablingRecordandReplayTroubleshootingforNetworks.InUSENIXATC,2011.160

sdn

Documents