passive realtimedatacenter fault detection and localization · load balanced traffic simplifies...

48
Passive realtime datacenter fault detection and localization Arjun Roy , James Hongyi Zeng*, Jasmeet Bagga*, and Alex C. Snoeren University of California, San Diego Facebook* 1

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Passiverealtime datacenterfaultdetectionandlocalization

ArjunRoy,JamesHongyi Zeng*,JasmeetBagga*,andAlexC.SnoerenUniversityofCalifornia,SanDiegoFacebook*

1

Page 2: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

- Ranjeeth Dasineni,Facebook(paraphrased)

2

Page 3: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Contemporarydatacenternetwork

However:faultsmaybepartial/intermittent.3

Page 4: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Partialfaults:Afewexamples

• Netpilot (Sigcomm 2011):Framecheckerror,unequalECMPhashing,etc.Wu,Xin,etal."Netpilot:automatingdatacenternetworkfailuremitigation." ACMSIGCOMMComputerCommunicationReview 42.4(2012):419-430.

• Everflow (Sigcomm 2015):TCAMbiterrors,silentpacketdrops.Zhu,Yibo,etal."Packet-LevelTelemetryinLargeDatacenterNetworks.”SIGCOMM,2015.

• Pingmesh (Sigcomm 2015):“fiberFCS…errors,switchingASICdefects,switchfabricflaw,switchsoftwarebug,NICconfigurationissue,networkcongestions,etc.Wehaveseenallthesetypesofissuesinourproductionnetworks.”

Guo,Chuanxiong,etal."Pingmesh:ALarge-ScaleSystemforDataCenterNetworkLatencyMeasurementandAnalysis.” SIGCOMM,2015.

4

Page 5: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Vastbodyofpriorwork(justasmallsample…)• Applicationinstrumentation:variousproductionsystems

• Activeprobing:Pingmesh (SIGCOMM’15),NetNorad (Facebook),ATPG(CoNEXT ‘12),Everflow (SIGCOMM‘15)

• Machinelearning:NetPoirot (SIGCOMM’16)

• Graphalgorithms:Gestalt(Usenix ATC‘14),SCORE(NSDI‘05)

• Pathtracing: Everflow (SIGCOMM‘15),NetNorad (Facebook),NetSight (NSDI‘14),TinyPacketPrograms(SIGCOMM‘14)

• Networkinstrumentation:FlowRadar (NSDI’16),Planck(SIGCOMM‘14),NetPilot (SIGCOMM‘11)

5

Page 6: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Weexploit:highlyregularloadbalancedtraffic

Sourceracktrafficmagnitude

Destinationracktrafficmagnitude

6

ArjunRoy,Hongyi Zeng,JasmeetBagga,GeorgePorter,andAlexC.Snoeren.InsidetheSocialNetwork's(Datacenter)Network. ACMSIGCOMM'15,London,England.

Page 7: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Loadbalancedtrafficsimplifiesfaulthandling

• Evenlyloadedpathsmeansperpathperformanceissimilarifnoerrors.• Networkfaultsleadtooutlierpaths.• Ifflownetworkpathknown,cancorrelateflowperformancewithpath.

• Approachallowsustofindandlocalizefaults:• Inanapplicationagnosticmanner• Incurringnoadditionalprobingoverhead• Morerapidlythanpriorpublishedworks

7

Page 8: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Facebookdatacentertopology

8

AlexeyAndreyev.Introducingdatacenterfabric,thenext-generationFacebookdatacenternetwork.https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

Page 9: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg9

Page 10: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg10

Page 11: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg11

Page 12: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg12

Page 13: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg13

Page 14: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg14

Page 15: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

15

Page 16: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

16

Page 17: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Solution:aggregationswitchmarkspacketsbasedoncoredownlinktraversed.

Agg

17

Page 18: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Howdoweusepathinformation?

• Inprinciple:cancompareflowperformancebypath.1. Combinatorialdisaster:O(10,000)pathsfromsinglehosttoremoteracks.2. Nolocalization:doesn’ttelluswhichlink/switchisatfault.

• But:forthistrafficpattern,ECMProutinggivesusevenbytes/link.

• Solution:Justcomparelinks!

Create“EquivalenceSets”:setsoflinkshandlingsimilarload

andexhibitingsimilarperformance,intheabsenceoffaults

18

Equivalencesets:1. Reducesnumberofcomparisonsneeded.

2. Pinpointsfaulttospecificlocation.

Page 19: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

EquivalencesetsinFacebooktopologyCoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

Equivalenceset:4uplinksfromeachToR

topodAgg layer

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors19

Page 20: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

CoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors

Equivalenceset:NuplinksfrompodAgg layertocorelayer

EquivalencesetsinFacebooktopology

20

Page 21: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Outlieranalysiswithapplicationagnosticmetrics

Hostsalreadytrackmetricsforcongestioncontrolorperformancemonitoring:

TCPCongestionwindow:Affectedbypacketloss.TCPRetransmits:Affectedbypacketloss.SmoothedRoundtriptime:Affectedbylatencyspikes.Systemcalllatency: Affectedbypacketloss.

Caveat:Canbedifficulttodetermineifanaffectisduetoafaultylink,overloadedhosts,applicationvariance,etc.

Withequivalencesetbasedgrouping,wecancomparedistributionsbylink.

Onlylinkfaultscausevariancebetweenlinks.

21

Page 22: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

DemonstratingequivalencesetsfromAgg toToR

(1)ToR markspacketDSCP

perinboundlink

(2)HostaggregatesTCPmetricsbylink(3b)Host drops0.5%ofpacketstraversinglink

(3a)Wesimulateerroronthislink:

22

Host ToRAgg 2

Agg 3

Agg 4

Agg 1

Page 23: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

TCPCongestionwindowinAgg toToR equivalenceset

Cacheserver 23

Page 24: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Congestionwindowsignalisapplicationagnostic

Cacheserver Webserver24

Page 25: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Weuse:TCPretransmitsinourwork

Cacheserver Webserver25

Page 26: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Detectingfaultsinproduction

• Monitoredtrafficthroughpodaggregationswitch.1. Nofaultsinjected.2. CollectedTCPmetricdataon30webserverhosts.3. Equivalenceset:fourlinecards connectingtocorelayer

(eachlinecard hasequalshareofuplinks).

• OnJanuary25th,asinglelinecard hadasoftwarefault.1. Linecard controllersoftwarehung.2. BGProutestimedout,productiontrafficthroughlinecard routedaway.3. Afewminuteslater,NetNORAD flaggedunresponsivelinecard.

26

Page 27: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Faultvisibletoourapproachin30seconds

27

Page 28: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Classifyingfaultylinks

• “Doesthislinkhavemoreretransmitsperflowthantheotherlinks?”

• “Dotwodistributionshavethesamemean,orisonegreater?”

28

Classifier:compareeachlinktootherlinkswithonesampleStudent’sT-Test.

Page 29: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

OnlinefaultmonitoringwithT-Testalone

• Inprinciple:cansetupasystemthatusesendhostT-Testresulttotelluswhichnetworklinksarefaulty.

• However:byitselfthisissusceptibletoFalsePositives.

• Can’taffordfalsepositivesinnetworkwithO(10,000)links!

29

Page 30: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Accountingforfalsepositives

• However,twocharacteristicsaidus:1. Per-hostfalsepositivesevenlydistributedperlinkovertime.2. Datacenterhasaplethoraofhostsforwhichthisistrue.

• Thus,we’renottryingtoseeif agivenlinkismarkedfaultybyhosts.

• Instead,weonceagainperformoutlieranalysis.1. “Areallthelinksbeingmarkedfaultybyhostsatsimilarrates?”2. “Arehostsflaggingaparticularsubsetoflinksasfaultyathigherrates?”

30

Chi-squaredtest:determinesifanylinksareoutliers.

P-Value≈ 1:“Yes,allthelinksbeingmarkedfaultybyhostsatsimilarrates.”

P-Value≈ 0: “No,asubsethasacomparativelyhighpercentageofhostsclaimingfault.”

Page 31: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Evaluationinthedatacenter

• Smalldetectionsurface;didnotdetectany‘organic’partialfaults.

• Approach:inject‘simulated’faultstoevaluateapproach.

• Inducedavarietyoffaultscenariostochallengeoursystem.

31

Page 32: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhaveverylowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

32

Page 33: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhavevery lowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

33

Page 34: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

34

Page 35: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

35

Page 36: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Equivalenceset:NuplinksfrompodAgg layertocorelayer

36

Page 37: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Partialfaultinducedonsingle

CoretoAggdownlink.

37

Page 38: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Faultdetectionratevsdroprate

38

Page 39: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

39

Page 40: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

40

Page 41: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

41

Page 42: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

42

Page 43: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

43

Page 44: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

Minisculefaults:choosingbetweendetectionspeedandsensitivity

44

Page 45: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

Ranjeeth Dasineni,Facebook(paraphrased)

45

Page 46: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

46

Page 47: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

InterpretingtheT-Test

1. T-Statistic:“Doesthislinkhavemoreorlessretransmitsthanaverage?”

• Positive T-statisticmeanslargerthanaverage.• Negative T-statisticmeanssmallerthanaverage.

2. P-Value:“Isthedifferenceinmeanbigenoughtoconcernus?”

• Closeto0meansthislinkcouldbeanoutlier.• Closeto1meanswearenotconcerned.

47

Page 48: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors

InterpretingtheT-Test

P-value0,t-stat>0

P-value1,t-stat≈0

48