data centers - courses.cs.washington.edu– readfileblock (vs. posix file read) – writefileblock...
TRANSCRIPT
DataCenters
TomAnderson
TransportClarification
• RPCmessagescanbearbitrarysize– Ex:oktosendatreeorahashtable– Canrequiremorethanonepacketsent/received
• Weassumemessagescanbedropped,duplicated,reordered– Itispacketsthataredropped,duplicated,reordered– MostRPCsaresinglepackets
• TCPaddressessomeoftheseissues– Butwe’llassumethegeneralcase
RPCSemantics
• Atleastonce(NFS,DNS,lab1b)– true:executedatleastonce– false:maybeexecuted,maybemultipletimes
• Atmostonce(lab1c)– true:executedonce– false:maybeexecuted,butnevermorethanonce
• Exactlyonce– true:executedonce– false:neverreturnsfalse
Whendoesat-least-oncework?
• Ifnosideeffects– read-onlyoperations(oridempotentops)
• Example:MapReduce• Example:NFS– readFileBlock(vs.Posixfileread)– writeFileBlock– (butnot:deletefile)
AtMostOnce
ClientincludesuniqueID(UID)witheachrequest– usesameUIDforre-send
ServerRPCcodedetectsduplicaterequests– returnpreviousreplyinsteadofre-runninghandlerifseen[uid]{r=old[uid]}else{r=handler()old[uid]=rseen[uid]=true}
SomeAt-Most-OnceIssues
HowdoweensureUIDisunique?– Bigrandomnumber?– CombineuniqueclientID(IPaddress?)withseq#?– Whatifclientcrashesandrestarts?Canitreusethe
sameUID?
• Inlabs,nodesneverrestart– Equivalentto:everynodegetsnewIDonstart– Clientaddress+seq#isunique– Seq#aloneisNOTunique
WhenCanServerDiscardOldRPCs?
Option1:Never?Option2:uniqueclientIDs– per-clientRPCsequencenumbers– clientincludes"seenallreplies<=X"witheveryRPC
Option3:oneclientRPCatatime– arrivalofseq+1allowsservertodiscardall<=seq
LabsuseOption3
WhatifServerCrashes?
Ifat-most-oncelistofrecentRPCresultsisstoredinmemory,serverwillforgetandacceptduplicaterequestswhenitreboots– DoesserverneedtowritetherecentRPCresultsto
disk?– Ifreplicated,doesreplicaalsoneedtostorerecent
RPCresults?InLabs,servergetsnewaddressonrestart– Clientmessagesaren’tdeliveredtorestartedserver– (discardifsenttowrongversionofserver)
ADataCenter
InsideaDataCenter
Datacenter
10k-100kservers:250k–10Mcores
1-100PBofDRAM
100PB-10EBstorage
1-10Pbpsbandwidth(>>Internet)
10-100MWpower-1-2%ofglobalenergyconsumption
100sofmillionsofdollars
Servers
Limitsdrivenbythepowerconsumption
1-4multicoresockets
20-24cores/socket(150Weach)
100sGB–1TBofDRAM(100-500W)
40Gbpslinktonetworkswitch
Serversinracks
19”wide
1.75”tall(1u)
(definedin1922!)
40-120servers/rack
networkswitchattop
Racksinrows
Rowsinhot/coldpairs
Hot/coldpairsindatacenters
Whereisthecloud?
Amazon,intheUS:-NorthernVirginia
-Ohio
-Oregon
-NorthernCalifornia
Whythoselocations?
MTTF/MTTR
MeanTimetoFailure/MeanTimetoRepairDiskfailures(notreboots)peryear~2-4%– Atdatacenterscale,that’sabout2/hour.– Ittakes10hourstorestorea10TBdisk
Servercrashes– 1/month*30secondstoreboot=>5mins/year– 100K+servers
TypicalYearinaDataCenter(2008)• ~0.5overheating(powerdownmostmachinesin<5mins,~1-2daystorecover)
• ~1PDUfailure(~500-1000machinessuddenlydisappear,~6hourstocomeback)
• ~1rack-move(plentyofwarning,~500-1000machinespowereddown,~6hours)
• ~1networkrewiring(rolling~5%ofmachinesdownover2-dayspan)
• ~20rackfailures(40-80machinesinstantlydisappear,1-6hourstogetback)
TypicalYearinaDataCenter(2008)• ~5racksgowonky(40-80machinessee50%packetloss)
• ~8networkmaintenances(4mightcause~30-minuterandomconnectivitylosses)
• ~12routerreloads(takesoutDNSandexternalvipsforacoupleminutes)
• ~3routerfailures(havetoimmediatelypulltrafficforanhour)
• ~dozensofminor30-secondblipsfordns• ~1000individualmachinefailures• ~thousandsofharddrivefailures• slowdisks,badmemory,misconfiguredmachines,flakymachines,etc
ThisWeek:PrimaryBackup
• Howdowebuildsystemsthatsurviveasinglenodefailure?
• Statereplication:runtwocopiesofserver– primary,backup– differentracks,differentPDUs,differentDCs!– Ifprimaryfails,backuptakesover
• Howdoweknowprimaryfailedvs.isslow?
DataCenterNetworks
EveryserverwiredtoaToR(topofrack)switch
ToR’sinneighboringaisleswiredtoanaggregationswitch
Agg.switcheswiredtocoreswitches
Earlydatacenternetworks
3layersofswitches-Edge(ToR)
-Aggregation
-Core
Earlydatacenternetworks
3layersofswitches-Edge(ToR)
-Aggregation
-CoreOptical
Electrical
Earlydatacenterlimitations
Cost-Core,aggregationrouters=highcapacity,lowvolume
-Expensive!
Fault-tolerance-Failureofasinglecoreoraggregationrouter=large
bandwidthloss
Bisectionbandwidthlimitedbycapacityoflargestavailablerouter
-Google’sDCtrafficdoubleseveryyear!
Closnetworks
HowcanIreplaceabigswitchbymanysmallswitches?
Bigswitch
Smallswitch
Closnetworks
HowcanIreplaceabigswitchbymanysmallswitches?
Bigswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
ClosNetworks
Whataboutbiggerswitches?
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Smallswitch
Multi-rootedtree
Everypairofnodeshasmanypaths
Faulttolerant!Buthowdowepickapath?
Multipathrouting
Lotsofbandwidth,splitacrossmanypaths
ECMP:hashonpacketheadertodetermineroute
-(5tuple):SourceIP,port,destinationIP,port,prot.
-Packetsfromclient–serverusuallytakesameroute
Onswitchorlinkfailure,ECMPsendssubsequentpacketsalongadifferentroute
=>Outoforderpackets!
DataCenterNetworkTrends
RTlatencyacrossdatacenter~10usec40Gbpslinkscommon,100Gbpsontheway– 1KBpacketevery80nsona100Gbpslink– Directdeliveryintotheon-chipcache(DDIO)
Upperlevelsoftreeare(expensive)opticallinks– Thintreetoreducecosts
Withinrack>withinaisle>withinDC>crossDC– Latencyandbandwidth:keepcommunicationlocal
LocalStorage
• Magneticdisksforlongtermstorage– Highlatency(10ms),lowbandwidth(250MB/s)– Compressedandreplicatedforcost,resilience
• Solidstatestorageforpersistence,cachelayer– 50usblockaccess,multi-GB/sbandwidth
• EmergingNVM– LowenergyDRAMreplacement– Sub-microsecondpersistence
DayintheLifeofaWebRequest
Usertypes“google.com”,whathappensDNS=DomainNameSystem– Globaldatabaseofname->IPaddress
DNSquerytorootnameserver– Rootnameserverisreplicated(600+)– HardcodedIPmulticastaddresses– Updateshappenasyncinbackground(rare)
Rootreturns:IPaddressofgoogle’snameserver– Cachedonclient(foraconfigurableperiod)– Canbeoutofdate
NameServertoDC
• DNSreturnsgoogle’snameserver• Googlenameserverreturnslocalnameserver– NameserverinDCclosetouser’sIPaddress
• LocalnameserverreturnsIPofwebfrontend– Alsoinlocaldatacenter– Withshorttimeout,rerouteiffrontendfailure
• BrowsersendsHTTPrequesttofrontend
ResilientScalableWebFrontEnd
• IPaddressoffrontendisanarrayofmachines• Packetsfirstgothroughloadbalancer– Actually,anarrayofloadbalancers(allthesame)
• Loadbalancerhashes3or5tuple->frontend– SourceIP,port#,DestinationIP,port#,protocol– Thisway,alltrafficfromusergoestosameserver
• Whenfrontendfails,loadbalancerredirectsincomingpacketselsewhere
Challenge
Howdowebuildahashfunctionthat’sstable?Thatreroutesonlythepacketsforthefailednode,leavingthepacketstoothernodesalone.
WebFrontEnd->Cache,StorageLayer
Key-valuestoreShardedacrossmanycacheandstoragenodesHash(key)->cache,storagenodeIfcachenodefails,rehashtonewcachenodeIfstoragenodefails,???