data centers - courses.cs.washington.edu– readfileblock (vs. posix file read) – writefileblock...

DataCenters

TomAnderson

TransportClarification

•  RPCmessagescanbearbitrarysize– Ex:oktosendatreeorahashtable– Canrequiremorethanonepacketsent/received

•  Weassumemessagescanbedropped,duplicated,reordered–  Itispacketsthataredropped,duplicated,reordered– MostRPCsaresinglepackets

•  TCPaddressessomeoftheseissues– Butwe’llassumethegeneralcase

RPCSemantics

•  Atleastonce(NFS,DNS,lab1b)–  true:executedatleastonce–  false:maybeexecuted,maybemultipletimes

•  Atmostonce(lab1c)–  true:executedonce–  false:maybeexecuted,butnevermorethanonce

•  Exactlyonce–  true:executedonce–  false:neverreturnsfalse

Whendoesat-least-oncework?

•  Ifnosideeffects–  read-onlyoperations(oridempotentops)

•  Example:MapReduce•  Example:NFS–  readFileBlock(vs.Posixfileread)– writeFileBlock–  (butnot:deletefile)

AtMostOnce

ClientincludesuniqueID(UID)witheachrequest–  usesameUIDforre-send

ServerRPCcodedetectsduplicaterequests–  returnpreviousreplyinsteadofre-runninghandlerifseen[uid]{r=old[uid]}else{r=handler()old[uid]=rseen[uid]=true}

SomeAt-Most-OnceIssues

HowdoweensureUIDisunique?–  Bigrandomnumber?–  CombineuniqueclientID(IPaddress?)withseq#?–  Whatifclientcrashesandrestarts?Canitreusethe

sameUID?

•  Inlabs,nodesneverrestart–  Equivalentto:everynodegetsnewIDonstart–  Clientaddress+seq#isunique–  Seq#aloneisNOTunique

WhenCanServerDiscardOldRPCs?

Option1:Never?Option2:uniqueclientIDs–  per-clientRPCsequencenumbers–  clientincludes"seenallreplies<=X"witheveryRPC

Option3:oneclientRPCatatime–  arrivalofseq+1allowsservertodiscardall<=seq

LabsuseOption3

WhatifServerCrashes?

Ifat-most-oncelistofrecentRPCresultsisstoredinmemory,serverwillforgetandacceptduplicaterequestswhenitreboots–  DoesserverneedtowritetherecentRPCresultsto

disk?–  Ifreplicated,doesreplicaalsoneedtostorerecent

RPCresults?InLabs,servergetsnewaddressonrestart–  Clientmessagesaren’tdeliveredtorestartedserver–  (discardifsenttowrongversionofserver)

ADataCenter

InsideaDataCenter

Datacenter

10k-100kservers:250k–10Mcores

1-100PBofDRAM

100PB-10EBstorage

1-10Pbpsbandwidth(>>Internet)

10-100MWpower-1-2%ofglobalenergyconsumption

100sofmillionsofdollars

Servers

Limitsdrivenbythepowerconsumption

1-4multicoresockets

20-24cores/socket(150Weach)

100sGB–1TBofDRAM(100-500W)

40Gbpslinktonetworkswitch

Serversinracks

19”wide

1.75”tall(1u)

(definedin1922!)

40-120servers/rack

networkswitchattop

Racksinrows

Rowsinhot/coldpairs

Hot/coldpairsindatacenters

Whereisthecloud?

Amazon,intheUS:-NorthernVirginia

-Ohio

-Oregon

-NorthernCalifornia

Whythoselocations?

MTTF/MTTR

MeanTimetoFailure/MeanTimetoRepairDiskfailures(notreboots)peryear~2-4%– Atdatacenterscale,that’sabout2/hour.–  Ittakes10hourstorestorea10TBdisk

Servercrashes– 1/month*30secondstoreboot=>5mins/year– 100K+servers

TypicalYearinaDataCenter(2008)•  ~0.5overheating(powerdownmostmachinesin<5mins,~1-2daystorecover)

•  ~1PDUfailure(~500-1000machinessuddenlydisappear,~6hourstocomeback)

•  ~1rack-move(plentyofwarning,~500-1000machinespowereddown,~6hours)

•  ~1networkrewiring(rolling~5%ofmachinesdownover2-dayspan)

•  ~20rackfailures(40-80machinesinstantlydisappear,1-6hourstogetback)

TypicalYearinaDataCenter(2008)•  ~5racksgowonky(40-80machinessee50%packetloss)

•  ~8networkmaintenances(4mightcause~30-minuterandomconnectivitylosses)

•  ~12routerreloads(takesoutDNSandexternalvipsforacoupleminutes)

•  ~3routerfailures(havetoimmediatelypulltrafficforanhour)

•  ~dozensofminor30-secondblipsfordns•  ~1000individualmachinefailures•  ~thousandsofharddrivefailures•  slowdisks,badmemory,misconfiguredmachines,flakymachines,etc

ThisWeek:PrimaryBackup

•  Howdowebuildsystemsthatsurviveasinglenodefailure?

•  Statereplication:runtwocopiesofserver–  primary,backup–  differentracks,differentPDUs,differentDCs!–  Ifprimaryfails,backuptakesover

•  Howdoweknowprimaryfailedvs.isslow?

DataCenterNetworks

EveryserverwiredtoaToR(topofrack)switch

ToR’sinneighboringaisleswiredtoanaggregationswitch

Agg.switcheswiredtocoreswitches

Earlydatacenternetworks

3layersofswitches-Edge(ToR)

-Aggregation

-Core

Earlydatacenternetworks

3layersofswitches-Edge(ToR)

-Aggregation

-CoreOptical

Electrical

Earlydatacenterlimitations

Cost-Core,aggregationrouters=highcapacity,lowvolume

-Expensive!

Fault-tolerance-Failureofasinglecoreoraggregationrouter=large

bandwidthloss

Bisectionbandwidthlimitedbycapacityoflargestavailablerouter

-Google’sDCtrafficdoubleseveryyear!

Closnetworks

HowcanIreplaceabigswitchbymanysmallswitches?

Bigswitch

Smallswitch

Closnetworks

HowcanIreplaceabigswitchbymanysmallswitches?

Bigswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

ClosNetworks

Whataboutbiggerswitches?

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Multi-rootedtree

Everypairofnodeshasmanypaths

Faulttolerant!Buthowdowepickapath?

Multipathrouting

Lotsofbandwidth,splitacrossmanypaths

ECMP:hashonpacketheadertodetermineroute

-(5tuple):SourceIP,port,destinationIP,port,prot.

-Packetsfromclient–serverusuallytakesameroute

Onswitchorlinkfailure,ECMPsendssubsequentpacketsalongadifferentroute

=>Outoforderpackets!

DataCenterNetworkTrends

RTlatencyacrossdatacenter~10usec40Gbpslinkscommon,100Gbpsontheway– 1KBpacketevery80nsona100Gbpslink– Directdeliveryintotheon-chipcache(DDIO)

Upperlevelsoftreeare(expensive)opticallinks– Thintreetoreducecosts

Withinrack>withinaisle>withinDC>crossDC– Latencyandbandwidth:keepcommunicationlocal

LocalStorage

•  Magneticdisksforlongtermstorage– Highlatency(10ms),lowbandwidth(250MB/s)– Compressedandreplicatedforcost,resilience

•  Solidstatestorageforpersistence,cachelayer– 50usblockaccess,multi-GB/sbandwidth

•  EmergingNVM– LowenergyDRAMreplacement– Sub-microsecondpersistence

DayintheLifeofaWebRequest

Usertypes“google.com”,whathappensDNS=DomainNameSystem– Globaldatabaseofname->IPaddress

DNSquerytorootnameserver– Rootnameserverisreplicated(600+)– HardcodedIPmulticastaddresses– Updateshappenasyncinbackground(rare)

Rootreturns:IPaddressofgoogle’snameserver–  Cachedonclient(foraconfigurableperiod)–  Canbeoutofdate

NameServertoDC

•  DNSreturnsgoogle’snameserver•  Googlenameserverreturnslocalnameserver– NameserverinDCclosetouser’sIPaddress

•  LocalnameserverreturnsIPofwebfrontend– Alsoinlocaldatacenter– Withshorttimeout,rerouteiffrontendfailure

•  BrowsersendsHTTPrequesttofrontend

ResilientScalableWebFrontEnd

•  IPaddressoffrontendisanarrayofmachines•  Packetsfirstgothroughloadbalancer– Actually,anarrayofloadbalancers(allthesame)

•  Loadbalancerhashes3or5tuple->frontend– SourceIP,port#,DestinationIP,port#,protocol– Thisway,alltrafficfromusergoestosameserver

•  Whenfrontendfails,loadbalancerredirectsincomingpacketselsewhere

Challenge

Howdowebuildahashfunctionthat’sstable?Thatreroutesonlythepacketsforthefailednode,leavingthepacketstoothernodesalone.

WebFrontEnd->Cache,StorageLayer

Key-valuestoreShardedacrossmanycacheandstoragenodesHash(key)->cache,storagenodeIfcachenodefails,rehashtonewcachenodeIfstoragenodefails,???

data centers - courses.cs.washington.edu– readfileblock (vs. posix file read) – writefileblock...

Documents