data centers - courses.cs.washington.edu– readfileblock (vs. posix file read) – writefileblock...

39
Data Centers Tom Anderson

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

DataCenters

TomAnderson

Page 2: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

TransportClarification

•  RPCmessagescanbearbitrarysize– Ex:oktosendatreeorahashtable– Canrequiremorethanonepacketsent/received

•  Weassumemessagescanbedropped,duplicated,reordered–  Itispacketsthataredropped,duplicated,reordered– MostRPCsaresinglepackets

•  TCPaddressessomeoftheseissues– Butwe’llassumethegeneralcase

Page 3: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

RPCSemantics

•  Atleastonce(NFS,DNS,lab1b)–  true:executedatleastonce–  false:maybeexecuted,maybemultipletimes

•  Atmostonce(lab1c)–  true:executedonce–  false:maybeexecuted,butnevermorethanonce

•  Exactlyonce–  true:executedonce–  false:neverreturnsfalse

Page 4: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Whendoesat-least-oncework?

•  Ifnosideeffects–  read-onlyoperations(oridempotentops)

•  Example:MapReduce•  Example:NFS–  readFileBlock(vs.Posixfileread)– writeFileBlock–  (butnot:deletefile)

Page 5: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

AtMostOnce

ClientincludesuniqueID(UID)witheachrequest–  usesameUIDforre-send

ServerRPCcodedetectsduplicaterequests–  returnpreviousreplyinsteadofre-runninghandlerifseen[uid]{r=old[uid]}else{r=handler()old[uid]=rseen[uid]=true}

Page 6: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

SomeAt-Most-OnceIssues

HowdoweensureUIDisunique?–  Bigrandomnumber?–  CombineuniqueclientID(IPaddress?)withseq#?–  Whatifclientcrashesandrestarts?Canitreusethe

sameUID?

•  Inlabs,nodesneverrestart–  Equivalentto:everynodegetsnewIDonstart–  Clientaddress+seq#isunique–  Seq#aloneisNOTunique

Page 7: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

WhenCanServerDiscardOldRPCs?

Option1:Never?Option2:uniqueclientIDs–  per-clientRPCsequencenumbers–  clientincludes"seenallreplies<=X"witheveryRPC

Option3:oneclientRPCatatime–  arrivalofseq+1allowsservertodiscardall<=seq

LabsuseOption3

Page 8: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

WhatifServerCrashes?

Ifat-most-oncelistofrecentRPCresultsisstoredinmemory,serverwillforgetandacceptduplicaterequestswhenitreboots–  DoesserverneedtowritetherecentRPCresultsto

disk?–  Ifreplicated,doesreplicaalsoneedtostorerecent

RPCresults?InLabs,servergetsnewaddressonrestart–  Clientmessagesaren’tdeliveredtorestartedserver–  (discardifsenttowrongversionofserver)

Page 9: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

ADataCenter

Page 10: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

InsideaDataCenter

Page 11: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Datacenter

10k-100kservers:250k–10Mcores

1-100PBofDRAM

100PB-10EBstorage

1-10Pbpsbandwidth(>>Internet)

10-100MWpower-1-2%ofglobalenergyconsumption

100sofmillionsofdollars

Page 12: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Servers

Limitsdrivenbythepowerconsumption

1-4multicoresockets

20-24cores/socket(150Weach)

100sGB–1TBofDRAM(100-500W)

40Gbpslinktonetworkswitch

Page 13: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Serversinracks

19”wide

1.75”tall(1u)

(definedin1922!)

40-120servers/rack

networkswitchattop

Page 14: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Racksinrows

Page 15: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Rowsinhot/coldpairs

Page 16: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Hot/coldpairsindatacenters

Page 17: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Whereisthecloud?

Amazon,intheUS:-NorthernVirginia

-Ohio

-Oregon

-NorthernCalifornia

Whythoselocations?

Page 18: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

MTTF/MTTR

MeanTimetoFailure/MeanTimetoRepairDiskfailures(notreboots)peryear~2-4%– Atdatacenterscale,that’sabout2/hour.–  Ittakes10hourstorestorea10TBdisk

Servercrashes– 1/month*30secondstoreboot=>5mins/year– 100K+servers

Page 19: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

TypicalYearinaDataCenter(2008)•  ~0.5overheating(powerdownmostmachinesin<5mins,~1-2daystorecover)

•  ~1PDUfailure(~500-1000machinessuddenlydisappear,~6hourstocomeback)

•  ~1rack-move(plentyofwarning,~500-1000machinespowereddown,~6hours)

•  ~1networkrewiring(rolling~5%ofmachinesdownover2-dayspan)

•  ~20rackfailures(40-80machinesinstantlydisappear,1-6hourstogetback)

Page 20: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

TypicalYearinaDataCenter(2008)•  ~5racksgowonky(40-80machinessee50%packetloss)

•  ~8networkmaintenances(4mightcause~30-minuterandomconnectivitylosses)

•  ~12routerreloads(takesoutDNSandexternalvipsforacoupleminutes)

•  ~3routerfailures(havetoimmediatelypulltrafficforanhour)

•  ~dozensofminor30-secondblipsfordns•  ~1000individualmachinefailures•  ~thousandsofharddrivefailures•  slowdisks,badmemory,misconfiguredmachines,flakymachines,etc

Page 21: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

ThisWeek:PrimaryBackup

•  Howdowebuildsystemsthatsurviveasinglenodefailure?

•  Statereplication:runtwocopiesofserver–  primary,backup–  differentracks,differentPDUs,differentDCs!–  Ifprimaryfails,backuptakesover

•  Howdoweknowprimaryfailedvs.isslow?

Page 22: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

DataCenterNetworks

EveryserverwiredtoaToR(topofrack)switch

ToR’sinneighboringaisleswiredtoanaggregationswitch

Agg.switcheswiredtocoreswitches

Page 23: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Earlydatacenternetworks

3layersofswitches-Edge(ToR)

-Aggregation

-Core

Page 24: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Earlydatacenternetworks

3layersofswitches-Edge(ToR)

-Aggregation

-CoreOptical

Electrical

Page 25: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Earlydatacenterlimitations

Cost-Core,aggregationrouters=highcapacity,lowvolume

-Expensive!

Fault-tolerance-Failureofasinglecoreoraggregationrouter=large

bandwidthloss

Bisectionbandwidthlimitedbycapacityoflargestavailablerouter

-Google’sDCtrafficdoubleseveryyear!

Page 26: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Closnetworks

HowcanIreplaceabigswitchbymanysmallswitches?

Bigswitch

Smallswitch

Page 27: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Closnetworks

HowcanIreplaceabigswitchbymanysmallswitches?

Bigswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Page 28: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

ClosNetworks

Whataboutbiggerswitches?

Page 29: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Page 30: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Smallswitch

Page 31: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Multi-rootedtree

Everypairofnodeshasmanypaths

Faulttolerant!Buthowdowepickapath?

Page 32: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Multipathrouting

Lotsofbandwidth,splitacrossmanypaths

ECMP:hashonpacketheadertodetermineroute

-(5tuple):SourceIP,port,destinationIP,port,prot.

-Packetsfromclient–serverusuallytakesameroute

Onswitchorlinkfailure,ECMPsendssubsequentpacketsalongadifferentroute

=>Outoforderpackets!

Page 33: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

DataCenterNetworkTrends

RTlatencyacrossdatacenter~10usec40Gbpslinkscommon,100Gbpsontheway– 1KBpacketevery80nsona100Gbpslink– Directdeliveryintotheon-chipcache(DDIO)

Upperlevelsoftreeare(expensive)opticallinks– Thintreetoreducecosts

Withinrack>withinaisle>withinDC>crossDC– Latencyandbandwidth:keepcommunicationlocal

Page 34: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

LocalStorage

•  Magneticdisksforlongtermstorage– Highlatency(10ms),lowbandwidth(250MB/s)– Compressedandreplicatedforcost,resilience

•  Solidstatestorageforpersistence,cachelayer– 50usblockaccess,multi-GB/sbandwidth

•  EmergingNVM– LowenergyDRAMreplacement– Sub-microsecondpersistence

Page 35: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

DayintheLifeofaWebRequest

Usertypes“google.com”,whathappensDNS=DomainNameSystem– Globaldatabaseofname->IPaddress

DNSquerytorootnameserver– Rootnameserverisreplicated(600+)– HardcodedIPmulticastaddresses– Updateshappenasyncinbackground(rare)

Rootreturns:IPaddressofgoogle’snameserver–  Cachedonclient(foraconfigurableperiod)–  Canbeoutofdate

Page 36: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

NameServertoDC

•  DNSreturnsgoogle’snameserver•  Googlenameserverreturnslocalnameserver– NameserverinDCclosetouser’sIPaddress

•  LocalnameserverreturnsIPofwebfrontend– Alsoinlocaldatacenter– Withshorttimeout,rerouteiffrontendfailure

•  BrowsersendsHTTPrequesttofrontend

Page 37: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

ResilientScalableWebFrontEnd

•  IPaddressoffrontendisanarrayofmachines•  Packetsfirstgothroughloadbalancer– Actually,anarrayofloadbalancers(allthesame)

•  Loadbalancerhashes3or5tuple->frontend– SourceIP,port#,DestinationIP,port#,protocol– Thisway,alltrafficfromusergoestosameserver

•  Whenfrontendfails,loadbalancerredirectsincomingpacketselsewhere

Page 38: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

Challenge

Howdowebuildahashfunctionthat’sstable?Thatreroutesonlythepacketsforthefailednode,leavingthepacketstoothernodesalone.

Page 39: Data Centers - courses.cs.washington.edu– readFileBlock (vs. Posix file read) – writeFileBlock – (but not: delete file) At Most Once Client includes unique ID (UID) with each

WebFrontEnd->Cache,StorageLayer

Key-valuestoreShardedacrossmanycacheandstoragenodesHash(key)->cache,storagenodeIfcachenodefails,rehashtonewcachenodeIfstoragenodefails,???