distributed systems communicaon and models · e.g. different types of failures occur with...

33
Distributed Systems Communica3on and models Rik Sarkar Spring 2018 University of Edinburgh

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

DistributedSystems

Communica3onandmodels

RikSarkarSpring2018

UniversityofEdinburgh

Page 2: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Models

•  Expecta3ons/assump3onsaboutthings•  Everyideaorac3onanywhereisbasedonamodel

•  Determineswhatcanorcannothappen

2

Page 3: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Communica3on&modeling•  Modelingdistributedsystems:– Howwecanthinkaboutthem

•  Communica3onbetweennearbynodes•  Communica3onbetweendistantnodes•  Communica3onwithmanynodesSometerminology:•  Oneto“all”:Broadcast– All:insomesetofinterest

•  Onetoone:pointtopoint3

Page 4: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Packets

•  Networkscommunicatedatainmessagesoffixed(bounded)size–calledpackets

•  Moredatarequiresmorepackets

•  NumberofmessagesorpacketstransmiWedisameasureofcommunica3onused

4

Page 5: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Localareanetworks•  Medium:Broadcast

– Messagegoesfromonecomputertoallothercomputers(restrictedtosomeset)•  Forexample,allothercomputersintheLAN,orsomeothersysteminconsidera3on

–  EthernetLANisabroadcastmedium•  Allcomputersareconnectedtoawire.Theytransmitmessagesonthewireandallcanreceive

– WirelessLAN(WiFi)isabroadcastmedium•  Electromagne3cwavesisthecommonmedium

5

Page 6: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Localareanetworks

Advantages:•  Sendingacommonmessagetoeveryoneiseasy•  Findingdes3na3oniseasy– Messagegoestoeveryone–  Justhavea“des3na3on”field

Mainissue:Mediumaccess•  Sincemediumisbroadcast,twopeopletransmiangatthesame3megarblesmessage

6

Page 7: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Mediumaccess•  Onlyonetransmissionata3mecanbeallowed– Mutualexclusionproblem(sharedresourceofcommunica3onwire)

– Wecannotusemessagestosolveit–  Protocols:

•  TDMA:Everyonehasaperiodicslot•  CSMA:Seeifanyoneelseistransmiang.Ifso,defer.•  Usuallyacksarealsousedtoensuretransmission

–  Retransmitifnecessary

–  Bandwidthreduceswithnumberofnodestryingtotransmit.•  OneLANshouldnothavetoomanynodes

7

Page 8: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Mediumaccess

•  Wireless:morecomplicated•  Hiddenterminalproblem•  Morecomplexprotocolusingacks

8

Page 9: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

OurmodelsofLAN

•  Graph:Everynodehasanedgetoeveryother

•  Weofenassumethattosendamessage(packet)toanodeonthesamenetworktakesoneunitof3me(or,atmostaconstant)

•  ThismaynotbetrueiftherecanbemanynodesinthesameLAN– Butusuallythenumberisnotverylarge

9

Page 10: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Reallifenetworks

•  LANsconnectedbyrouters

Ethernet/WifiEthernet/Wifi

PointtoPoint

Routers

10

Page 11: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

1 4

2 4

3 4

4 4

1 3

2 3

3 3

5 5

1 1

2 2

4 4

5 4

2 2

3 3

4 3

5 3

Rou3ng•  Findingapathinthenetwork•  Everynodehasarou3ngtable•  EquivalenttoaBFStreeforeverynode

1

2

4

3

5

1 1

3 3

4 3

5 3 11

Page 12: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

1-4 4

1-3 3

5 5

1 1

2 2

4-5 42 2

3-5 3

Rou3ng:Distributedsearchforapath

•  Smallerrou3ngtablesbycombiningaddresses•  UsedinIP(Internet)rou3ng•  Smallerrou3ngtablesarepreferable

1

2

4

3

5

1 1

3-5 3

12

Page 13: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Rou3ng

•  Realrou3ngismorecomplicated•  Withmorethanonepathtoades3na3on,backupsetc

13

Page 14: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Related:Loca3onbasedrou3ng

•  Geographicalrou3ngusesanode’sloca3ontodiscoverpathtothatnode.

x

y

GreedyRou3ng:Forwardtotheneighborthatisnearesttothedes3na3on

14

Page 15: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Largenetworks•  Communica3onistypicallypoint-to-pointusingrou3ng

•  Broadcastisnotautoma3c–  Ifweneedbroadcast,wewillhavetoarrangeaflood(orsomeothermethod)

Ethernet/WifiEthernet/Wifi

PointtoPoint

Routers

15

Page 16: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Transportmanagement•  UDP:

–  Sendapacket,hopethatthenetworkroutesanddeliversit,in3me

–  NoSequencenumber•  NotnecessarilyFIFO

–  Usefulinstreamingaudio/video.Notforimportantdata.

•  TCP:–  Sendapacket(orfewpackets)–  Packetshavesequencenumber

•  FIFO–  Ifnoacksarrive,resendpackets–  Ifnoacksarefoundafermanytries,returnerror

16

Page 17: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

TCP

•  Doesdistributedconges3oncontrol– Whenpacketsdon’tgetdelivered,TCPslowsdownthestream•  Assump3on:routersdroppacketswhentherearetoomany

•  Difficulty– Acksmaynotarriveduetootherfactors•  Someconnec3onfailedtemporarily•  Usermovedfromonenetworktoanother

17

Page 18: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Networkstack

•  Eachlayersolvesadifferentdistributedproblem

18

Formwikipedia

Page 19: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Communica3on:Overlaynetwork•  Wemaysome3mesignorepartsofthenetwork

–  Nodesthatcarrymessagesbutdonotdirectlypar3cipateeg.routers

–  Oredgesthatexistbutwearenotusing–  Orwedon’tknowabout

•  Ofenusedinpeer-to-peernetworks–  Noteverynodeknowsallothernodesinthenetwork–  Butcommunicatestoknownnodesthroughrou3ng

19

Page 20: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Blockingandnon-blockingcommunica3on

•  Blockingcommunica3on–  Sendersendsmessage,waitsun3lreceiverreplies–  Doesnotdoanythinginthemean3me

•  Non–blockingcommunica3on–  Sendersendsmessage,thencon3nuesitsownworkwithoutwai3ng

– Whenreceiverreplies,orsomeothermessagearrives,nodeinterruptscurrentworktohandlemessage

•  Some3mesthesearecalledsynchronous/asynchronous,butwewilltrynottousethat

20

Page 21: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Computa3on•  Synchronous:– Opera3oninrounds–  Inaround,anodeperformssomecomputa3on,andthensendssomemessages

– Allmessagessentattheendofroundxareavailabletorecipientsatstartofroundx+1•  Butnotearlier

21

Computa3on

Communica3on

Round1 Round2 Round3

Page 22: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Communica3on

•  Synchronous– Canbeimplementedifmessagetransmission3meisboundedbysomeconstantsaym

– Computa3on3mesforallnodesareboundedbysomeconstantc

– Clocksaresynchronized(sufficiently)– Thenseteachroundtobem+cindura3on

22

Page 23: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

AsynchronousCommunica3on

•  Nosynchroniza3onorrounds– Nodescomputeatdifferentandarbitraryspeeds– Messagesproceedatdifferentspeeds:maybearbitrarilydelayed,maybereceivedatany3me

•  Worstcasemodel– Noassump3onaboutspeedsofprocessesorchannel–  (Butdoesnotincludecommunica3on/computa3onerrors)

23

Page 24: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

AsynchronousCommunica3on

•  Hardertomanage– Messagecanarriveatany3meaferbeingsent,mustbehandledsuitably

– Possibletomakesomesimplifyingassump3onsE.g.:•  ChannelsareFIFO:orderofmessagesonachannelarepreserved•  Somecodeblocksareatomic(notinterruptedbymessages)•  Eithercommunica3onorcomputa3on3mesbounded

24

Page 25: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Synchronouscommunica3oninRealsystems

•  Synchronouscommunica3oncanbeafairmodel•  Moderncomputersandnetworksarefast–  (thoughnotarbitrarilyfast)

•  Easiertodesignalgorithmsandanalyze•  Welldesignedalgorithmsarefasterandmoreefficient

•  Ofencanbeadaptedtoasynchronoussystems– Ofenastar3ngpointfordesign

25

Page 26: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Failures

•  Nodesmayfail– Hardwarefailure– Runoutofenergyorpowerfailure– Sofwarefailure(crash)– Permanent– Temporary(whathappenswhenitrestarts?Recoversthestate?Startsfromini3alstate?)

– Modeldependsonsystem.E.g.differenttypesoffailuresoccurwithcorrespondingprobabili3es

26

Page 27: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Nodefailures•  Commonabstractmodels

–  Stoppingfailure:nodejuststopsworking•  Mayneedassump3onsaboutwhichcomputa3on/communica3onitfinishesbeforestopping

•  Mayneedassump3onaboutneighborsknowingoffailure–  Byzan3nefailure:nodebehavesasanadversary

•  Imagineyourenemyhastakencontrolofthenode•  Istryingtospoilyourcomputa3on

•  Nodesmayfailindividually–  E.g.eachnodefailswithprobabilityp

•  Nodesmayhavecorrelatedfailure–  E.g.allnodesfailinaregion(datacenter,sensorfield)

27

Page 28: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Link/communica3onfailure•  Maybetemporary/permanent•  Mayhappendueto

–  Hardwarefailure–  Noise:electronicdevices(microwavesetc)maytransmitradiowavesatsimilarfrequenciesanddisruptcommunica3on

–  Interference:Othercommunica3ngnodesnearbymaydisruptcommunica3on

•  Effects–  Channelsilentandunusable(hardwarefailure)–  Channelac3ve,butunusableduetonoiseandinterference–  Channelac3ve,butmaycontainerroneousmessage(maybedetectedbyerrorcorrec3ngcodes)

28

Page 29: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Security•  Issues:– Unauthorizedaccess,modifica3on.Makingsystemsunavailable(DOS)

– AWackononeormorenodes•  Causingtoitfail•  Readdata•  Takingcontroltoreadfuturedata,disruptopera3on

– AWackoncommunica3onlinks/channel•  Blockcommunica3on•  Readdatainthechannel(easyinwirelesswithoutencyp3on)

•  Corruptdatainthechannel

29

Page 30: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Security

•  Solu3onsusuallyhavespecificassump3onsofwhattheadversarycando

•  E.g.Ifadversaryhasaccesstochannel– Cryptographymaybeabletopreventreading/corrup3ngdata

30

Page 31: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Mobility

•  Movementmakesithardertodesigndistributedsystems–  Communica3onisdifficult

•  Delays,lostmessages•  Edgeweightscanchange

– Applica3onsthatdependonloca3onmustadapttomovement

•  Howdopeoplemove?Whatisamodelofmovement?– Notyetwellunderstood

31

Page 32: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

Modelingdistributedsystems

•  Manypossibili3es•  Chooseyourassump3onscarefullyforyourproblem

•  PaycloseaWen3ontowhatisknownaboutcommunica3on/network

•  Startwithsimplermodels– Usuallymoreassump3ons,fewerparameters– Seewhatcanbeachieved– Thentrytodrop/relaxassump3ons

32

Page 33: Distributed Systems Communicaon and models · E.g. different types of failures occur with corresponding probabili3es 26 Node failures • Common abstract models – Stopping failure:

33