towards a redundancy -aware network stack for data centersmusa/uploads/hotnets_2016_talk.pdf ·...
TRANSCRIPT
TowardsaRedundancy-AwareNetworkStackforDataCenters
AliMusaIftikhar(Tufts)
Ihsan AyyubQazi(LUMS)
FahadR.Dogar(Tufts)
TheProblemofTailLatency inDataCenters!
2
TheProblemofTailLatency inDataCenters!
Highfan-out
3
TheProblemofTailLatency inDataCenters!
• Loadimbalance• Backgroundtasks• Failures,etc.
Highfan-out
Straggler
4
TheProblemofTailLatency inDataCenters!
• Loadimbalance• Backgroundtasks• Failures,etc.
LongtaillatencyHighfan-out Stragglers+
Highfan-out
Straggler
5
Howtoavoidstragglers?
Reactively Proactively
6
Howtoavoidstragglers?
Reactively Proactively
PRO:lowoverhead
Hopper(SIGCOMM’15)C3(NSDI’15)
Sinbad(SIGCOMM’13)
CON:requiresstragglerdetection(slowandinaccurate)
7
Howtoavoidstragglers?
Reactively Proactively
PRO:lowoverhead
Dolly(NSDI’13)Lowlatencyvia
redundancy(CoNext’13)
Hopper(SIGCOMM’15)C3(NSDI’15)
Sinbad(SIGCOMM’13)
PRO:fastandaccurate
CON:requiresdeterminingthresholdload(non-trivial)
CON:requiresstragglerdetection(slowandinaccurate)
8
Howtoavoidstragglers?
Reactively Proactively
PRO:lowoverhead
Can we achieve the benefits of both without their limitations?
Dolly(NSDI’13)Lowlatencyvia
redundancy(CoNext’13)
Hopper(SIGCOMM’15)C3(NSDI’15)
Sinbad(SIGCOMM’13)
PRO:fastandaccurate
CON:requiresdeterminingthresholdload(non-trivial)
CON:requiresstragglerdetection(slowandinaccurate)
9
Overview• Duplicate-Aware Scheduling Framework
• Redundancy-Aware Network Stack
• Preliminary Results
Genericframework
NewnetworkstackforDC
10
Duplicate-awarescheduling
Replica1
Client
Replica2 11
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
low
1. PriorityQueues
12
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
low
request
1. PriorityQueues
13
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
low
request
1. PriorityQueues
14
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
low
P
request
B
1. PriorityQueues
15
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
P
high
lowB
request
1. PriorityQueues
16
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
lowB
request
1. PriorityQueues
17
Duplicate-awarescheduling
Replica1
Client
Replica2
high
low
high
lowB
request
purge
1. PriorityQueues2. Purging
18
NeedforPriorityQueuing
high
lowbackup
primary
19
ØDuplicationhasanoverhead!
L
NeedforPriorityQueuing
high
lowbackup
primary
üStrictprioritiesüWorkconservationüPreemption
20
ØDuplicationhasanoverhead!
LPropertiesrequired:
NeedforPriorityQueuing
high
lowbackup
primary
üStrictprioritiesüWorkconservationüPreemption
21
ØDuplicationhasanoverhead!
LPropertiesrequired:
PQ makes the overhead of duplication low. sJ
NeedforPriorityQueuing
high
lowbackup
primary
üStrictprioritiesüWorkconservationüPreemption
22
ØDuplicationhasanoverhead!
LPropertiesrequired:
PQ makes the overhead of duplication low. sJessential
ImportanceofPurging
ØStalerequestsblocknewrequests.
Lhigh
lowreq1req2
stale
23
ImportanceofPurging
high
lowreq1req2
stale
ØStalerequestsblocknewrequests.
L
Purging makes the system more efficient! A J24
ImportanceofPurging
high
lowreq1req2
stale
ØStalerequestsblocknewrequests.
L
Purging makes the system more efficient! A Joptimization 25
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
26
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
Network
27
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
Compute
Network
28
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
Memory
Compute
Network
29
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
GFSHDFS BigTable
Memory
Compute
Filesystem/Database
Network
30
RealizingDuplicate-AwareSchedulingateverypotentialbottleneck resourceinaDC
GFSHDFS BigTable
Storage
Memory
Compute
Filesystem/Database
Network
31
ateverypotentialbottleneck resourceinaDC
GFSHDFS BigTable
Memory
Compute
Storage
Network
Filesystem/Database
In-networkpurging
Prioritization
Purging+preemption
challenges
32
RealizingDuplicate-AwareScheduling
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
+purging
33
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
Applicationsneedtobemodified.
challenge
+purging
34
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
Applicationsneedtobemodified.
Expressiveinterfaceallowsrichcommunicationb/wAppand
Transport.E.g.DAG
challenge
opportunity
+purging
35
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
Applicationsneedtobemodified.
Expressiveinterfaceallowsrichcommunicationb/wAppand
Transport.E.g.DAG
challenge
opportunity
Hardtoimplementperpacketpurging.
challenge+purging
36
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
Applicationsneedtobemodified.
Expressiveinterfaceallowsrichcommunicationb/wAppand
Transport.E.g.DAG
challenge
opportunity
Hardtoimplementperpacketpurging.
challenge
AddssupportforexistingPQsinDCswitches.
opportunity
+purging
37
RedundancyAwareNetworkStack(RANS)
Application
Transport
Link
Network
Duplicate-Awareness
Pointtomultipoint
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Layer NewRole
+purging
Applicationsneedtobemodified.
Expressiveinterfaceallowsrichcommunicationb/wAppand
Transport.E.g.DAG
challenge
opportunity
Hardtoimplementperpacketpurging.
challenge
AddssupportforexistingPQsinDCswitches.
opportunity
+purging
38
e.g.Improvedfaulttolerance
üMultipath
üMulti-destination
RANSTransport:PointtoMulti-point
ØEnables:Richtransport
Sender1(replica1)
Receiver(client)
Sender2(replica2)
39
RANSTransport:ByteAggregation
Sender1(replica1)
Receiver(client)
Sender2(replica2)
ØOpportunity:Receiverdriventransport
Response
e.g.Moreefficientcongestioncontrol(2xormore)
üTwoormoreresponsestreams
üAggregatebytesatreceiverside
40
RANSTransport:PriorityAssignment
Sender1(replica1)
Receiver(client)
Sender2(replica2)
ØDynamicreplicaassignment
Response+Feedback
e.g.Improvedreplicaassignment
üFinegrainedmonitoringofcongestionwindow
üDynamicallyreprioritizeflows
üFeedbacktoApplication
41
Overview• Duplicate-Aware Scheduling Framework
• Redundancy-Aware Network Stack
• Preliminary Results
42
PreliminaryEvaluation:ns-2setupdetailsØ Storagescenario
Client
10servers
Replica1
Replica2
43
PreliminaryEvaluation:ns-2setupdetailsØ Storagescenario
Client
10servers
Replica1
Replica2
bottlenecks
44
PreliminaryEvaluation:ns-2setupdetails
TrafficDetails
Totalrequests 20K
Arrivalprocess Poisson
Server&replicaselection Uniformlyrandom
Ø Storagescenario
Client
10servers
Replica1
Replica2
bottlenecks
45
PreliminaryEvaluation:ns-2setupdetails
TrafficDetails
Totalrequests 20K
Arrivalprocess Poisson
Server&replicaselection Uniformlyrandom
Ø Storagescenario
Client
10servers
Replica1
Replica2
bottlenecks
The only source of stragglers is load imbalance.
46
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
47
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
48
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
49
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
50
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
51
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
~2X
52
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
~2X
Expecting more gains even at lower loads with additional straggler sources.
53
Noduplicates(baseline)
2-copies(proactivew/oPQ)
+PQs
+Purging
+ByteAggregation(RANS)
Averagerequestcompletiontimeof:
Load(%)
Requ
estcom
pletiontim
e(s)
50-80% improvement over the baseline across all loads.
Expecting more gains even at lower loads with additional straggler sources.
~2X
54
Summary&Futurework
• TheIssueofStragglers
• Duplicate-AwareSchedulingFramework
• RANS
• ImplementinginHDFSandCassandra
Simpleyetchallengingsolution
Afirststeptowardsaduplicate-awarenetwork
55
RANS:FeedbackandDiscussion
• AliMusaIftikhar([email protected])
• FahadR.Dogar ([email protected])
• Ihsan A.Qazi ([email protected])Transport
Link
Network
PointtomultipointByteaggregationPriorityassignment
PriorityQueues
Physical Sameasbefore
Sameasbefore
ExpressiveInterface
Application Duplicate-Awareness
Layer NewRole
+purging
+purging
56
Possiblequestions– backupslide• Preemptionoverhead
• Notreallyanissueinthenetworkbecausepacketsaresmall.
• Packetpurging• PFC(backpressure,buildqueuesattheendhostsandpurgethem)
• Droptheentireduplicatequeue(easierthanper-packetdrops)
• Recenttrendtowardsprogrammableswitches
• GainswithPQ• Moregainswithfailuresasstragglers(primaryundergoesafailure)
• Alsomorebenefitswithdifferentresources
• Duplicationoverheadatclient• Clientisusuallynotthebottleneck
• Non-Idempotentrequests• Wearetargetingtheclassofappswhichhaveflexibleendpointsandrequireatleastoncesemantics
• Replicatingonlysmallpacketsandprioritizingthem• Onlybeneficialwithbursty smallflows• HDFShaveatypicalchunksizeb/w64MB-128MB
• Quorumsystems• RANScomplementssuchsystems,theycanusethistechniqueandsendKoutofNrequestsathighprio whileN-Kasbackups
• Can’tjustimplementattheappandgetthesamebenefits?• Networkcouldbeabottleneck• Finegrainedcontrol,muchmorecontrol
• Rootcausesofperformanceimprovement• PQavoidsoverheads• Nowwecaneasilygetthebenefitsofduplicationslikeaggregationetc.
• Purgingwillalsoattimespurgeprimarymakingthesystemmoreefficient.
57
Foodforthought
DCPrimary DCFailover
InterDCDuplicate-AwareScheduling
e.g.Google’sGeo-DistributedDatabase“Spanner”(OSDI’12)
58
Foodforthought
DCPrimary DCFailover
InterDCDuplicate-AwareScheduling
e.g.Google’sGeo-DistributedDatabase“Spanner”(OSDI’12)
Prefetch
SearchsuggestionsSpellcheck
• Searchenginesdropspellcheck,suggestions,etc.athighloads.
• Canbenefitfromduplicate-awarescheduling.
59
WhenRANSworksbest?
• Applicationfanout ishighandstragglersarefrequent.• End-pointsareflexibleand“atleastonce”semanticsaresufficient.• Clientisnotthebottleneck.• Requestsizesaresmall(orpreemptionoverheadisminimal).
60