hedera : dynamic flow scheduling for data center networks

33
Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat hics stolen from original NSDI 2010 slides (thanks Moha Presented by TD Hedera: Dynamic Flow Scheduling for Data Center Networks

Upload: jenn

Post on 23-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Hedera : Dynamic Flow Scheduling for Data Center Networks. Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat. Presented by TD. Graphics stolen from original NSDI 2010 slides (thanks Mohammad!). Easy to understand problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hedera : Dynamic Flow Scheduling for Data Center Networks

Mohammad Al-FaresSivasankar RadhakrishnanBarath RaghavanNelson HuangAmin Vahdat

Graphics stolen from original NSDI 2010 slides (thanks Mohammad!)

Presented by TD

Hedera: Dynamic Flow Scheduling for Data Center Networks

Page 2: Hedera : Dynamic Flow Scheduling for Data Center Networks

Easy to understand problem• MapReduce style DC applications need bandwidth • DC networks have many ECMP paths between servers • Flow-hash-based load balancing insufficient

Simple key insight/idea• Find flows that need bandwidth and periodically rearrange

them in the network to balance load across ECMP paths

Algorithmic challenges• Estimate bandwidth demands of flows• Find optimal allocation network paths to flows

Page 3: Hedera : Dynamic Flow Scheduling for Data Center Networks

ECMP Paths• Many equal cost paths going up to the core switches• Only one path down from each core switch• Randomly allocate paths to flows using hash of the flow

– Agnostic to available resources– Long lasting collisions between long (elephant) flows

DS

Page 4: Hedera : Dynamic Flow Scheduling for Data Center Networks

Collisions of elephant flows

• Collisions possible in two different ways– Upward path– Downward path

D1S1 D2S2 D3S3 D4S4

Page 5: Hedera : Dynamic Flow Scheduling for Data Center Networks

Collisions of elephant flows

• Average of 61% of bisection bandwidth wasted on a network of 27K servers

D1S1 D2S2 D3S3 D4S4

Page 6: Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera Scheduler

• Detect Large Flows– Flows that need bandwidth but are network-limited

• Estimate Flow Demands– Use min-max fairness to allocate flows between src-dst pairs

• Place Flows– Use estimated demands to heuristically find better placement of

large flows on the ECMP paths

EstimateFlow Demands

PlaceFlows

DetectLarge Flows

Page 7: Hedera : Dynamic Flow Scheduling for Data Center Networks

Elephant Detection

• Scheduler continually polls edge switches for flow byte-counts– Flows exceeding B/s threshold are “large”

• > %10 of hosts’ link capacity (i.e. > 100Mbps)

• What if only mice on host?– Default ECMP load-balancing efficient for small flows

Page 8: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

• Flows can be constrained in two ways– Host-limited (at source, or at destination)– Network-limited

• Measured flow rate is misleading

• Need to find a flow’s “natural” bandwidth requirement when not limited by the network

• Forget network, just allocate capacity between flows using min-max fairness

Page 9: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

• Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively…

– Sender equally distributes bandwidth among outgoing flows that are not receiver-limited

– Network-limited receivers decrease exceeded capacity equally between incoming flows

– Repeat until all flows converge

• Guaranteed to converge in O(|F|) time

Page 10: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

A

B

C

X

Y

Flow Estimate Conv. ?AXAYBYCY

Sender Available Unconv. BW Flows Share

A 1 2 1/2B 1 1 1C 1 1 1

Senders

Page 11: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Recv RL? Non-SLFlows Share

X No - -Y Yes 3 1/3

ReceiversFlow Estimate Conv. ?AX 1/2AY 1/2BY 1CY 1

A

B

C

X

Y

Page 12: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Flow Estimate Conv. ?AX 1/2AY 1/3 YesBY 1/3 YesCY 1/3 Yes

Sender Available Unconv. BW Flows Share

A 2/3 1 2/3B 0 0 0C 0 0 0

Senders

A

B

C

X

Y

Page 13: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Flow Estimate Conv. ?AX 2/3 YesAY 1/3 YesBY 1/3 YesCY 1/3 Yes

Recv RL? Non-SLFlows Share

X No - -Y No - -

Receivers

A

B

C

X

Y

Page 14: Hedera : Dynamic Flow Scheduling for Data Center Networks

Flow Placement

• Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized– That is maximum utilization of theoretically available b/w

• Two approaches– Global First Fit:

Greedily choose path that has sufficient unreserved b/w– Simulated Annealing:

Iteratively find a globally better mapping of paths to flows

Page 15: Hedera : Dynamic Flow Scheduling for Data Center Networks

Global First-Fit

• New flow detected, linearly search all possible paths from SD• Place flow on first path whose component links can fit that flow

?Flow AFlow BFlow C

? ?0 1 2 3

Scheduler

Page 16: Hedera : Dynamic Flow Scheduling for Data Center Networks

Global First-Fit

• Flows placed upon detection, are not moved• Once flow ends, entries + reservations time out

Flow AFlow BFlow C

0 1 2 3

Scheduler

Page 17: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• Annealing: slowly cooling metal to give it nice properties like ductility, homogeneity, etc– Heating to enter high energy state (shake up things)– Slowly cooling to let the crystalline structure settle down in

a low energy state

• Simulated Annealing: treat everything as metal– Probabilistically, shake things up– Let it settle slowly (gradient descent)

Page 18: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• 4 specifications– State space– Neighboring states– Energy– Temperature

• Simple example: Minimizing f(x)

f(x)

x

Page 19: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing• State: All possible mapping of flows to paths

– Constrained to reduce state space size– Flows to a destination constrained to use same core

• Neighbor State: Swap paths between 2 hosts– Within same pod, – Within same ToR,– etc

Page 20: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing• Function/Energy: Total exceeded b/w capacity

– Using the estimated demand of flows– Minimize the exceeded capacity

• Temperature: Iterations left– Fixed number of iterations (1000s)

• Achieves good core-to-flow mapping– Sometimes very close to global optimal– Non-zero probability of worse state

Page 21: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

– Example run: 3 flows, 3 iterations

0 1 2 3Flow AFlow BFlow C

Core210

? ? ?202

?203

Scheduler

Page 22: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• Final state is published to the switches and used as the initial state for next round

0 1 3Flow AFlow BFlow C

Core ? ? ? ?203

2

Scheduler

Page 23: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• Optimizations– Assign a single core switch to each destination host– Incremental calculation of exceeded capacity– Using previous iterations best result as initial state

Page 24: Hedera : Dynamic Flow Scheduling for Data Center Networks

Fault-Tolerance

• Link / Switch failure– Use PortLand’s fault notification protocol– Hedera routes around failed components

0 1 3Flow AFlow BFlow C

2

Scheduler

Page 25: Hedera : Dynamic Flow Scheduling for Data Center Networks

Fault-Tolerance

• Scheduler failure– Soft-state, not required for correctness (connectivity)– Switches fall back to ECMP

0 1 3Flow AFlow BFlow C

2

Scheduler

Page 26: Hedera : Dynamic Flow Scheduling for Data Center Networks

Rand0Rand1

RandBij0RandBij1

RandX2RandX3

Hotspot0

2

4

6

8

10

12

14

16ECMP Global First-Fit

Simulated Annealing Non-blocking

Communication Pattern

Bise

ction

Ban

dwid

th (G

bps)

Evaluation

Stag1(.2,.3)

Stag2(.2,.3)

Stag1(.5,.3)

Stag2(.5,.3)

Stride(2)

Stride(4)

Stride(8)

0

2

4

6

8

10

12

14

16ECMP Global First-FitSimulated Annealing Non-blocking

Communication PatternBi

secti

on B

andw

idth

(Gbp

s)

Stag(.5,0)

Stag(.2,.3)

Stride(1)

Stride(16)

Stride(256)

RandBijRand

0

1000

2000

3000

4000

5000

6000

7000ECMP Global First-Fit Simulated Annealing Non-blocking

Communication Pattern

Bise

ction

Ban

dwid

th (G

bps)

Page 27: Hedera : Dynamic Flow Scheduling for Data Center Networks

Evaluation

• 16-hosts: 120 GB all-to-all in-memory shuffle

– Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

ECMP GFF SA ControlTotal Shuffle Time (s) 438.4 335.5 336.0 306.4

Avg. Completion Time (s) 358.1 258.7 262.0 226.6

Avg. Bisection BW (Gbps) 2.81 3.89 3.84 4.44

Avg. host goodput (MB/s) 20.9 29.0 28.6 33.1

Data Shuffle

Page 28: Hedera : Dynamic Flow Scheduling for Data Center Networks

Reactiveness

• Demand Estimation: – 27K hosts, 250K flows, converges < 200ms

• Simulated Annealing:– Asymptotically dependent on # of flows + # iterations

• 50K flows and 10K iter: 11ms

– Most of final bisection BW: first few hundred iterations

• Scheduler control loop:– Polling + Estimation + SA = 145ms for 27K hosts

Page 29: Hedera : Dynamic Flow Scheduling for Data Center Networks

Limitations

• Dynamic workloads, large flow turnover faster than control loop

– Scheduler will be continually chasing the traffic matrix

• Need to include penalty term for unnecessary SA flow re-assignmentsFlow Size

Mat

rix S

tabi

lity

Stab

leUn

stab

le

ECMP Hedera

Page 30: Hedera : Dynamic Flow Scheduling for Data Center Networks

Conclusions

• Simulated Annealing delivers significant bisection BW gains over standard ECMP

• Hedera complements ECMP – RPC-like traffic is fine with ECMP

• If you are running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

Page 31: Hedera : Dynamic Flow Scheduling for Data Center Networks

Thoughts [1]

Inconsistencies

• Motivation is MapReduce, but Demand Estimation assumes sparse traffic matrix– Something fuzzy

• Demand Estimation step assumes all bandwidth available at host – Forgets the poor mice :’(

Page 32: Hedera : Dynamic Flow Scheduling for Data Center Networks

Thoughts [2]

Inconsistencies

• Evaluation Results (already discussed)

• Simulation using their own flow-level simulator– Makes me question things

• Some evaluation details fuzzy– No specification of flow sizes (other than shuffle), arrival

rate, etc

Page 33: Hedera : Dynamic Flow Scheduling for Data Center Networks

Thoughts [3]

• Periodicity of 5 seconds– Sufficient? – Limit? – Scalable to higher b/w?