a large scale study of data center network reliability · data center network reliability ... older...

45
A Large Scale Study of Data Center Network Reliability Justin Meza * ,• Tianyin Xu ‡,• Kaushik Veeraraghavan Onur Mutlu †, * Facebook *CMU ETH Zürich UIUC

Upload: others

Post on 18-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

A Large Scale Study of Data Center Network Reliability

Justin Meza *,•

Tianyin Xu ‡,• Kaushik Veeraraghavan •

Onur Mutlu †,*

•Facebook *CMU †ETH Zürich ‡UIUC

Page 2: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

2

Page 3: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

3

Page 4: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

4

Page 5: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

5

Internet ISP Edge Node

WAN

Page 6: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

6

Core Switches Data Center Fabric Top of Rack Switch

Page 7: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

7

Internet ISP Edge Node

WAN

Core Switches Data Center Fabric Rack Switch

Page 8: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• As DC networks become more software managed, the next challenge is first and last hop reliability

• Growth in geo replicated software drives the need for

reliable backbone network capacity planning

8

Key takeaways

Page 9: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Tracking how network failures affect software

• A next challenge for data center network reliability

• Geo replication and backbone capacity planning

• Concluding thoughts

9

Roadmap

Page 10: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

10

Network Incidents cause

Software Failures that result in

Site Events (SEVs)

Page 11: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

11

Core Switches Data Center Fabric Top of Rack Switch

Page 12: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

Automated repair

• Not deployed on all devices, but highly effective • Top fixes: port failures, config files, switch ping failures 12

Page 13: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

13

• Filled out by engineers who fixed the problem

• Contain metadata (switch ID, switch type, ...)

• Classified based on severity

• Continually audited for accuracy and completeness

SEV reports

Page 14: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

14

• Want to know software impact and severity

• Events alone don't provide enough context

• Often masked by redundancy and automated repairs

• We examine a different class of failures

• Software failures resulting from network events

• Top-line impact on software reliability

Network incidents, not events

Page 15: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Tracking how network failures affect software

• A next challenge for data center network reliability

• Geo replication and backbone capacity planning

• Concluding thoughts

15

Roadmap

Page 16: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

16

• Simple, custom switches

• Software-based fabric networks

• Automated repair

Data center network trends

Page 17: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

17

• Simple, custom switches

• Software-based fabric networks

• Automated repair

Data center network trends

Page 18: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

18

Older cluster-based design

Page 19: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Cluster network incidents increased 9x over 4 years 19

Older cluster-based design

Page 20: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

20

Cluster versus fabric designs

Page 21: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Cluster have 2X total incidents & 2.8X on a per-device level 21

Cluster versus fabric designs

Page 22: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

22

DC fabric has fewer incidents

• Reversing the negative software-level reliability trend

Page 23: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

23

First and last hop reliability

Page 24: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

24

First and last hop reliability

Page 25: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

25

First and last hop reliability

Page 26: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

26

of network devices82%

rack switches make up

Page 27: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

27

Main cause across all severities

Page 28: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

28

• More redundant switches is one approach

• Make software more resilient

• More aggressive automated repairs

Implications for DC networks

Page 29: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Tracking how network failures affect software

• A next challenge for data center network reliability

• Geo replication and backbone capacity planning

• Concluding thoughts

29

Roadmap

Page 30: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

30

Page 31: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

31

Backbone traffic growth

Page 32: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

32

Page 33: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

33

• Shared resource

• Frequent link failures

• Capacity planning dictates reliability

Data center backbones

Page 34: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Sent via email for maintenances and outages

• Parsed and logged in a database

• Used to compute reliability statistics: • Mean time between failures (MTBF) • Mean time to repair (MTTR)

34

Measuring backbone reliability

Page 35: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

35

Edge node MTBF distribution

• Typical edge node failure rate is on the order of months

Page 36: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

36

Edge node MTTR distribution

• Edge node mean time to repair is on the order of hours

Page 37: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

37

Fiber vendor MTBF distribution

• Typical vendor link failure rate is on the order of months

Page 38: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

38

Fiber vendor MTTR distribution

• Vendor MTBF and MTTR span multiple orders of magnitude

Page 39: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

39

Minimizing backbone outages

Model 2

Model 3

...

Simulation objective = six 9's yearly reliability

Capacity planNode1: Links A, BNode 2: Links X, Y

Page 40: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

40

Minimizing backbone outages

Model 2

Model 3

...

Simulation objective = six 9's yearly reliability

Capacity planNode1: Links A, BNode 2: Links X, Y

Page 41: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

Forest City Data Center

41

Page 42: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

Forest City Data Center

42

Page 43: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• Tracking how network failures affect software

• A next challenge for data center network reliability

• Geo replication and backbone capacity planning

• Concluding thoughts

43

Roadmap

Page 44: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

• First and last hop reliability forces us to rethink how

network and software share the task of reliability

• Reliable backbone planning is a key enabler for geo

replication and software management flexibility

44

Concluding thoughts

Page 45: A Large Scale Study of Data Center Network Reliability · Data Center Network Reliability ... Older cluster-based design. 20 Cluster versus fabric designs ... • Concluding thoughts

45