network for the large-scale hadoop cluster at yahoo! japan

Post on 07-Jan-2017

418 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

2016/10/27

Kai Fukazawa, Yahoo Japan Corporation

Network for the Large-scale

Hadoop cluster at Yahoo! JAPAN

Agenda

2

Hadoop and Related NetworkYahoo! JAPAN’s Hadoop Network TransitionNetwork Related Problems and Solutions

Network Related Problems Network Requirements of The Latest Cluster Adopted IP CLOS Network for Solving Problems

Yahoo! JAPAN’s IP CLOS Network Architecture Performance Tests New Problems

Future Plan

Hadoop and Related Network

Hadoop and Related Network

4

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).

“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).

Hadoop and Related Network

5

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).

“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).

Hadoop and Related Network

6

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

North/South

Hadoop and Related Network

7

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

East/West

Hadoop and Related Network

8

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

HighLow

Hadoop and Related Network

9

“Introduction to Facebook‘s data center fabric”. (11/14/2014). Retrieved https://www.youtube.com/watch?v=mLEawo6OzFM. (10/06/2016).

Hadoop and Related Network

10

Oversubscription commonly expressed as a ratio of the amount of desired bandwidth required

versus bandwidth available

10Gbps

1Gbps NIC 40Nodes = 40Gbps

Oversubscription40 : 10 = 4 : 1

“Hadoop Operations by Eric Sammer (O’Reilly). Copyright 2012 Eric Sammer, 978-1-449-32705-7.”

Yahoo! JAPAN’s

Hadoop Network Transition

12

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1(Jun. 2011)

Cluster2(Jan. 2013)

Cluster3(Apr. 2014)

Cluster4(Dec. 2015)

Cluster5(Jun. 2016)

01020304050607080

Cluster VolumeP

B

13

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/RackServer NICUpLinkOversubscription

14

Yahoo! JAPAN’s Hadoop Network Transition

20G

Cluster1

4 Switches/Stack

Stack ArchitectureNodes/RackServer NICUpLinkOversubscription

15

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription

16

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription

17

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription

20Gbps

18

Yahoo! JAPAN’s Hadoop Network Transition

20Gbps

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1

19

Yahoo! JAPAN’s Hadoop Network Transition

20Gbps

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1

Up to ~10 switches

20

Cluster2

Yahoo! JAPAN’s Hadoop Network Transition

Spanning Tree ProtocolNodes/RackServer NICUpLinkOversubscription

21

Cluster2

Yahoo! JAPAN’s Hadoop Network Transition

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

22

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

23

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription10Gbps

24

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 110Gbps

25

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 1Blocking

26

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription

Cluster3

27

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Cluster3

28

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Cluster3

29

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription

L2 Fabric

Cluster3

20Gbps 20Gbps

30

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 2 : 1

L2 Fabric

Cluster3

20Gbps 20Gbps

31

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription

L2 Fabric

Cluster4

32

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLinkOversubscription

L2 Fabric

Cluster4

33

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLink 80GbpsOversubscription 2 : 1

L2 Fabric

80Gbps 80Gbps

Cluster4

34

Yahoo! JAPAN’s Hadoop Network transition

Release Volume #Nodes/Switch NIC Oversubscription

Cluster1 3PByte 90 1Gbps 4.5:1

Cluster2 20PByte 40 1Gbps 4:1

Cluster3 38PByte 40 1Gbps 2:1

Cluster4 58PByte 16 10Gbps 2:1

Cluster5 75PByte ? ?Gbps ?:?

Network Related Problems

And Solutions

Network Related Problems

36

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

37

Effect of switch failure in the Stack Architecture

One of the switches which formed the Stack failed

This affected the other switches forming the same Stack

Communication interruption among 90 nodes(5 racks)

insufficient computing resources and processing stoppage

Network Related Problems

38

Load on the switch due to BUM Traffic

L2 Fabric

… …4400Nodes

Due to ARP traffic from servers, load on the core switch CPU increases

Tuning of ARP Cache entry timeout

The problem is Large Network Address

Network Related Problems

39

Limitations for the DataNode Decommission

Network Related Problems

Consideration of the impact on jobs

Limiting the number of nodes for Decommissioning

40

Limitations for the Scale-out

Stack Architecture Up to ~10 switches

L2 Fabric Architecture Depending on the number of

chassis

Network Related Problems

41

Requirements 120~200 RacksScale-out possible up to 10000 Nodes 100~200Gbps UpLink/Rack

10Gbps NIC Server20Nodes/Rack

DataCenter Located in US

Network Requirements of The Latest Cluster

42

How to solve these problems?

43

How to solve these problems?

We adopted IP CLOS Network!

Adopted IP CLOS Network For Solving Problems

44

Google, Facebook, Amazon, Yahoo…Over The Top have adopted                DC network architecture

“Introducing data center fabric, the next-generation Facebook data center network”. Facebook Code. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/. (10/06/2016).

Adopted IP CLOS Network For Solving Problems

45

Improved scalability

Improved high availability

Cope-Up with increase in East-West traffic

Reduction in operating cost

Yahoo! JAPAN’sIP CLOS Network

47

BoxSwitch Architecture No limitation on Scale-out Requires many switches

・・・・・ ・・・

・・ ・・・・・ ・・・

・・

・・ ・・ ・・ ・・・・・

Spine

Leaf

ToR

Architecture

48

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Architecture

Architecture

49

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

・・・・・Spine

Leaf

Why was this architecture adopted? Reduce in items to be managed

IP address and cable, Interface, BGP Neighbor….. Overcomes the physical constraints, such

as one floor limit Reduction in cost

Architecture

ECMP

Between Spine and Leaf is BGP

51

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

BGP

Architecture

52

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

/31

/26 /27

ArchitectureBetween Spine and Leaf : /31Rack : /26, /27

53

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

/31

/26 /27

Architecture

Resolved the “BUM Traffic problem”

54

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160Gbps① ②

③④

Architecture

55

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

① ②③

Architecture

10Gbps NIC20Nodes

160Gbps

56

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1

10Gbps NIC20Nodes

57

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1Resolved the “Limitations for the

DataNode Decommission”

10Gbps NIC20Nodes

58

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1Improved High Availability

10Gbps NIC20Nodes

Architecture

59

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

Architecture

60

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

✔✔✔

Limited Resolved

61

Yahoo! JAPAN’s Hadoop Network transition

Release Volume #Nodes/Switch NIC Oversubscription

Cluster1 3PByte 90 1Gbps 4.5:1

Cluster2 20PByte 40 1Gbps 4:1

Cluster3 38PByte 40 1Gbps 2:1

Cluster4 58PByte 16 10Gbps 2:1

Cluster5 75PByte 20 10Gbps 1.25:1

Performance Tests(5TB Terasort)

62

63

Performance Tests(40TB DistCp)

64

Performance Tests(40TB DistCp)

16Nodes/Rack8Gbps/Node

65

Performance Tests(40TB DistCp)

16Nodes/Rack8Gbps/NodeAbout 30Gbps x 4 =

120Gbps

New Problems

66

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

Slow

New Problems

67

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”

Slow

New Problems

68

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”

Slow

New Problems

69

IP changes when the server rack changes Also has a network address for each rack Access control using IP address

Requires ACL update according to relocation

192.168.0.0/26 192.168.0.64/26

192.168.0.10 192.168.0.100

Future Plan

Future Plan

71

Detecting error packet failure before affecting the data transfer

Error!

Future Plan

72

Error!

Auto Shutdown

Detecting error packet failure before affecting the data transfer

Future Plan

73

Use Erasure Coding striping

64kBOriginal raw data

Future Plan

74

Use Erasure Coding

D6

striping64kBOriginal raw data

Raw dataD5D4D3D2D1

Future Plan

75

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

Future Plan

76

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Future Plan

77

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Read

Future Plan

78

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Read

Future Plan

79

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Low Data Locality

Future Plan

80

・・・・・・・・・・・・

Interconnecting various platforms

… …

BOTTLENECK

Future Plan

81

・・・・・・・・・・・・・・

Isolation of computing and storage

: Storage Machine

: Computing Machine

Thank You for Listening!

Appendix

Appendix

84

JANOG38http://www.janog.gr.jp/meeting/janog38/program/clos

top related