b96611024 謝宗廷 b96b02016 張煥基 1. outline introduction bcube structure bcube source routing...

65
BCube: A High Performance, Server- centric Network Architecture for Modular Data Centers B96611024 謝謝謝 B96b02016 謝謝謝 1

Upload: egbert-lindsey

Post on 25-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • B96611024 B96b02016 1
  • Slide 2
  • Outline Introduction Bcube structure Bcube source routing OTHER DESIGN ISSUES GRACEFUL DEGRADATION Implementation Architecture Conclusion 2
  • Slide 3
  • Introduction Organizations now use the MDC. (shorter deployment time, higher system and power density, lower cooling and manufacturing cost.) The Bcube is a high-performance and robust network architecture for an MDC network architecture. BCube is designed to well support all these traffic patterns. (one-to-one, one-to-several, one-to-all, or all-to-all.)
  • Slide 4
  • Bandwidth-intensive application support One-to-one: one server moves data to another server. (disk backup) One-to-several: one server transfers the same copy of data to several receivers. (distributed file systems) One-to-all: a server transfers the same copy of data to all the other servers in the cluster (boardcast) All-to-all: very server transmits data to all the other servers (mapreduce) 4
  • Slide 5
  • BCUBE STRUCTURE 5
  • Slide 6
  • Bcube construction (Bcube k,n)
  • Slide 7
  • Bcube 1
  • Slide 8
  • Bcube 2 (n=4) level2
  • Slide 9
  • Each server in a BCube k has k + 1 ports, which are numbered from level-0 to level-k. a BCube k has N = n k+1 servers and k+1 level of switches, with each level having n k n-port switches. a BCube k using an address array a k a k-1 a0. 9
  • Slide 10
  • Single-path Routing in BCube use h(A;B) to denote the Hamming distance of two servers A and B. Two servers are neighbors if they connect to the same switch. The Hamming distance of two neighboring servers is one. More specifcally, two neighboring servers that connect to the same level-i switch only differ at the i-th digit in their address arrays. 10
  • Slide 11
  • The diameter, which is the longest shortest path among all the server pairs, of a BCube k, is k + 1. k is a small integer, typically at most 3. There- fore, BCube is a low-diameter network. 11
  • Slide 12
  • Multi-paths for One-to-one Traffic Two parallel paths between a source server and a destination server exist if they are node-disjoint,, i.e., the intermediate servers and switches on one path do not appear on the other. It is also easy to observe that the number of parallel paths between two servers be upper bounded by k + 1, since each server has only k + 1 links. 12
  • Slide 13
  • There are k + 1 parallel paths between any two servers in a BCube k. 13
  • Slide 14
  • There are h(A;B) and k + 1-h(A;B) paths in the first and second categories, respectively. observe that the maximum path length of the paths constructed by BuildPathSet be k + 2. It is easy to see that BCube should also well support several-to-one and all-to-one traffic patterns. 14
  • Slide 15
  • Speedup for One-to-several Traffic These complete graphs can speed up data replications in distributed file systems src has n-1 choices for each d i. Therefore, src can build (n - 1) k+1 such complete graphs. When a client writes a chunk to r chunk servers, it sends 1/r of the chunk to each of the chunk server. This will be r times faster than the pipeline model. 15
  • Slide 16
  • Source:00000 Want to build a complete graph: 00001,00010,00100,01000,10000 Complete graph: (00000,00001,00010,00100) 01000->01001->00001 01000->01010->00010 01000->01100->00100 16
  • Slide 17
  • Speedup for One-to-all Traffic In one-to-all, a source server delivers a file to all the other servers. It is easy to see that under tree and fat-tree, the time for all the receivers to receive the file is at least L. A source can deliver a file of size L to all the other servers in L /k+1 time in a BCube k. constructing k+1 edge-disjoint server spanning trees from the k + 1 neighbors of the source. 17
  • Slide 18
  • When a source distributes a file to all the other servers, it can split the file into k +1 parts and simultaneously deliver all the parts via different spanning trees. 18
  • Slide 19
  • Aggregate Bottleneck Throughput for All-to-all Traffic the flows that receive the smallest throughput are called the bottleneck flows. The aggregate bottleneck throughput (ABT) is defined as the number of flows times the throughput of the bottleneck flow. n/n-1 (N -1), where n is the switch port number and N is the number of servers. 19
  • Slide 20
  • BCUBE SOURCE ROUTING 20
  • Slide 21
  • 21 sourcedestination intermediate K+1 path Probe packet
  • Slide 22
  • Source: obtain k+1 parallel paths and then probes these paths. if one path is found not available, the source uses the Breadth First Search (BFS) algorithm to find another parallel path. removes the existing parallel paths and the failed links from the BCube k graph, and then uses BFS to search for a path the number of parallel paths must be smaller than k + 1. 22
  • Slide 23
  • Intermediate: Case1: if its next hop is not available, it returns a path failure message (which includes the failed link) to the source. Case2: it updates the available bandwidth field of the probe packet if its available bandwidth is smaller than the existing value. 23
  • Slide 24
  • Destination: a destination server receives a probe packet, it first updates the available bandwidth field of the probe packet if the available bandwidth of the incoming link is smaller than the value carried in the probe packet. It then sends the value back to the source in a probe response messages 24
  • Slide 25
  • 5.1 Partial BCube 25
  • Slide 26
  • Why Partial BCube??? In some cases, it may be difficult or unnecessary to build a complete BCube structure. For example, when n = 8 and k = 3, we have 4096 servers in a BCube3. 8 ** 4 = 4096 However, due to space constraint, we may only be able to pack 2048 servers. 26
  • Slide 27
  • partial BCube k (1) build the BCube k1 s (2) use partial layer-k switches to interconnect the BCube k1 s. 27
  • Slide 28
  • Example 28
  • Slide 29
  • 29
  • Slide 30
  • Solution When building a partial BCube k, we first build the needed BCube k1 s, we then connect the BCube k1 s using a full layer-k switches. 30
  • Slide 31
  • Pro and con of full layer-k switches BCubeRouting performs just as in a complete BCube, and BSR just works as before. switches in layer-k are not fully utilized 31
  • Slide 32
  • 5.2 Packaging and Wiring 32
  • Slide 33
  • Condition We show how packaging and wiring can be addressed for a container with 2048 servers and 1280 8-port switches (a partial BCube with n = 8 and k = 3). 33
  • Slide 34
  • 40-feet container 34
  • Slide 35
  • One rack 35 16 layer-1 8 layer-2 16 layer-3
  • Slide 36
  • One rack = One BCube 1 36 64 servers 16 (8-port switches)
  • Slide 37
  • One super-rack = One BCube 2 The level-2 wires are within a super-rack and level-3 wires are between super-racks.
  • Slide 38
  • 5.3 Routing to External Networks
  • Slide 39
  • We assume that both internal and external computers use TCP/IP. We propose aggregator and gateway for external communication.
  • Slide 40
  • We can use a 48X1G+1X10G aggregator to replace several mini-switches and use the 10G link to connect to the external network. The servers that connect to the aggregator become gateways.
  • Slide 41
  • When an internal server sends a packet to an external IP address (1) choose one of the gateways. (2) The packet is then routed to the gateway using BSR (BCube Source Routing) (3) After the gateway receives the packet, it strips the BCube protocol header and forwards the packet to the external network via the 10G uplink
  • Slide 42
  • Slide 43
  • aggregate bottleneck throughput (ABT) ABT reects the all-to-all network capacity. ABT = ( the bottleneck ow) * ( the number of total ows in the all- to-all tra c model ) Graceful degradation states that when server or switch failure increases, ABT reduces slowly and there are no dramatic performance falls.
  • Slide 44
  • In this section, we use simulations to compare the aggregate bottleneck throughput (ABT) of BCube, fat-tree [1], and DCell [9], under random server and switch failures.
  • Slide 45
  • THE FAT TREE
  • Slide 46
  • DCell
  • Slide 47
  • Assumption: all the links are 1Gb/s and there are 2048 servers. switch: we use 8-port switches to construct the network structures.
  • Slide 48
  • BCube networkwe use is a partial BCube3 with n = 8 that uses 4 full BCube 2. fat-tree structureve layers of switches, with layers 0 to 3 having 512 switches per-layer and layer-4 having 256 switches. DCellpartial DCell2 which contains 28 full DCell1 and one partial DCell1 with 32 servers. 48
  • Slide 49
  • BCube(1) only BCube provides high ABT and graceful degradation fat-treewhen there is no failure, both BCube and fat-tree provide high ABT values, 2006Gb/s for BCube and 1895 Gb/s for fat-tree. DCell(1) ABT: 298Gb/s First, the trac is imbalanced at dierent levels of links in DCell. Second, partial DCell makes the trac imbalanced even for links at the same level. load-balancing 49
  • Slide 50
  • ABT under server failure
  • Slide 51
  • ABT under switch failure
  • Slide 52
  • BCube BCube performs well under both server and switch failures. the degradation is graceful. when the switch failure ratio reaches 20%: fat-tree ABT 267Gb/s BCube ABT 765Gb/s
  • Slide 53
  • Slide 54
  • BCube stack We have prototyped the BCube architecture by designing and implementing a BCube protocol stack.
  • Slide 55
  • BCube stack BCube stack BSR protocolrouting neighbor maintenance protocol maintains a neighbor status table the packet sending/receiving part interacts with the TCP/IP stack packet forwarding enginerelays packets for other servers. 55
  • Slide 56
  • BCube packet
  • Slide 57
  • BCube header
  • Slide 58
  • BCube header source and destination BCube addresses packet id protocol type payload length header checksum 58 BCube stores the complete path and a next hop index (NHI) in the header of every BCube packet
  • Slide 59
  • NHA
  • Slide 60
  • relays packets for other servers
  • Slide 61
  • BCube stack BCube stack BSR protocolrouting neighbor maintenance protocol maintains a neighbor status table the packet sending/receiving part interacts with the TCP/IP stack packet forwarding enginerelays packets for other servers. 61
  • Slide 62
  • packet forwarding engine We have designed an efficient packet forwarding engine which decides the next hop of a packet by only one table lookup. 62 neighbor status table 1)NeighborMAC : MAC address (2)OutPort, and : the port that connects to the neighbor (3)StatusFlag: if the neighbor is available packet forwarding procedure
  • Slide 63
  • Sending packets to the next hop It then extracts the status and the MAC address of the next hop, using the NHA value as the index.
  • Slide 64
  • CONCLUSION BCube as a novel network architecture for shipping- container-based modular data centers (MDC) 64 accelerates one- to-x traffic patterns provides high network capacity for all-to-all traffic
  • Slide 65
  • how to scale our server-centric design from the single container to multiple containers