camcube - rethinking the data center cluster€¦ · camcube rethinking the data center cluster...
TRANSCRIPT
CamCube Rethinking the Data Center Cluster
Paolo Costa [email protected]
joint work with Austin Donnelly, Greg O’Shea, Antony Rowstron (MSRC) Hussam Abu-Libdeh (Intern, Cornell), Simon Schubert (Intern, EPFL)
Paolo Costa 2 CamCube - Rethinking the Data Center Cluster
A New Software Stack
Paolo Costa 3 CamCube - Rethinking the Data Center Cluster
A New Software Stack
Dremel
Dryad/DryadLINQ
Paolo Costa 4 CamCube - Rethinking the Data Center Cluster
A New Software Stack
Dremel
Dryad/DryadLINQ
Paolo Costa 5
Network is a critical component Focus of this talk: How to make it easy to design
and deploy efficient data center applications
CamCube - Rethinking the Data Center Cluster
Building Data Center Applications is Hard!
Abstraction Reality
Paolo Costa 6
• Application logical topologies
Dynamo
MapReduce
Tree
Dremel
Databus
• Data center physical topology
CamCube - Rethinking the Data Center Cluster
Abstraction & Reality Mismatch
Paolo Costa 7 CamCube - Rethinking the Data Center Cluster
Abstraction & Reality Mismatch
Switches
Router
One logical hop is mapped to multiple physical hops
Paolo Costa 8 CamCube - Rethinking the Data Center Cluster
Abstraction & Reality Mismatch
Switches
Router
Paolo Costa 9 CamCube - Rethinking the Data Center Cluster
Abstraction & Reality Mismatch
Switches
Router
Two disjoint logical paths share some physical links
Paolo Costa 10 CamCube - Rethinking the Data Center Cluster
Abstraction & Reality Mismatch
Switches
Router
Paolo Costa 11 CamCube - Rethinking the Data Center Cluster
CamCube - Rethinking the Data Center Cluster
Issue #1: Oversubscription
Switches
Router
Paolo Costa 12
Bandwidth gets scarce as you move up the tree Locality is key to performance
CamCube - Rethinking the Data Center Cluster
Issue #2: Path collision
Paolo Costa 13
The network allocates paths independently Applications cannot modify the way packets are routed
Addressing These Issues…
• Oversubscription: Fat-tree[SIGCOMM’08], VL2[SIGCOMM’09], …
• Path collision: Hedera[NSDI’10], MPTCP[SIGCOMM’11], SPAIN[NSDI’10], …
• TCP Incast: DCTCP [SIGCOMM’10], ICTCP[CoNEXT’10], FDS[OSDI’12], …
• Traffic prioritization: Orchestra [SIGCOMM’11], D2TCP[SIGCOMM’11], …
• Fair sharing: Seawall [NSDI’11], FairCloud [SIGCOMM’12], …
Paolo Costa 14 CamCube - Rethinking the Data Center Cluster
Applications & Network Gap
The network is a black box for applications (and vice versa)
Paolo Costa 15
CamCube - Rethinking the Data Center Cluster
Applications & Network Gap Applications perspective
10.0.1.4 10.0.2.3
• Applications only see IP addresses − Hard to infer locality & congestion
• No control on packet routing − Point-to-point only
• Need to reverse-engineer the network
?
Why slow?
Paolo Costa 16
CamCube - Rethinking the Data Center Cluster
Applications & Network Gap Applications perspective Network Perspective
• The network only sees packets
• No insights about application behaviour
• Has to infer application patterns
10.0.1.4 10.0.2.3
• Applications only see IP addresses − Hard to infer locality & congestion
• No control on packet routing − Point-to-point only
• Need to reverse-engineer the network
?
? Why slow?
Are these related? Long vs. short flows?
Paolo Costa 17 CamCube - Rethinking the Data Center Cluster
Applications & Network Gap Applications perspective Network Perspective
• The network only sees packets
• No insights about application behaviour
• Has to infer application patterns
10.0.1.4 10.0.2.3
• Applications only see IP addresses − Hard to infer locality & congestion
• No control on packet routing − Point-to-point only
• Need to reverse-engineer the network
?
? Why slow?
Are these related? Long vs. short flows?
Paolo Costa 18 CamCube - Rethinking the Data Center Cluster
Internet & Data Centers
Internet
• Multiple administration domains
• Heterogeneous HW and network
• Topology not known
• Malicious software
• This is due to how the Internet was designed… − …but data centers are not mini-Internets
Strict layer isolation
Paolo Costa 19
CamCube - Rethinking the Data Center Cluster
Internet & Data Centers
Internet Data Centers
• Multiple administration domains
• Heterogeneous HW and network
• Topology not known
• Malicious software
• Single administration domain
• Homogenous HW and network − x86 and Ethernet
• Topology known − and can be customised
• Trusted components − e.g., using virtualization
• This is due to how the Internet was designed… − …but data centers are not mini-Internets
Paolo Costa 20 CamCube - Rethinking the Data Center Cluster
Internet & Data Centers
Internet Data Centers
• Multiple administration domains
• Heterogeneous HW and network
• Topology not known
• Malicious software
• Single administration domain
• Homogenous HW and network − x86 and Ethernet
• Topology known − and can be customised
• Trusted components − e.g., using virtualization
• This is due to how the Internet was designed… − …but data centers are not mini-Internets
Paolo Costa 21
How can we exploit this flexibility to improve efficiency and reduce complexity?
CamCube - Rethinking the Data Center Cluster
CamCube
How can we design a data center closer to what a distributed systems builder expects?
Paolo Costa 22 CamCube - Rethinking the Data Center Cluster
CamCube
How can we design a data center closer to what a distributed systems builder expects?
Paolo Costa 23
• Today: The network is a given and apps adapt to it
• CamCube: Adapt the network to the apps’ needs
CamCube - Rethinking the Data Center Cluster
CamCube
How can we design a data center closer to what a distributed systems builder expects?
Direct-Connect topology Servers are directly interconnected to each other
(no switches / routers)
Physical Ethernet cable
Paolo Costa 24 CamCube - Rethinking the Data Center Cluster
CamCube
How can we design a data center closer to what a distributed systems builder expects?
Direct-Connect topology Servers are directly interconnected to each other
(no switches / routers)
A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly
Paolo Costa 25 CamCube - Rethinking the Data Center Cluster
CamCube
Direct-Connect topology Servers are directly interconnected to each other
(no switches / routers)
A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly
Dynamo
Paolo Costa 26 CamCube - Rethinking the Data Center Cluster
CamCube
Direct-Connect topology Servers are directly interconnected to each other
(no switches / routers)
A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly
Paolo Costa 27 CamCube - Rethinking the Data Center Cluster
CamCube
How can we design a data center closer to what a distributed systems builder expects?
Direct-Connect topology Servers are directly interconnected to each other
(no switches / routers)
A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly
Not very scalable Node degree grows linearly with N
(high server load and cabling complexity)
Paolo Costa 28 CamCube - Rethinking the Data Center Cluster
Which topology?
• Various options available − Trees, rings, hypercubes, tori, …
• Scalable − Node degree is constant (=6)
• Fault-tolerant − High degree of multi-path
• Easy to wire − Only short links are needed
• Trade-off − Increased hop count
2D Torus
3D Torus Paolo Costa 29 CamCube - Rethinking the Data Center Cluster
Network Visibility
Paolo Costa
• Limited network visibility −Hard to infer server location
• IP addresses only
−Hard to infer congestion
• Nodes have (x,y,z) coordinates − Easy to understand locality
• Servers have full visibility on the status of network links
y
z
(1,2,2)
x
(1,2,1)
10.0.1.4 10.0.2.3
30 CamCube - Rethinking the Data Center Cluster
Packet Routing
• Single routing protocol − Point-to-point only
• Servers can intercept, process, and forward packets − multiple custom routing protocols − e.g., multicast, multipath
Paolo Costa 31 CamCube - Rethinking the Data Center Cluster
Packet Processing
• Application-agnostic packet processing − Typically header-only − e.g., OpenFlow
• Application-specific packet processing − Servers understand the
application semantics − E.g., caching, aggregation
Paolo Costa 32
CamCube - Rethinking the Data Center Cluster
CamCube Services
• Several services have been implemented on top of CamCube, including:
• CamKey − Key-value store
• Camdoop − MapReduce-like system
• CamGraph − Graph processing engine
• TCP/IP service − Enables running unmodified TCP applications
Paolo Costa 33 CamCube - Rethinking the Data Center Cluster
CamCube Services
• Several services have been implemented on top of CamCube, including:
• CamKey − Key-value store
• Camdoop − MapReduce-like system
• CamGraph − Graph processing engine
• TCP/IP service − Enables running unmodified TCP applications
Paolo Costa 34 CamCube - Rethinking the Data Center Cluster
Key-based Routing • Packets are routed based on the
key rather than server address
• Inspired by Distributed Hash Tables (DHTs) − The (x,y,z)coordinates
define a key-space
• 160-bit keys are expressed as (x,y,z,w) − If alive, (x,y,z) is the server responsible for − Otherwise, keys are re-mapped to 1-hop neighbors based on w
• Example − (2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
y
z
x
(2,2,0)
(2,1,0)
(1,2,0)
Paolo Costa 35 CamCube - Rethinking the Data Center Cluster
CamKey
• Reliable high-performance key-value store − Combination of BigTable + memcached
Two components:
• Replicated store − Ensures fault tolerance
• Caching service − Provides high performance
Paolo Costa 36 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Data objects IDs are hashed using SHA-1 and the result is interpreted as 4D coordinates
Paolo Costa 37 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica hash(ID) = e689eb3… = (2,2,0,27)
(2,2,0)
The primary replica is stored at the server responsible for the key
Paolo Costa 38 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
The first secondary replica is stored at the server that will become responsible for the key
if the primary fails
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
Secondary replica
(2,1,0)
Paolo Costa 39 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Secondary replica
The second secondary replica is stored on the next server on the list and so on
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), … (1,2,0)
Paolo Costa 40 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Secondary replica
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
High-locality Secondary replicas are 1-hop neighbors
Disjoint paths can be used Paolo Costa 41 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Secondary replica
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
Client Transparency Clients do not need to know the replica identity
Key-based routing is used to deliver packets
Route to (2,2,0,27)
Paolo Costa 42 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Secondary replica
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
Client Transparency Clients do not need to know the replica identity
Key-based routing is used to deliver packets Paolo Costa 43 CamCube - Rethinking the Data Center Cluster
Replicated Store hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Secondary replica
(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
Client Transparency Clients do not need to know the replica identity
Key-based routing is used to deliver packets Paolo Costa 44 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Caches
For each key, we generate c additional keys that represent the location of caches
Paolo Costa 45 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
These cache keys are assigned to servers using the usual mapping function
Caches
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 46 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
When a server lookups a key, the path is chosen so as to pass through the closest cache
Caches
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 47 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
When a server lookups a key, the path is chosen so as to pass through the closest cache
Caches
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 48 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and the response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 49 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 50 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 51 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 52 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 53 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 54 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 55 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
On a cache miss, the lookup request is forwarded to the primary replica
and response is cached on the way back
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 56 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Next requests for the same key are intercepted on-path and the associated value is returned
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 57 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Next requests for the same key are intercepted on-path and the associated value is returned
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 58 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Next requests for the same key are intercepted on-path and the associated value is returned
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 59 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Next requests for the same key are intercepted on-path and the associated value is returned
f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…
Paolo Costa 60 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Write operations always go to the primary replica and caches are invalidated
Paolo Costa 61 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Write operations always go to the primary replica and caches are invalidated
Paolo Costa 62 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Write operations always go to the primary replica and caches are invalidated
Paolo Costa 63 CamCube - Rethinking the Data Center Cluster
Caching Service hash(ID) = e689eb3… = (2,2,0,27)
Primary replica
Caches
Write operations always go to the primary replica and caches are invalidated
Paolo Costa 64 CamCube - Rethinking the Data Center Cluster
Evaluation
Testbed − 27-server CamCube (3 x 3 x 3) − Quad-core 2.27 Ghz, 12 GB RAM − Six 1 Gbps ports per server − Runtime & services implemented in user-space (C#)
Workload: Image store − 9 external servers (up to 150 concurrent requests) − Insert: 1.47 MB average image size − Lookup: 3.55 KB average thumbnail size
Paolo Costa 65 CamCube - Rethinking the Data Center Cluster
Insert Throughput
Better
Worse 0
1
2
3
4
5
6
0 25 50 75 100 125 150
Inse
rt t
hro
ug
hp
ut
(Gb
ps)
Concurrent insert requests
switch
CamKey
switch (no disk)
CamKey (no disk)
Paolo Costa 66 CamCube - Rethinking the Data Center Cluster Load increases
Insert Throughput
Better
Worse 0
1
2
3
4
5
6
0 25 50 75 100 125 150
Inse
rt t
hro
ug
hp
ut
(Gb
ps)
Concurrent insert requests
switch
CamKey
switch (no disk)
CamKey (no disk)
Disk I/O bounded
Server bandwidth bounded
CamKey exploits disjoint paths to create replicas
Paolo Costa 67 CamCube - Rethinking the Data Center Cluster Load increases
Lookup Throughput
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
0 25 50 75 100 125 150
Lo
ok
up
ra
te (
req
s/s)
Concurrent lookup requests
switch
CamKey (disabled cache)
CamKey
Better
Worse
Paolo Costa 68 CamCube - Rethinking the Data Center Cluster Load increases
Lookup Throughput
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
0 25 50 75 100 125 150
Lo
ok
up
ra
te (
req
s/s)
Concurrent lookup requests
switch
CamKey (disabled cache)
CamKey
Better
Worse
Latency 0.83 ms (median)
1.70 ms (95th perc)
Latency 0.97 ms (median)
2.13 ms (95th perc)
Higher hop count
Caches reduce
hop count
Paolo Costa 69 CamCube - Rethinking the Data Center Cluster Load increases
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140
Inse
rt t
hro
ug
hp
ut
(Gb
ps)
Time (s)
CamKey
Failures
A random server fails every 10 s
`
Paolo Costa 70 CamCube - Rethinking the Data Center Cluster
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140
Inse
rt t
hro
ug
hp
ut
(Gb
ps)
Time (s)
CamKey
Failures
A random server fails every 10 s
Only 18 servers left
Paolo Costa 71 CamCube - Rethinking the Data Center Cluster
CamCube Services
• Several services have been implemented on top of CamCube, including:
• CamKey − Key-value store
• Camdoop − MapReduce-like system
• CamGraph − Graph processing engine
• TCP/IP service − Enables running unmodified TCP applications
Paolo Costa 72 CamCube - Rethinking the Data Center Cluster
MapReduce
• Map − Processes input data and generates (key, value) pairs
• Shuffle − Distributes the intermediate pairs to the reduce tasks
• Reduce − Aggregates all values associated to each key
Chunk 0
Chunk 1
Chunk 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
Paolo Costa 73 CamCube - Rethinking the Data Center Cluster
Shuffle Phase
Split 0
Split 1
Split 2
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results
• Shuffle phase is challenging for data center networks − All-to-all traffic pattern with O(N2) flows
• Often a bottleneck for MapReduce jobs − Led to proposals for full-bisection bandwidth
Paolo Costa 74 CamCube - Rethinking the Data Center Cluster
Data Reduction
• The final results are typically much smaller than the intermediate results (e.g., WordCount)
• In most Facebook jobs final size is 5.4 % of the intermediate size
• In most Yahoo jobs the ratio is 8.2 %
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
Paolo Costa 75 CamCube - Rethinking the Data Center Cluster
Data Reduction
• The final results are typically much smaller than the intermediate results (e.g., WordCount)
• In most Facebook jobs final size is 5.4 % of the intermediate size
• In most Yahoo jobs the ratio is 8.2 %
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
How can we exploit this to reduce the traffic and improve the performance of the shuffle phase?
Paolo Costa 76 CamCube - Rethinking the Data Center Cluster
Aggregation Tree
• We could use aggregation trees to perform multiple steps of aggregation to reduce inter-rack traffic − e.g., rack-level aggregation
Paolo Costa 77 CamCube - Rethinking the Data Center Cluster
Aggregation Tree
• We could use aggregation trees to perform multiple steps of aggregation to reduce inter-rack traffic − e.g., rack-level aggregation
Paolo Costa 78 CamCube - Rethinking the Data Center Cluster
Mapping a tree…
… on a traditional topology … on CamCube
• Mismatch between logical and physical topology
• 1:1 mapping btw. logical and physical topology
• Packets are aggregated on path (=> less traffic)
Only one child per link
Rack Switch
Link shared by all children
Paolo Costa 79 CamCube - Rethinking the Data Center Cluster
Mapping a tree…
… on a traditional topology … on CamCube
• Mismatch between logical and physical topology
• 1:1 mapping btw. logical and physical topology
• Packets are aggregated on path (=> less traffic)
Only one child per link
Rack Switch
Link shared by all children
Paolo Costa 80 CamCube - Rethinking the Data Center Cluster
Mapping a tree…
… on a traditional topology … on CamCube
• Mismatch between logical and physical topology
• 1:1 mapping btw. logical and physical topology
• Packets are aggregated on path (=> less traffic)
Only one child per link
Rack Switch
Link shared by all children
Paolo Costa 81 CamCube - Rethinking the Data Center Cluster
Mapping a tree…
… on a traditional topology … on CamCube
• Mismatch between logical and physical topology
• 1:1 mapping btw. logical and physical topology
• Packets are aggregated on path (=> less traffic)
Only one child per link
Rack Switch
Link shared by all children
Paolo Costa 82 CamCube - Rethinking the Data Center Cluster
Mapping a tree…
… on a traditional topology … on CamCube
• Mismatch between logical and physical topology
• 1:1 mapping btw. logical and physical topology
• Packets are aggregated on path (=> less traffic)
Rack
Switch
Paolo Costa 83
Camdoop Improve the performance of the shuffle phase
by reducing the traffic rather than by increasing the bandwidth
CamCube - Rethinking the Data Center Cluster
Workload Parameter
• Output size / intermediate size (S) − S=1 (no aggregation)
o All map outputs have a disjoint set of keys − S=1/N ≈ 0 (full aggregation)
o All map outputs share the same set of keys
− We use synthetic workloads to explore different value of S o Intermediate data size is 22.2 GB (843 MB/server)
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Output results
Paolo Costa 84 CamCube - Rethinking the Data Center Cluster
Evaluation
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size/ intermediate size (S)
Baseline
Camdoop (no agg.)
Camdoop
Worse
Better
Full aggregation
No aggregation
Paolo Costa 85 CamCube - Rethinking the Data Center Cluster
Evaluation
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size/ intermediate size (S)
Baseline
Camdoop (no agg.)
Camdoop
Worse
Better
Full aggregation
No aggregation
Running on the switch using TCP
Impact of in-network
aggregation
Facebook reported aggregation ratio
Impact of running on CamCube
Paolo Costa 86 CamCube - Rethinking the Data Center Cluster
Summary
• Data centers present both unique challenges and opportunities to network designers
• Good time to revisit previous assumptions and rethink application and protocol design
• CamCube − Enables applications to “control” the network − Removes distinction between computation and
network devices
Paolo Costa 87 CamCube - Rethinking the Data Center Cluster