camdoop: exploiting in-network aggregation for big data applications · camdoop exploiting...
TRANSCRIPT
Camdoop Exploiting In-network Aggregation for Big Data Applications
Paolo Costa [email protected]
joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge)
MapReduce Overview
• Map − Processes input data and generates (key, value) pairs
• Shuffle − Distributes the intermediate pairs to the reduce tasks
• Reduce − Aggregates all values associated to each key
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Chunk 0
Chunk 1
Chunk 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
2/52
Problem
• Shuffle phase is challenging for data center networks − All-to-all traffic pattern with O(N2) flows − Led to proposals for full-bisection bandwidth
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
3/52
Data Reduction
• The final results are typically much smaller than the intermediate results
• In most Facebook jobs the final size is 5.4 % of the intermediate size
• In most Yahoo jobs the ratio is 8.2 %
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
4/52
Data Reduction
• The final results are typically much smaller than the intermediate results
• In most Facebook jobs final size is 5.4 % of the intermediate size
• In most Yahoo jobs the ratio is 8.2 %
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
How can we exploit this to reduce the traffic and improve the performance of the shuffle phase?
5/52
Background: Combiners
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
• To reduce the data transferred in the shuffle, users can specify a combiner function − Aggregates the local intermediate pairs
• Server-side only => limited aggregation 6/52
Background: Combiners
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0
Split 1
Split 2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
Combiner
Combiner
Combiner
• To reduce the data transferred in the shuffle, users can specify a combiner function − Aggregates the local intermediate pairs
• Server-side only => limited aggregation 7/52
Distributed Combiners
• It has been proposed to use aggregation trees in MapReduce to perform multiple steps of combiners − e.g., rack-level aggregation [Yu et al., SOSP’09]
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 8/52
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here
Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Logical and Physical Topology
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology
ToR Switch
9/52
Logical and Physical Topology
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology
ToR Switch
Only 500 Mbps per child
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here
Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Logical and Physical Topology
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology
ToR Switch
Only 500 Mbps per child
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here
Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Camdoop Goal Perform the combiner functions within the network as
opposed to application-level solutions
Reduce shuffle time by aggregating packets on path
How Can We Perform In-network Processing?
• We exploit CamCube − Direct-connect topology − 3D torus − Uses no switches / routers
for internal traffic
• Servers intercept, forward and process packets
• Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures
y
z
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
12/52
How Can We Perform In-network Processing?
• We exploit CamCube − Direct-connect topology − 3D torus − Uses no switches / routers
for internal traffic
• Servers intercept, forward and process packets
• Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures
y
z
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
13/52
How Can We Perform In-network Processing?
• We exploit CamCube − Direct-connect topology − 3D torus − Uses no switches / routers
for internal traffic
• Servers intercept, forward and process packets
• Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures
y
z
(1,2,2)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
(1,2,1)
14/52
How Can We Perform In-network Processing?
• We exploit CamCube − Direct-connect topology − 3D torus − Uses no switches / routers
for internal traffic
• Servers intercept, forward and process packets
• Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures
y
z
(1,2,2)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
(1,2,1)
Key property No distinction between network and computation devices
Servers can perform arbitrary packet processing on-path
Mapping a tree…
… on a switched topology … on CamCube
• The 1 Gbps link becomes the bottleneck
• Packets are aggregated on path (=> less traffic)
• 1:1 mapping btw. logical and physical topology
1 Gbps
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
1/in-degree Gbps
16/52
Camdoop Design
Goals
1. No change in the programming model
2. Exploit network locality
3. Good server and link load distribution
4. Fault-tolerance
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 17/52
Design Goal #1
Programming Model
• Camdoop adopts the same MapReduce model
• GFS-like distributed file-system − Each server runs map tasks on local chunks
• We use a spanning tree − Combiners aggregate map tasks and children results (if any)
and stream the results to the parents − The root runs the reduce task and generates the final output
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 18/52
Design Goal #2 Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
How to map the tree nodes to servers?
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 19/52
Design Goal #2 Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Map task outputs are always read from the local disk
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 20/52
Design Goal #2 Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
The parent-children are mapped on physical neighbors
(1,2,1)
(1,2,2) (1,1,1)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 21/52
Design Goal #2 Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
How to map the tree nodes to servers?
Map task outputs are always read from the local disk
The parent-children are mapped on physical neighbors
(1,2,1)
(1,2,2) (1,1,1)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
This ensures maximum locality and optimizes network transfer
Network Locality Logical View Physical View (3D Torus)
One physical link is used by one and only one logical link
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 23/52
Design Goal #3 Load Distribution
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 24/52
Design Goal #3 Load Distribution
Poor server load distribution
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Only 1 Gbps (instead of 6)
Different in-degree
25/52
Design Goal #3 Load Distribution
Only 1 Gbps (instead of 6)
Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 26/52
Design Goal #3 Load Distribution
Poor server load distribution Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Solution: stripe the data across disjoint trees Different links are used Improves load distribution
Design Goal #3
First issue: Poor load distribution Second issue: Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Solution: stripe the data across 6 disjoint trees All links are used => (Up to) 6 Gbps / server
Good load distribution
Design Goal #4
Fault-tolerance
• The tree is built in the coordinate space − CamCube remaps coordinates in case of failures
• Details in the paper
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 29/52
Evaluation
Testbed − 27-server CamCube (3 x 3 x 3) − Quad-core Intel Xeon 5520 2.27 Ghz − 12GB RAM − 6 Intel PRO/1000 PT 1 Gbps ports − Runtime & services implemented in user-space
Simulator − Packet-level simulator (CPU overhead not modelled) − 512-server (8x8x8) CamCube
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 30/52
Evaluation
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop
Shuffle & reduce parallelized
• Reduce phase is parallelized with the shuffle phase − Since all streams are ordered, as soon as the root receive at least
one packet from all children, it can start the reduce function − No need to store to disk intermediate results on reduce servers
Reduce Shuffle Map
Shuffle Map
Reduce
Evaluation
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop
Shuffle & reduce parallelized
CamCube
Six disjoint trees
In - network aggregation
32/52
Evaluation
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop TCP Camdoop (switch)
Shuffle & reduce parallelized
CamCube
Six disjoint trees
In - network aggregation
Reduce Shuffle Map Shuffle Map
• TCP Camdoop (switch) − 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase
33/52
Evaluation
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop TCP Camdoop (switch) Camdoop (no agg )
Shuffle & reduce parallelized
CamCube
Six disjoint trees
In - network aggregation
Reduce Shuffle Map Shuffle Map
• TCP Camdoop (switch) − 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase
• Camdoop (no agg) − Like Camdoop but without in-network aggregation − Shows the impact of just running on CamCube
34/52
Validation against Hadoop & Dryad
• Sort and WordCount
• Camdoop baselines are competitive against Hadoop and Dryad
• Several reasons: − Shuffle and reduce parellized − Fine-tuned implementation
0
1
10
100
1000
Sort WordCountT
ime
log
sca
le (
s)
Hadoop Dryad/DryadLINQ
TCP Camdoop (switch) Camdoop (no agg)
Worse
Better
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 35/52
Validation against Hadoop & Dryad
• Sort and WordCount
• Camdoop baselines are competitive against Hadoop and Dryad
• Several reasons: − Shuffle and reduce parellized − Fine-tuned implementation
0
1
10
100
1000
Sort WordCountT
ime
log
sca
le (
s)
Hadoop Dryad/DryadLINQ
TCP Camdoop (switch) Camdoop (no agg)
We consider these as our
baselines
Worse
Better
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 36/52
Parameter Sweep
• Output size / intermediate size (S) − S=1 (no aggregation)
o Every key is unique − S=1/N ≈ 0 (full aggregation)
o Every key appears in all map task outputs
− We use synthetic workloads to explore different value of S o Intermediate data size is 22.2 GB (843 MB/server)
• Reduce tasks (R) − R= 1 (all-to-one)
o E.g., Interactive queries, top-K jobs − R=N (all-to-all)
o Common setup in MapReduce jobs o N output files are generated
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 37/52
All-to-one (R=1)
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size/ intermediate size (S)
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
Worse
Better
Full aggregation
No aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 38/52
All-to-one (R=1)
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size/ intermediate size (S)
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
Worse
Better
Full aggregation
No aggregation
Performance independent of S
Impact of in-network
aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Impact of running on CamCube
39/52
All-to-one (R=1)
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size/ intermediate size (S)
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
Worse
Better
Full aggregation
No aggregation
Performance independent of S
Impact of in-network
aggregation
Facebook reported aggregation ratio
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Impact of running on CamCube
40/52
All-to-all (R=27)
1
10
100
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size / intermediate size (S)
TCP Camdoop (switch) Camdoop (no agg)
Camdoop Switch 1 Gbps (bound)
Worse
Better
Full aggregation
No aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 41/52
All-to-all (R=27)
1
10
100
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsc
ale
Output size / intermediate size (S)
TCP Camdoop (switch) Camdoop (no agg)
Camdoop Switch 1 Gbps (bound)
Impact of running on CamCube
Worse
Better
Full aggregation
No aggregation
Impact of in-network
aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Maximum theoretical performance over the switch
42/52
Number of reduce tasks (S=0)
1
10
100
1000
1 6 11 16 21 26
Tim
e (
s) lo
gsc
ale
# reduce tasks (R)
TCP Camdoop (switch) Camdoop (no agg) CamdoopWorse
Better
4.31 x
10.19 x
1.13 x
1.91 x
All-to-one All-to-all
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
27
43/52
Number of reduce tasks (S=0)
1
10
100
1000
1 6 11 16 21 26
Tim
e (
s) lo
gsc
ale
# reduce tasks (R)
TCP Camdoop (switch) Camdoop (no agg) CamdoopWorse
Better
4.31 x
10.19 x
1.13 x
1.91 x
R does not (significantly) impact performance
All-to-one All-to-all
Performance depends on R
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Implementation bottleneck
27
44/52
Number of reduce tasks (S=0)
1
10
100
1000
1 6 11 16 21 26
Tim
e (
s) lo
gsc
ale
# reduce tasks (R)
TCP Camdoop (switch) Camdoop (no agg) CamdoopWorse
Better
4.31 x
10.19 x
1.13 x
1.91 x
R does not (significantly) impact performance
All-to-one All-to-all
Performance depends on R
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
27 All resources are used even when R = 1
Camdoop decouples the job execution time from the number of output files generated
Behavior at scale (simulated)
0.001
0.01
0.1
1
10
100
1 4 16 64 256
Tim
e (
s) lo
gsc
ale
# reduce tasks (R) logscale
Switch 1 Gbps (bound) Camdoop
512
N=512, S= 0
Worse
Better
All-to-one All-to-all
512x
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 46/52
Behavior at scale (simulated)
0.001
0.01
0.1
1
10
100
1 4 16 64 256
Tim
e (
s) lo
gsc
ale
# reduce tasks (R) logscale
Switch 1 Gbps (bound) Camdoop
512
N=512, S= 0
Worse
Better
All-to-one All-to-all
512x
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
This assumes full-bisection
bandwidth
47/52
• More experiments (failures, multiple jobs,…) in the paper
Beyond MapReduce
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 48/52
Beyond MapReduce • The partition-aggregate model
also common in interactive services − e.g., Bing Search, Google Dremel
• Small-scale data − 10s to 100s of KB returned per server
• Typically, these services use one reduce task (R=1) − Single result must
be returned to the user
• Full aggregation is common (S ≈ 0) − E.g., N servers generate their best k responses each
and the final result contains the best k responses
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Web server
requests
Cache
Leaf servers
Parent servers
…
…
49/52
Small-scale data (R=1, S=0) Worse
Better 0
10
20
30
40
50
60
20 KB 200 KB
Tim
e (
ms)
Input data size / server
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 50/52
Small-scale data (R=1, S=0) Worse
Better 0
10
20
30
40
50
60
20 KB 200 KB
Tim
e (
ms)
Input data size / server
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
In-network aggregation can be beneficial also for (small-scale data) interactive services
• Camdoop − Explores the benefits of in-network processing by
running combiners within the network − No change in the programming model − Achieves lower shuffle and reduce time − Decouples performance from the # of output files
• A final thought: how would Camdoop run on this? − AMD SeaMicro – a 512-core cluster
for data centers using a 3D torus − Fast interconnect: 5 Gbps / link
Conclusions
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 52/52