Download - System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O
System Architecture for Web-Scale Applications Using Lightweight
CPUs and Virtualized I/O
Kshitij Sudan*Saisanthosh Balakrishnan§
Sean Lie §, Min Xu § Dhiraj Mallick §, Gary Lauterbach§
Rajeev Balasubramonian*§
*
HPCA-2013
Exec Summary
• Focus on web-scale applications• Contribution 1: use of simple cores• This amplifies the power/cost contribution of the
I/O subsystem• Contribution 2: virtualize I/O, e.g., single disk
shared by many cores• Contribution 3: software stack optimizations• Contribution 4: evaluations on a production
quality real design
HPCA-2013
Exec Summary
• Focus on web-scale applications• Contribution 1: use of simple cores• This amplifies the power/cost contribution of the
I/O subsystem• Contribution 2: virtualize I/O, e.g., single disk
shared by many cores• Contribution 3: software stack optimizations• Contribution 4: evaluations on a production
quality real design
HPCA-2013
Exec Summary
• Focus on web-scale applications• Contribution 1: use of simple cores• This amplifies the power/cost contribution of the
I/O subsystem• Contribution 2: virtualize I/O, e.g., single disk
shared by many cores• Contribution 3: software stack optimizations• Contribution 4: evaluations on a production
quality real design
HPCA-2013
Web Scale Applications
• Targeting datacenter platforms• Focus on power and cost (OpEx and CapEx)• Web scale applications have large datasets,
high concurrency, high communication, high I/O – e.g., MapReduce
• Typically, performance increases as cluster size grows, but so does power and cost
HPCA-2013
Energy Efficient CPUs
• For embarrassingly parallel workloads, energy per instruction (EPI) is important
• For a given power/energy budget, many low-EPI cores can yield a higher throughput than a few high-EPI cores
• Hence, use many light-weight energy-efficient CPUs (Atom CPU at 8.5 W)
HPCA-2013
Contribution of the I/O Sub-System
• With light-weight cores, the energy and cost contributions of “other” components grow– Intel Atom CPU + Chipset = 11 Watts– Typical disk, or Ethernet card = 5-25 Watts– Fans, power supplies etc…
• The application only uses 20-60 MB/s disk bw, while the disk has a peak read bw of 120 MB/s
HPCA-2013
0
20
40
60
80
100
120
140
160
Atom TeraSort - Aggregate Disk BW read Moving average (read) writ Moving average (writ)
Dsik
BW
(MB/
sec)
Wasting energy on over-provisioned resources
HPCA-2013
Cluster-in-a-Box with Virtualized I/O
• Use energy-efficient CPUs– ~10x more CPUs in same power budget than using
typical server class CPUs• Virtualize I/O devices – disk and Ethernet– Balanced resource provisioning and lower
cost/power• Amortize fixed server overheads by sharing
components– Fans, power supplies, etc.
HPCA-2013
Compute Cards
Compute card – 6 CPUs share 4 ASICs (PCIe connection), ASIC implements the fabric, 4GB DDR2 memory per CPU on the back
HPCA-2013
Compute Cards
Compute card – 6 CPUs share 4 ASICs (PCIe connection), ASIC implements the fabric, 4GB DDR2 memory per CPU on the back
HPCA-2013
Logical Organization
Ethernet FPGA
E-Cards
(Up to 8 per system each with 8xSATA HDD/SSD)
Storage FPGA
S-Cards
(Up to 8 per system, each with 8x1 GbE or 2x10 GbE)
CPU + ChipsetASIC
3D-Torus Interconnect formed by ASICs
ComputeCard
HPCA-2013
Physical Organization
S-Card
E-Card
Compute Card
Midplane Interconnect
HDD/SSD
HPCA-2013
Cluster-in-a-Box Summary• 768 CPU cores interconnected using a high bandwidth fabric
in a 3D torus topology– Low-latency distributed fabric architecture based on low-power
ASICs• FPGAs implement the disk and ethernet controllers • Fabric and FPGAs implement I/O virtualization
– Up to 64 disks shared by 384 server nodes• Server nodes don’t require a rack-top-switch to
communicate– All internal cluster communication via fabric
• Entire cluster consumes < 3.5kW under full-load
HPCA-2013
System Software Improvements
• Implement large SATA packet sizes to reduce disk seek overheads
• Other OS/ethernet configuration knobs: avoid journaling in the filesystem, jumbo TCP/IP frames, interrupt coalescing
• MapReduce configuration: designate the few nodes near the S-cards as DataNodes
HPCA-2013
Methodology
• Compare two cluster designs with the same power envelope to evaluate TCO and power for cluster architectures – 17-node Core i7 CPU based cluster (baseline) and
a 384-node Atom cluster-in-a-box– 4 kW Core i7 cluster; 3.5 kW Atom cluster-in-a-box– Four Apache Hadoop benchmarks– TCO calculations based on Hamilton’s model
HPCA-2013
TeraGen TeraSort WordCount GridMix0
20
40
60
80
100
120
9.5
34.26
6.11
34.4823.68
98
5.66
65.63
Execution Time Results
AtomCore i7
Exec
ution
Tim
e (m
ins)
HPCA-2013
Improvement in EDP
-100%
0%
100%
200%
300%
400%
500%
600%
700%
329%
606%
-34%
273%
% C
hang
e in
Per
f./W
-h
HPCA-2013
TeraGen TeraSort WordCount GridMix-40%-20%
0%20%40%60%80%
100%120%140%160%
75.50%
147.75%
-15.38%
46.96%
Improvement in EnergyCh
ange
in P
erf./
Watt
HPCA-2013
Performance/TCO vs. Number of Disks and Number of Cores
HPCA-2013
Conclusions
• Datacenter power and cost are limiting factors when scaling web-scale apps– Build clusters using light-weight, low-power CPUs
• Balanced resource provisioning can improve utilization, cost, power– Virtualize I/O (disk and Ethernet)– Amortize the overheads of fans, power supplies, etc.
• The cluster-in-a-box system yields up to 6x improvement in EDP, relative to a traditional cluster
Questions?
Thank You
CPU and Disk Utilization
HPCA-2013
768 CPUs, 64 disks 64 CPUs, 32 disks