capacity provisioning problems in geo-distributed data centers€¦ · bhuvan urgaonkar dept. of...
TRANSCRIPT
Capacity Provisioning Problems in Geo-distributed Data CentersBhuvan UrgaonkarDept. of CSEThe Pennsylvania State University
Geo-distributed Data Centers
3
• Reasons for geo-distribution:- Latency- Availability
• What are the cost implications?7/15/2014 Microsoft Faculty Summit 2014
What’s New?
• What is well-understood:- How to build single data centers cost-effectively?- How to create distributed applications using an existing pool of data
centers (that were built separately)? E.g., content distribution networks such as Akamai
E.g., recent work on procuring resources from geo-distributed public clouds like SpanStore
• What is (likely) less well-explored:- Building a fleet of distributed data centers from scratch for
supporting large-scale distributed workloads
• Approach: specific case studies -> general insights & challenges
4
How do costs change when we build a geo-distributed version of a centralized DC?
7/15/2014 Microsoft Faculty Summit 2014
2525
25 25
Base + Spare IT costIT Cost
Degree of Distr.2 431
Base IT cost100
200
A Simple Thought Experiment8 8
88
2525
2525
Costs of networking DCs
7/15/2014 5Microsoft Faculty Summit 2014
Costs: What have we made worse?
• Networking infrastructure to connect DCs
• Larger overall IT capacity- Redundancy for availability
Higher for heterogeneous collection of DCs
- Poorer statistical multiplexing
67/15/2014 Microsoft Faculty Summit 2014
Costs: What have we made worse?
• Networking infrastructure to connect DCs
• Larger overall IT capacity- Redundancy for availability
Higher for heterogeneous collection of DCs
- Poorer statistical multiplexing
7
How do we keepthis “small”?
7/15/2014 Microsoft Faculty Summit 2014
Costs: What have we made worse?
• Networking infrastructure to connect DCs
• Larger overall IT capacity- Redundancy for availability
Higher for heterogeneous collection of DCs
- Poorer statistical multiplexing
• Non-IT infrastructure (power+cooling) costs- To support higher IT capacity
87/15/2014 Microsoft Faculty Summit 2014
Costs: What have we made worse?
• Networking infrastructure to connect DCs
• Larger overall IT capacity- Redundancy for availability
Higher for heterogeneous collection of DCs
- Poorer statistical multiplexing
• Non-IT infrastructure (power+cooling) costs- To support higher IT capacity
9
Can we keep non-IT Infra. “size” small?
7/15/2014 Microsoft Faculty Summit 2014
Costs: What has improved?
10
• Revenue due to better latency improvements
7/15/2014 Microsoft Faculty Summit 2014
Costs: What has improved?
11
• Revenue due to better latency improvements
• Aspects of availability
Base + Spare IT costIT Cost
Degree of Distr.2 431
Base IT cost100
200
Costs of networking DCs
7/15/2014 Microsoft Faculty Summit 2014
Avail.=0.99Avail.=0.999
Costs: What has improved?
12
• Revenue due to better latency improvements
• Aspects of availability
Base + Spare IT costIT Cost
Degree of Distr.2 431
Base IT cost100
200
Costs of networking DCs
Can we lower the availabilityof individual DC infra.?
7/15/2014 Microsoft Faculty Summit 2014
Avail.=0.99Avail.=0.999
Outline
An example of cost-effective IT provisioning
• Keeping non-IT infrastructure costs low• Lowering peak power related costs using batteries
• Conclusions
137/15/2014 Microsoft Faculty Summit 2014
Problem Setting
• DC locations given
• Client demands known, time-varying
• Goal: determine total capacity at each DC- To meet latency constraints, and- To allow for one DC to fail
• Our optimizer: An LP- Generally, NP-hard facility location problems
7/15/2014 14Microsoft Faculty Summit 2014
Results
• DC locations- 6 MS data centers in the US
• Client demand model- Exhibits time zone specific variation- Proportional to population
7/15/2014 Microsoft Faculty Summit 2014 15
New York demand
Oregon demand
Results
Experiments using demand measured for one Microsoft cluster, and 6 MS DC locations within US. L’= L
Availability (against 1 failure) for free!
0 20 40 60 80 100 120 140 160
Single DC capacity
Nearest DC (no failure)
Optimized (support 1 failure)
Without time-of-day
Optimized (no failure)
TOTAL CAPACITY
7/15/2014 16Microsoft Faculty Summit 2014
Results
Excess capacity for high availability
0 20 40 60 80 100 120 140 160
Single DC capacity
Nearest DC (no failure)
Optimized (support 1 failure)
Without time-of-day
Optimized (no failure)
TOTAL CAPACITY
7/15/2014 17Microsoft Faculty Summit 2014
Experiments using demand measured for one Microsoft cluster, and 6 MS DC locations within US. L=L’
Details: Narayanan et al., “Towards leaner geo-distributed cloud infrastructure,” Proc. HotCloud 2014
Outline
An example of cost-effective IT provisioning
Keeping non-IT infrastructure costs low• Lowering peak power related costs using batteries
• Conclusions
187/15/2014 Microsoft Faculty Summit 2014
A Closer Look at Power Infrastructure
19
UtilitySubstation
PowerDistributionUnit (PDU)
ServerRacks
Diesel Generator
UPS Battery
1$/W
0.6$/W
0.3$/W
0.2$/W
Auto TransferSwitch (ATS)
PowerInfrastructure
…
…
… …
…
7/15/2014 Microsoft Faculty Summit 2014
Lowering Peak Draw
20
Power
Power Cap
Cap-ex Saving
Re-shaped Power
Time
7/15/2014 Microsoft Faculty Summit 2014
Using Energy Storage
21
…
…
…
… …
Time
Po
we
r (W
) Power Cap
Newdraw
Energy Storage Device (ESD)(No Performance Impact)
How to provision and harness ESDs in data centers?
7/15/2014 Microsoft Faculty Summit 2014
ESDs in Current Data Centers
22
…
…
…
… …
…
…
…
… …
Cost Saving Cost Saving
1$/W
0.6$/W
0.3$/W
0.2$/W
Why restrict ESDs to any one level of the datacenter power hierarchy (e.g., central or server)?
7/15/2014 Microsoft Faculty Summit 2014
23
Spe
cific
Ene
rgy (W
h/k
g)
Specific Power (W/kg)
0Batteries
Capacitors
Compressed Air (CAES)
Supercapacitors
10,000
1,000
100
10
1
0.1
10 100 1,000 10,000 100,000 1,000,000
Lead acid
Fuel Cell
Flywheels (FW)
Lithium ion
Ultracapacitors(UC)
Ragone Plot
7/15/2014 Microsoft Faculty Summit 2014
Capital Cost (Energy and Power)
FlywheelUltracapacitor Lead-acid battery
Lithium ion battery
Compressed air
Energy Cost ($/kWh)
10,000 5,000 525 200 50
Power Cost
($/kW)
100 250 175 125 600
Why restrict to single ESD technology (e.g., Lead acid battery)?
7/15/2014 24Microsoft Faculty Summit 2014
Multi-level Multi-technology ESDs
25
ATS
ESD
PDU PDU PDU…
…
UtilityDiesel Generator
ESD
…
…
ESD
…
ESD
ESDServerH/W
Battery
Capacitor
Rack Rack Rack
Flywheel
Battery
CompressedAir
7/15/2014
Microsoft Faculty Summit 2014
Cost Savings for Google Workloads
26
(Savings, ESD cost)Datacenter:
Fly Wheel+Compress Air
Server: lead acidSaving
s ($
/day)
Single-tech,Datacenter-level
Multi-tech,Server Level
Multi-tech,Multi-level
Total cost without ESD is $12k/day
Single-tech,Server-level
Server:Ultracap+Lead acid
Power
(MW
)
1,000
2,000
0
Server: Lead Acid
Datacenter: Compress Air
25% 30%(4.9k,0.4k)20%
Time (hour)
3,000
4,000
5,000
(4.7k,0.3k)(5.2k,0.3k)
(3.9k,0.2k)
7/15/2014 Microsoft Faculty Summit 2014
Details: Wang et al., “Energy Storage in the Datacenter: What, Where,and How Much?,” Proc. ACM Sigmetrics 2012
Outline
An example of cost-effective IT provisioning
Keeping non-IT infrastructure costs low Lowering peak power related costs using batteries
• Conclusions
277/15/2014 Microsoft Faculty Summit 2014
Related Work
• IT capacity provisioning- Capacity planning [Goiri et al. ICDCS’11]
Showed that more DCs, where each is lower availability (lower cost) but extra geo-spares, better
Computed optimal capacity placements
• Lowering infrastructure availability/cost- Reducing the “size” of power infrastructure
Under-provisioning backup generators [Wang14] Reducing component redundancy [Govindan11,Kansal13]
- Less aggressive cooling design Has similarity in offering an availability vs cost trade-off
[Schroeder@Sigmetrics12] Related work in geo-distributed setting: [Wierman]
- Lower availability IT
287/15/2014 Microsoft Faculty Summit 2014
Conclusions
• Cost-effective capacity provisioning of geo-distributed data centers presents opportunities for novel problems in optimization and system design
- Putting together lower availability data centers with appropriate fault tolerance mechanisms during subsequent operation
- Key source of difficulty is uncertainty of subsequent workload evolution Typical facility location based formulations might be inadequate Stochastic optimization? Robust optimization?
• More information: http://www.cse.psu.edu/~bhuvan• Joint work with: Anand Sivasubramaniam, Aman Kansal, Di Wang,
Sriram Govindan, Hosam Fathy, Iyswarya Narayanan
297/15/2014 Microsoft Faculty Summit 2014
Microsoft Privacy Policy statement applies to all information collected. Read at research.microsoft.com
Save the planet and return your name badge before you
leave (on Tuesday)