cost-effective cloud hpc resource provisioning by building semi-elastic virtual clusters
DESCRIPTION
Cost-effective Cloud HPC Resource Provisioning by Building Semi-Elastic Virtual Clusters. Shuangcheng Niu 1 , Jidong Zhai 1 , Xiaosong Ma 2,3 Xiongchao Tang 1 , Wenguang Chen 1 THU 1 & NCSU 2 & ORNL 3. “HPC in Cloud” Is Trend?. HPC in cloud On-demand Elastic No upfront cost - PowerPoint PPT PresentationTRANSCRIPT
Cost-effective Cloud HPC Resource Provisioning by Building Semi-Elastic Virtual Clusters
Shuangcheng Niu1, Jidong Zhai1, Xiaosong Ma2,3
Xiongchao Tang1, Wenguang Chen1
THU1 & NCSU2 & ORNL3
2
“HPC in Cloud” Is Trend?
HPC in cloud◦ On-demand◦ Elastic◦ No upfront cost◦ Saving management fee◦ …
More and more engineers start using HPC cloud
3
“On-demand Model” Is Effective?
Reserved instance pricing model◦ 6 reserved instance classes in Amazon EC2 CCI◦ Discounted charge rate with upfront fee
0 200 400 600 800 10000
10000
20000
30000
40000
Amazon’s EC2 cc2.8xlarge Pricing Model
On-Demand3Y-Light3Y-Medium
Time (day)
Tota
l in
stance
cost
($)
6.8%38.3%
4
“On-demand Model” Is Lower Utilized!Reserved instance pricing model
◦ Difficult to be utilized for individuals
SDSC Data Star system trace◦ 391 day◦ 460 users◦ 1 user, 1 3Y-Light
Instance Type Used
3Y-Medium 0
3Y-Light 0.15 %
On-demand 99.85 %
5
Short Jobs
Hourly-charging granularity
Several minutes delay when start
Maybe I should pack my short
jobs to lower my rental cost.
70%
6
Our Proposal
Semi-Elastic Cluster computing model◦ Organization-owned◦ Cloud-based virtual cluster◦ Dynamic capacity◦ Sharing resources between users
7
SEC Architecture
8
SEC Model
Traditional local cluster
A (0,1.5)
Wait time: 15 minUtilization: 56.7%
C(1,0.75) D (1.75,1.5)
9
SEC Model
Traditional local clusterPure on-demand cloud
A (0,1.5)D (1.75,1.5)
C(1,0.75)
A (0,1.5)
Wait time: 0 minUtilization: 70.8%
Wait time: 15 minUtilization: 56.7%
10
SEC Model
Traditional local clusterPure on-demand cloudSemi-elastic cluster A (0,1.5)
D (1.75,1.5)
C(1,0.75)
A (0,1.5)
Wait time: 0 minUtilization: 70.8%
A (0,1.5)
C(1,0.75)
D (1.75,1.5)
Wait time: 0 minUtilization: 77.3%
Wait time: 15 minUtilization: 56.7%
11
53.0069444444444 570
50
100
150
200
Used Size Allocated
Time (day)
Num
of
Inst
ance
s
Aggregated Workloads
SEC trace slices with SDSC Data Star workload
3Y-Medium,73.66 %
3Y-Light,15.75 %
On-Demand, 10.59 %
12
SEC Challenges
Finer-tuned capacity
◦ Intelligently controlled capacity according to job queue and submission history
◦ Tradeoff between responsiveness and lower cost
Aggregated workloads◦ Predict long-term resource requirements◦ Auto resource provisioning
Evaluation without real traces
13
Job Scheduling & Cluster Size ScalingProblem definition
◦ Configurable wait time constraint◦ Minimize total cost
Batch scheduling◦ Extended backfilling algorithms◦ Dynamic resource provisioning
Resource provisioning strategies◦ Wait-time bounded instance acquisition◦ Expanding capacity according to job queue
Job placement policies
14
Experimental Setup
Workload◦ 391-day trace from SDSC’s Data Star system
Cloud platform◦ Amazon's EC2 Cluster Compute Instances (CCIs)◦ Eight Extra Large Instances (cc2.8xlarge)◦ 16 processors (2 × Intel Xeon E5-2670, eight-core)◦ 60.5 GB memory ◦ 4 × 850 GB instance storage
15
0 100 200 300 4000
1
2
3
4
NoWait SEC-On-Demand
SEC-Hybrid
Individual
Avg. Wait Time (sec)
Avg
. C
ost
Ra
te (
$/h
ou
r)
SEC vs. On-demand Model
◦ Individual◦ NoWait◦ SEC-On-Demand◦ SEC-Hybrid
Trace: SDSC DS
61.0%
13.3%
16
0 200 400 600 800 1000 12000
0.5
1
1.5
2
Local-1.5XLocal-1.75X
Local-2X
SEC-Hybrid
Avg. Wait Time (sec)
Avg
. C
ost
Ra
te (
$/h
ou
r)
SEC vs. Local Cluster
◦ Traditional local cluster◦ SEC-Hybrid
Trace: SDSC DS
17
Offline Reserved Instance ConfigurationOffline configuration problem
◦ Input Utilization matrix Un×m (from given cluster capacity
trace) Pricing classes {C0, C1, C2,…Ch}
◦ Solution Purchased instance matrix: Rn×m, where Ri,k≥0
◦ Optimization Minimizing total rental cost
A hard problem!
18
Choosing larger time interval, e.g. a week ◦ Reduce computation granularity
Offline Forward Greedy Algorithm
Running: At beginning of each time interval
Steps:
1) Calculate all instances' utilization level based on given
future demands
2) Identify first economical class for each instance
3) Summarize provisioning plan
4) Compare provisioning plan with current inventory and
decide amount of purchased
5) Adjusting active reserved instances
Running: At beginning of each time interval
Steps:
1) Calculate all instances' utilization level based on given
future demands
2) Identify first economical class for each instance
3) Summarize provisioning plan
4) Compare provisioning plan with current inventory and
decide amount of purchased
5) Adjusting active reserved instances
19
Offline Optimal-Competitive Algorithm
Transform the original pricing classes into new classes
TotalCost (Ck) ≥ TotalCost(Ck’) =
Transform the original pricing classes into new classes
TotalCost (Ck) ≥ TotalCost(Ck’) =
20
Online Reserved Instance ConfigurationUse weekly time intervals
◦ Reduce computation complexity◦ Reduce short-term variance◦ Less impact on long-term reservation decisions
Evolution model◦ Assumed a quadratic polynomial model
21
Long-Term Demand Prediction
Classical Exponential Smoothing (ES) method◦ Relatively simple ◦ Quite robust for processing non-stationary noises◦ Widely used
Our prediction method◦ Extended Holt's double-parameter ES method ◦ Auto adjusting smoothing factors
22
Verifying Workloads
Validation workloads
•Bounded by fixed machine size
•6 real traces
HPC cluster
•Semi-elastic machine size
SEC
•Not bounded
•6 SNS traces
SNS
23
SNS-based Synthetic Workloads
SearchTraffic
ActiveUsersSNS
ActiveUsers
ResourceDemandHPC
SNS search traffic
HPC trace slices Syntheti
c workload
SyntheticWorkloadGeneration
24
Reserved Instance Configuration AnalysisHPC trace
SDSC DS HPC2N Sandia Ross0
0.5
1
1.5
2
2.5
Optimal-Competitive Offline-Greedy Online-SEC
Online-3Y-Only Online-1Y-Only Online-OD-Only
Avg
. C
ha
rge
Ra
te (
$/h
ou
r)
25
Reserved Instance Configuration AnalysisSynthetic workloads using SNS trace
Facebook MySpace Flickr Renren0
0.5
1
1.5
2
2.5
Optimal-Competitive Offline-Greedy Online-SEC
Online-3Y-Only Online-1Y-Only Online-OD-Only
Avg
. C
ha
rge
Ra
te (
$/h
ou
r)
26
Overhead Analysis with SEC Prototype Overhead for data protection with instance reuse
◦ Reformatting EC2 ephemeral 4×845GB disks◦ 3.4 seconds
Configuration overhead when requesting new instances◦ Configuring host names, hosts file, file system, etc.◦ About 8.0 seconds
Configuration overhead when releasing instances◦ About 5.0 seconds
27
Conclusion
SEC : A new execution model for HPC◦ Organization-owned dynamic cloud-based clusters◦ Reduced costs by workload aggregations◦ Better responsiveness through instance reuse◦ Higher utilization level by efficient utilizing residual
resources
SEC can potentially become a viable alternative to organizations owning and managing physical clusters
28
Related Work
[1] Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/, 2012.
[2] SLURM: A Highly Scalable Resource Manager. https://computing.llnl.gov/linux/slurm/, 2012.
[3] StarCluster. http://web.mit.edu/star/cluster/, 2012.
[4] Google Trends. http://www.google.com/trends/, 2013.
[5] E. S. Gardner Jr. Exponential smoothing: The state of the art. Journal of Forecasting, 1985.
[6] W. Voorsluys, S. Garg, and R. Buyya. Provisioning spot market cloud resources to create cost-effective virtual clusters. Algorithms and Architectures for Parallel Processing, 2011.
[7] H. Zhao, M. Pan, X. Liu, X. Li, and Y. Fang. Optimal resource rental planning for elastic applications in cloud market. In Parallel & Distributed Processing Symposium (IPDPS), IEEE, 2012.
29
Acknowledgments
We would thanks to ◦ HPC Workloads archive
◦ Anonymous reviewers and shepherd
◦ Research grants from Chinese 863 project, NSF grants, a joint faculty appointment between ORNL and NCSU, and a senior visiting scholarship at Tsinghua University
30
Thanks!
31
Classical HPC traces
SDSC’s Data Star, SDSC's Blue Horizon (SDSC Blue), SDSC's IBM SP2 (SDSC SP2), Cornell Theory Center IBM SP2 (CTC SP2), High Performance Computing Center North (HPC2N), Sandia Ross cluster(Sandia Ross).
Variance in node-hour per active user
32
Synthesis workloads
SNS search trace from Google Trends
33
Cost-responsiveness analysis
Local cluster expense items
34
Impact of scheduling parameters
35
Impact of scheduling parameters
Average wait timeExpandin
g strategie
s
Wait Time
Threshold
36
Impact of scheduling parameters
Average charge rateExpandin
g strategie
s
Wait Time
Threshold
37
Overhead Analysis with SEC Prototype Overhead for data protection with instance reuse
◦ Reformatting EC2 ephemeral 4×845GB disks◦ 3.4 seconds
Configuration overhead when requesting new instances◦ Configuring host names, hosts file, and the file system◦ Set up user accounts and add nodes to the SLURM partition.
Configuration overhead when releasing instances