agile: elastic distributed resource scaling for infrastructure-as-a … · (≈ 2 minutes) !...

21
Computer Science AGILE: Elastic distributed resource scaling for Infrastructure-as-a-Service Hiep Nguyen, Zhiming Shen, Xiaohui (Helen) Gu North Carolina State University Sethuraman Subbiah NetApp John Wilkes Google

Upload: others

Post on 06-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Computer Science

AGILE: Elastic distributed resource scaling for Infrastructure-as-a-Service

Hiep Nguyen, Zhiming Shen, Xiaohui (Helen) Gu

North Carolina State University Sethuraman Subbiah

NetApp John Wilkes

Google

Page 2: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Elastic resource scaling for Infrastructure-as-a-Service §  Elasticity: grow/shrink resource as required

VM VM VM

CPU

dem

and

Time

1

VM VM VM

VM VM VM VM VM VM

VM VM VM

VM VM VM

Page 3: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Design goals

§  Application agnostic – Easy to deploy – Support different applications

§  Effective overload handling – Predict overload accurately – Minimize SLO violations – Minimize resource cost

§  Low overhead – Light-weight – Little interference

2

Page 4: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

State of the art

§  Reactive resource scaling [e.g., Amazon EC2] –  Performance degradation due to long instantiation latency

(≈ 2 minutes)

§  Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007, Shen et al. SOCC 2011 ]

–  Focus on short-term prediction or need to assume cyclic workload patterns

§  Model-driven resource scaling [e.g., Zhu et al. ICAC 2008, Kalyvianaki et al. ICAC 2009, Padala et al. Eurosys 2007]

–  Have parameters that need to be specified or tuned offline for different applications/workloads

3

Page 5: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

AGILE system overview

4

Resource pressure modeling

Server pool scaling manager

Resource usage monitoring

VM add/remove

Resource demand prediction

SLO violation feedback

AGILE

Server pool prediction

VM VM VM VM VM VM VM VM VM

Overload starts

Advance alert

Overload stops

VM VM VM Future resource demand

Resource to maintain

When to scale How many VMs

Page 6: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Pre-copy live VM cloning

§  Design goals –  Immediate performance scale-up –  Avoid storing and maintaining VM snapshots

Memory

Disk Image

Pre-copy memory

5

Page 7: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Pre-copy live VM cloning

§  Design goals –  Immediate performance scale-up –  Avoid storing and maintaining VM snapshots

Disk Image

Memory Memory Copy

Incremental Image Incremental Image Read only

6

Page 8: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Pre-copy live VM cloning

§  Dynamic copy-rate configuration – Minimum copy-rate (e.g., little interference) – Finish cloning within the overload pending

time

dirtyclone rtMemorySizeCopyRate += /

Predicted overload

(avg. ≈ 100 Mbps)

7

Page 9: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Performance scale-up comparison

Immediate performance scale-up

2.5 minutes 5 seconds

8

New VM request

Application starts

Page 10: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Wavelet-based medium-term prediction

Original signal

Detail signals

Scale 1

Scale 2

Scale 3

Scale 4

Scale 4 Approximation signals

Decompose Synthesize

9

Page 11: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Online resource pressure modeling

§  Mapping function between: – Resource pressure – SLO violation rate

Observations

10

Page 12: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Optimizations for AGILE cloning

§  Post-cloning auto-configuration –  Event driven auto-configuration –  Application VMs can subscribe to critical events

§  False alarm handling –  Continuously check predicted overload state –  Cancel cloning triggered by the false alarm

11

Page 13: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Experimental evaluation

§  Implemented on top of KVM – Modified KVM to support pre-copy live cloning

§  Test bed: – 10 cloud nodes running CentOS 6.2 with KVM

0.12.1.2 §  Benchmark systems

– RUBiS driven by four real workload traces •  WorldCup' 98, EPA, Nasa, ClarkNet (one day traces)

– Google cluster data: 100 CPU usage and 100 Memory usage traces (29 days)

12

Page 14: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Wavelet-based prediction accuracy

§  RUBiS traces

13

Page 15: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Wavelet-based prediction accuracy

§  RUBiS traces

14

Page 16: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Wavelet-based prediction accuracy

§  Google CPU traces

15

Page 17: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Overload handling

§  Web server and database server scaling (≈ 2 hours, scale from 1 to 2 servers)

16

Page 18: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Overload handling

§  Web server: during scaling

17

Page 19: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Dynamic copy-rate configuration

Accurately control the cloning time under different deadlines 18

Page 20: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Conclusion

§  Prediction-driven elastic distributed resource scaling: – Accurate medium-term prediction based on

wavelet transforms – Adaptive copy-rate to minimize interference – Application-agnostic performance model

§  Immediate performance scale-up with little overhead

Thank you! http://dance.csc.ncsu.edu

19

Page 21: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,

Acknowledgement

§  This work was sponsored in part by NSF CNS0915567 grant, NSF CNS0915861 grant, U.S. Army Research Office (ARO) under grant W911NF-10-1-0273, and Google Research Awards

20