operational best practices in the cloud

35
Operational Best Practices in the Cloud October 27, 2011 Watch the video of this webinar

Upload: rightscale

Post on 20-Aug-2015

1.154 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Operational Best Practices in the Cloud

Operational Best Practicesin the Cloud

October 27, 2011

Watch the video of this webinar

Page 2: Operational Best Practices in the Cloud

# 2

Cloud Management Platform

Your Panel TodayPresenting• Rafael H. Saavedra, VP Engineering, RightScale• Josep Blanquer, Sr. Systems Architect, RightScale

Q&A • David Manriquez, Account Manager, RightScale

Please use the “Questions” window to ask questions any time!

Page 3: Operational Best Practices in the Cloud

# 3

Cloud Management Platform

Agenda• RightScale architecture• The release cycle• Monitoring, alerts and escalations• When servers fail• Our best practices

Today’s material will discuss how we run RightScale in the cloud. From this, we distill best practices that are relevant for all.

Please use the “Questions” window to ask questions any time!

Page 4: Operational Best Practices in the Cloud

Operational Best Practices in the Cloud

RightScale architecture

Page 5: Operational Best Practices in the Cloud

# 5

Cloud Management Platform

The scale of RightScale

• > 3M servers launched by RightScale

• RightScale continuously monitors > 100K servers

• Every day at RightScale:• 2,000 array resize actions are executed• 35,000 alert escalations are triggered• 20,000 escalation emails are sent to users• 9.0TB of monitoring data is exchanged with our servers• 1.6TB of logging data is sent to our servers

Page 6: Operational Best Practices in the Cloud

# 6

Cloud Management Platform

Architecture of a cloud-based SaaS app• RightScale is a SaaS application that runs completely in the cloud

• Databases • Core web app and API• Services such as monitoring, logging, and MultiCloud Marketplace

Page 7: Operational Best Practices in the Cloud

# 7

Cloud Management Platform

A quick primer on ServerTemplates

Custom MySQL 5.0.24 (CentOS 5.2)

Custom MySQL 5.0.24 (CentOS 5.4)

MySQL 5.0.36 (CentOS 5.4)

MySQL 5.0.36 (Ubuntu 8.10)

MySQL 5.0.36 (Ubuntu 8.10) 64bit

Frontend Apache 1.3 (Ubuntu 8.10)

Frontend Apache 2.0 (Ubuntu 9.10) - patched

CMS v1.0 (CentOS 5.4)

CMS v1.1 (CentOS 5.4)

My ASP appserver (windows 2008)

My ASP.net (windows 2008) – security update 1

My ASP.net (windows 2008) – security update 8

SharePoint v4 (windows 2003) – 32bit

SharePoint v4 (windows 2003) –64bit

SharePoint v4.5 (windows 2003) –64bit

Configuring serversthrough bundling Images:

A set of configuration directives that will install and configure software on top of

the base image

Configuring serverswith ServerTemplates:

CentOS 5.2

CentOS 5.4

Ubuntu 8.10

Ubuntu 9.10

Win 2003

Win 2007

Base ImageVery few and basicMultiCloudImage

Setup DNS and IPs

Restore last backup

Configure MySQL

Install MySQL Server

Install monitoring

boot

seq

uenc

e

Page 8: Operational Best Practices in the Cloud

# 8

Cloud Management Platform

We use the same ServerTemplates our customers do• RightScale uses 15-20 different ServerTemplates in Production

• We don’t build images, we use pre-built MultiCloud Images with RightLink• We make heavy use of RightScale provided tool boxes (EBS, DNS, LB)

• Off-the shelf: 1 template (MySQL)• Customized: App servers and load balancers

• Written with RightScripts in Ruby, Bash, etc.• Mostly Rail apps to run our core services: front-end, API, Marketplace, etc.

• From MultiCloud Image: Messaging and databases• RabbitMQ, Cassandra

Page 9: Operational Best Practices in the Cloud

# 9

Cloud Management Platform

Deployments group RightScale services

Page 10: Operational Best Practices in the Cloud

# 10

Cloud Management Platform

Best practices: Architecture

• ServerTemplates can be used off the shelf or customized• Don’t bundle images

• Make heavy use of MCI’s instead of hardcoding base RightImages

• Deployments let you stage servers in the cloud• The use of inputs guarantee consistency across all servers

• Easily test or failover

• Macros/API automation can quickly stand up entire deployments

Page 11: Operational Best Practices in the Cloud

Operational Best Practices in the Cloud

The release cycle

Page 12: Operational Best Practices in the Cloud

# 12

Cloud Management Platform

Challenges of the release cycle

• Limited resources and lead time for procuring and provisioning equipment

• Maintaining multiple environments from development through production

• Maintaining consistency for reusability and QA

• Distributed teams and team members

Page 13: Operational Best Practices in the Cloud

# 13

Cloud Management Platform

A typical release cycle flow

Page 14: Operational Best Practices in the Cloud

# 14

Cloud Management Platform

Our development environment• We keep a number of different deployments

• Each development team has its own mini-environment• A larger integrated staging environment• One production environment

• Accounts keep things organized and secure• We keep a separate accounts for staging and production• One team of sys admins manage all environments

Page 15: Operational Best Practices in the Cloud

# 15

Cloud Management Platform

RightScale release cycle• One set of scripts and ServerTemplates are used everywhere

• Gate accounts for security, development vs. production, etc.• Less test variance between Production and Staging• Only difference is size of environment

• Easy to bring up development environment on demand using deployments and macros• Get it up and running, on demand in less than an hour• Cloud is pay-by-the-hour, so it is cheap to run temporary environments

Page 16: Operational Best Practices in the Cloud

# 16

Cloud Management Platform

Best practices: Release cycle

• Don’t be afraid to run many environments• Dynamically clone, launch and teardown environments for quick tests• Configure a fixed set of environment for development, integration, staging• Use different accounts to segregate users and configurations.• Sys admins are expensive. Cloud servers are cheap.

• Reuse ServerTemplates to keep environments consistent• Make use of the versioning and freeze software repositories• Share or Publish them through the MultiCloud Marketplace• Create all-in-one ServerTemplates from the same RightScripts and recipes

• Avoid upgrading existing servers, fail forward instead • Keep old servers running so you can rollback, or do post-mortem later on• For databases: Launch additional slaves. Freeze replication at upgrade point.

Take snapshots!

Page 17: Operational Best Practices in the Cloud

# 17

Cloud Management Platform

Front Ends

DB Slave

Databases

DB Master

Main App

Release night steps

Main App

DB Slave

7) Take snapshot at cutoff

6) Stop replication

2) Servers with new code

1) Servers with current code

4) Cut access to site

5) Stop all access to databases

3) Add second slave

9) Reconnect all servers

8) Update schema 10) Open access to site

Page 18: Operational Best Practices in the Cloud

Operational Best Practices in the Cloud

Monitoring, alerts and escalations

Page 19: Operational Best Practices in the Cloud

# 19

Cloud Management Platform

Monitoring and alerts: Diagnose & optimize

• Off-the-shelf monitoring• OS: CPU, Disk, Memory, Network, Processes, System• App: Apache, IIS, MySQL, Nginx, SQL Server• Plus many more CollectD plug-ins!

• Custom monitoring

• Cluster monitoring

• Alerts & escalations

Page 20: Operational Best Practices in the Cloud

# 20

Cloud Management Platform

Monitoring, alerts & escalations• We monitor as much relevant data as possible and display it

in insightful ways to quickly detect patterns and abnormalities• We proactively eliminate the conditions that raise critical alerts

• No broken windows policy. No critical alerts can remain unresolved.

API Network Activity Dashboard Network Activity

Page 21: Operational Best Practices in the Cloud

# 21

Cloud Management Platform

Off-the-shelf: MySQL Collectd Plugin

Page 22: Operational Best Practices in the Cloud

# 22

Cloud Management Platform

Off-the-shelf: MySQL reads graphs• Read-random-next represents a table scan• Read-next represents an index scan

Page 23: Operational Best Practices in the Cloud

# 23

Cloud Management Platform

Custom: Whatever you want with collectd• Any statistic you can think of can easily be added as a monitor.• All of these are graph-able and alert-able in our dashboard!• Many can be written in less than an hour.

• As easy as printing a line of formatted numbers every few seconds

• support.rightscale.com is an authority on collectd

• How we do it:• We use Ruby to write our custom monitors• Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans• Passenger: Ruby script that parses data from Passenger command line interface

Page 24: Operational Best Practices in the Cloud

# 24

Cloud Management Platform

Custom: Cassandra monitors

Page 25: Operational Best Practices in the Cloud

# 25

Cloud Management Platform

Cluster: Monitor hundreds of servers• We leverage a

monitoring data warehouse to develop heat maps & stacked graphs

Page 26: Operational Best Practices in the Cloud

# 26

Cloud Management Platform

Automated actions using alerts from monitors

• Create an alert for any monitor, even your custom ones• RightScale example: Cassandra pending reads signals overloading

• Break alerts into critical and warning• Critical: Wake me up! Page me!• Warning: Send email to team.

• Trigger many actions: email, run script, scale, relaunch, reboot,…• Customize to your monitor, situation, and IT processes• RightScale example: Run a RightScript if swap is too high• Integrate with 3rd party services like PagerDuty

Page 27: Operational Best Practices in the Cloud

# 27

Cloud Management Platform

Best practices: Monitoring and alerts• Monitor your critical processes off-the-shelf

• Set monitors with scripts on your ServerTemplates• Use mon_process (e.g. Ruby)

• Customize to your application needs• Use collectd plug-ins or easily build your own• The monitor is graphed in the RightScale dashboard

• Plan out your critical alerts• Set your response plan: warnings vs. critical

Page 28: Operational Best Practices in the Cloud

Operational Best Practices in the Cloud

When servers fail

Page 29: Operational Best Practices in the Cloud

# 29

Cloud Management Platform

How to think about server failure in the cloud• Design for failure

• Make sure your application remains healthy after the failure of a node• Don’t use sticky sessions• Distribute your application services

• Debug ServerTemplates and not servers• Use alerts to reboot and/or relaunch• Auto-scale app server arrays• Use dynamic DNS and static IPs for load balancers

• Your app servers and databases will always know where to look

Page 30: Operational Best Practices in the Cloud

# 30

Cloud Management Platform

Deep dive on database failure• Use database backups for rollbacks or disaster scenarios

• Restore from backups in event of complete system failure• One-click with fully automated RightScale Database Managers

• Use database redundancy for high availability (example master/slave)• Promote slave if master fails• Possible to prime your slave database to make failover more seamless • After promotion is complete, quick to launch a new slave• Worry about troubleshooting when you have time• One-click with fully automated RightScale Database Managers

Page 31: Operational Best Practices in the Cloud

# 31

Cloud Management Platform

Backups to block volumes and object stores• Block volumes: EBS snapshots

• + Easy to snapshot• + Easy to rotate• + Easy consistency• + Instant restore (mount)• - Difficult to move between

clouds/regions• - Must backup entire volume

• What we do:• EBS: Databases

• Object stores: S3/Cloud Files• + Backup into other clouds • + Backup individual folders or files• + Incremental backups (e.g. as

files/data are flushed)• - More coding, customization• - Custom rotation strategy• - Download time

• What we do:• S3: Monitoring system (Cassandra

in the future)

Page 32: Operational Best Practices in the Cloud

# 32

Cloud Management Platform

Best practices: Planning for failure• No excuse for not backing up your servers

• RightScale Database Manager + EBS tools make it easy to take backups

• Plan your rotation policy• Database Manager helps you tailor daily, weekly, and monthly backups

• Backup across clouds and regions• Database Manager for MySQL and SQL Server make it easy to backup to S3 or

CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace

• Organize your backups• Keep track with lineages and timelines using the Database Managers

• Test your backups!• It is easy and cheap on the cloud• A crisis is the worst time to find out your backups are corrupted

Page 33: Operational Best Practices in the Cloud

Operational Best Practices in the Cloud

Our best practices

Page 34: Operational Best Practices in the Cloud

# 34

Cloud Management Platform

Best practices for operating in the cloud• Keep your environment organized and consistent

• Accounts, deployments, ServerTemplates, and macros

• Change and debug configurations not servers• ServerTemplates, MultiCloudImages, fail-forward

• Monitor your servers efficiently• Off-the-shelf and custom monitoring and alerts

• Automate, automate and also automate• Server arrays, macros/API for more complex flows, alert actions …

• Backup your databases (organize, multi-cloud, rotate, test)• Database Manager ServerTemplates

Page 35: Operational Best Practices in the Cloud

# 35

Cloud Management Platform

Getting Started and Q&A

Contact RightScale(866) [email protected] www.rightscale.com

More InfoWebinar archive: RightScale.com/webinars

White Papers: RightScale.com/whitepapers

Free Edition: RightScale.com/Free

RightScale ConferenceNov 9 in Santa Clara, CAwww.RightScale.com/Conference• Attend technical breakout sessions• Talk with RightScale customers• Ask questions at the Expert Bar• Training on 11/8 and 11/10