anynines - running cloud foundry for 12 months - an experience report

77
Running Cloud Foundry An Experience Report

Upload: anynines

Post on 22-Nov-2014

198 views

Category:

Technology


1 download

DESCRIPTION

anynines runs a public PaaS located in a German datacenter based on Cloud Foundry. In more than 12 months of running a Cloud Foundry PaaS man lessons about security, high availability, open stack and many other exciting topics have been learned. See how Bosh can be used and how it shouldn't be used. Learn how to perform Cloud Foundry upgrades and read how to harden Cloud Foundry by adding more fault tolerance with pacemaker.

TRANSCRIPT

Page 1: Anynines - Running Cloud Foundry for 12 months - An experience report

Running Cloud Foundry An Experience Report

Page 2: Anynines - Running Cloud Foundry for 12 months - An experience report

About this talk

Page 3: Anynines - Running Cloud Foundry for 12 months - An experience report

• Receive an opinion about running Cloud Foundry (CF)

• How to shoot your own leg with CF and overcommitment settings

• How to perform CF updates

• How to harden CF

• Wise words about CF services

Page 4: Anynines - Running Cloud Foundry for 12 months - An experience report

Introduction

Page 5: Anynines - Running Cloud Foundry for 12 months - An experience report

about.me/fischerjulian

Page 6: Anynines - Running Cloud Foundry for 12 months - An experience report

Running a public Cloud Foundry

for more than a year.

Page 7: Anynines - Running Cloud Foundry for 12 months - An experience report

It works.

Page 8: Anynines - Running Cloud Foundry for 12 months - An experience report

In order to run Cloud Foundry smoothly …

Page 9: Anynines - Running Cloud Foundry for 12 months - An experience report

… refer to the package leaflet for risks and side effects and consult pivotal, cloudcredo or anynines.“

Page 10: Anynines - Running Cloud Foundry for 12 months - An experience report

The details

Page 11: Anynines - Running Cloud Foundry for 12 months - An experience report

The anynines Stack

Page 12: Anynines - Running Cloud Foundry for 12 months - An experience report

Hardware

OpenStack

Cloud Foundry

VMware

Page 13: Anynines - Running Cloud Foundry for 12 months - An experience report

We migrated from a Rented VMware to a

self-hosted OpenStack.

Page 14: Anynines - Running Cloud Foundry for 12 months - An experience report

For more details on this: http://rh.gd/a9vmw2sos

Page 15: Anynines - Running Cloud Foundry for 12 months - An experience report

Proof point made…

Page 16: Anynines - Running Cloud Foundry for 12 months - An experience report

Cloud Foundry saves investments into software development

by being infrastructure agnostic.

Page 17: Anynines - Running Cloud Foundry for 12 months - An experience report

Running Cloud Foundry. What happened.

Page 18: Anynines - Running Cloud Foundry for 12 months - An experience report

Security Issues

Page 19: Anynines - Running Cloud Foundry for 12 months - An experience report

• Pivotal informs partners early about issued

• Usually along with fixes

Page 20: Anynines - Running Cloud Foundry for 12 months - An experience report

OpenStack Issues

Page 21: Anynines - Running Cloud Foundry for 12 months - An experience report

• Ext4 vs. Ext3

• DEA MTU

• rsyslogd command not found

Page 22: Anynines - Running Cloud Foundry for 12 months - An experience report

CF Gotchas

Page 23: Anynines - Running Cloud Foundry for 12 months - An experience report

DEA evacuate & Bosh timeout race-condition

Page 24: Anynines - Running Cloud Foundry for 12 months - An experience report

• Removing a DEA → Apps will be evacuated→ DEA will be stopped

• Bosh deployment will fail when evacuation takes longer than the Bosh timeout

• Set your Bosh timeout accordingly!

Page 25: Anynines - Running Cloud Foundry for 12 months - An experience report

DEA over-commitment

Page 26: Anynines - Running Cloud Foundry for 12 months - An experience report

Default overcommitment factor = 4

Page 27: Anynines - Running Cloud Foundry for 12 months - An experience report

RAM peaks may cause random errors

Page 28: Anynines - Running Cloud Foundry for 12 months - An experience report

• Failures during staging

• Random application crashes

• No meaningful log information

Page 29: Anynines - Running Cloud Foundry for 12 months - An experience report

Reducing over-commitment

Page 30: Anynines - Running Cloud Foundry for 12 months - An experience report

• Native strategy

• Reduce over-commitment factor

• Bosh deploy

Page 31: Anynines - Running Cloud Foundry for 12 months - An experience report
Page 32: Anynines - Running Cloud Foundry for 12 months - An experience report

• 8 GB VM, OC factor 4 → Announces 32 GB (V)RAM

• 8 GB VM, OC factor 2 → Announces 16 GB (V)RAM

• When evacuating a 32 GB (V)RAM host, another 32 GB (V)RAM host will be preferred (more free space)

Page 33: Anynines - Running Cloud Foundry for 12 months - An experience report

Evacuation Wave

Page 34: Anynines - Running Cloud Foundry for 12 months - An experience report

1 GB

1 GB

1 GB

1 GB

Page 35: Anynines - Running Cloud Foundry for 12 months - An experience report

= maximum impact on running apps!

Page 36: Anynines - Running Cloud Foundry for 12 months - An experience report

New DEAs (OC 2) will receive apps when old DEAs

(OC 4) have been stopped.

Page 37: Anynines - Running Cloud Foundry for 12 months - An experience report

Hints

Page 38: Anynines - Running Cloud Foundry for 12 months - An experience report

• Create 2nd resource pool for new DEAs

• Deploy the 2nd resource pool before startup to stop old DEAs

• (-) Needs more resources

• (+) Smoother transition

Page 39: Anynines - Running Cloud Foundry for 12 months - An experience report

Updating Cloud Foundry

Page 40: Anynines - Running Cloud Foundry for 12 months - An experience report

Required: Staging System

Page 41: Anynines - Running Cloud Foundry for 12 months - An experience report

• Structurally identical

• Less VMs

Page 42: Anynines - Running Cloud Foundry for 12 months - An experience report

1. Determine new features

since last release

Page 43: Anynines - Running Cloud Foundry for 12 months - An experience report

2. Study

deployment manifest changes

Page 44: Anynines - Running Cloud Foundry for 12 months - An experience report

3. Apply

deployment manifest changes

Page 45: Anynines - Running Cloud Foundry for 12 months - An experience report

4. First staging attempt

Page 46: Anynines - Running Cloud Foundry for 12 months - An experience report

5. Debug and Fix it!

Page 47: Anynines - Running Cloud Foundry for 12 months - An experience report

6. Simulate the live-upgrade

Page 48: Anynines - Running Cloud Foundry for 12 months - An experience report

7. Schedule maintenance on

status.anynines.com

Page 49: Anynines - Running Cloud Foundry for 12 months - An experience report

8. Perform the upgrade

and cross fingers.

Page 50: Anynines - Running Cloud Foundry for 12 months - An experience report

CF Hardening

Page 51: Anynines - Running Cloud Foundry for 12 months - An experience report

Accept that VMs are ephemeral

Page 52: Anynines - Running Cloud Foundry for 12 months - An experience report

VM Failover Strategies

Page 53: Anynines - Running Cloud Foundry for 12 months - An experience report

Resurrect

Page 54: Anynines - Running Cloud Foundry for 12 months - An experience report

• Monitor VM

• Re-Build VMs automatically

• e.g. using Cloud Foundry Bosh

• + Easy

• - Takes long (minutes not seconds)

• - Open Stack doesn’t release persistent disks automatically

Page 55: Anynines - Running Cloud Foundry for 12 months - An experience report

Failover to Standby VM

Page 56: Anynines - Running Cloud Foundry for 12 months - An experience report

Distribute CF components across availability zones

Page 57: Anynines - Running Cloud Foundry for 12 months - An experience report

• Build disjunct networks, racks, etc.

• Each disjunct zone = availability zone

• Tell your IaaS about availability zones

• On provision choose the AZ

• Build Bosh releases accordingly

Page 58: Anynines - Running Cloud Foundry for 12 months - An experience report

• Provide stand-by VM

• Monitor VM and perform failover

• IP failover using Pacemaker

• + Fast failover (seconds)

• - Pacemaker not easy to use (& boshify)

• - Increased resource usage by stdby VM(s)

Page 59: Anynines - Running Cloud Foundry for 12 months - An experience report

• 2 * UAA

• 2 * CC

• 2 * n * DEAs

• 2 * Health Manager

• …

Page 60: Anynines - Running Cloud Foundry for 12 months - An experience report

UAA & CC DB =

SPOF

Page 61: Anynines - Running Cloud Foundry for 12 months - An experience report

HA Postgres

Page 62: Anynines - Running Cloud Foundry for 12 months - An experience report

• UAA and Cloud Controller database

• Single point of failure for Cloud Foundry

Page 63: Anynines - Running Cloud Foundry for 12 months - An experience report

• Postgres not inherently clusterable > failover with standby vm

• Master/slave replication

• Pacemaker/corosync

• IP-Failover using NIC-reattachment

Page 64: Anynines - Running Cloud Foundry for 12 months - An experience report

That’s half way towards a PostgreSQL CF Service

Page 65: Anynines - Running Cloud Foundry for 12 months - An experience report

• Add a V2 Service Broker

• Add a provisioning logic

• Provision 2-node db cluster on cf create service postgres medium-cluster

Page 66: Anynines - Running Cloud Foundry for 12 months - An experience report

Services

Page 67: Anynines - Running Cloud Foundry for 12 months - An experience report

“The best way to find yourself is to lose yourself in the service of others.”

― Mahatma Gandhi

Page 68: Anynines - Running Cloud Foundry for 12 months - An experience report

Wardenized Services (community services)

are cute for pet projects.

Page 69: Anynines - Running Cloud Foundry for 12 months - An experience report

Not suitable for production.

Page 70: Anynines - Running Cloud Foundry for 12 months - An experience report

• Implementations are outdated

• One size doesn’t fit all!

Page 71: Anynines - Running Cloud Foundry for 12 months - An experience report

No production CF without high quality services.

Page 72: Anynines - Running Cloud Foundry for 12 months - An experience report

CF Service Design

Page 73: Anynines - Running Cloud Foundry for 12 months - An experience report

• Use clusterable services if possible

• Implement automatic failover if not

• Autoprovisioning using Bosh

• Organize self-healing

• (Semi-)Automatic recovery from degraded mode

Page 74: Anynines - Running Cloud Foundry for 12 months - An experience report

Summary

Page 75: Anynines - Running Cloud Foundry for 12 months - An experience report

• Bosh & the CF release are powerful, yet you can cut yourself.

• HA Services are very necessary.

• CF is ready to be used in production.

Page 76: Anynines - Running Cloud Foundry for 12 months - An experience report

Questions?

Page 77: Anynines - Running Cloud Foundry for 12 months - An experience report

Thank you!