cloud foundry summit 2015: building a robust cloud foundry (ha, security and dr)
TRANSCRIPT
Building a Robust Cloud FoundryHA, Security and DR
Haydon Ryan | Duncan Winn
This Talk
• High Availability (HA)
• Security
• Backing Up to Mitigate Disasters
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
HA
High Availability FocusKeep apps and services running in a performant, reliable and recoverable manner with timely error detection
1. Application Instances
2. Platform Processes
3. Platform VMs
4. Availability Zones
Keep Cloud Foundry running in a performant, reliable and recoverable manner with timely error detection
HA Deployments
Data Center Data Center
vs
Single Foundation Deployment
Dual Foundation Deployment
Data Center
AZ AZ
RDS
WHAT IF I TOLD YOU
IT’S POSSIBLE TO SANELY STREACH LAYER 2
User Targets myapp.mycf.com
DNS Resolution
NSX Boundary NSX Boundary
VIP VIPSSL Termination
SSL Termination
DNS Global Traffic Management (GTM)
HA ProxyHA Proxy
LTM ApplianceLTM Appliance
HA ProxyHA Proxy
LTM Appliance LTM Appliance
DomainsSystem Application
myapp.mycf.comtargetsClient
cf1.comcf push myappDeveloperapi.runtime-cf1.comcf apiDeveloper
CF1
cf2.comcf push myappDeveloperapi.runtime-cf2.comcf apiDeveloper
CF2
myapp.mycf.comtargetsClientmyapp.mycf.comtargetsClientmyapp.mycf.comtargetsClient
Services
ServicesAppApp
ServicesService Service
AppApp
Services
HA Deployments
Data Center Data Center
vs
Single Foundation Deployment
Dual Foundation Deployment
Data Center
AZ AZ
RDS
Customer Requirements
• AWS with One VPC • Specific IP Ranges • Using their internal corporate DNS • no ELBs or Route 53 due to security setup • Multiple Deployments of Cloud Foundry
• Availability Requirements: • App uptime • Failure matrix for downtime situations 15
16
HA Proxy HA Proxy
Bind DNS
CF Router CF Router
HA Proxy HA ProxySSL Termination
Who does the deployment need to be highly available for?
• Users
17
• Developers • Operations
Any non-critical jobs?• clock_global
• used to clean up cc jobs. • Rely on Resurrector? • Redeploy to a different AZ by changing
the resource_pool
18
Critical Jobs & VMs• haproxy • router • nats • cloud controller • uaa/login? • doppler?
19
Any less-critical jobs?• loggregator / doppler • loggregator traffic controller • etcd
• Jumpbox? • bosh?
20
Caveats with this design• Single points of failure?
• DNS • Bosh • Jumpbox
• Human interaction required in outage • Bind DNS does not do health monitoring.
Monitoring scripts were outside the scope of the engagement. 21
22
AZ 2 Private Subnet
Customer Managed
Interstate Data Center
VPC10.202.64.0/19
AZ 1 Private Subnet Bosh Subnet
jumpbox
CF SG
Direct connect
Bosh SG
login
uaa
bosh
router
dea cc
natshealth etcd
doppler
cc worker
loggregator traffic
controller
clockRDS Subnet
RDS SG
boshdb
uaadb
ccdb
apps manager
router
bind dns
Customer Managed
NAT
bastion
ha Proxy
ha Proxy
ha Proxy
ha Proxy
router
router
login
uaadea cc
natshealth etcd
doppler
cc worker
loggregator traffic
controller
AZ 1
AZ 2
How We Deployed Services
• Proxy is a Single Point of Failure
• No Load Balancer to use • Acceptable by customer in
failure matrix 23
Proxy Server
Server
App
Proxy
Proxy
Best Practices for Services
24
• By Default the service binding uses the first proxy address only
Proxy
Proxy Server
Server
Server
App
Load Balancer
Which Deployment
25
Data Center Data Center
Dual Foundation Deployment
Single Foundation Dual AZs
Data Center
Single Foundation Single DC
Data Center
AZ AZ
RDS
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Security and Networking (on AWS)
Security• Security is Hard • Three main concepts
• Restrict • Limit scope if Compromised • Mitigate
• Feedback Loop
Restrict Users• Individual Multi Factor Authentication
• IaaS Console/Hypervisor • Jumpbox
• Separate accounts • jumpbox • bosh • github
28
Restrict Packets• IaaS
• Security Groups (Instance Level) (better) • ACLs (Subnet Level) • Routes
29
Restrict Containers• Cloud Foundry
• Application Security Groups • dea network properties
• (allow_networks, deny_networks)
30
Pivotal Cloud Foundry for AWS 1.4
31
VPC10.0.0.0/16
RDS Subnet
Private Subnet
Public Subnet
Ops Manager
Elastic Runtime SG
ELB
Internet Gateway
NAT SG
Ops Manager SG
RDS SG
login
uaa micro
router
vpcall
NAT
restricted ip80, 443, 22*
dea
Common traffic flow
sg allow rules
cc
natshealth etcd
doppler
cc worker
loggregator traffic
controller
clock
boshdbuaadb ccdbapps
manager db
autoscaling
ELB SG
80?,443
vpcall
vpcall
was it just DEAs that used NAT?
Limit Scope if Compromised• Different user/pass for each component
• Strong passwords (and usernames) • 20 Characters Long • RANDOM • Both Cases • best avoid special characters • eg: YxLIodYrUBQJrvMRYSQL
• Avoid cloud cow 32http://vanmethod.deviantart.com/art/Purple-‐Cow-‐on-‐a-‐Cloud-‐146265642
Limit Scope if Compromised
33
Runner
UAA
Login
uaadb
mySql App Data
Post Breach Security Measures• Roll
• AWS Credentials • Username and password (Manifest) • PEMs
• Investigate: • Vm Logs (stored in Splunk / CloudWatch Logs) • Bosh and Login Audit Trail • Isolate the VM for investigation
• Resurrector will resurrect a non compromised VM • Feedback:
• Incident Reports and Management Support 34
Paranoid Level Security for AWS• Cloudtrail
• Alerts • Audit Logs • Rollback’
• Remove ability to delete • s3 buckets • subnets / vpc • backups
• Everything else can be recovered from a backup… 35
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Disaster Recovery
Backing Up Cloud Foundry
Configuration
CCDB UAADB Apps Man DB BOSH DB
BlobstoreNFS Server
SCENARIO ONELOSE PCF OPS-MGR
ORCF DEPLOYMENT
Restoring Ops Manager
Export Configuration
Create New Ops Manager
Import Configuration
ConfigurationBackup Ops Manager
scp ubuntu@<OPS MRG HOST>:/var/tempest/workspaces/default/deployments/*yml .Backup Deployment Manifests
Deployment Manifests in BOSH
~$ bosh deployments
bosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
SCENARIO TWOLOSE BOSH
Restoring Bosh With PCF
Export Configuration Import
Configuration:/var/tempest/workspaces/default/deployments/micro
BOSH Director
+ bosh.yml
Restoring Bosh Manually
BOSH
BOSH DB
bosh.yml
pg_dump /var/vcap/store
/dev/xvda /dev/sdb /dev/sdf
Volume:
BOSH DB
External MySQL
Blobstore
Critical DatabasesBackup Cloud Controller DB Encryption Credentials
Locate Databases Info From Deployment Manifestbosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
NFS / Blobstore✦ Managing Access with ACLs
✦ Create Group Bucket Policy for “Deny DeleteBucket”
✦ Turn on versioning { "Version": "2012-10-17", "Statement": [ { "Effect": "Deny", "Action": [ "s3:DeleteBucket", "s3:DeleteObjectVersion" ], "Resource": [ "*" ] } ] }
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Takeaway
Takeaways✦ Tradeoffs: No “One Size Fits All”
✦ Service Layer
✦ Existing: Environmental Security and Networking Constraints
✦ Backup: Configuration, Databases, Blobstore (This is your CF).
KEEPCALM
AND
CF PUSH