20171122 aws usergrp_coretech-spn-cicd-aws-v01
TRANSCRIPT
SPN CI/CD journey on AWS
SPN Infra., CoreTech
Scott Miao
11/22/2017
1
Who am I
• Scott Miao
• RD, SPN Infra., TrendMicro
• OOAD system dev. 10+ years
• Hadoop ecosystem 6 years
• AWS for BigData 4 years
• @slideshare
2
Agenda
• Original services delivery process in SPN
• Dev/Ops
– DevOps goals V.S. our original way
• CI/CD on AWS
• An example service CI/CD on AWS
• DevOps goals V.S. our original way V.S. CI/CD on AWS
• Lessons learned
Original services delivery process in SPN
Developers
2. Source Repo
1. Dev, utests,…
3. Back and forth
4. Trigger CI
Release portal
7. TriggerReleasebuild
8. Release artifacts
Operators Infra. admin5. Devices spec.For both Stg/PROD6.1 Monitoring scripts
6.2 Puppet scripts6.3 Operation guides
Release portal
Stg.
PROD
Service team Operation team DCS team
9. Stg resources ready
11. Deploy and monitor
13. Release artifacts
12.1 Itests12.2 Stress tests12.3 UAT
15. 16.17. PROD release
10. Deploy service &scripts
14. PROD resources ready
Dev/Ops
8
DevOps is not a new technology or a product. It’s an approach or culture of
software development that seeks stabilityand performance at the same time that it
speeds software deliveries to the business.
── Andi Mann, CA Technology ──
Cited from: Derek Chen, RD, TrendMicrohttps://www.slideshare.net/derekhound/devops-in-practice-78905911, p#15
9
Software Delivery
Plan ReleaseOperat
eCode Build DeployTest
Monitor
Agile Development
Continuous Integration
Continuous Delivery
Continuous Deployment
DevOps
Cited from: Derek Chen, RD, TrendMicrohttps://www.slideshare.net/derekhound/devops-in-practice-78905911, p#23
DevOps goals V.S. our original way
• Faster time to market– Too complicated to miss steps– Service team needs to follow up themselves– Lead time needed steps (Machine resources, etc)
• Lower failure rate of new releases– Manual steps lead to errors
• Shorten lead time between fixes– Rolling upgrade– Invasive
• Faster mean time to recovery– Hard to deal with machine errors and peak
2https://en.wikipedia.org/wiki/DevOps#Goals
“Very often, automation supports this objective”
https://en.wikipedia.org/wiki/DevOps#Goals
Quoted from Wikipedia for DevOps goals
CI/CD on AWSTWO ACHIEVE SAME DEVOPS GOALS
DEVOPS FOCUSES ON ORGANIZATIONAL CHANGES
CI/CD FOCUSES ON TECHNICAL IMPLEMENTATIONS
Review for CI and CD
• Continuous Integration– is the practice of merging all developer working
copies to a shared mainline (trunk) several times a day
• Continuous Delivery– produce software in short cycles, ensuring that
the software can be reliably released at any time
• Continuous Deployment– means that every change is automatically
deployed to production
https://en.wikipedia.org/wiki/Continuous_integrationhttps://en.wikipedia.org/wiki/Continuous_delivery
Characteristics of Cloud Computing
• On-demand self-service– A consumer can unilaterally provision computing capabilities
• Broad network access– Capabilities are available over the network and accessed
through standard mechanisms
• Resource pooling– The provider's computing resources are pooled to serve
multiple consumers using a multi-tenant model
• Rapid elasticity– Capabilities can be elastically provisioned and released
• Measured service– Cloud systems automatically control and optimize resource use
http://www.inforisktoday.com/5-essential-characteristics-cloud-computing-a-4189https://en.wikipedia.org/wiki/Infrastructure_as_Code
(AWS)
DevOps
CI/CD
Automation
Cloud Computing
AWS managed services SPN used
• AWS CloudFormation– Gives developers and systems administrators an easy
way to create and manage a collection of related AWS resources
– We use it to provision our service components• Such as Load balancer (ALB), machines (EC2)
• AWS OpsWorks– A configuration management service that uses Chef,
an automation platform that treats server configurations as code
– We use it to deploy, configure and startup our service components
https://aws.amazon.com/cloudformation/https://aws.amazon.com/opsworks/
AWS CloudFormation + OpsWorks
user
main
IAM ELB OpsWorks
AWS CloudFormation
main
IAM ALB OpsWorks
AWS OpsWorks
artifacts
AWS S3
AWSVPC
Chef recipes1. Put CF templates2. Put artifacts3. Put Chef recipes
4. Create CF W/ params, VPC ID, etc
5. Templates input
6. Create CF stacks
7. Provision AWS resources
8. Create OpsWorks9. Artifacts/recipes input
10. Deploy/Config/start up service
UserCF
Ops
Ready to serve
CoreTech DCS managed services
• Enterprise github
– Just like the github we use on Internet
• CloudCI – Enterprise Circle CI
– A Docker container based CI solution
– Seamlessly integrated with github
• JFrog Artifactory
– A CoreTech wise shared artifacts repo.
An example service CI/CD on AWSANALYTIC ENGINE
Analytic Engine is an API service for…
Common Big Data computation service on Cloud (AWS)
https://www.slideshare.net/takeshi_miao/analytic-engine-a-common-big-data-computation-service-on-the-aws
IDC
AE High Level Architecture Design
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-accountS3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB
IDC
This is really what we taking care about
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-accountS3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB
What components in CI/CD scope
• In scope– API, Worker, Eureka, Genie W/ auto-scaling group
• EC2, deploy, configure and startup component services
– AWS Elastic Application Load Balancer– AWS Simple Notification Service
• NOT in scope– VPC/subnets/VPC peerings
• We use fixed VPC and subnets for both VPN connections and VPC peerings
– RDS MySQL DB• Already pre-created
– EMR clusters• Create by user API calls via AWS Java SDK
CI/CD Usecases
1. Developer edits/pushes codes to github
2. Developer deploys AE to Dev env. for tests
3. Developer terminates AE in Dev env. after tests
4. Developer deploys AE to Stg env. for integrated tests/UAT
5. Developer deploys AE to PROD env.
6. Developer patches hotfixes and deploys to PROD
7. Monitor your service components
1. Developer edits/pushes codes to github
Developers
master
AE-100
Repo: spn/ae-saas Project: spn/ae-saas
1.19.0 3.build 4.utests 5.package
6.cp artifacts to S3
S3: dev-us-east-1
CF templates
ae-1.19.AE_100.jars
Chef recipes
ae-1.19.AE_100.jars
1. Push AE-100 branch
2. Trigger CI
7. cp to S3
8.publish artifacts to mvn repo.
9. Publish artifacts to mvn repo.
Feature branch workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
Every commit will trigger this build
2. Developer deploys AE to Dev env. for tests
Developers
Repo: spn/ae-saas Project: spn/ae-saas
4.Create CF
S3: dev-us-east-1
CF templates
ae-1.19.AE_100.jars
Chef recipes
1. Git tag: c-1.19.AE_100-dev-us-east-1-myAE
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF
5. CF creating for stack: ae-dev-myAE
5.1 Templates input
6. Provision resources
7. Deploy/config/startup service
Ready for tests
Env. variables
in CImaster
AE-100
3. Developer terminates AE in Dev env. after tests
Developers
Repo: spn/ae-saas Project: spn/ae-saas
4.delete CF
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF
5. CF deleting for stack: ae-dev-myAE
6. Terminating resources
1. Git tag: d-1.19.AE_100-dev-us-east-1-myAE
master
8.1 Deploy/config/startup service
4. Developer deploys AE to Stg env. for integrated tests/UAT (Much like UC#2)
Developers
Repo: spn/ae-saas Project: spn/ae-saas
7.Create CF
S3: dev-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
2. Git tag: c-1.19.563-stg-us-east-1-myAE
4. Trigger CI
Feature branch workflow
3. Push tag
Dev VPC
AWS CF8. Provision resources for stack: ae-stg-myAE
Ready for tests
Env. variables
in CImaster
AE-100
1.19.563
1. Merge feature branch: 1.19.<buildNum>
5.cp artifacts to stg S3
●●●
6.1 copying
6. cp artifacts from dev to stg
9.Run itests
S3: stg-us-east-1
Run itestson service
5. Developer deploys AE to PROD env. (Much like UC#4)
29
Much like UC#4Git tag: c-1.19.563-prod-us-west-2-myAE
6. Developer patches hotfixes and deploys to PROD (1/2)
Developers
Repo: spn/ae-saas Project: spn/ae-saas
6.Update CF
S3: stg-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
1. Git tag: u-1.19.570-prod-us-west-2-myAE
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF7. Update CF stack: ae-prod-myAE
Ready to serve
Env. variables
in CImaster
AE-105
1.19.5704.cp artifacts
to prod S3
●●●
5.1 copying
5. cp artifacts from stg to prod
S3: prod-us-west-2
8.1 Re-Deploy/config/startup service
6. Developer patches hotfixes and deploys to PROD (2/2)
• Updating W/O SLA impact– ALB W/ AutoScalingReplacingUpdate for
UpdatePolicy Attribute configured
• Better and flexible Auto-scaling– EC2 Auto-scaling group + Opsworks
• Cross region deployment as early as possible– Minor configuration diffs
• Deploy to us-east-1 successful does not assure on others…
– AWS SDK default value is us-east-1• You may forgot to set in your code…
31
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
https://aws.amazon.com/tw/blogs/devops/auto-scaling-aws-opsworks-instances/
(Auto-healing really sucks)
7. Monitor your service components (1/2)
These are the practices we learned from other teams in Trend
• Visibility– Operator can get the timely system status every time every where– Practice:
• CW metrics -> CW dashboard• CloudWatchLog -> AWS Lambda -> Log management system
• Monitoring– Operator can setup a threshold at specific point for any metrics as a
monitor– Therefore, the monitor can trigger corresponding actions to notify operator– Practice:
• [App logs -> WC agent -> | custom] WC metrics -> WC Alarm
• Auto-Recovery– System can auto recovers itself for every component runs failed– Practice:
• EC2 auto-scaling group + Opsworks• WC metrics -> WC Alarm -> AWS Lambda -> AWS SDK -> AWS Opsworks|AWS EC2
32
7. Monitor your service components (2/2)A high level architecture design
33
App components
Managed Services
AWS CloudWatch
Default metrics
Custom metrics (CPU, mem, disk)
CW metrics
CW Dashboard
CW Alarms
Pager
AWS SNS
AWS Lambda
AWS CloudWatchLog
App logs to CWLog
Metric filters
AWS Lambda
Input Store Process Output
Log management
Visibility
Monitoring
Visibility
AWS Lambda
Auto-recovery
DevOps goals V.S. our original way V.S. CI/CD on AWS
Goals Original way CI/CD
Faster time to market
• Too complicated to miss steps
• Service team needs to follow up themselves
• Lead time needed steps (Machine resources, etc)
• One click delivery• Only one role “developer”• Minutes of lead time for
resources
Lower failure rate of new releases
• Manual steps lead to errors • Fully automation
Shorten lead time between fixes
• Rolling upgrade• Invasive
• Replacing/Rolling upgradedeployment
• Non-invasive
Faster mean time to recovery
• Hard to deal with machine errors and peak
• Elasticities brought from Cloud Computing platform
https://en.wikipedia.org/wiki/DevOps#Goals
Lessons learned
• Try to automate everything as you can– Cloudformation + EC2 Auto-scaling group + Opsworks– AWS::CloudFormation::CustomResource is also a tool to rescue
• Consider to split your service CF template– Service infra. (RDS, SNS, KMS key, etc)
• You not update your infra. often
– Service instance, (EC2, etc)• We update our service instances very often
• Not only consider about first time creation– How to update your services W/O impact SLA
• Monitor ! Monitor !! Monitor !!!• TEST ! TEST !! TEST !!!
35
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html
2
37
Backups
Different types of Auto-scaling group
39
Service Auto Scaling Group
Features Deploy
OpsWorks
24/7•manual creation/deletion•configure one instance for one AZ
chef recipe
time-based •can specify time slot(s) based on hour unit, on everyday or any day in week•configure one instance for one AZ
chef recipe
load-based
•can specify CPU/MEM/workload avg. based on an OPS layer•UP: when to increase instances•Down: when to decrease instances•No max./min. # of instances setting•configure one instance for one AZ
chef recipe
EC2 •can set max./min. for # of instance•Multi-AZs support
user-data
Auto Recovery based on Monit
• OpsWorks already use Monit for Auto Recovery
– Leverage the Monit on EC2
– Have practices in on-premise
11/22/2017
Confidential | Copyright 2014 TrendMicro Inc.
2
AZ1 AZ2
API
server
API
server
https://mmonit.com/monit/
Auto Scaling group
• Instance check by CloudWatch
• Process check by Monit
• No process –restart process
• Process health check failed –terminate EC2
• Terminate EC2 !Auto Scaling group launch new EC2
Little variances among AWS regions
• Impact
– Same automation scripts can not run successfully among regions, even the same region sometimes
• Issues
11/22/2017
Confidential | Copyright 2014 TrendMicro Inc.
2
Service Regions Root cause
OpsWorks Same region on us-west-2
S3 URL acceptable spec. had changed for property “Repository URL”From “https://s3.amazonaws.com” to “https://s3-us-west-2.amazonaws.com”
OpsWorks us-west-2 V.S. us-east-1
Still be “Repository URL” issue. “https://s3-us-west-2.amazonaws.com” V.S. “https://s3.amazonaws.com”
EC2 us-west-2 V.S. us-east-1
EC2 FQDN spec. is different.“ip-10-104-33-152.us-west-2.compute.internal” V.S. “ip-10-103-73-248.ec2.internal”
OpsWorks V.S. image-based deployment
• OpsWorks deployment
– We are currently using
– It takes too long to launch a service component
• E.g. It takes about ~10 mins to launch a Genie node
• Image-based deployment
– Theoretically, it should takes very short time to launch a service component
– More responsive for peak workloads
– AMI (AWS Machine Images) V.S. Docker images ?
How about API Gateway and ECS ?
• API Gateway
– Not good due to only Internet accessible
– Cold start
– RDB connection overflow
– CORS integration for web UI
• ECS
– Still need to run standby EC2 instances for peak…
– Only take care for RESTful API services
– Kubernates more suitable for our usecases
43