20171122 aws usergrp_coretech-spn-cicd-aws-v01

SPN CI/CD journey on AWS

SPN Infra., CoreTech

Scott Miao

11/22/2017

1

Who am I

• Scott Miao

• RD, SPN Infra., TrendMicro

• OOAD system dev. 10+ years

• Hadoop ecosystem 6 years

• AWS for BigData 4 years

• @linkedIn

• @slideshare

2

https://www.linkedin.com/in/scott-miao-2b891053/

https://www.slideshare.net/takeshi_miao

Agenda

• Original services delivery process in SPN

• Dev/Ops

– DevOps goals V.S. our original way

• CI/CD on AWS

• An example service CI/CD on AWS

• DevOps goals V.S. our original way V.S. CI/CD on AWS

• Lessons learned

Original services delivery process in SPN

Developers

2. Source Repo

1. Dev, utests,…

3. Back and forth

4. Trigger CI

Release portal

7. TriggerReleasebuild

8. Release artifacts

Operators Infra. admin5. Devices spec.For both Stg/PROD6.1 Monitoring scripts

6.2 Puppet scripts6.3 Operation guides

Release portal

Stg.

PROD

Service team Operation team DCS team

9. Stg resources ready

11. Deploy and monitor

13. Release artifacts

12.1 Itests12.2 Stress tests12.3 UAT

15. 16.17. PROD release

10. Deploy service &scripts

14. PROD resources ready

Dev/Ops

8

DevOps is not a new technology or a product. It’s an approach or culture of

software development that seeks stabilityand performance at the same time that it

speeds software deliveries to the business.

── Andi Mann, CA Technology ──

Cited from: Derek Chen, RD, TrendMicrohttps://www.slideshare.net/derekhound/devops-in-practice-78905911, p#15

https://www.slideshare.net/derekhound/devops-in-practice-78905911

9

Software Delivery

Plan ReleaseOperat

eCode Build DeployTest

Monitor

Agile Development

Continuous Integration

Continuous Delivery

Continuous Deployment

DevOps

Cited from: Derek Chen, RD, TrendMicrohttps://www.slideshare.net/derekhound/devops-in-practice-78905911, p#23

https://www.slideshare.net/derekhound/devops-in-practice-78905911

DevOps goals V.S. our original way

• Faster time to market– Too complicated to miss steps– Service team needs to follow up themselves– Lead time needed steps (Machine resources, etc)

• Lower failure rate of new releases– Manual steps lead to errors

• Shorten lead time between fixes– Rolling upgrade– Invasive

• Faster mean time to recovery– Hard to deal with machine errors and peak

2https://en.wikipedia.org/wiki/DevOps#Goals

https://en.wikipedia.org/wiki/DevOps#Goals

“Very often, automation supports this objective”


Quoted from Wikipedia for DevOps goals


CI/CD on AWSTWO ACHIEVE SAME DEVOPS GOALS

DEVOPS FOCUSES ON ORGANIZATIONAL CHANGES

CI/CD FOCUSES ON TECHNICAL IMPLEMENTATIONS

Review for CI and CD

• Continuous Integration– is the practice of merging all developer working

copies to a shared mainline (trunk) several times a day

• Continuous Delivery– produce software in short cycles, ensuring that

the software can be reliably released at any time

• Continuous Deployment– means that every change is automatically

deployed to production

https://en.wikipedia.org/wiki/Continuous_integrationhttps://en.wikipedia.org/wiki/Continuous_delivery

https://en.wikipedia.org/wiki/Continuous_integration

https://en.wikipedia.org/wiki/Continuous_delivery

Characteristics of Cloud Computing

• On-demand self-service– A consumer can unilaterally provision computing capabilities

• Broad network access– Capabilities are available over the network and accessed

through standard mechanisms

• Resource pooling– The provider's computing resources are pooled to serve

multiple consumers using a multi-tenant model

• Rapid elasticity– Capabilities can be elastically provisioned and released

• Measured service– Cloud systems automatically control and optimize resource use

http://www.inforisktoday.com/5-essential-characteristics-cloud-computing-a-4189https://en.wikipedia.org/wiki/Infrastructure_as_Code

http://www.inforisktoday.com/5-essential-characteristics-cloud-computing-a-4189

https://en.wikipedia.org/wiki/Infrastructure_as_Code

(AWS)

DevOps

CI/CD

Automation

Cloud Computing

AWS managed services SPN used

• AWS CloudFormation– Gives developers and systems administrators an easy

way to create and manage a collection of related AWS resources

– We use it to provision our service components• Such as Load balancer (ALB), machines (EC2)

• AWS OpsWorks– A configuration management service that uses Chef,

an automation platform that treats server configurations as code

– We use it to deploy, configure and startup our service components

https://aws.amazon.com/cloudformation/https://aws.amazon.com/opsworks/

https://aws.amazon.com/cloudformation/

https://aws.amazon.com/opsworks/

AWS CloudFormation + OpsWorks

user

main

IAM ELB OpsWorks

AWS CloudFormation

main

IAM ALB OpsWorks

AWS OpsWorks

artifacts

AWS S3

AWSVPC

Chef recipes1. Put CF templates2. Put artifacts3. Put Chef recipes

4. Create CF W/ params, VPC ID, etc

5. Templates input

6. Create CF stacks

7. Provision AWS resources

8. Create OpsWorks9. Artifacts/recipes input

10. Deploy/Config/start up service

UserCF

Ops

Ready to serve

CoreTech DCS managed services

• Enterprise github

– Just like the github we use on Internet

• CloudCI – Enterprise Circle CI

– A Docker container based CI solution

– Seamlessly integrated with github

• JFrog Artifactory

– A CoreTech wise shared artifacts repo.

An example service CI/CD on AWSANALYTIC ENGINE

Analytic Engine is an API service for…

Common Big Data computation service on Cloud (AWS)

https://www.slideshare.net/takeshi_miao/analytic-engine-a-common-big-data-computation-service-on-the-aws

https://www.slideshare.net/takeshi_miao/analytic-engine-a-common-big-data-computation-service-on-the-aws

IDC

AE High Level Architecture Design

AZb

AE API servers

RDS

AZa

AZb

AZc

AE API servers

RDS

services

services

services

peering

HTTPS

EMR

EMR

Cross-accountS3 buckets

Auto

Scaling

group

worker

s

worker

sMulti-AZs

Auto

Scaling

group

Auto

Scaling

group

Eureka

Eureka

VPN

HTTPS/HTTP Basic

Cloud Storagepeering

isValidUser

CS output

HTTPS/HTTP Basic

Amazon

SNS

Oregon (us-west-2)

IDC

VPN

Splunk

peering

Private ALB

IDC

This is really what we taking care about

AZb

AE API servers

RDS

AZa

AZb

AZc

AE API servers

RDS

services

services

services

peering

HTTPS

EMR

EMR

Cross-accountS3 buckets

Auto

Scaling

group

worker

s

worker

sMulti-AZs

Auto

Scaling

group

Auto

Scaling

group

Eureka

Eureka

VPN

HTTPS/HTTP Basic

Cloud Storagepeering

isValidUser

CS output

HTTPS/HTTP Basic

Amazon

SNS

Oregon (us-west-2)

IDC

VPN

Splunk

peering

Private ALB

What components in CI/CD scope

• In scope– API, Worker, Eureka, Genie W/ auto-scaling group

• EC2, deploy, configure and startup component services

– AWS Elastic Application Load Balancer– AWS Simple Notification Service

• NOT in scope– VPC/subnets/VPC peerings

• We use fixed VPC and subnets for both VPN connections and VPC peerings

– RDS MySQL DB• Already pre-created

– EMR clusters• Create by user API calls via AWS Java SDK

CI/CD Usecases

1. Developer edits/pushes codes to github

2. Developer deploys AE to Dev env. for tests

3. Developer terminates AE in Dev env. after tests

4. Developer deploys AE to Stg env. for integrated tests/UAT

5. Developer deploys AE to PROD env.

6. Developer patches hotfixes and deploys to PROD

7. Monitor your service components

1. Developer edits/pushes codes to github

Developers

master

AE-100

Repo: spn/ae-saas Project: spn/ae-saas

1.19.0 3.build 4.utests 5.package

6.cp artifacts to S3

S3: dev-us-east-1

CF templates

ae-1.19.AE_100.jars

Chef recipes

ae-1.19.AE_100.jars

1. Push AE-100 branch

2. Trigger CI

7. cp to S3

8.publish artifacts to mvn repo.

9. Publish artifacts to mvn repo.

Feature branch workflow

https://www.atlassian.com/git/tutorials/comparing-workflows

Every commit will trigger this build

https://www.atlassian.com/git/tutorials/comparing-workflows

2. Developer deploys AE to Dev env. for tests

Developers


4.Create CF

S3: dev-us-east-1

CF templates

ae-1.19.AE_100.jars

Chef recipes

1. Git tag: c-1.19.AE_100-dev-us-east-1-myAE

3. Trigger CI


2. Push tag

Dev VPC

AWS CF

5. CF creating for stack: ae-dev-myAE

5.1 Templates input

6. Provision resources

7. Deploy/config/startup service

Ready for tests

Env. variables

in CImaster

AE-100

3. Developer terminates AE in Dev env. after tests

Developers


4.delete CF

3. Trigger CI


2. Push tag

Dev VPC

AWS CF

5. CF deleting for stack: ae-dev-myAE

6. Terminating resources

1. Git tag: d-1.19.AE_100-dev-us-east-1-myAE

master

8.1 Deploy/config/startup service

4. Developer deploys AE to Stg env. for integrated tests/UAT (Much like UC#2)

Developers


7.Create CF

S3: dev-us-east-1

CF templates

ae-1.19.563.jars Chef recipes

2. Git tag: c-1.19.563-stg-us-east-1-myAE

4. Trigger CI


3. Push tag

Dev VPC

AWS CF8. Provision resources for stack: ae-stg-myAE

Ready for tests

Env. variables

in CImaster

AE-100

1.19.563

1. Merge feature branch: 1.19.<buildNum>

5.cp artifacts to stg S3

●●●

6.1 copying

6. cp artifacts from dev to stg

9.Run itests

S3: stg-us-east-1

Run itestson service

5. Developer deploys AE to PROD env. (Much like UC#4)

29

Much like UC#4Git tag: c-1.19.563-prod-us-west-2-myAE

6. Developer patches hotfixes and deploys to PROD (1/2)

Developers


6.Update CF

S3: stg-us-east-1

CF templates

ae-1.19.563.jars Chef recipes

1. Git tag: u-1.19.570-prod-us-west-2-myAE

3. Trigger CI


2. Push tag

Dev VPC

AWS CF7. Update CF stack: ae-prod-myAE

Ready to serve

Env. variables

in CImaster

AE-105

1.19.5704.cp artifacts

to prod S3

●●●

5.1 copying

5. cp artifacts from stg to prod

S3: prod-us-west-2

8.1 Re-Deploy/config/startup service

6. Developer patches hotfixes and deploys to PROD (2/2)

• Updating W/O SLA impact– ALB W/ AutoScalingReplacingUpdate for

UpdatePolicy Attribute configured

• Better and flexible Auto-scaling– EC2 Auto-scaling group + Opsworks

• Cross region deployment as early as possible– Minor configuration diffs

• Deploy to us-east-1 successful does not assure on others…

– AWS SDK default value is us-east-1• You may forgot to set in your code…

31

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

https://aws.amazon.com/tw/blogs/devops/auto-scaling-aws-opsworks-instances/

(Auto-healing really sucks)

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

https://aws.amazon.com/tw/blogs/devops/auto-scaling-aws-opsworks-instances/

7. Monitor your service components (1/2)

These are the practices we learned from other teams in Trend

• Visibility– Operator can get the timely system status every time every where– Practice:

• CW metrics -> CW dashboard• CloudWatchLog -> AWS Lambda -> Log management system

• Monitoring– Operator can setup a threshold at specific point for any metrics as a

monitor– Therefore, the monitor can trigger corresponding actions to notify operator– Practice:

• [App logs -> WC agent -> | custom] WC metrics -> WC Alarm

• Auto-Recovery– System can auto recovers itself for every component runs failed– Practice:

• EC2 auto-scaling group + Opsworks• WC metrics -> WC Alarm -> AWS Lambda -> AWS SDK -> AWS Opsworks|AWS EC2

32

7. Monitor your service components (2/2)A high level architecture design

33

App components

Managed Services

AWS CloudWatch

Default metrics

Custom metrics (CPU, mem, disk)

CW metrics

CW Dashboard

CW Alarms

Pager

AWS SNS

AWS Lambda

AWS CloudWatchLog

App logs to CWLog

Metric filters

AWS Lambda

Input Store Process Output

Log management

Visibility

Monitoring

Visibility

AWS Lambda

Auto-recovery

DevOps goals V.S. our original way V.S. CI/CD on AWS

Goals Original way CI/CD

Faster time to market

• Too complicated to miss steps

• Service team needs to follow up themselves

• Lead time needed steps (Machine resources, etc)

• One click delivery• Only one role “developer”• Minutes of lead time for

resources

Lower failure rate of new releases

• Manual steps lead to errors • Fully automation

Shorten lead time between fixes

• Rolling upgrade• Invasive

• Replacing/Rolling upgradedeployment

• Non-invasive

Faster mean time to recovery

• Hard to deal with machine errors and peak

• Elasticities brought from Cloud Computing platform



Lessons learned

• Try to automate everything as you can– Cloudformation + EC2 Auto-scaling group + Opsworks– AWS::CloudFormation::CustomResource is also a tool to rescue

• Consider to split your service CF template– Service infra. (RDS, SNS, KMS key, etc)

• You not update your infra. often

– Service instance, (EC2, etc)• We update our service instances very often

• Not only consider about first time creation– How to update your services W/O impact SLA

• Monitor ! Monitor !! Monitor !!!• TEST ! TEST !! TEST !!!

35

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html

Backups

Different types of Auto-scaling group

39

Service Auto Scaling Group

Features Deploy

OpsWorks

24/7•manual creation/deletion•configure one instance for one AZ

chef recipe

time-based •can specify time slot(s) based on hour unit, on everyday or any day in week•configure one instance for one AZ

chef recipe

load-based

•can specify CPU/MEM/workload avg. based on an OPS layer•UP: when to increase instances•Down: when to decrease instances•No max./min. # of instances setting•configure one instance for one AZ

chef recipe

EC2 •can set max./min. for # of instance•Multi-AZs support

user-data

Auto Recovery based on Monit

• OpsWorks already use Monit for Auto Recovery

– Leverage the Monit on EC2

– Have practices in on-premise

11/22/2017

Confidential | Copyright 2014 TrendMicro Inc.

2

AZ1 AZ2

API

server

API

server

https://mmonit.com/monit/

Auto Scaling group

• Instance check by CloudWatch

• Process check by Monit

• No process –restart process

• Process health check failed –terminate EC2

• Terminate EC2 !Auto Scaling group launch new EC2

https://mmonit.com/monit/

Little variances among AWS regions

• Impact

– Same automation scripts can not run successfully among regions, even the same region sometimes

• Issues

11/22/2017

Confidential | Copyright 2014 TrendMicro Inc.

2

Service Regions Root cause

OpsWorks Same region on us-west-2

S3 URL acceptable spec. had changed for property “Repository URL”From “https://s3.amazonaws.com” to “https://s3-us-west-2.amazonaws.com”

OpsWorks us-west-2 V.S. us-east-1

Still be “Repository URL” issue. “https://s3-us-west-2.amazonaws.com” V.S. “https://s3.amazonaws.com”

EC2 us-west-2 V.S. us-east-1

EC2 FQDN spec. is different.“ip-10-104-33-152.us-west-2.compute.internal” V.S. “ip-10-103-73-248.ec2.internal”

OpsWorks V.S. image-based deployment

• OpsWorks deployment

– We are currently using

– It takes too long to launch a service component

• E.g. It takes about ~10 mins to launch a Genie node

• Image-based deployment

– Theoretically, it should takes very short time to launch a service component

– More responsive for peak workloads

– AMI (AWS Machine Images) V.S. Docker images ?

How about API Gateway and ECS ?

• API Gateway

– Not good due to only Internet accessible

– Cold start

– RDB connection overflow

– CORS integration for web UI

• ECS

– Still need to run standby EC2 instances for peak…

– Only take care for RESTful API services

– Kubernates more suitable for our usecases

43