save millions by e cient resource utilization through mesos2010 ~560 services 15 mloc 2018 ~2000...

24
Save Millions By E cient Resource Utilization Through Mesos By Smarth Madan

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Save Millions By Efficient Resource Utilization Through Mesos

u By Smarth Madan

Page 2: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

PayPal and its Code is growing YOY <WIP>

2010 ~560 Services

15 MLOC

2018 ~2000 Services

80 MLOC

2014 ~1021 Services

45 MLOC

Page 3: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Test Cycle at PayPal Before Managed Stage

Secure VM Configure VM

Code

Deploy on VM

Learn, Debug & Troubleshoot

Failing Services

Up Rev Dependent

Services + DB Schema

Test

Keep your Dependent Services UP

Page 4: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Developer Pain points

Deploying all the components

Tons of time for setup and test

Identifying transitive dependencies

Maintaining stable environment

Page 5: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Infrastructure Team Pain points

Hardware requirements grows YOY

Huge Maintenance cost for ~4K test Environment

Network Bandwidth

Test topology is not same prod

Page 6: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Requirements

o  We need Production Like environment

o  Cluster of machines running all services

o  Multiple instances of each service with auto healing for availability

o  Needed to scale as the number of users grow

o  Code refresh in mins

o  Easy to connect from all other VMs

Page 7: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Resource Management

o  Manage a large set of machines in terms of compute

o  Moving away from static allocation of machines

o  Identifying unused capacity

o  Slave Categorization (diverse PayPal Tech stack)

Page 8: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Job Scheduler

o  Defines jobs with ordered tasks

o  Binding jobs to specific salves using constraints

o  Auto-healing

o  REST API capability

o  Final task to cleanup the slave for any new job

Page 9: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

What’s a Managed Stage?

Managed Stage is a continuously available Mesos based multi-node staging

environment that allows Developers and Quality engineers to certify and

release apps to LIVE.

Page 10: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Managed Stage Architecture

CC

Standby Mesos Master Standby Mesos Master Active Mesos Master

Standby Aurora Standby Aurora

Scheduler Active Aurora Scheduler

DB

Mesos slaves

Router Pool

Pool Front Pool Mid

Pool Back

Other pools

Zookeeper 1

Zookeeper 3

Zookeeper 2

Page 11: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Sample Aurora Task

Page 12: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Complex Job Scenario

Page 13: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Dynamic Service Discovery & Registration

CC

Standby Mesos Master Standby Mesos Master Active Mesos Master

Standby Aurora Standby Aurora

Scheduler Active Aurora Scheduler

DB

Mesos slaves

Router Pool

Pool Front

Pool Mid Pool Back

Other pools

Zookeeper 1

Zookeeper 3

Zookeeper 2

Job

Page 14: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Code Refresh Cycle

o  In-place and Full deploy o  5000 Total number of packages gets deployed

•  Full Deploy takes < 1 Hr •  Incremental deploy < 20 Mins

o  Code refresh frequency : •  Daily code refresh incrementally •  Biweekly full code refresh

Page 15: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Dependent Stage

MANAGED STAGE ENVIRONMENT

1000+ services

Foo [N+1]

Foo [N+1]

HAPROXY

Page 16: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Advantage

o  Developer •  Stage setup is reduced by 90% •  Increased productivity by 30 - 40% •  Abstracting all transitive dependencies

o  Infrastructure •  Optimized resource utilization •  Reduced hardware cost in data center •  Less network traffic for deploy

Page 17: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Test Cycle at PayPal After Managed Stage

Code

Deploy Your Service On UserStage

Test

Self-Serv

UserStage

Page 18: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Managed Stage Versions

18

MsMaster (Live)

         

N

R  

S  

H

C  

N

R  

S  

H

C  

N

R  

S  

H

C  

MsRelease (N+1)

         

N

R  

S  

H

C  

N

R  

S  

H

C  

N

R  

S  

H

C  

MsLnP

         

N

R  

S  

H

C  

N

R  

S  

H

C  

N

R  

S  

H

C  

Page 19: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Why Mesos ?

o  CI on Mesos was a success at PayPal

o  Setup cost and time is really low

o  Mesos & Aurora : an out-of-the-box solution

o  Docker integration in future

Page 20: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Cost Analysis

o  No. of VM : ~ 5000

o  Average Stage size : 16 CPU, 64 GB RAM, 256 GB HDD

o  Total CPU : 80,000

o Total CPU used in Managed Stage : 1500/env

o  Reclaimed CPU : 40,000 (just by 50% reduction)

Page 21: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Roadmap o  Using network isolation which was introduced in Mesos 0.22.0 o  Elastic scaling bases on usage patterns

o  Merging clusters with CI

o  Exploring Docker for containerizing pools

Page 22: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Vision

With a click of a button, you can create your own environment with components of specific versions.

Page 23: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Q&A

Page 24: Save Millions By E cient Resource Utilization Through Mesos2010 ~560 Services 15 MLOC 2018 ~2000 Services 80 MLOC 2014 ~1021 Services 45 MLOC . Test Cycle at PayPal Before Managed

Thanks