openstack summit tokyo 2015 - building a private cloud to efficiently handle 40 billion requests per...

43
Building a Private Cloud to Efficiently Handle 40 Billion Requests / Day October 28 th , 2015 Pierre Gohon | Sr. Site Reliability Engineer | [email protected] Pierre Grandin | Sr. Site Reliability Engineer | [email protected]

Upload: pierre-grandin

Post on 20-Jan-2017

1.088 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Building a Private Cloud to Efficiently Handle 40 Billion Requests / Day

October 28th, 2015

Pierre Gohon | Sr. Site Reliability Engineer | [email protected] Grandin | Sr. Site Reliability Engineer | [email protected]

Page 2: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Who are we?

TubeMogul (Nasdaq : TUBE)● Enterprise software company for digital branding● Over 27 Billion Ads served in 2014● Over 40 Billion Ad Auctions per day in Q3 2015● Bids processed in less than 50 ms● Bids served in less than 80 ms (inc. network round trip)● 5 PB of monthly video traffic served● 1.6 EB of data stored

Page 3: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Who are we?

Operations Engineering● Ensure the smooth day to day operation of the platform

infrastructure● Provide a cost effective and cutting edge infrastructure● Provide support to dev teams ● Team composed of SREs, SEs and DBAs (US and UA)● Managing over 2,500 servers (virtual and physical)

Page 4: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Our Infrastructure

Public Cloud On Premises

Multiple locations with a mix of Public Cloud and On Premises

Page 5: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● 6 AWS Regions (us-east*2, us-west*2, europe, apac)● Physical servers in Michigan / Arizona (Web/Databases)● DNS served by third party (UltraDNS +Dynect)● External monitoring using Catchpoint● CDNs to deliver content● External security audits

We’re not adding complexity!

Before Openstack: we’re already very “Hybrid”…

Page 6: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Why?

● Own your infrastructure stack● Physical proximity matters (reduced/controlled latency)● Better infrastructure planning● Technological transparency

● … $$ !

Page 7: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Project timeline

Page 8: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Where do we stand?

Page 9: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● DIY ?○ Small OPS team

■ 12 members in two timezones■ 3 only dedicated to OpenStack

○ New challenges■ Internal training■ Little external support (really ?) vs AWS■ Manage data centers (Servers, Network, …)

OpenStack challenges - Operational aspect

Page 10: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● Are applications AWS dependent ?○ Internal ops tools○ Developer’s applications○ AWS S3, DynamoDB, SNS, SQS, SES, SWF

● Convert developers to the project : we need their support● OpenStack release cycle (when shall we update to latest

version?)● OpenStack really needed components ?● How far do we go (S3 replacement ? Network control ?

Hardware control ?)

OpenStack challenges - Application migration aspect

Page 11: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● Managing our own ASN / IPs (v4/v6)● Choose “best for needs” transit providers (tier 1)● Better control routes to/from our endpoints● Allow dedicated AWS connections / others ● Allow direct peerings to ad networks● Want to be accountable for networking issues● Cost control

How? Networking - External connectivity

Page 12: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● Applications are already designed for redundancy/cloud● Circumvent virtualized networking limitations● Fine-tune baremetal nodes for HAProxy ● For the future equipments are “cloud ready” (nexus 5K for

top of rack switch)○ automatic switch configuration○ cisco software evolutions ?

● 1G for admin, X*10G for public ?● Leverage multicast ?

How? Networking - Hybrid physical / virtualized

Page 13: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

How? Networking - Hybrid physical / virtualized

Network node Compute node Load balancer

public network

private networkusing VLANs

1

2 3 2

Page 14: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

How? Networking - RTT

● Latency from our DC to AWS is 6ms average in US-WESTrtb-bidder01(rtb):~$ mtr -r -c 50 gw01.us-west-1a.publicHOST: rtb-bidder01 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.4.1 0.0% 50 0.2 0.2 0.1 0.3 0.0 2.|-- XXX.XXX.XXX.XXX 0.0% 50 0.2 0.3 0.2 2.6 0.3 3.|-- ae-43.r02.snjsca04.us.bb. 0.0% 50 1.4 1.5 1.2 2.3 0.2 4.|-- ae-4.r06.plalca01.us.bb.g 0.0% 50 2.0 2.1 1.8 3.4 0.3 5.|-- ae-1.amazon.plalca01.us.b 0.0% 50 39.2 3.5 1.5 39.2 5.6 6.|-- 205.251.229.40 0.0% 50 3.5 2.8 2.2 4.9 0.6 7.|-- 205.251.230.120 0.0% 50 2.1 2.3 2.0 8.5 0.9 8.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 9.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 10.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 11.|-- 216.182.237.133 0.0% 50 4.0 6.0 2.7 20.2 5.2

Page 15: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● If you are not building a multi-thousand hypervisors cloud, you don’t need it to be complex

● Simplifies day-to-day operations● Home made puppet catalog

○ because less lines of code○ because of the learning curve○ because need to tweak settings (ulimit?)

● No need for horizon● No need for shared storage

How? Keep it simple

Page 16: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● Affinity / anti-affinity rules○ Enforce resiliency using anti-affinity rules

○ Improve performances using affinity rules

How? Leverage your knowledge of your infrastructure

{"profile": "OpenStack", "cluster": "rtb-hbase", "hostname": "rtb-hbase-region01", "nagios_host": "mgmt01"}

Page 17: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

How?Treat your infrastructure as any other

engineering project

Page 18: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code● Follow standard development lifecycle● Repeatable and consistent server

provisioning

Continuous Delivery● Iterate quickly● Automated code review to improve code

quality

ReliabilityImprove Production StabilityEnforce Better Security Practices

How? Continuous Delivery

Page 19: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● We already have a lot of automation:● ~10,000 Puppet deployments last year● Over 8,500 production deployments via jenkins last year

● On the infrastructure:○ masterless mode for the deployment○ master mode once the node is up and running

● On the VMs:○ Puppet run is triggered by cloud-init, directly at boot○ from boot to production ready: <5 minutes

Puppet

see also : http://www.slideshare.net/NicolasBrousse/puppet-camp-paris-2015

Page 20: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - Code Review

Page 21: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Gerrit, an industry standard : OpenStack, Eclipse, Google, Chromium, WikiMedia, LibreOffice, Spotify, GlusterFS, etc...

Fine Grained Permissions RulesPlugged into LDAPCode Review per commitStream EventsIntegrated with Jenkins, Jira and HipchatManaging about 600 Git repositories

Infrastructure As Code - Gerrit Integration

Page 22: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - Gerrit in Action

Automatic verify : -1 if the commit doesn’t pass Jenkins code validation

Page 23: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - The Workflow

Lab / QA

Prod cluster

Page 24: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - Continuous Delivery with Jenkins

Page 25: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - Team Awareness

Page 26: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Infrastructure As Code - Safe upgrade paths

Easy as 1-2-3:1. Test your upgrades using Jenkins2. Deploy the upgrade by pressing a

single button*3. Enjoy the rest of your day

* https://github.com/pgrandin/lcamfig.1 : N. Brousse, Sr. Director of Operation Engineering, switching our production workload to OpenStack

Page 27: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Get ready for production :Monitor everything

Page 28: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Monitor as much as you can ?

● Existing monitoring (Nagios, Graphite) still in use● Specific checks for OpenStack

○ check component API : performance / availability / operability

○ check resources : ports, failed instances● Monitoring capacity metrics for all hardware● SNMP traps for network equipment● Monitoring is just an extension of our existing

monitoring in AWS

Page 29: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Monitoring auto-discovery

● New OpenStack node is automatically monitored○ automatically / upon request○ nagios detects new hosts (API query)○ nagios applies component related check by role○ graphing is also automatically updated

Page 30: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Centralized monitoring

Page 31: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Monitoring is graphing

Page 32: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

A look in the rearview mirror

Page 33: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Benefits - Transparency / visibility

Discover new odd/unexpected traffic/activity patterns

Page 34: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Benefits - Tailored Instances

Before After

m3.xlarge + 2GB RAM? m3.2xlarge!

# nova flavor-create rtb.collector rtb.collector 17408 8 2

Page 35: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Benefits - Operational Transparency

AWS

OpenStack

# cerveza -m noc -- --zone tm-sjc-1a --start demo01

# cerveza -m noc -- --zone us-east-1a --start demo01

Page 36: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Benefits - Efficiency

Before After

Page 37: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Benefits - Efficiency

1+ million rx packets/s on only 2 Haproxy Load Balancers, full SSL

Page 38: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

What does not fit?

Downscaling does not really make sense for uscpus are online and paid for, we should use them

Upscaling has its limits : AWS is refreshing instance types every year …

Sometime a small feature added can have huge load impact.

It makes sense to keep the elastic workloads (machine learning, ...) in AWS

Page 39: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

● We can be “double hybrids” (aws + openstack + haproxy bare metal)

● Dev environment is needed for Openstack (new versions / break things)

● Storage is still a big issue due to our volume (1.6 EB)● Some stuff may stay “forever” on AWS ?● More dev/ops communication● OpenStack is flexible● No need for HA everywhere● Spikes can be offloaded on AWS

(cloud bursting)

What we’ve learnt

Page 40: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Still a lot left to do

Technical aspectNeed to migrate other AWS RegionsGain more experienceVersion upgradesContinue to adapt our toolingAdd more alarms for capacity issuesDifferent Regions, different issues ?

Human aspectDev team still thinks in the AWS world

( and sometimes OPS too…)

Page 41: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

- Ad serving in production since 2015-05- Bidding traffic in production since 2015-09- 100% uptime since pre-production (2015-03)

Cost of operation for our current production workload:- Reduced by a factor of two, including OpEx cost!

Aftermath

Page 42: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Questions?

Page 43: Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

Pierre GohonPierre Grandin

@pierregohon@p_grandin