(kishore jalleda) launching products at massive scale - the devops way

46
Launching products at massive scale: The DevOps way Velocity, Amsterdam, 2016.

Upload: kjalleda

Post on 21-Feb-2017

14 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Launching products at massive scale: The DevOps

wayVelocity, Amsterdam, 2016.

Page 2: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Who are we? Kishore Jalleda

Senior Director, Production Engineering, Yahoo!

[email protected]

Gopal Mor

Software Architect, Yahoo!

[email protected]

Page 3: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Yahoo Scale ● 1+ Billion MAUs● 6+ major data centers in strategic

locations around the world ● 50+ edge PODs● 400,000+ servers

Yahoo! News

Yahoo!Sports

Yahoo!Finance

Yahoo! Fantasy

......

Page 4: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Product redesigns at scale are non-trivial

Page 5: (Kishore Jalleda) Launching products at massive scale - the DevOps way

We take feedback seriously!

Page 6: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Make sure to run a ton of experiments.

Page 7: (Kishore Jalleda) Launching products at massive scale - the DevOps way

● 100+ iterations / experiments at any given time - almost every user is in some sort of an experiment

● Validating metrics is not easy when you are dealing with a billion users - need to make the right decisions for the user.

● Should not cannibalize other services like search and mail

Page 8: (Kishore Jalleda) Launching products at massive scale - the DevOps way

And… there is “DevOps”

Page 9: (Kishore Jalleda) Launching products at massive scale - the DevOps way

What is DevOps?

“DevOps is about eliminating Technical, Process and Cultural barriers between Idea and Execution -- using Software”

-Kishore Jalleda

9

Page 10: (Kishore Jalleda) Launching products at massive scale - the DevOps way

The DevOps Way

(How)

Page 11: (Kishore Jalleda) Launching products at massive scale - the DevOps way

The DevOps Way

People Process Tech

DevOps

Culture ToolsProcess

Page 12: (Kishore Jalleda) Launching products at massive scale - the DevOps way

The DevOps Way

People Process Tech

DevOps

Culture ToolsProcess

Ownership Excellence

Enable

Page 13: (Kishore Jalleda) Launching products at massive scale - the DevOps way

The DevOps Way

People Process Tech

DevOps

Culture ToolsProcess

Agile Automated

Engineer

Page 14: (Kishore Jalleda) Launching products at massive scale - the DevOps way

The DevOps Way

People Process Tech

DevOps

Culture ToolsProcess

(Re)Usable Self-Serve

Develop

Page 15: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Functional Pillars

(What)

Page 16: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Functional Pillars

DevOps

Deliver Prevent Repair… products to market quickly

… defects from reaching customers

… production issues quickly

Page 17: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Functional Pillars

DevOps

Deliver

goalsustainable

velocity

metricvelocity

(time to market)

use casesprovision

codeship

strategyeasy CDcloudify

platformize

culture & process

AgileCD practices

toolsCD pipelines

CloudDev Tools

Page 18: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Functional Pillars

DevOps

Prevent

goalprevent defects from

reaching users

metricquality

use casestesting axes: func, perf,

resilience, scale...

strategyself-serve toolsexpert services

culture & processtest coverage

CD & launch gates

tools Disruptive Testing

Metrics based promotion...

Page 19: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Functional Pillars

DevOps

Repair

goalfix issues fast

metricTT(R)

use casesdetect

decide / diagnosealert / remediate

strategydirected alertingauto-remediation

...

culture & processDirected Alerting

postmortemsuser feedback

...

toolsMonitoring

Auto RemediationProduct Health-

Dashboards...

Page 20: (Kishore Jalleda) Launching products at massive scale - the DevOps way

In Summary...

Page 21: (Kishore Jalleda) Launching products at massive scale - the DevOps way

In Summary...

Culture Ownership ExcellenceEnableAgile AutomatedEngineer Processes

Develop Tools(Re)Usable Self-Serve

a of &

&

&

to kick ass at…

Delivery Prevention Repair

Page 22: (Kishore Jalleda) Launching products at massive scale - the DevOps way

(Product) Resilience

Resilience is critical to launching and operating products at a massive scale!

Let’s talk about it in detail!

Page 23: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Resilience at Yahoo Homepage and Media sites

Page 24: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Yahoo Homepage (www.yahoo.com)

● Among top 3 destinations on

internet

● Personalized content

● Available in 22 internationals

● Page consists of multiple modules

● 99.999% availability

Page 25: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Availability Challenge

Many subsystems/layersUser agent

Hard to guarantee availability and

latency in a ...

● Distributed multilayer architecture

● 100s of subsystems

● Complex request flow

● Change is the only constant

Page 26: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Availability Challenge

99.91%

99.95%

99.90%

99.95%

99.97%

99.91%

99.95%

User agentIn this hypothetical example ...

● Each subsystem is highly

available

● But combined system

availability = 99.50%

● Downtime per year = 1 day,

19 hours, 49 min

The number against each box, in above figure, is availability of individual sub-system.

Page 27: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Availability Challenge

99.91%

99.95%

99.90%

99.95%

99.97%

99.91%

99.95%

User agentIn this hypothetical example ...

● Each subsystem is highly

available

● But combined system

availability = 99.50%

● Downtime per year = 1 day,

19 hours, 49 min

The number against each box, in above figure, is availability of individual sub-system.

Page 28: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Availability Challenge

99.91%

99.95%

99.90%

99.95%

99.97%

99.91%

99.95%

User agentIn this hypothetical example ...

● Each subsystem is highly

available

● But combined system

availability = 99.50%

● Downtime per year = 1 day,

19 hours, 49 min

Combined system is weaker than the weakest subsystem.

The number against each box, in above figure, is availability of individual sub-system.

Page 29: (Kishore Jalleda) Launching products at massive scale - the DevOps way

How we ensure high availability

Four layers of resiliency in serving stack

1. Speculative Retry

2. Per module fallback

3. Fullpage failsafe

4. Failwhale Be-Right-Back page

Page 30: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Speculative Retry

● Trigger a retry when latency is

higher than threshold

● High success rate for retry due

to low latency at p95

● Addresses long tail latency and

intermittent failures

Longtail latency

Page 31: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Speculative Retry

Not drawn to scale

Page 32: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Speculative v/s Backup

Speculative Retry Request

● Retry only when needed

● Need extra servers based on

max retry rate

Backup Request

● Always send a backup request

● Need twice number of servers

● Need twice network resources

Page 33: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Speculative Retry

Few more considerations

● Useful for idempotent requests only

● Define max retry rate

● Prefer new connection for retry

● Track retry requests

● Use feature flag to turn on/off

Page 34: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Speculative Retry - Results

Per Module Fallback Rate (%)

Speculative Retry rate (%)

Speculative retryrate (% of total traffic)

Spec retry helps reduce fallback rate by big margin.

Page 35: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Per Module Fallback

● Serve cached content for failed

module

● Non personalized content

● Addresses prolonged failure of

subsystem(s)

Parts (modules) served from cache.

Page 36: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Per Module Fallback

● Not possible for cases like

○ Real time data (Example - sports scores)

○ Personal info (Example - stock tickers)

Page 37: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Per Module Fallback

Non personalized cache, for each module, is always available on frontend servers

Populate non-personalized cacheon Frontend servers

Page 38: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Per Module Fallback

Make sure that ...

● Cache is always fresh

● Strong validation needed on cache data

● Check for backward compatibility if TTL is high

Page 39: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Per Module Fallback - Results

● It is a degraded

experience

● Keep it as low as

possible

Per Module Fallback Rate (%)

Speculative Retry rate (%)

Per module fallback rate (%)

Page 40: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Fullpage Failsafe

● Cache entire page

● Non-personalized

● No ads

● Min interactions

● Used when page cannot be served

Entire page served from cache.

Page 41: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Fullpage Failsafe

No single point of failure between serving stack and failsafe stack

Page 42: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Fullpage Failsafe

● Using autoscale on AWS

● Automatic or manual switch

● Fine control on amount or type of traffic

● Helpful during unprecedented traffic spike

● Monitor cache freshness, failsafe traffic

Page 43: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Failwhale

Looks familiar?

Page 44: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Failwhale

● Last resort when everything fails

● All hands on deck situation

● This page is served from edge

Page 45: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Summary

1. Analyze entire range of failure types

2. Understand their rate and impact level

3. Holistic plan to cover all failure types

4. Fire drills - Test, Test, Test

Remember that Murphy’s law is not on our side.

Anything that can go wrong, will go wrong.

Page 46: (Kishore Jalleda) Launching products at massive scale - the DevOps way

Thank you!

CREDITS

Shay Holmes

Rashmi Tenginka

Santosh Mandi

Pushkar Sachdeva

Dreux Ludovic

Sandeep Davu

Karthikeyan Thangaraj

Phil Hayward

Natarajan Kannan