linkedin sre: from inception to global scale · 24/7 deployments move to service oriented...

39
SRE Bruno Connelly LinkedIn SRE: From Inception to Global Scale Bruno Connelly & Viji Nair

Upload: others

Post on 17-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

SRE

Bruno Connelly

LinkedIn SRE:

From Inception to Global Scale

Bruno Connelly & Viji Nair

Page 2: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

About Us

Bruno Connelly VP, Engineering

Joined LinkedIn in April 2010

Background in ISP and Consumer Internet

Initial experience team building in Asia in 2005

Page 3: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

About Us

Viji NairDirector, SRE

Joined LinkedIn in January 2012

16 Years as an SRE + Startups

Open source evangelist and contributor

Working with global teams since 2006

Page 4: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

OUR VISION

Create economic opportunity for every member of the global workforce

Page 5: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

ECONOMIC GRAPH

MEMBERS COMPANIES

JOBSSKILLS

SCHOOLS KNOWLEDGE

Page 6: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

1M+ MEMBERS 80K+ COMPANIES

15K+ JOBS22K SKILLS

8K HIGHER EDU ORGS 12M+ CONVERSATIONS

Page 7: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Growing Global Network

500M+ 100KMembers Articles published weekly

40%yr/yr increase in engaged

feed sessions weekly

2 50%New sign-ups per second Active members use

Linkedin Messaging weekly

100M+Monthly Unique Visitors

Page 8: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Numbers Behind the Scenes

340K 770KGraph QPS

1.5KGraph EdgesEdge QPS

60BServices in production

4TKafka Messagesconsumed/day

600TB 2TKafka Messages published/day

12MData storage Peak Data QPS

Page 9: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Edge POP Footprint

15Edge Locations

HONG KONG

SINGAPORE

SYDNEY

MUMBAI

SEATTLE

SAN JOSE

LOS ANGELES DALLAS

CHICAGO

MIAMI

ASHBURN

SÃO PAULO

DUBLIN

LONDON

FRANKFURT

Page 10: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Production Application Footprint

OREGON

TEXAS

4Active Data Centers

VIRGINIA

SINGAPORE

Page 11: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Engineering Footprint

SAN FRANCISCO NEW YORK

SUNNYVALE BANGALORE

Page 12: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Looking back 2010 . . .

Page 13: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Entire Production Footprint

2010

1Active Data Center

CHICAGO

LOS ANGELES

2Data Centers

Page 14: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

LinkedIn Operations

● Classical, stratified model: Systems, Networks, Applications, DBA

● Heavy-weight processes driven by tickets and heroes

● Culture of not trusting developers in any deployed environments

● Huge wall and growing frustration between Dev and Ops teams (and in ops itself)

● 7 engineers in total made up NOC, SRE, Release Operations: “Site Operations”

● On-call was horrible

2010

Page 15: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Member Growth

500,000,000

450,000,000

400,000,000

350,000,000

300,000,000

250,000,000

200,000,000

150,000,000

100,000,000

50,000,000

02003 2004 2005 20072006 2008 2009 2010 2011 2012 2013 2014 2015 2016

7 Years of Tech Debt

2017

We were here

32% YOY Growth

Page 16: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Is the Site Up?

● Peak traffic periods Mon-Wed ~ 6-10am

● Regular capacity related outages Mon-Wed ~ 6-10am

● Zero tolerance for failure in the application stack

● Near zero instrumentation

● Bi-weekly downtime maintenances

2010

Page 17: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Let’s make a few changeschange software development model

active/active serving model

cheaper datacenters

remove monolithic databasesgraceful degradation

remove hardware load balancers

more data centers

move to service oriented architecture24/7 deployments

dev driven deployments

replace java serialized objects over RPC with REST APIs

modernize our application stack

move faster

self service everything

code contributions to the main application stack3x3 deployments

auto escalation

auto remediation

automated datacenter buildout

Page 18: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Core Principles

Site Up Empower Developer Ownership

Operations is an Engineering Problem

1 2 3

Page 19: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Everyone should be able to deploy code[safely]

Page 20: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Self-service Deployments

Promote to a single production data center

“Canary” to a single production instance

EKG: automated metrics-based validation

Ramp features slowly to the member base

Promote to remaining production data centers

1

2

3

4

5

15K+Successful commits/day

Code promotions/day

200+

600+Feature ramps/day

Page 21: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Create a culture of operational metrics“What gets measured gets fixed.”

Page 22: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

REST API

Self-service Instrumentation and Monitoring

java applications

non-java applications

metrics collectors

alerting visualization

metrics api

IRIS

23KGraph dashboards

10MMetrics ingested/sec

340KAlerts processed/min

600M+Total metrics

Page 23: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

We don’t want a traditional NOC[permanently]

Page 24: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Correlation Engine

Self-service Remediation and Escalation

15KRemediation Plans

Escalation Plans

9K

17KExecutions/day

Alerts Salt

Deployment

Metrics

Notify(IRIS, JIRA, etc..)

FeedbackNurse

Page 25: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Unexpected outcomes from a different model

● Tiered escalation systems did not scale and incentivized undesirable outcomes

● Introduced “Production SRE”; SREs solving NOC problems via software

● Created NOC to SRE transition program

Escalation

Fixed?

N

Y

Alerts

Auto-remediation

Page 26: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Scaling the team into India

Page 27: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Member Growth

500,000,000

450,000,000

400,000,000

350,000,000

300,000,000

250,000,000

200,000,000

150,000,000

100,000,000

50,000,000

02003 2004 2005 20072006 2008 2009 2010 2011 2012 2013 2014 2015 2016

SRE Established

Bangalore Office Online

2017

Bangalore SRE

Page 28: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Ground Rules

Start simple

2

Follow the sun is not the goal

1

Same culture, same hiring bar, same everything

3

Page 29: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

We made mistakes and learned from them

404

Armed w/ our principles & playbook...

Resources not found

Page 30: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Lesson 1: Bootstrapping Remote Teams Is Hard

Evangelize SRE role3

Pre conceived notion of US based companies

Standardize SRE title2 Negative association of the “SRE” title

Invest in potential & new grads

Applying US model did not work

1

Page 31: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

OUTCOMES

Lesson 1: Bootstrapping Remote Teams Is Hard

Don’t expect hiring patterns from other markets to

necessarily work

Leverage the community to help; you’re likely not alone

Leverage other sources: College Grads, High-Potential

junior engineers

Page 32: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Lesson 2: Create Your Own Identity

3Need engaging work

to sustain teamsProvide quality work

for the teams

FINDERDC MANAGER Projects & & ALERT CORRELATION 2 Work patterns are

differentFocus on opportunity

and strengths

1 Invest based in value addition & local

needsMirroring teams does

not work

Page 33: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

OUTCOMES

Lesson 2: Create Your Own Identity

Don’t map teams 1:1 across offices

Find the right balance, quality vs quantity

Use lulls in workload to give them hard problems they can

own

Page 34: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Lesson 3: One Size Does Not Fit All

33Alignment on

leadership commitment & investment

Right leadership is essential

1 Curate to specific team needEvery team is unique

2 Equal partnership in running a product

Equal ownership is critical to success

Page 35: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

OUTCOMES

Lesson 3: One Size Does Not Fit All

Focus on teams that have shown that they can

collaborate

Mistakes will happen; own them and fix them quickly

Page 36: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Bangalore Today

Building region-specific products:

● Performance focused mobile client● New college grad placements● Filling India’s blue-collar labor gap

Strong team presence:

● 60+ SREs across 10 teams● 100+ application developers● Owns LinkedIn core messaging

platform

Page 37: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

SFSan Francisco

SNVSunnyvale BLR Bangalore

NYC New York CitySRE

SRE Globally Today

Learnings from BLR led to other regional expansions

300+ SREs across four global offices

Page 38: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Takeaways from our Journey

Unique office identities

leveraged for other

engineering presencesDirectionally good

ideas haven’t failed us

yet

It really does take

a villageThis isn’t easy, but

it is possible

Page 39: LinkedIn SRE: From Inception to Global Scale · 24/7 deployments move to service oriented architecture dev driven deployments replace java serialized objects over RPC with REST APIs

Thanks!