linkedin sre: from inception to global scale · 24/7 deployments move to service oriented...
TRANSCRIPT
SRE
Bruno Connelly
LinkedIn SRE:
From Inception to Global Scale
Bruno Connelly & Viji Nair
About Us
Bruno Connelly VP, Engineering
Joined LinkedIn in April 2010
Background in ISP and Consumer Internet
Initial experience team building in Asia in 2005
About Us
Viji NairDirector, SRE
Joined LinkedIn in January 2012
16 Years as an SRE + Startups
Open source evangelist and contributor
Working with global teams since 2006
OUR VISION
Create economic opportunity for every member of the global workforce
ECONOMIC GRAPH
MEMBERS COMPANIES
JOBSSKILLS
SCHOOLS KNOWLEDGE
1M+ MEMBERS 80K+ COMPANIES
15K+ JOBS22K SKILLS
8K HIGHER EDU ORGS 12M+ CONVERSATIONS
Growing Global Network
500M+ 100KMembers Articles published weekly
40%yr/yr increase in engaged
feed sessions weekly
2 50%New sign-ups per second Active members use
Linkedin Messaging weekly
100M+Monthly Unique Visitors
Numbers Behind the Scenes
340K 770KGraph QPS
1.5KGraph EdgesEdge QPS
60BServices in production
4TKafka Messagesconsumed/day
600TB 2TKafka Messages published/day
12MData storage Peak Data QPS
Edge POP Footprint
15Edge Locations
HONG KONG
SINGAPORE
SYDNEY
MUMBAI
SEATTLE
SAN JOSE
LOS ANGELES DALLAS
CHICAGO
MIAMI
ASHBURN
SÃO PAULO
DUBLIN
LONDON
FRANKFURT
Production Application Footprint
OREGON
TEXAS
4Active Data Centers
VIRGINIA
SINGAPORE
Engineering Footprint
SAN FRANCISCO NEW YORK
SUNNYVALE BANGALORE
Looking back 2010 . . .
Entire Production Footprint
2010
1Active Data Center
CHICAGO
LOS ANGELES
2Data Centers
LinkedIn Operations
● Classical, stratified model: Systems, Networks, Applications, DBA
● Heavy-weight processes driven by tickets and heroes
● Culture of not trusting developers in any deployed environments
● Huge wall and growing frustration between Dev and Ops teams (and in ops itself)
● 7 engineers in total made up NOC, SRE, Release Operations: “Site Operations”
● On-call was horrible
2010
Member Growth
500,000,000
450,000,000
400,000,000
350,000,000
300,000,000
250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
02003 2004 2005 20072006 2008 2009 2010 2011 2012 2013 2014 2015 2016
7 Years of Tech Debt
2017
We were here
32% YOY Growth
Is the Site Up?
● Peak traffic periods Mon-Wed ~ 6-10am
● Regular capacity related outages Mon-Wed ~ 6-10am
● Zero tolerance for failure in the application stack
● Near zero instrumentation
● Bi-weekly downtime maintenances
2010
Let’s make a few changeschange software development model
active/active serving model
cheaper datacenters
remove monolithic databasesgraceful degradation
remove hardware load balancers
more data centers
move to service oriented architecture24/7 deployments
dev driven deployments
replace java serialized objects over RPC with REST APIs
modernize our application stack
move faster
self service everything
code contributions to the main application stack3x3 deployments
auto escalation
auto remediation
automated datacenter buildout
Core Principles
Site Up Empower Developer Ownership
Operations is an Engineering Problem
1 2 3
Everyone should be able to deploy code[safely]
Self-service Deployments
Promote to a single production data center
“Canary” to a single production instance
EKG: automated metrics-based validation
Ramp features slowly to the member base
Promote to remaining production data centers
1
2
3
4
5
15K+Successful commits/day
Code promotions/day
200+
600+Feature ramps/day
Create a culture of operational metrics“What gets measured gets fixed.”
REST API
Self-service Instrumentation and Monitoring
java applications
non-java applications
metrics collectors
alerting visualization
metrics api
IRIS
23KGraph dashboards
10MMetrics ingested/sec
340KAlerts processed/min
600M+Total metrics
We don’t want a traditional NOC[permanently]
Correlation Engine
Self-service Remediation and Escalation
15KRemediation Plans
Escalation Plans
9K
17KExecutions/day
Alerts Salt
Deployment
Metrics
Notify(IRIS, JIRA, etc..)
FeedbackNurse
Unexpected outcomes from a different model
● Tiered escalation systems did not scale and incentivized undesirable outcomes
● Introduced “Production SRE”; SREs solving NOC problems via software
● Created NOC to SRE transition program
Escalation
Fixed?
N
Y
Alerts
Auto-remediation
Scaling the team into India
Member Growth
500,000,000
450,000,000
400,000,000
350,000,000
300,000,000
250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
02003 2004 2005 20072006 2008 2009 2010 2011 2012 2013 2014 2015 2016
SRE Established
Bangalore Office Online
2017
Bangalore SRE
Ground Rules
Start simple
2
Follow the sun is not the goal
1
Same culture, same hiring bar, same everything
3
We made mistakes and learned from them
404
Armed w/ our principles & playbook...
Resources not found
Lesson 1: Bootstrapping Remote Teams Is Hard
Evangelize SRE role3
Pre conceived notion of US based companies
Standardize SRE title2 Negative association of the “SRE” title
Invest in potential & new grads
Applying US model did not work
1
OUTCOMES
Lesson 1: Bootstrapping Remote Teams Is Hard
Don’t expect hiring patterns from other markets to
necessarily work
Leverage the community to help; you’re likely not alone
Leverage other sources: College Grads, High-Potential
junior engineers
Lesson 2: Create Your Own Identity
3Need engaging work
to sustain teamsProvide quality work
for the teams
FINDERDC MANAGER Projects & & ALERT CORRELATION 2 Work patterns are
differentFocus on opportunity
and strengths
1 Invest based in value addition & local
needsMirroring teams does
not work
OUTCOMES
Lesson 2: Create Your Own Identity
Don’t map teams 1:1 across offices
Find the right balance, quality vs quantity
Use lulls in workload to give them hard problems they can
own
Lesson 3: One Size Does Not Fit All
33Alignment on
leadership commitment & investment
Right leadership is essential
1 Curate to specific team needEvery team is unique
2 Equal partnership in running a product
Equal ownership is critical to success
OUTCOMES
Lesson 3: One Size Does Not Fit All
Focus on teams that have shown that they can
collaborate
Mistakes will happen; own them and fix them quickly
Bangalore Today
Building region-specific products:
● Performance focused mobile client● New college grad placements● Filling India’s blue-collar labor gap
Strong team presence:
● 60+ SREs across 10 teams● 100+ application developers● Owns LinkedIn core messaging
platform
SFSan Francisco
SNVSunnyvale BLR Bangalore
NYC New York CitySRE
SRE Globally Today
Learnings from BLR led to other regional expansions
300+ SREs across four global offices
Takeaways from our Journey
Unique office identities
leveraged for other
engineering presencesDirectionally good
ideas haven’t failed us
yet
It really does take
a villageThis isn’t easy, but
it is possible
Thanks!