riding the n train: how we dismantled groupon's ruby on rails monolith
DESCRIPTION
This is a story about how Groupon's business was changing and our technology couldn't keep up. We rewrote the web site using node.js and changed the way our company and culture.TRANSCRIPT
Riding the N(ode) Train: Dismantling the Monoliths
Tuesday, December 3, 2013
Sean McCullough – Engineer at Groupon @mcculloughsean
Part I
Broken Architecture and
A Changing Business
Business in Early 2012
Page 3
Architecture in 2012
Page 4
0%
20%
40%
60%
80%
100%
January ‘11
January ‘13
October ’12
July ’12
April ’12
January ’12
October ‘11
July ’11
April ’11
March ‘13
June ‘13
Leading the Mobile Commerce Revolution
Page 5
Mobile Transaction Mix Monthly, January 2011 to September 2013 (% of transactions)
September ’13
Product Engineering was Stuck
We couldn’t build features fast enough
We wanted to build features world-wide
Mobile and Web weren’t at feature parity
Page 6
Part II
The Rewrite
Page 7
The Rewrite
Page 8
The Rewrite
Should ...
• be built on APIs for consistent contract with mobile
• be easy to hire developers
• allow for teams to work at their own pace
• allow teams to deploy their own code
• allow for global design changes
• have out of the box I18n/L13n support
• be optimized for our read-heavy traffic pattern
• be small Page 9
How do we…?
• Deploy
• Authorize Users
• Share Sessions
• Route to different applications
• Manage distributed ops
• QA the whole site
Page 10
We Tried This Before and Failed
• Rolled out a new site design in our monolith
• Too many things changed all at once
• Hard to evaluate performance of each feature
Page 11
New Platform Evaluation
We evaluated:
• Node
• MRI Ruby/Rails, MRI Ruby/Sinatra
• JRuby/Rails, Sinatra
• MRI Ruby + Sinatra+EM
• Java/Play, Java/Vertx
• Python+Twisted
• PHPPage 12
Why Node?
• Vibrant community
• NPM!
• Easy to hire JavaScript developers
• Had the minimum viable performance characteristic
• Easy scaling (process model)
Page 13
The First App
Page 14
Growing Pains
Page 15
Poking Holes in our Infrastructure
• Longevity Test over two days
• Try to root out memory leaks
• Talking only to non-production systems
Page 16
Poking Holes in our Infrastructure
Within 2 hours we had a major site outage
Page 17
Poking Holes in our Infrastructure
• SSL termination on our hardware load balancer caused CPU to max out at 100%
• Production systems were using same LB as test and development systems
Page 18
Lessons Learned
• You will run into problems with Node
• You will find problems with your infrastructure
• Don’t panic!
Page 19
The Second App
• Looking for the next page
• Chose the “Browse” page
• Recently Built
• Built using mostly Backbone
• Experienced team of JS developers
Page 20
The Second App
Page 21
The Second App
New Problems:
• User authentication
• More service calls
• Complicated routing
• More traffic
• Needed to share look and feel
Page 22
The Second App
• Cultural problems
• Change of workflow
• Feedback loop fell apart
3 rewrites
6 months to launch
Page 23
Shared Layout
Maintain consistent look and feel across site:
• Distribute layout as library
• Use ESIs for top/bottom of page
• Apps are called through a “chrome service”
• Fetch templates from service
Page 24
Groupon Interface Guidelines
Page 25
Layout Service
• Uses semantic versioning
• Roll forward with bug fixes
• Stay locked on a specific version
• Enable Site-Wide ExperimentsPage 26
Layout Service
Page 27
Layout Service
Page 28
Routing Service
Page 29
The Big Push… or There’s No Going Back
Page 30
• Decided to get the whole company to move at once
• Supporting two platforms is hard – Rip off the band aid!
• End of June 2012 - move to I-Tier by September 1st
The Big Push… or There’s No Going Back
Page 31
• ~150 developers
• Global effort
• Feature freeze – A/B testing against mostly the same features
Part III
It Worked!
Page 32
95% Consumer Traffic On Node
Page 33
Sustained US Traffic Over 120k RPM
Page 34
Our Pages Got Faster
Page 35
It Worked!
Page 36
Success?
Page 37
• Moving to a new platform is not a straight line
• Solving for old problems
• Solving for new problems
• Culture shift
38
• Streaming responses for better performance
• Better resiliency to outages… circuit breakers, brownouts
• Distributed Tracing
• International
• Open Source
New I-Tier apps as we build new teams, products, ideas.
Latest technologies to help us drive our business.
Next Steps
Q&A