slash n: tech talk track 2 – website architecture-mistakes & learnings - siddhartha reddy
DESCRIPTION
TRANSCRIPT
![Page 1: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/1.jpg)
Flipkart Website Architecture
Mistakes & Learnings
Siddhartha ReddyArchitect, Flipkart
![Page 2: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/2.jpg)
June 2007
![Page 3: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/3.jpg)
November 2007
![Page 4: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/4.jpg)
December 2012
![Page 5: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/5.jpg)
www.flipkart.com
• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as…
• [1] Issue: Website is “slow”• [2] RCA = Root Cause Analysis
Issue[1] RCA[2] Actions Learnings
![Page 6: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/6.jpg)
INFANCY (2007 – MID-2010)Surviving & reacting to the environment
![Page 7: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/7.jpg)
Website is “slow”!
![Page 8: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/8.jpg)
RCA
• Why?– MySQL queries taking too long
• Why?– Too many queries– Many slow queries– Queries locking tables
• Why?– Capacity
• Hmm…
![Page 9: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/9.jpg)
Fixing it
• Get beefier servers (the obvious)• Separate master_db, slave_db– Writes go to master_db– Reads from slave_db– Critical reads from master_db
MySQL
ReadsWrites
MySQL
Master
Writes
MySQL
Slave
Reads
Replication
![Page 10: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/10.jpg)
Learning from it
• Scale-out databases reads by distributing load across systems
• Isolate database writes from reads– Writes are (usually) more critical
![Page 11: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/11.jpg)
Website is “slow”!(Again)
![Page 12: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/12.jpg)
RCA
• Why?– MySQL queries taking too long (on slave_db)
• Why?– Too many queries– Many slow queries
• Why?– Queries from analytics / reporting and other
backend jobs• Urm…
![Page 13: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/13.jpg)
Fixing it
• Analytics / reporting DB (archival_db)– Use MyISAM — optimized for reads– Additional indexes for quicker reporting
MySQL
Master
Website
Writes
MySQL
Slave
Website
Reads
Analytics
Reads
Replicatio
n
MySQL
Master
Website Writes
MySQL
Slave 1
Website
Reads
Replication
MySQL Slave 2
Analytics Reads
Replication
![Page 14: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/14.jpg)
Learning from it
• Isolate the databases being used for serving website traffic from those being used for analytical/reporting
• Isolate systems being used by production website from those being used for background processing
![Page 15: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/15.jpg)
BABY (2010 – 2011)Learning the basics
![Page 16: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/16.jpg)
Website is “slow”!
![Page 17: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/17.jpg)
RCA
• Why?• How?– Instrumentation
![Page 18: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/18.jpg)
RCA - 1
• Why?– Logging a lot– PHP processes blocking on writing logs
Log file
Request1-> Process1
Request2-> Process2Request3
-> Process3Waiting
Request2:Process1
Waiting
Request2:Process2
Writing
Request3:Process3
![Page 19: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/19.jpg)
RCA - 2
• Why?– Service Oriented Architecture (SOA)– Too many calls to remote services per request• Creating fresh connection for each call• All the calls are made in serial order
Receive
request
Connect to
Service1
Request
Service1
Connect
Service2
Request
Service2
Send respon
se
![Page 20: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/20.jpg)
RCA - 3
• Why?– Configurability– Fetch a lot of “config” from database for serving
each request
Receive request
Fetch Config1
Fetch Config2
Fetch Config3
Fetch Config4
Send response
Database
![Page 21: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/21.jpg)
RCA – 1,2,3
• Why?– Logging a lot– SOA– Configurability
• Why?– PHP’s process model
• Argh!
![Page 22: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/22.jpg)
Fixing it
• fk-w3-agent– Simple Java “middleware” daemon– Deployed on each web server– PHP communicates to it through local socket– Hosts pluggable “handlers”
![Page 23: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/23.jpg)
fk-w3-agent: LoggingHandler
Log file
Request1->
Process1
Request2->
Process2
Request3->
Process3
fk-w3-agent
Request1->
Process1
Request2->
Process2
Request3->
Process3
Log file
Async / buffered
![Page 24: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/24.jpg)
fk-w3-agent: ServiceHandler(s)
Receive request Callfk-w3-agent
Send response
fk-w3-agent
Service1Service2
Receive
request
Connect to
Service1
Request
Service1
Connect
Service2
Request
Service2
Send respon
se
![Page 25: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/25.jpg)
fk-w3-agent: ConfigHandlerReceiv
e reques
t
Fetch Config
1
Fetch Config
2
Fetch Config
3
Fetch Config
4
Send respon
se
Database
Receive request Fetch all config fromfk-w3-agent Send response
fk-w3-agent
Database
Poll and cache
![Page 26: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/26.jpg)
Learning from it
• PHP — good for frontend and templating– Gives a lot of agility– Limiting process model• Hurdle for high performance
• Java — stability and performance• Horses for courses
![Page 27: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/27.jpg)
Website is “slow”!(Again)
![Page 28: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/28.jpg)
RCA
• Why?– PHP processes taking up too much time– PHP processes taking up too much CPU
• Why?– Product info deserialization taking up time/CPU– View construction taking up time/CPU
![Page 29: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/29.jpg)
Fixing it
• Caching!• Cache fully constructed pages– For a few minutes– Only for highly trafficked pages (Homepage)
• Cache PHP serialized Product objects– ~20 million objects– Memcache
• Yeah! But…– Add caching => add complexity
![Page 30: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/30.jpg)
Caching: Complications (1)
• “Caching fully constructed pages”• But parts of pages still need to be dynamic
• Example: Logged-in user’s name
• Impossible to do effective bucket testing• Or at least makes it prohibitively complex
![Page 31: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/31.jpg)
Caching: Complications (2)
• “Caching PHP serialized Product objects”• Without caching:
• With caching, cache hit:
• With caching, cache miss:
getProductInfo() Fetch from CMS
getProductInfo() Fetch from Cache
getProductInfo()
Fetch from Cache
Fetch from CMS Set in Cache
![Page 32: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/32.jpg)
Caching: Complications (3)
• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache– Receive “notifications” about product updates• Notification Server — pushes notifications raised by
CMS
• Use a persistent, distributed cache– Memcache => Membase, Couchbase
![Page 33: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/33.jpg)
Learning from it
• Caching is a powerful tool for performance optimization
• Caching adds complexities– Reduced by keeping cache close to data source– Think deeply about TTL, invalidation
• Use caching to go from “acceptable performance” to “awesome performance”– Don’t rely on it to get to “acceptable
performance”
![Page 34: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/34.jpg)
KID (2012)Growing up
![Page 35: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/35.jpg)
Website is “slow”!
![Page 36: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/36.jpg)
RCA
• Why?– Search-service is slow (or Reviews-service is slow
or Recommendations-service is slow)• But why is rest of website slow?– Requests to the slow service are blocking
processing threads• Eh?!
![Page 37: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/37.jpg)
Let’s do some math
• Let’s say– Mean (or median) response time: 100 ms– 8-core server– All requests are CPU bound
• Throughput: 80 requests per second (rps)• Let’s also say
– 95th Percentile response time: 1000 ms• Call them “bad requests”
• 4 bad requests in a second– Throughput down to 44 rps
• 8 bad requests in a second?– Throughput down to 8 rps
![Page 38: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/38.jpg)
Fixing it
• Aggressive timeouts for all service calls– Isolate impact of a slow service• only to pages that depend on it
• Very aggressive timeouts for non-critical services– Example: Recommendations• On a Product page, Search results page etc.• Not on My Recommendations page
• Load non-critical parts of pages through AJAX
![Page 39: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/39.jpg)
Learning from it
• Isolate the impact of a poorly performing services / systems
• Isolate the required from the good-to-have
![Page 40: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/40.jpg)
Website is “slow”!(Again)
![Page 41: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/41.jpg)
RCA
• Why?– Load average of web servers has spiked
• Why?– Requests per second has spiked• From 1000 rps to 1500 rps
• Why?– Large number of notifications of product
information updates
![Page 42: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/42.jpg)
Fixing it
• Separate cluster for receiving product info update notifications from the cluster that serves users
• Admission control: Don’t let a system receive more requests than it can handle– Throttling
• Batch the notifications
![Page 43: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/43.jpg)
Learning from it
• Isolate the systems serving internal requests from those serving production traffic
• Admission control to ensure that a system is isolated from the over-enthusiasm of a client
• Look at the granularity at which we’re working
![Page 44: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/44.jpg)
TEENAGERIncreasing complexity
![Page 45: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/45.jpg)
![Page 46: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/46.jpg)
THANK YOU
![Page 47: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy](https://reader033.vdocuments.site/reader033/viewer/2022061205/5480c34fb4af9f8c288b462c/html5/thumbnails/47.jpg)
Mistake?
• Sub-optimal decision– Not all information/scenarios considered– Insufficient information– Built for a different scenario
• Due to focus on “functional” aspects• A mistake is a mistake– … in retrospect