handling massive traffic with python
DESCRIPTION
At Paylogic we handle massive online peak sales, with tens of thousands customers coming every second trying to get a chance to buy their ticket. We built a virtual queue to handle this load and sell the tickets in a fair order. This is how we did it (as much as I can tell you!). I presented this talk at PyGrunn 2013.TRANSCRIPT
What’s the problem?
• High Traffic (>10k hits/s)
• Redirect low traffic to Paylogic
• Change redirected TPS
• Expect things to break
• Be fair, respect FIFO (within reason)
• Keep users informed
02
In more detail
• Open/hold/close sales
• Expect any server to go down
• Expect ALL servers to go down
• Expect users to disappear
• Display expected waiting time and other inf
• Keep it working
• Prevent attacks
03
How It Works
• A horde of customers appear!
• see a pretty page.
• get a position in the queue.
• page auto-refresh.
• your turn? to the Frontoffice!
• meanwhile info is shown.
• (waiting time, information from event managers…)
04
Data Storage
• Estimates
• Not much data, stored in the instances and synced.
• Tokens
• A LOT of data!
• way too much to store and sync
• use distributed storage
• (the browsers)
05
Architecture
• ELB
• Queue Instances
• Bouncer Process
• Syncer Process
• HTML/JS Queue Page in Cloudfront
06
ELB
• Auto-scales (but not fast enough).
• Many regions.
• Can boot/kill instances automatically.
• We don’t do it yet.
07
Queue Instances
• EC2 instances, which handle the traffic.
• All identical, sync eachother.
• They can be added or removed at will.
• If some (but not all) die, the users won’t notice.
• If all die, only the statistics will be affected.
• (Never happened).
08
Users Handler
• Give out and validate tokens.
• Determine if the user should:
• Keep waiting
• Go to the Frontoffice
• See the Sold Out page.
• Return the expected waiting time.
• Return the values configured by the Event Managers.
09
Synchronization of Statistics
• Keep the Queue Instances synced so they know:
• How many users are waiting.
• How to calculate the waiting time.
• How many users are being let through by the system
10
HTML/JS Queue Page in Cloudfront
• Uses Handlebars
• Served by Cloudfront so that the Queue keeps looking good even if all
our servers were down.
• Updated frequently.
• Calls the Load Balancer. Error? Retry.
• Errors are very rare.
11
Deployment
• Debs in private repos.
• Installed through tunnel.
• Custom python2deb tool (to be released).
12
Stresstest
• Custom client with human-like behaviour.
• Notify amazon!
13
What we learned
• Debugging distributed apps is hard.
• Last bugs are nasty.
• ELB doesn’t scale fast enough by itself.
14
Q&A
15