best practices for large-scale web sites
DESCRIPTION
This is a lightning presentation given by Brian Ko summarizing a session he attended at JavaOne 2009 on how to build very large scale websites.TRANSCRIPT
![Page 1: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/1.jpg)
Best Practices for Large-Scale Web Sites
Lessons from Ebay
Brian Ko
![Page 2: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/2.jpg)
Ebay
• 276,000,000 registered users
• stores over 2 Petabytes of data
• over 1 billion page views per day
• 113 million items for sale in over 50,000 categories
• 2 billion Photos
1
![Page 3: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/3.jpg)
Ebay
• 300+ features per quarter
• Rolls 100,000+ lines of code every two weeks
• In 39 countries, in 7 languages, 24x7x365
• 48 Billion SQL executions/day!
• In Year 2008
2
![Page 4: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/4.jpg)
Design goal
• Scalability– Resource usage should increase linearly (or
better!) with load– Design for 10x growth in data, traffic, users,
etc.
• Availability– Resilience to failure– Graceful degradation– Recoverability from failure
3
![Page 5: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/5.jpg)
Design Goal
• Latency– User experience, data latency
• Manageability– Simplicity, Maintainability
– Provide diagnostics
• Cost– Development effort and complexity
– Operational cost (TCO)
5
![Page 6: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/6.jpg)
Architecture consideration
• Partition everything – “you eat an elephant only one bite at a time”
• Asynchrony for everywhere– “Good things come to those who wait”
• Automate everything– “Automation will save time and eliminate
human errors…”
• Assume everything fails– “Be Prepared”
4
![Page 7: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/7.jpg)
Partition EverythingSplit
• Split every problem into manageable chunks– “If you can’t split it, you can’t scale it”
– By data, load, and/or usage pattern
– For example, there are 1000’s of databases
6
![Page 8: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/8.jpg)
Partition EverythingMotivation
• Scalability: can scale horizontally and independently
• Availability: can isolate failures
• Manageability: can decouple different segments and functional areas
• Cost: can use less expensive hardware
7
![Page 9: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/9.jpg)
Partition EverythingDatabases
• Functional Segmentation– Segment databases into functional areas –
user, item, transaction, product, account, feedback
– Over 1000 logical databases on over 400 physical hosts
• Horizontal Split– Split (or “shard”) databases horizontally along
primary access path.
8
![Page 10: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/10.jpg)
Partition EverythingDatabases
• No Database Transactions• eBay’s transaction policy
– Absolutely no client side transactions, two-phase commit, etc.
– Auto-commit for vast majority of DB writes
• Consistency is not always required or possible– To guarantee availability and partition-tolerance, we
are forced to trade off consistency (Brewer’s CAP Theorem)
9
![Page 11: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/11.jpg)
Partition EverythingDatabases
• Consistency without transactions– Careful ordering of DB operations
– Eventual consistency through asynchronous event or reconciliation batch
10
![Page 12: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/12.jpg)
Partition Everything Application Tier
• Over 17,000 application servers in 220 pools• Functional Segmentation
– Segment functions into separate application pools
– Allows for parallel development, deployment, and monitoring
– Minimizes DB / resource dependencies
• Horizontal Split– Within pool, all application servers are created equal
11
![Page 13: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/13.jpg)
Partition Everything Application Tier
• User session flow moves through multiple application pools
• Absolutely no session state
• Transient state maintained by– URL, Cookie, Scratch database
12
![Page 14: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/14.jpg)
Async Everywhere
• Prefer Asynchronous Processing– Where possible, integrate disparate components
asynchronously• Motivations
– Scalability: can scale components independently– Availability
• Can decouple availability state• Can retry operations
– Latency• Can significantly improve user experience latency at cost
of data/execution latency• Can allocate more time to processing than user would tolerate
– Cost: can spread peak load over time
13
![Page 15: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/15.jpg)
Async EverywhereBatch
• Scheduled offline batch process appropriate for– Infrequent, periodic, or scheduled processing– Non-incremental computation (a.k.a. “Full Table
Scan”)
• Examples– Import data (catalogs, currency, etc.)– Generate recommendations (items, products,
searches, etc.)– Process items at end of auction
14
![Page 16: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/16.jpg)
Automate EverythingMotivation
• Scalability– Can scale with machines, not humans
• Availability / Latency– Can adapt to changing environment more rapidly
• Cost– Machines are far less expensive than humans– Can learn / improve / adjust over time without
manual effort
15
![Page 17: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/17.jpg)
Automate EverythingDeployment
• Challenge– Need to deploy the application to over
17,000 application servers at the same time
• Solution– Deploy Application in advance with the new
feature switch turned off– Turn on the switch through automatic
process on target date.– Make the roll back easier.
16
![Page 18: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/18.jpg)
Assume Everything Fails
• Build all systems to be tolerant of failure– Assume every operation will fail and
every resource will be unavailable– Rapid failure detection and recovery– Do as much as possible during failure
• Motivation– Availability
17
![Page 19: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/19.jpg)
Assume Everything Fails
• Rollback– Absolutely no changes to the site which
cannot be undone (!)
• Failure Detection– Real-time application state monitoring:
exceptions and operational alerts– “Resource slow” is often far more
challenging than “resource down”
18
![Page 20: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/20.jpg)
Assume Everything Fails Graceful Degradation
• Application “marks down” the resource– Stops making calls to it and sends alert
• Non-critical functionality is removed or ignored• Critical functionality is retried or deferred
– Failover to alternate resource– Defer processing to async event
• Explicit “markup”– Allows resource to be restored and brought
online in a controlled way
19
![Page 21: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/21.jpg)
Summary
• Partition everything
• Asynchrony for everywhere
• Automate everything
• Assume everything fails
20
![Page 22: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/22.jpg)
The End
5 minutes of question time
starts now!
![Page 23: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/23.jpg)
Questions
4 minutes left!
![Page 24: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/24.jpg)
Questions
3 minutes left!
![Page 25: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/25.jpg)
Questions
2 minutes left!
![Page 26: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/26.jpg)
Questions
1 minute left!
![Page 27: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/27.jpg)
Questions
30 seconds left!
![Page 28: Best Practices for Large-Scale Web Sites](https://reader035.vdocuments.site/reader035/viewer/2022062616/5492ff24ac795959288b4927/html5/thumbnails/28.jpg)
Questions
TIME IS UP!