building a large-scale e- commerce site with apache and mod_perl bill hilf and perrin harkins
TRANSCRIPT
Myths about building large sites:
• You must use C++ or Java
• You must have a single packaged solution
• You must have incredibly expensive hardware, software, networks, storage, etc., etc.
Perl is great!
• Excellent performance with mod_perl
• Rapid development cycle
• Flexible OO
• Community support… and growing– Programming Perl (3rd Edition) still the #1
selling book for O’Reilley
Roll your own application server.
• Apache + mod_perl + CPAN– Session handling– Load balancing– Persistent database connections– Advanced HTML templating– Security
You need source code and mailing lists.
• Direct line to developers
• All parts of the system are under your control -- skills, not technology, are the limits
Legacy System
• MySQL
• CGI
• Not very modular (Perl4-ish)
• Monolithic architecture - web + application serving on same resources
Christmas ’99 – Impending Doom
• Barely survived previous Christmas
• MySQL maxed out
• Seasonality - benefits and banes (time to make changes before next spike, but the deadline is truly fixed)
Port to mod_perl and Oracle
• Apache::PerlRun to the rescue– Basic CGI to mod_perl porting
• Tuning DBI calls– Persistent connections– Bind variables
Surviving Christmas ‘99
• Furious eight weeks
• Traffic patterns in 1999– 60 - 70,000 sessions/hour– 800,000 page views/hour– 7,000 orders/hour
Planning the new architecture
• Goals– Moving away from off-line generation
• Handling the book store
– Improving database schema
– Easing team development
– Flexible system to support rapid growth and change
• Methods– Training the team on Perl OO
– Coding standards
Surviving Christmas ‘00
• Another 3X!– 200,000+ sessions/hour– 2.5 million+ page views/hour– 20,000+ orders/hour
• How to cook a router
Linux for manageability
• Remote cluster administration
• Security
• Automated builds
• Easy horizontal scalability
Proxy servers
• Slim Apache
• Apache modules in C
• Session cookies
• Up to 400 processes/box
• Very fast and scalable
• Talks to application servers via HTTP
Search servers
• Custom C++ daemon• Returns sorted list of IDs• Inverted index built from database periodically• Takes load off other systems
– Search is a large percentage of traffic
• Perl solutions– Search::InvertedIndex
– DBIx::FullTextSearch
Handling Searches
load balancer
app server
search server search server
app server
database
app server
app server
sorted product IDssearch request
product data
product ID
Load Balancing and Failover
• Proxy servers are balanced using a random selection algorithm
• Application servers are randomly selected and then sticky for each user (Session cookies)
• Load balancers remove servers that fail a service check and move all users to another server
• All data on app servers is written to the database, so that nothing is lost when a server fails
• Database has separate failover system
Code Structure
• Model – View – Controller pattern• Controller objects
– Map HTTP requests and parameters to method calls on Model objects
– Chooses appropriate View object
• Model objects– Represent business concepts like “product” or “user”– Know nothing about HTTP– Can be used in non-web applications (cron jobs)– Talk to data sources (database, search server)
Caching
• Object– Storable– BerkeleyDB– Multicast shared data
• Page– mod_proxy– Controlled by HTTP headers– Cache deletion interface
Session Tracking
• mod_session_etoys
• mod_unique_id
• Apache::Session– Write-through cache backed by database
• A sticky situation
• Failover
database
app server cache
app server cache
Security
• You will be attacked.• Avoid guessable session IDs• Don’t trust the client!• Message Authentication Check (MAC)
– Digest::SHA1
• Hiding data with Crypt::* modules• More on MAC and security
– CGI Programming with Perl, 2nd Edition– Writing Apache Modules with Perl and C
Exceptions
• Java did this well – steal it!• Graham Barr’s Error.pm
try { do_some_stuff();} catch My::Exception with { my $E = shift; handle_exception($E);};
Handling DBI errors
• No need to check return codes, just use RaiseError => 1
try { $sth->execute();} catch Error with { # roll back and recover $dbh->rollback(); # etc.};
Templates
• Template::Toolkit
[% FOREACH item = cart.items %]
name: [% item.name %]
price: [% item.price %]
[% END %]
Hello Controller Code
package ESF::Control::Hello; use strict;use ESF::Control; @ESF::Control::Hello::ISA = qw(ESF::Control); use ESF::Util;
sub handler { ### do some setup work my $class = shift; my $apr = ESF::Util->get_request(); ### instantiate the model my $name = $apr->param('name'); # we create a new Model::Hello object. my $hello = ESF::Model::Hello->new(NAME => $name);
Hello Controller Code (cont’d)
### send out the view my $view_data{'hello'} = $hello->view(); # the process_template() method is inherited # from ESF::Control $class->process_template(TEMPLATE => ‘hello.html’, DATA => \%view_data); }
• Base class wraps handler method in try block.• Process_template() sends out headers and text.• Controller can specify a timeout in seconds, which will be
used for cache control on the proxy server.
Hello Model Code
package ESF::Model::Hello; use strict;
sub new { my $class = shift; my %args = @_; my $self = bless {}, $class; $self{'name'} = $args{‘NAME'} || ‘World'; return $self;}
sub view { # the object itself will work for the view return shift;}
Hello Template
<HTML><TITLE>Hello, My Oyster</TITLE><BODY>
[% PROCESS header.html %]
Hello [% hello.name %]!
[% PROCESS footer.html %]
</BODY> </HTML>
Performance Tuning
• Tune DBI code
• Cache like crazy
• Lazy data loading
• Avoid object creation when possible– Re-use TT object– Cache session and database handle for length of
request ($r->pnotes)
Be careful of nested exceptions!
my $foo;try { # some stuff… try { $foo++; # more stuff… } catch Error with { # handle error };} catch Error with { # handle other error};
Berkeley DB
• Beyond DB_File– Shared memory buffer– No opening/closing files or syncing to disk– Locking handled for you– Transaction support and page-level locking available
• Hard kills and segfaults in Apache/mod_perl can cause corruption
• Deadlocks happen.– Page-level locking requires some form of deadlock
handling
Suggestions for using Berkeley DB
• Database-level locking is good enough for almost anyone– Simpler (no deadlock issues)
– Still much faster than DB_File
– Be careful of long operations with cursors
• If you have the coding resources, write a daemon for it– Handle signals safely
– Handle deadlocks
Valuable Tools
• Debugger
• Profiler
• Ability to run system on workstations
• Dia & Rational Rose for analysis and design
An Open Source Success Story
• Scalable• Cost effective
– No license fees
– Commodity hardware
• Customized• Learning environment• Pervasive technologies
• Giving back to the Open Source community• To contact us…
– Bill Hilf - [email protected]– Perrin Harkins - [email protected]
Thank you!