building a large-scale e- commerce site with apache and mod_perl bill hilf and perrin harkins

Building a Large-scale E-commerce Site With Apache and

mod_perl

Bill Hilf and Perrin Harkins

Myths about building large sites:

• You must use C++ or Java

• You must have a single packaged solution

• You must have incredibly expensive hardware, software, networks, storage, etc., etc.

Perl is great!

• Excellent performance with mod_perl

• Rapid development cycle

• Flexible OO

• Community support… and growing– Programming Perl (3rd Edition) still the #1

selling book for O’Reilley

Roll your own application server.

• Apache + mod_perl + CPAN– Session handling– Load balancing– Persistent database connections– Advanced HTML templating– Security

You need source code and mailing lists.

• Direct line to developers

• All parts of the system are under your control -- skills, not technology, are the limits

Case Study: eToys.com

Legacy System

• MySQL

• CGI

• Not very modular (Perl4-ish)

• Monolithic architecture - web + application serving on same resources

Christmas ’99 – Impending Doom

• Barely survived previous Christmas

• MySQL maxed out

• Seasonality - benefits and banes (time to make changes before next spike, but the deadline is truly fixed)

Port to mod_perl and Oracle

• Apache::PerlRun to the rescue– Basic CGI to mod_perl porting

• Tuning DBI calls– Persistent connections– Bind variables

Surviving Christmas ‘99

• Furious eight weeks

• Traffic patterns in 1999– 60 - 70,000 sessions/hour– 800,000 page views/hour– 7,000 orders/hour

Planning the new architecture

• Goals– Moving away from off-line generation

• Handling the book store

– Improving database schema

– Easing team development

– Flexible system to support rapid growth and change

• Methods– Training the team on Perl OO

– Coding standards

Surviving Christmas ‘00

• Another 3X!– 200,000+ sessions/hour– 2.5 million+ page views/hour– 20,000+ orders/hour

• How to cook a router

The Architecture

Network Layout

Linux for manageability

• Remote cluster administration

• Security

• Automated builds

• Easy horizontal scalability

Proxy servers

• Slim Apache

• Apache modules in C

• Session cookies

• Up to 400 processes/box

• Very fast and scalable

• Talks to application servers via HTTP

Application servers

• mod_perl

• local cache

• Shared resources over NFS

• Dual CPU, 1GB RAM

Search servers

• Custom C++ daemon• Returns sorted list of IDs• Inverted index built from database periodically• Takes load off other systems

– Search is a large percentage of traffic

• Perl solutions– Search::InvertedIndex

– DBIx::FullTextSearch

Handling Searches

load balancer

app server

search server search server

app server

database

app server

app server

sorted product IDssearch request

product data

product ID

Load Balancing and Failover

• Proxy servers are balanced using a random selection algorithm

• Application servers are randomly selected and then sticky for each user (Session cookies)

• Load balancers remove servers that fail a service check and move all users to another server

• All data on app servers is written to the database, so that nothing is lost when a server fails

• Database has separate failover system

Code Structure

• Model – View – Controller pattern• Controller objects

– Map HTTP requests and parameters to method calls on Model objects

– Chooses appropriate View object

• Model objects– Represent business concepts like “product” or “user”– Know nothing about HTTP– Can be used in non-web applications (cron jobs)– Talk to data sources (database, search server)

Code Strucure (cont’d)

• View objects– HTML Templates– No control flow code

MVC Diagram

Caching

• Object– Storable– BerkeleyDB– Multicast shared data

• Page– mod_proxy– Controlled by HTTP headers– Cache deletion interface

Page Cached

Page Not Cached

Session Tracking

• mod_session_etoys

• mod_unique_id

• Apache::Session– Write-through cache backed by database

• A sticky situation

• Failover

database

app server cache

app server cache

Security

• You will be attacked.• Avoid guessable session IDs• Don’t trust the client!• Message Authentication Check (MAC)

– Digest::SHA1

• Hiding data with Crypt::* modules• More on MAC and security

– CGI Programming with Perl, 2nd Edition– Writing Apache Modules with Perl and C

Exceptions

• Java did this well – steal it!• Graham Barr’s Error.pm

try { do_some_stuff();} catch My::Exception with { my $E = shift; handle_exception($E);};

Handling DBI errors

• No need to check return codes, just use RaiseError => 1

try { $sth->execute();} catch Error with { # roll back and recover $dbh->rollback(); # etc.};

Templates

• Template::Toolkit

[% FOREACH item = cart.items %]

name: [% item.name %]

price: [% item.price %]

[% END %]

Hello Controller Code

package ESF::Control::Hello; use strict;use ESF::Control; @ESF::Control::Hello::ISA = qw(ESF::Control); use ESF::Util;

sub handler { ### do some setup work my $class = shift; my $apr = ESF::Util->get_request(); ### instantiate the model my $name = $apr->param('name'); # we create a new Model::Hello object. my $hello = ESF::Model::Hello->new(NAME => $name);

Hello Controller Code (cont’d)

### send out the view my $view_data{'hello'} = $hello->view(); # the process_template() method is inherited # from ESF::Control $class->process_template(TEMPLATE => ‘hello.html’, DATA => \%view_data); }

• Base class wraps handler method in try block.• Process_template() sends out headers and text.• Controller can specify a timeout in seconds, which will be

used for cache control on the proxy server.

Hello Model Code

package ESF::Model::Hello; use strict;

sub new { my $class = shift; my %args = @_; my $self = bless {}, $class; $self{'name'} = $args{‘NAME'} || ‘World'; return $self;}

sub view { # the object itself will work for the view return shift;}

Hello Template

<HTML><TITLE>Hello, My Oyster</TITLE><BODY>

[% PROCESS header.html %]

Hello [% hello.name %]!

[% PROCESS footer.html %]

</BODY> </HTML>

Performance Tuning

• Tune DBI code

• Cache like crazy

• Lazy data loading

• Avoid object creation when possible– Re-use TT object– Cache session and database handle for length of

request ($r->pnotes)

Be careful of nested exceptions!

my $foo;try { # some stuff… try { $foo++; # more stuff… } catch Error with { # handle error };} catch Error with { # handle other error};

Berkeley DB

• Beyond DB_File– Shared memory buffer– No opening/closing files or syncing to disk– Locking handled for you– Transaction support and page-level locking available

• Hard kills and segfaults in Apache/mod_perl can cause corruption

• Deadlocks happen.– Page-level locking requires some form of deadlock

handling

Suggestions for using Berkeley DB

• Database-level locking is good enough for almost anyone– Simpler (no deadlock issues)

– Still much faster than DB_File

– Be careful of long operations with cursors

• If you have the coding resources, write a daemon for it– Handle signals safely

– Handle deadlocks

Valuable Tools

• Debugger

• Profiler

• Ability to run system on workstations

• Dia & Rational Rose for analysis and design

An Open Source Success Story

• Scalable• Cost effective

– No license fees

– Commodity hardware

• Customized• Learning environment• Pervasive technologies

• Giving back to the Open Source community• To contact us…

– Bill Hilf - [email protected]– Perrin Harkins - [email protected]

Thank you!

building a large-scale e- commerce site with apache and mod_perl bill hilf and perrin harkins

Documents

growingprogramming perl

apache mod

database periodicallytakes

failoverproxy servers

data sources database

serverall data

server failsdatabase

large sites