infrastructure strategies for success behind the university of florida's sakai implementation

29
Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation Chris Cuevas, Systems Administrator ([email protected]) Martin Smith, Systems Administrator ([email protected])

Upload: chen

Post on 06-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation. Chris Cuevas, Systems Administrator ([email protected]) Martin Smith, Systems Administrator ([email protected]). What is a. Design pattern. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

Chris Cuevas, Systems Administrator ([email protected])Martin Smith, Systems Administrator ([email protected])

Page 2: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Design patternWhat is a...

"A general reusable solution to a commonly occurring problem." [1]

[1] http://en.wikipedia.org/wiki/Design_pattern_%28computer_science%29

Page 3: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Change control, build promotion, deployment

Patterns for…

Page 4: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Baseline set of artifacts for a change

• What do we consider a complete build?o Version number o Readme fileo Change logo SQL scriptso Sakai 'binary' distribution

• Reduce ambiguity, recovery time, and improves the chance of catching errors early

Page 5: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: build promotion process

• All changes are load tested and functionally tested against monitoring scripts (i.e. our test cluster is the same size as our prod cluster, and it is monitored like prod)

• All changes require a full two weeks of testing time, a go/no-go decision at least 4 days before (this allows us to announce the change), and at least a 2 hour maintenance window

Page 6: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Maintenance for a new build

• During a deployment/build promotion, we have two strategies:o Rolling restart: Quiesce nodes, upgrade them, and

reintroduce themo Full outage: Stop all nodes, upgrade in chunks, apply

any SQL, and start them all• Session replication is key here for seamless upgrades

(and with Sakai, we don't have it).

Page 7: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Other Software (OS/DB/etc patches, updates)

Patterns for…

Page 8: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Patterns: Other updates

• High risk packages are identified, only updated by those who know the application best

• All others packages are updated (at least) quarterly• Database patches are done best-effort (for now)• Rarely, infrastructure-wide changes will affect a particular

service worse than others • We reserve a weekly maintenance window• Least well understood at this time

Page 9: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Traffic ManagementPatterns for…

Page 10: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Application stack

• User Traffic dispatchingo Sticky TCP traffic to Apache httpd frontends based on

perceived healtho Cookie based route from httpd to tomcat, with ability to

select a nodeo Both of these fail to failover session information well

• We’re considering a design pattern where we combine the httpd+tomcat stack and do full NAT dispatching so that we can get more change flexibility

• Compare other architectures

Page 11: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Current cluster layout

Page 12: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Current cluster layout as two sites

Page 13: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Site-local dispatching

Page 14: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Combining more of the stack

Page 15: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Resource clustering

• Database failover is automatic now with Oracle & JDBC• File tier still doesn't do failover in any nice way• Application+web tier no longer complex dependencies• (All state for a user lives on a single server now)• Split presence across two sites for database (dataguard),

file storage (emc celerra), app/web tier (vmware)

Page 16: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Monitoring and loggingPatterns for…

Page 17: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: System health checks

• Overall:o Fully synthetic login to Sakaio Cluster checks on Apache and Tomcat (more than X

out of Y servers in the cluster in a bad state)o Wget?

• Individual server checks for web, app, db tierso Database connection poolo Clock, SNMP, Ping, Disko Java processes, Apache configtesto AJP and Web response time and status codeso Replication health, available storage growth

Page 18: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation
Page 19: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Interventions

• Fully automated functional test that authenticates and requests some course sites

• Response time is as-important as success or failure• We’re hesitant to automatically restart application nodes,

since session replication isn’t available – this would be a major interruption to our users

Page 20: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Collecting data

• Collect the usual suspects• sakai events, automatic (?) thread dumps to detect stuck

processes, server-status results• Sakai health: .jsp file that dumps many data points (JVM

memory, ehcache stats, database pools, etc)• Anything we can pull from the JVM or Sakai APIs, we’ll

use that jsp file and collectd

Page 21: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation
Page 22: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Application responsiveness

• Also known as, "Get close to the user"• Bug reports are aggregated using shared mailbox, send

daily/weekly/yearly reports with buckets for browser, user, course site, tool, stack trace hash, etc

• Redirection for 4XX/5XX http status codes as much as possible, with explanations

• Timeouts for long-running activities, so make sure traffic isn’t waiting forever

• Watch for AJP errors from specific application servers

Page 23: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

Summary of weekly Sakai bug reports for 2011-06-12:

browser-id => count:Mac-Mozilla => 377Win-InternetExplorer => 356Win-Mozilla => 194UnknownBrowser => 33empty => 12

service-version => count:[r329] => 967empty => 8

user => count:atorres78 (Alina Torres) => 32lisareeve (Lisa Jacobs) => 26ziggy41 (Stefan Katz) => 15ngrosztenger (Nathalie Grosz-Tenger) => 14agabriel2450 (Gabriel Arguello) => 12

stack-trace-digest => count:41D7C94702B20B270953EBB00ECA9F5C1388A393 => 180DEB88C2307DA572C9C1EFE1E8E17828DC29A7C00 => 154A600DAE1792C82B1472C9980EED8938E5F39B4F0 => 8815963E2F2314286E1BC1A24DF953560B7845BDCE => 33042CF39E8D34570CD3D79152B757A090AB6AB39F => 24

app-server => count:sakaiapp-prod06.osg.ufl.edu => 154sakaiapp-prod02.osg.ufl.edu => 146sakaiapp-prod04.osg.ufl.edu => 118sakaiapp-prod05.osg.ufl.edu => 96sakaiapp-prod03.osg.ufl.edu => 83

Page 24: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Backup and recoveryPatterns for…

Page 25: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Backing up for DR

• File tier is backed up every 4 hours, with a 2 week retention window

• Database tier is backed up daily, with archived redo logs every 4 hours, and 2 week retention window

Page 26: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Backing up user data

• Hoping this comes from application-specific operations to backup and restore (and delete!) user specific data

• Can't do a full restore of your files and database every time your user deletes a site by accident

• Strive for reasonable windows of retention (e.g. hardware, software, application-level data)

•  This is supposedly coming in Sakai 2.x

Page 27: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Multi-site replication

• Database and file tier are both replicated to a 2nd site, file tier is also redundant internally, some manual intervention still required there

Page 28: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Pattern: Bringing production to test

• We use ‘snapshot standby’ in Oracle RDBMS to take read consistent copies of production for reloading test and development copies

• We use rsync to copy over the file storage tier• With our full set of build artifacts from earlier, we can

always build a complete version of what's in prod

Page 29: Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation

12th Sakai Conference – Los Angeles, California – June 14-16

Thank you!Questions?