no fallen angels! redundancy, backup, recovery andrea chappell: university of waterloo adam...

27
No Fallen ANGELs! No Fallen ANGELs! Redundancy, Backup, Redundancy, Backup, Recovery Recovery Andrea Chappell Andrea Chappell : University of : University of Waterloo Waterloo Adam Hauerwas Adam Hauerwas : Providence College : Providence College Ruomiao Wang & Jie Li Ruomiao Wang & Jie Li : Kelly Direct, : Kelly Direct, Indiana University Indiana University Terry O'Heron & Crystal Foust Terry O'Heron & Crystal Foust : Penn : Penn State State

Upload: nickolas-daniels

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

No Fallen ANGELs! No Fallen ANGELs! Redundancy, Backup, Redundancy, Backup,

RecoveryRecovery

Andrea ChappellAndrea Chappell: University of Waterloo : University of Waterloo Adam HauerwasAdam Hauerwas: Providence College: Providence CollegeRuomiao Wang & Jie LiRuomiao Wang & Jie Li: Kelly Direct, Indiana : Kelly Direct, Indiana UniversityUniversityTerry O'Heron & Crystal FoustTerry O'Heron & Crystal Foust: Penn State: Penn State

AgendaAgenda

How do you backup/archive courses?How do you backup/archive courses? What policies and procedures guide your What policies and procedures guide your

response to requests to recover a course, a response to requests to recover a course, a file, an internal ANGEL page, a student file, an internal ANGEL page, a student upload file?upload file?

How do you protect your system from How do you protect your system from various failures, and in what time do you various failures, and in what time do you “promise” to have it back online? “promise” to have it back online?

University of Waterloo (Andrea)University of Waterloo (Andrea)

ANGEL is the centrally supported LMS ANGEL is the centrally supported LMS since summer 2004.since summer 2004.

Core to university business. Core to university business. Need to configure against various types of Need to configure against various types of

failures, e.g.:failures, e.g.: Disaster (fire, flooding, etc.)Disaster (fire, flooding, etc.) Partial system failure (ANGEL/IIS or SQL Partial system failure (ANGEL/IIS or SQL

server systems, disks, etc.)server systems, disks, etc.)

Constraints (what we can’t change)Constraints (what we can’t change)

Support coverage is not 24x7: Central IT Support coverage is not 24x7: Central IT (IST) provides extended support for critical (IST) provides extended support for critical systems but not 24x7 support.systems but not 24x7 support.

Cannot survive lengthy power outages.Cannot survive lengthy power outages. Cannot survive some network outages.Cannot survive some network outages.

Network support is also not 24x7.Network support is also not 24x7.

Backup ProcessesBackup Processes

System data backupSystem data backup Database (dump of db file), Transaction logs Database (dump of db file), Transaction logs

(cut once per day) and Upload files backed up (cut once per day) and Upload files backed up nightly bynightly by campus backup service.campus backup service.

Course archivesCourse archives Long term: Archive courses at end of term. Long term: Archive courses at end of term. Shorter term: Remove from system after 4 Shorter term: Remove from system after 4

terms. (Note: to offer a course again, copy terms. (Note: to offer a course again, copy course rather than reuse same instance). course rather than reuse same instance).

Recovery ProcessRecovery Process

Recover data to dev system and copy lost Recover data to dev system and copy lost data to production.data to production. This can be very complex if the missing data This can be very complex if the missing data

is a quiz that was run, a bulletin board, etc.!is a quiz that was run, a bulletin board, etc.! Currently no policies on what to recover, Currently no policies on what to recover,

or promise of time to recovery. Requests or promise of time to recovery. Requests considered on individual basis.considered on individual basis.

Protecting against failuresProtecting against failures Current strategy: Buy robust equipment, Current strategy: Buy robust equipment,

configure to minimize points of failure.configure to minimize points of failure.

ANGEL/IIS(Dell server)

SQL Server(Dell server)

ANGEL/IISand SQL

server

Production Systems

• Dual RAID disks

• Dual power supply

• 7x24 4 hour hardware support (from vendor)

• Housed in access-controlled machine room

• Uninterrupted Power Supply

Development System

Vulnerabilities in Current StrategyVulnerabilities in Current Strategy

The ANGEL/IIS or SQL Server hardware, e.g., The ANGEL/IIS or SQL Server hardware, e.g., system motherboard failuresystem motherboard failure Don’t have ready back-up machine.Don’t have ready back-up machine.

• Could temporarily use development system.Could temporarily use development system. Likely a minimum half day down-time.Likely a minimum half day down-time.

Machine room “fire”Machine room “fire” All hardware lost.All hardware lost. Up to one day of lost data (if 24 hours from last Up to one day of lost data (if 24 hours from last

backup).backup). Days of down time!Days of down time!

Configurations under InvestigationConfigurations under Investigation

Looking for faster recovery time, less potential Looking for faster recovery time, less potential data loss, through increased redundancy.data loss, through increased redundancy.

Config 1: Identical production and Config 1: Identical production and development systems, different locations.development systems, different locations.

Config 2: Identical production and dev Config 2: Identical production and dev systems, shared data (data filer), Load systems, shared data (data filer), Load Balancer (Cisco), different locations.Balancer (Cisco), different locations.

Config 1Config 1 Identical production and development Identical production and development

systems, different locations.systems, different locations.

ANGEL/IIS(Dell server)

SQL Server(Dell server)

Gains:

• In system failure:

• If possible, move disks to duplicate system – 4 working hours.

• Or, recover data to duplicate systems – perhaps 8 working hours.

Issues:

• People intervention still required.

Cost:

• Two new systems.

Config 2Config 2 Identical prod and dev systems, shared Identical prod and dev systems, shared

data, load balancer, different locations.data, load balancer, different locations.

ANGEL/IIS(Dell server)

SQL Server(Dell server)

Gains:

• Failure of one ANGEL/IIS system - instantaneous fall over to remaining.

• Failure of SQL Server - reconfigure dev system to point to data filer.

Issues:

• Single point of failure unless filer clustered.

• Greater complexity may cause downtime.

Cost:

• 3 new systems, plus filer (~$30 USD)

ANGEL/IIS(Dell server)

Data Filer

Load Balancer

Providence College (Adam)Providence College (Adam)

Like Waterloo, ANGEL has been our LMS Like Waterloo, ANGEL has been our LMS since Fall, 2001.since Fall, 2001.

Support coverage is not 24x7.Support coverage is not 24x7. Cannot survive lengthy power outages or Cannot survive lengthy power outages or

network outages.network outages.

PC Backup and RecoveryPC Backup and Recovery

System data backupSystem data backup Back up database and logs to files once per day.Back up database and logs to files once per day. Use Tivoli to back up both DB and file system nightly.Use Tivoli to back up both DB and file system nightly. Creates “backup of a backup.”Creates “backup of a backup.”

Course archivesCourse archives Short term: Archive courses 90 days after term end. Short term: Archive courses 90 days after term end. Long term: Store archives to DVD.Long term: Store archives to DVD.

RecoveryRecovery Like Waterloo, recover Production database in Like Waterloo, recover Production database in

Development environment.Development environment.

PC’s RedundancyPC’s Redundancy Today: Robust Production ServerToday: Robust Production Server

ANGELIIS/SQL

(HP DL380)

ANGELIIS/SQL

(Desktop)

Production System

• Multiple RAID disks (System, DB, Data)

• Dual Power Supplies and NIC’s

• Access-controlled machine room

• UPS

Development System

IBM Storage Area Network

PC’s Future ArchitecturePC’s Future Architecture This Summer: New Server and SANThis Summer: New Server and SAN

ANGELIIS/SQL

(New HP)

ANGELIIS/SQL(Old HP)

Production System

• Purchase new server and install O/S and SQL Serveron local RAID.

• Store database and web files on SAN disk.

• In the event of Production hardware failure, connect Production disk to Development server with little downtime.

Development System

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana University (Ruomiao)Indiana University (Ruomiao)

Road to ANGELRoad to ANGEL

Piloted ANGEL as LMS in Fall 2003Piloted ANGEL as LMS in Fall 2003 Spring 2004: all courses delivered via Spring 2004: all courses delivered via

ANGELANGEL Critical learning platform that connects KD to Critical learning platform that connects KD to

the studentsthe students

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Current Data Protection MeasuresCurrent Data Protection Measures BackupBackup

System BackupsSystem Backups• Full Backups once a week starting Friday nightFull Backups once a week starting Friday night• Differential Backups every night around 11 PMDifferential Backups every night around 11 PM

Database BackupsDatabase Backups• Full ANGEL SQL database backup every night at Full ANGEL SQL database backup every night at

10PM. The database backup output files are then 10PM. The database backup output files are then backed up by system tape backups for that night. backed up by system tape backups for that night.

• Transaction log backups every six hours.Transaction log backups every six hours.The backup tapes are then taken to an offsite location.The backup tapes are then taken to an offsite location.

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Current System Protection MeasuresCurrent System Protection Measures DiskDisk

• Configured with RAID 5 with a spare diskConfigured with RAID 5 with a spare disk Dual power connectionsDual power connections UPS System connection (30 min.)UPS System connection (30 min.) Spare ChassisSpare Chassis

• Test server has identical hardware and server as Test server has identical hardware and server as a spare chassisa spare chassis

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Current Recovery PracticesCurrent Recovery Practices File or Database RestoreFile or Database Restore

• Restore from disk, tape backups, or individual developer’s Restore from disk, tape backups, or individual developer’s machines.machines.

System Component FailureSystem Component Failure• Replace the faulty component(s) from the spare chassis Replace the faulty component(s) from the spare chassis

(test server) or move entire disk array to from production to (test server) or move entire disk array to from production to test servertest server

Total System Failure or disk array failureTotal System Failure or disk array failure• Rebuilt entire system, possibly to alternate hardware. Rebuilt entire system, possibly to alternate hardware. • All the ANGEL components will either need to be installed from All the ANGEL components will either need to be installed from

scratch, or restored from backup tapes. Some system scratch, or restored from backup tapes. Some system components have to be reconfigured manually. components have to be reconfigured manually.

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Challenges for KD ANGEL EnvironmentChallenges for KD ANGEL Environment SecuritySecurity

• ANGEL web server resides on the same physical machine that ANGEL web server resides on the same physical machine that hosts the ANGEL databaseshosts the ANGEL databases

ScalabilityScalability• Limited capability to scale performance based on volumeLimited capability to scale performance based on volume

AvailabilityAvailability• No redundancy built in. Single server design. Any component No redundancy built in. Single server design. Any component

failure means downtimefailure means downtime Shrinking Maintenance Window (or do we still have one?)Shrinking Maintenance Window (or do we still have one?) (continue on next slide)(continue on next slide)

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Challenges for KD ANGEL EnvironmentChallenges for KD ANGEL Environment Storage CapacityStorage Capacity

• Limited expansion capabilityLimited expansion capability RecoverabilityRecoverability

• Single copy of production data on disk. Tape restoration is time Single copy of production data on disk. Tape restoration is time consuming and means data lossconsuming and means data loss

AvailabilityAvailability• No redundancy built in. Single server design. Any component No redundancy built in. Single server design. Any component

failure means downtimefailure means downtime GrowthGrowth

• Significant enrollment growth is expected for the programs in the Significant enrollment growth is expected for the programs in the next three yearsnext three years

Development EnvironmentDevelopment Environment• Developers are coding on own machines. Configurations differ Developers are coding on own machines. Configurations differ

from production environment. Less efficient.from production environment. Less efficient.

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Some QuestionsSome Questions How can backend infrastructure better support the vision of How can backend infrastructure better support the vision of

the on-line programs? the on-line programs? How to plan system capacity when progarm changes (such as How to plan system capacity when progarm changes (such as

enrollment growth)? enrollment growth)? How to better protect student data? How to better protect student data? What the available options for long-term data retention? What the available options for long-term data retention? How to better meet the requirements for less service How to better meet the requirements for less service

interruption? interruption? What should we do to ensure a faster ANGEL systems What should we do to ensure a faster ANGEL systems

recovery? recovery?

Kelley Direct On-Line Programs, Kelley Direct On-Line Programs, Indiana UniversityIndiana University

Penn State Environment (Terry, Penn State Environment (Terry, Crystal)Crystal)

Support coverage is 24x7Support coverage is 24x7 Backup Power (generator)Backup Power (generator) Redundant network connectivityRedundant network connectivity Failover capabilityFailover capability Mirrored storageMirrored storage Daily Backups/Off-site storageDaily Backups/Off-site storage Daily Maintenance (5-7 am)Daily Maintenance (5-7 am) Archive (courses, inactive groups)Archive (courses, inactive groups)

ConstraintsConstraints

BackupBackup SQL: 3 hoursSQL: 3 hours File: 3-4 daysFile: 3-4 days

RestorationRestoration SQL: 1.5 hoursSQL: 1.5 hours File: 2 min. - ??File: 2 min. - ??

ANGEL Production Environment

Dell PE1650 (WIN2K)(2) 1.4 GHZ, 2.5 GB RAM

Web Server 1Dell PE1850(2) 3.2 GHZ, 3 GB RAM

Web Server 7Dell PE1850(2) 3.2 GHZ, 3 GB RAM

Web Server 5Dell PE1850(2) 3.2 GHZ, 3 GB RAM

Web Server 3Dell PE1850(2) 3.2 GHZ, 3 GB RAM

Web Server 2Dell PE1850(2) 3.2 GHZ, 3 GB RAM

SQL ServerIBM xSeries 445

(8) 2.7 GHZ, 16 GB RAM

File ServerDell PE2650

(2) 3.06 GHZ, 8 GB RAM

eND Load BalancereND Load Balancer (Failover)

Dell PE1650 (WIN2K)(2) 1.4 GHZ, 2 GB RAM

SQL Server FailoverIBM xSeries 445

(8) 2.7 GHZ, 16 GB RAM

File Server (Failover)Dell PE2650

(2) 2.8 GHZ, 4 GB RAM

Web Server 6Dell PE1850(2) 3.2 GHZ, 3 GB RAM

Web Server 4Dell PE1750(2) 3.0 GHZ, 4 GB RAM