3178 24 x 7 starteam randy guck chief scientist, dsp borland
TRANSCRIPT
317824 x 7 StarTeam
Randy GuckChief Scientist, DSP
Borland
Overview
High availability fundamentals– How available is highly-available?– High availability at what price?– Enemies of high availability
Overview
StarTeam high availability best practices– Administrative practices– Flash (demand peak) control– Backup procedures– Redundancy– Failover and clustering– Disaster recovery and replication
High Availability Fundamentals:How available is highly-available?
A Distorted Term
Depending on who you ask, high availability means:– 24 x 7 uptime– Clustering– Failover– Online backups– “Five nines” or “Six sigma”– Currently not dating
Availability by the Numbers% Uptime % Downtime Downtime
per yearDowntime per week
99% 1% 3.65 days 1 hour,
41 mins
99.9% 0.1% 8 hours,
45 mins
10 mins,
5 secs
99.99% 0.01% 52.5 mins 1 minute
99.999% 0.001% 5.25 mins 6 seconds
99.9999%
(“six sigma”)
0.0001% 31.5 seconds 0.6 seconds
The Myth of the Nines
Most people want more than they need
Actual reliability difficult to compute (complex mathematics)
Example: 99.99% reliability (downtime=52 minutes/year) of 7 components results in 99.93% (downtime=6 hours/year).
Downtime often affected by future, unforeseeable business decisions
MTBF versus MTTR
MTBF = mean time between failures
MTTR = mean time to repair
Availability:
A = MTBF / (MTBF + MTTR)
Availability is good if MTTR is low
99.9999% availability (six sigma) =6 mins downtime in 11.4 years!
A Better Approach
Focus on scenarios and probabilities– Examine organization’s needs– Identify possible service disruptions– Prioritize failures by probability– Address scenarios on a cost/benefit
basis– Test each failure scenario
The result is your high availability plan!
High Availability Fundamentals:High availability at what price?
Availability versus Investment
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
What kind of systems need the highest availability?Life-rated systems
– Space shuttle onboard systems– Emergency response systems– Command-and-control systems
High financial cost systems– Stock-trading systems– Reservation systems– Banking systems
ALM High Availabilityin Perspective
Although ALM systems are becoming more mission-critical, they do not have the same financial or “loss of life” impact as some systems, so it doesn’t make sense to model high availability after them
Bottom line: Strike a reasonable balance between investment and high availability
High Availability Fundamentals: Enemies of availability
Infrastructure Issues
Hardware failures– Disk, CPU, memory, power supply,
fans, network card, motherboard, disk controller, etc.
Environmental failures– Power, cooling, fire, flood, hurricane,
earthquake, terrorism, etc.
Infrastructure Issues
Network outages– LAN outages: switch/cable failures
(server-to-DB network segment, client-to-server network segment, etc.)
– WAN outages: VPN failure, ISP failure, physical network issues, etc.
– Service outages: DNS, DHCP, directory server, email, etc.
Infrastructure Issues
Database outages– Out-of-disk issues, recovery time after
reboot, index corruption, etc.
Bandwidth issues– Network congestion, database
congestion, resource starvation
Denial-of-service Attacks– Viruses, worms, DDOS
Application Issues
Application brown-outs– Locking/bottleneck issues, demand
peaks, etc.
Application outages– Hangs, fatal exceptions, out-of-memory
Scheduled outages– Offline backups, application patches,
database upgrades, etc.
Plan of Attack
To a specific user, a service is “down” when it is not available for any reason
A comprehensive high availability plan must consider all potential outages from end-to-end, on a cost/benefit basis
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Administrative Practices
Administrative best practices top 10 list
#10: Don’t be cheap
#9: Enforce security
#8: Centralize your servers
#7: Enforce change control
#6: Document everything
Administrative Practices
Administrative best practices top 10 list
#5: Test everything
#4: Design for growth
#3: Choose mature software
#2: Choose mature hardware
#1: K.I.S.S.
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Flash Control
Client/server systems have natural demand peaks
Peaks are often time-based; e.g.:– Everyone logs in the morning– Big reports launched just before lunch
Peaks are often calendar-based; e.g.:– End-of-week builds– End-of-month reports
Client/Server Architecture
StarTeamClient
StarTeamClient
VaultVault
StarTeamServer
DB
Command API
StarTeamClient
All information is pulled by
clients using a request/reply
command API
Demand peakcongestion areas
StarTeamMPX
StarTeamClient
StarTeamClient
VaultVault
StarTeamServer
DBMessageBroker
Event publish stream
StarTeamClient
Updated objects are pushed to
clients, preventing poll and refresh
requests, smoothing demand peaks
New for 7.0: MPX Cache Agent
StarTeamClient
StarTeamClient
VaultVault
StarTeamServer
DB
EncryptedCache
EncryptedCache
MessageBroker
CacheAgent
File publish stream
Check-outrequests
The Cache Agent is trickled charged
with file contents, providing an alternate
check-out source for remote clients.
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Backups for High Availability
Mirroring does not replace backups
Backups are an important part of high availability
Test integrity of backups periodically
Consider a rotating/hierarchical storage system, which can serve disaster recover scenarios
StarTeam Backups
StarTeam 6.0 backup procedure:– Lock the server – Backup the database and vault
• Disk-to-disk and differential dumps can speed things up
– Unlock the server
Why does the server need to be locked?
Cache Folder
Archive Folder Base version Rev 1 Rev 2 Rev 3 …
Base version Delta 1 Delta 2 Delta 3 …
Base version Delta 1 Delta 2 Delta 3 …
Base version Rev 1 Rev 2 Rev 3 …
Text files
Binary files
Full version
Full version
Full version
Full version
Full version
Full version
Full version
Full version
Uncompressed
Single VolumeSingle Volume
Single VolumeSingle Volume
Review: StarTeam 6.0 Vault
New StarTeam 7.0 Vault
StarTeam
Server
DB
Vault
HiveHive
HiveHive
…
Hive
Index
7.0 Vault: Inside the Hive
Cache Root
Archive Root
00/0
ff/f
subfolders
…
000a807b9f393f58a69998b2cd7db7d2.gz
000752242cc7e16d573f299a127903f2.gz
fff16c26e911ac72abad5557ac44d84c
…
compressed
uncompress
ed
MD5-based storage
uncompress
ed
00/0
ff/f
…
000a807b9f393f58a69998b2cd7db7d2
000752242cc7e16d573f299a127903f2
fffb865605a09eef1f06be92a38bc8da…
HiveHive
StarTeam 7.0On-line Backup Procedure
The new vault allows on-line backups:
1. Backup the database on-line
2. When complete, backup archive and attachment folders 2.1 Perform full backups weekly
2.2 Perform incremental backups daily
No need to lock the server!
StarTeam 7.0Recovery Procedure
To recover a full StarTeam configuration:
1. Reload the database
2. Simultaneously reload archive and attachment folders: 2.1 Load latest full backup
2.2 Load all incrementals since last full backup in parallel
Modify this procedure for partial recoveries
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Reducing SPOFs
Servers– Dual power supplies, ECC/mirrored
memory, dual fans, etc.
Storage– Dual controllers, mirrored/RAID disks
Network– Dual network cards, redundant
switches, dual ISP connections, etc.
Redundant Everything
RAIDvaultdisks
RAIDDB
disks
SwitchStarTeam Server
Database Server
dualcontrollersdual NICs
Switch
ECC memory, dual fans, etc.
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Failover Checklist
At least two identically configured systems
Shared disks
Network connections– Heartbeat/server network– Client-facing service network– Optional: administrative network
Failure Management System (FMS)
Cluster set: app, db connections, IP address
StarTeam Active/Passive Configuration Requirements
Each system identically configured– StarTeam release (including patches)– starteam-server-configs.xml– EventServices\<config>\*.xml– ServerLicenses.st
Access to shared vault and database
Only one instance can be running at a time
Failover time is secondary startup time
Active/Passive Configuration
Mirroreddisks
client-facing service network
heartbeat network
Active
Server
Passive
Server
12.34.56.78
Failover Condition
Mirroreddisks
client-facing service network
heartbeat network
Active
Server
Passive
Server
12.34.56.78X
StarTeam and BDOC
Borland Deployment Op-Center can assist with process monitoring and restart– StarTeam Server process– MPX Processes
• Message Broker• Multicast Service• Cache Agent
– Workflow Notification Agent
Op-Center Example
StarTeam High AvailabilityBest Practices
Availability
Investment
• Basic administrative practices
• Demand peak management (flash control)
• Backup procedures
• Redundancy (no SPOFs)
• Failover management
• Disaster recovery planning
Replication for DR
Types of replication based on latency– Synchronous: Remote site is always
up-to-date– Asynchronous: Remote site lags by a
small amount of time– Batch: Remote site receives periodic
snapshots (e.g., backups)
Synchronous Replication
Long-distance mirroring– Fibre channel: 10km or more with
newer technologies– Variation: disk replication software
(e.g., Veritas Volume Replicator)– Advantages: real-time replication– Disadantages: cost
Asynchronous Replication
Possible strategy for StarTeam:– Database-provided replication; e.g.:
• SQL Server “Log Shipping”• Oracle Standby Database Replication
– Continuous/incremental copy of attachment and archive files
• Exploits write-once feature of StarTeam 7.0 vault
– “Possible” because not yet in use!
Asynchronous Replication
Advantages– Less network bandwidth needed than
synchronous replication– Database “currency” window can be
tuned
Disadvantages– Requires reliable network– Not yet tested!
Batch Replication
Sending backups offsite– “Never underestimate the bandwidth
of a station wagon filled with tapes barreling down the highway”
– Make copies of backups or rotate backups through offsite storage
– Send backups via FedEx, UPS, Volvo net, etc.
Batch Replication
Advantages– Reliable– Low cost– Full backups ensure recoverability
Disadvantages– Asynchronous (time lag)– Manual process (media handling)
unless network bandwidth is available
StarTeam High AvailabilityBest Practices
Other Topics
Other High Availability Features for StarTeam 7.0
New StarTeam 7.0 Vault– Conversion from StarTeam 6.0 vault
can occur in real-time as background or scheduled process
– Vault space can be increased dynamically by adding new hives
– Archive files can be offloaded/ reloaded dynamically
Other High Availability Features for StarTeam 7.0
New StarTeam 7.0 Memory Management– New memory management caps
memory growth with XxxCaching values > 0 (where Xxx = Files, ChangeRequests, etc.)
– Allows the server to run for very long periods without restarting
Summary
High availability is a cost/benefit pursuit– Review administrative practices– Smooth demand peaks (MPX)– Establish on-line backup procedures– Eliminate SPOFs– Consider clustering for failover– Create a disaster recovery plan
Document and test everything!
References
Blueprints for High Availability 2nd Edition, Evan Marcus and Hal Stern, Wiley Publishing Inc. (2003); detailed discussion of all issues related to high availability
Applied Reliability, Paul Tobias and David Trindade, Kluwer Academic Publishers (1995); detailed mathematical treatment of failure rates and renewability
Questions?
Thank You
317824 x 7 StarTeam
Please fill out the speaker evaluation
You can contact me further at …[email protected]