3178 24 x 7 starteam randy guck chief scientist, dsp borland

317824 x 7 StarTeam

Randy GuckChief Scientist, DSP

Borland

Overview

High availability fundamentals– How available is highly-available?– High availability at what price?– Enemies of high availability

Overview

StarTeam high availability best practices– Administrative practices– Flash (demand peak) control– Backup procedures– Redundancy– Failover and clustering– Disaster recovery and replication

High Availability Fundamentals:How available is highly-available?

A Distorted Term

Depending on who you ask, high availability means:– 24 x 7 uptime– Clustering– Failover– Online backups– “Five nines” or “Six sigma”– Currently not dating

Availability by the Numbers% Uptime % Downtime Downtime

per yearDowntime per week

99% 1% 3.65 days 1 hour,

41 mins

99.9% 0.1% 8 hours,

45 mins

10 mins,

5 secs

99.99% 0.01% 52.5 mins 1 minute

99.999% 0.001% 5.25 mins 6 seconds

99.9999%

(“six sigma”)

0.0001% 31.5 seconds 0.6 seconds

The Myth of the Nines

Most people want more than they need

Actual reliability difficult to compute (complex mathematics)

Example: 99.99% reliability (downtime=52 minutes/year) of 7 components results in 99.93% (downtime=6 hours/year).

Downtime often affected by future, unforeseeable business decisions

MTBF versus MTTR

MTBF = mean time between failures

MTTR = mean time to repair

Availability:

A = MTBF / (MTBF + MTTR)

Availability is good if MTTR is low

99.9999% availability (six sigma) =6 mins downtime in 11.4 years!

A Better Approach

Focus on scenarios and probabilities– Examine organization’s needs– Identify possible service disruptions– Prioritize failures by probability– Address scenarios on a cost/benefit

basis– Test each failure scenario

The result is your high availability plan!

High Availability Fundamentals:High availability at what price?

Availability versus Investment

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

What kind of systems need the highest availability?Life-rated systems

– Space shuttle onboard systems– Emergency response systems– Command-and-control systems

High financial cost systems– Stock-trading systems– Reservation systems– Banking systems

ALM High Availabilityin Perspective

Although ALM systems are becoming more mission-critical, they do not have the same financial or “loss of life” impact as some systems, so it doesn’t make sense to model high availability after them

Bottom line: Strike a reasonable balance between investment and high availability

High Availability Fundamentals: Enemies of availability

Infrastructure Issues

Hardware failures– Disk, CPU, memory, power supply,

fans, network card, motherboard, disk controller, etc.

Environmental failures– Power, cooling, fire, flood, hurricane,

earthquake, terrorism, etc.


Network outages– LAN outages: switch/cable failures

(server-to-DB network segment, client-to-server network segment, etc.)

– WAN outages: VPN failure, ISP failure, physical network issues, etc.

– Service outages: DNS, DHCP, directory server, email, etc.


Database outages– Out-of-disk issues, recovery time after

reboot, index corruption, etc.

Bandwidth issues– Network congestion, database

congestion, resource starvation

Denial-of-service Attacks– Viruses, worms, DDOS

Application Issues

Application brown-outs– Locking/bottleneck issues, demand

peaks, etc.

Application outages– Hangs, fatal exceptions, out-of-memory

Scheduled outages– Offline backups, application patches,

database upgrades, etc.

Plan of Attack

To a specific user, a service is “down” when it is not available for any reason

A comprehensive high availability plan must consider all potential outages from end-to-end, on a cost/benefit basis

StarTeam High AvailabilityBest Practices

Availability

Investment







Administrative Practices

Administrative best practices top 10 list

#10: Don’t be cheap

#9: Enforce security

#8: Centralize your servers

#7: Enforce change control

#6: Document everything

Administrative Practices

Administrative best practices top 10 list

#5: Test everything

#4: Design for growth

#3: Choose mature software

#2: Choose mature hardware

#1: K.I.S.S.


Availability

Investment







Flash Control

Client/server systems have natural demand peaks

Peaks are often time-based; e.g.:– Everyone logs in the morning– Big reports launched just before lunch

Peaks are often calendar-based; e.g.:– End-of-week builds– End-of-month reports

Client/Server Architecture

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DB

Command API

StarTeamClient

All information is pulled by

clients using a request/reply

command API

Demand peakcongestion areas

StarTeamMPX

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DBMessageBroker

Event publish stream

StarTeamClient

Updated objects are pushed to

clients, preventing poll and refresh

requests, smoothing demand peaks

New for 7.0: MPX Cache Agent

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DB

EncryptedCache

EncryptedCache

MessageBroker

CacheAgent

File publish stream

Check-outrequests

The Cache Agent is trickled charged

with file contents, providing an alternate

check-out source for remote clients.


Availability

Investment







Backups for High Availability

Mirroring does not replace backups

Backups are an important part of high availability

Test integrity of backups periodically

Consider a rotating/hierarchical storage system, which can serve disaster recover scenarios

StarTeam Backups

StarTeam 6.0 backup procedure:– Lock the server – Backup the database and vault

• Disk-to-disk and differential dumps can speed things up

– Unlock the server

Why does the server need to be locked?

Cache Folder

Archive Folder Base version Rev 1 Rev 2 Rev 3 …

Base version Delta 1 Delta 2 Delta 3 …

Base version Delta 1 Delta 2 Delta 3 …

Base version Rev 1 Rev 2 Rev 3 …

Text files

Binary files

Full version

Full version

Full version

Full version

Full version

Full version

Full version

Full version

Uncompressed

Single VolumeSingle Volume

Single VolumeSingle Volume

Review: StarTeam 6.0 Vault

New StarTeam 7.0 Vault

StarTeam

Server

DB

Vault

HiveHive

HiveHive

…

Hive

Index

7.0 Vault: Inside the Hive

Cache Root

Archive Root

00/0

ff/f

subfolders

…

000a807b9f393f58a69998b2cd7db7d2.gz

000752242cc7e16d573f299a127903f2.gz

fff16c26e911ac72abad5557ac44d84c

…

compressed

uncompress

ed

MD5-based storage

uncompress

ed

00/0

ff/f

…

000a807b9f393f58a69998b2cd7db7d2

000752242cc7e16d573f299a127903f2

fffb865605a09eef1f06be92a38bc8da…

HiveHive

StarTeam 7.0On-line Backup Procedure

The new vault allows on-line backups:

1. Backup the database on-line

2. When complete, backup archive and attachment folders 2.1 Perform full backups weekly

2.2 Perform incremental backups daily

No need to lock the server!

StarTeam 7.0Recovery Procedure

To recover a full StarTeam configuration:

1. Reload the database

2. Simultaneously reload archive and attachment folders: 2.1 Load latest full backup

2.2 Load all incrementals since last full backup in parallel

Modify this procedure for partial recoveries


Availability

Investment







Reducing SPOFs

Servers– Dual power supplies, ECC/mirrored

memory, dual fans, etc.

Storage– Dual controllers, mirrored/RAID disks

Network– Dual network cards, redundant

switches, dual ISP connections, etc.

Redundant Everything

RAIDvaultdisks

RAIDDB

disks

SwitchStarTeam Server

Database Server

dualcontrollersdual NICs

Switch

ECC memory, dual fans, etc.


Availability

Investment







Failover Checklist

At least two identically configured systems

Shared disks

Network connections– Heartbeat/server network– Client-facing service network– Optional: administrative network

Failure Management System (FMS)

Cluster set: app, db connections, IP address

StarTeam Active/Passive Configuration Requirements

Each system identically configured– StarTeam release (including patches)– starteam-server-configs.xml– EventServices\<config>\*.xml– ServerLicenses.st

Access to shared vault and database

Only one instance can be running at a time

Failover time is secondary startup time

Active/Passive Configuration

Mirroreddisks

client-facing service network

heartbeat network

Active

Server

Passive

Server

12.34.56.78

Failover Condition

Mirroreddisks

client-facing service network

heartbeat network

Active

Server

Passive

Server

12.34.56.78X

StarTeam and BDOC

Borland Deployment Op-Center can assist with process monitoring and restart– StarTeam Server process– MPX Processes

• Message Broker• Multicast Service• Cache Agent

– Workflow Notification Agent

Op-Center Example


Availability

Investment







Replication for DR

Types of replication based on latency– Synchronous: Remote site is always

up-to-date– Asynchronous: Remote site lags by a

small amount of time– Batch: Remote site receives periodic

snapshots (e.g., backups)

Synchronous Replication

Long-distance mirroring– Fibre channel: 10km or more with

newer technologies– Variation: disk replication software

(e.g., Veritas Volume Replicator)– Advantages: real-time replication– Disadantages: cost

Asynchronous Replication

Possible strategy for StarTeam:– Database-provided replication; e.g.:

• SQL Server “Log Shipping”• Oracle Standby Database Replication

– Continuous/incremental copy of attachment and archive files

• Exploits write-once feature of StarTeam 7.0 vault

– “Possible” because not yet in use!

Asynchronous Replication

Advantages– Less network bandwidth needed than

synchronous replication– Database “currency” window can be

tuned

Disadvantages– Requires reliable network– Not yet tested!

Batch Replication

Sending backups offsite– “Never underestimate the bandwidth

of a station wagon filled with tapes barreling down the highway”

– Make copies of backups or rotate backups through offsite storage

– Send backups via FedEx, UPS, Volvo net, etc.

Batch Replication

Advantages– Reliable– Low cost– Full backups ensure recoverability

Disadvantages– Asynchronous (time lag)– Manual process (media handling)

unless network bandwidth is available

3178 24 x 7 starteam randy guck chief scientist, dsp borland

Documents