3178 24 x 7 starteam randy guck chief scientist, dsp borland

59
3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Upload: leon-nash

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

317824 x 7 StarTeam

Randy GuckChief Scientist, DSP

Borland

Page 2: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Overview

High availability fundamentals– How available is highly-available?– High availability at what price?– Enemies of high availability

Page 3: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Overview

StarTeam high availability best practices– Administrative practices– Flash (demand peak) control– Backup procedures– Redundancy– Failover and clustering– Disaster recovery and replication

Page 4: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

High Availability Fundamentals:How available is highly-available?

Page 5: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

A Distorted Term

Depending on who you ask, high availability means:– 24 x 7 uptime– Clustering– Failover– Online backups– “Five nines” or “Six sigma”– Currently not dating

Page 6: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Availability by the Numbers% Uptime % Downtime Downtime

per yearDowntime per week

99% 1% 3.65 days 1 hour,

41 mins

99.9% 0.1% 8 hours,

45 mins

10 mins,

5 secs

99.99% 0.01% 52.5 mins 1 minute

99.999% 0.001% 5.25 mins 6 seconds

99.9999%

(“six sigma”)

0.0001% 31.5 seconds 0.6 seconds

Page 7: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

The Myth of the Nines

Most people want more than they need

Actual reliability difficult to compute (complex mathematics)

Example: 99.99% reliability (downtime=52 minutes/year) of 7 components results in 99.93% (downtime=6 hours/year).

Downtime often affected by future, unforeseeable business decisions

Page 8: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

MTBF versus MTTR

MTBF = mean time between failures

MTTR = mean time to repair

Availability:

A = MTBF / (MTBF + MTTR)

Availability is good if MTTR is low

99.9999% availability (six sigma) =6 mins downtime in 11.4 years!

Page 9: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

A Better Approach

Focus on scenarios and probabilities– Examine organization’s needs– Identify possible service disruptions– Prioritize failures by probability– Address scenarios on a cost/benefit

basis– Test each failure scenario

The result is your high availability plan!

Page 10: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

High Availability Fundamentals:High availability at what price?

Page 11: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Availability versus Investment

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 12: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

What kind of systems need the highest availability?Life-rated systems

– Space shuttle onboard systems– Emergency response systems– Command-and-control systems

High financial cost systems– Stock-trading systems– Reservation systems– Banking systems

Page 13: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

ALM High Availabilityin Perspective

Although ALM systems are becoming more mission-critical, they do not have the same financial or “loss of life” impact as some systems, so it doesn’t make sense to model high availability after them

Bottom line: Strike a reasonable balance between investment and high availability

Page 14: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

High Availability Fundamentals: Enemies of availability

Page 15: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Infrastructure Issues

Hardware failures– Disk, CPU, memory, power supply,

fans, network card, motherboard, disk controller, etc.

Environmental failures– Power, cooling, fire, flood, hurricane,

earthquake, terrorism, etc.

Page 16: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Infrastructure Issues

Network outages– LAN outages: switch/cable failures

(server-to-DB network segment, client-to-server network segment, etc.)

– WAN outages: VPN failure, ISP failure, physical network issues, etc.

– Service outages: DNS, DHCP, directory server, email, etc.

Page 17: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Infrastructure Issues

Database outages– Out-of-disk issues, recovery time after

reboot, index corruption, etc.

Bandwidth issues– Network congestion, database

congestion, resource starvation

Denial-of-service Attacks– Viruses, worms, DDOS

Page 18: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Application Issues

Application brown-outs– Locking/bottleneck issues, demand

peaks, etc.

Application outages– Hangs, fatal exceptions, out-of-memory

Scheduled outages– Offline backups, application patches,

database upgrades, etc.

Page 19: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Plan of Attack

To a specific user, a service is “down” when it is not available for any reason

A comprehensive high availability plan must consider all potential outages from end-to-end, on a cost/benefit basis

Page 20: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 21: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Administrative Practices

Administrative best practices top 10 list

#10: Don’t be cheap

#9: Enforce security

#8: Centralize your servers

#7: Enforce change control

#6: Document everything

Page 22: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Administrative Practices

Administrative best practices top 10 list

#5: Test everything

#4: Design for growth

#3: Choose mature software

#2: Choose mature hardware

#1: K.I.S.S.

Page 23: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 24: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Flash Control

Client/server systems have natural demand peaks

Peaks are often time-based; e.g.:– Everyone logs in the morning– Big reports launched just before lunch

Peaks are often calendar-based; e.g.:– End-of-week builds– End-of-month reports

Page 25: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Client/Server Architecture

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DB

Command API

StarTeamClient

All information is pulled by

clients using a request/reply

command API

Demand peakcongestion areas

Page 26: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeamMPX

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DBMessageBroker

Event publish stream

StarTeamClient

Updated objects are pushed to

clients, preventing poll and refresh

requests, smoothing demand peaks

Page 27: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

New for 7.0: MPX Cache Agent

StarTeamClient

StarTeamClient

VaultVault

StarTeamServer

DB

EncryptedCache

EncryptedCache

MessageBroker

CacheAgent

File publish stream

Check-outrequests

The Cache Agent is trickled charged

with file contents, providing an alternate

check-out source for remote clients.

Page 28: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 29: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Backups for High Availability

Mirroring does not replace backups

Backups are an important part of high availability

Test integrity of backups periodically

Consider a rotating/hierarchical storage system, which can serve disaster recover scenarios

Page 30: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam Backups

StarTeam 6.0 backup procedure:– Lock the server – Backup the database and vault

• Disk-to-disk and differential dumps can speed things up

– Unlock the server

Why does the server need to be locked?

Page 31: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Cache Folder

Archive Folder Base version Rev 1 Rev 2 Rev 3 …

Base version Delta 1 Delta 2 Delta 3 …

Base version Delta 1 Delta 2 Delta 3 …

Base version Rev 1 Rev 2 Rev 3 …

Text files

Binary files

Full version

Full version

Full version

Full version

Full version

Full version

Full version

Full version

Uncompressed

Single VolumeSingle Volume

Single VolumeSingle Volume

Review: StarTeam 6.0 Vault

Page 32: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

New StarTeam 7.0 Vault

StarTeam

Server

DB

Vault

HiveHive

HiveHive

Hive

Index

Page 33: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

7.0 Vault: Inside the Hive

Cache Root

Archive Root

00/0

ff/f

subfolders

000a807b9f393f58a69998b2cd7db7d2.gz

000752242cc7e16d573f299a127903f2.gz

fff16c26e911ac72abad5557ac44d84c

compressed

uncompress

ed

MD5-based storage

uncompress

ed

00/0

ff/f

000a807b9f393f58a69998b2cd7db7d2

000752242cc7e16d573f299a127903f2

fffb865605a09eef1f06be92a38bc8da…

HiveHive

Page 34: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam 7.0On-line Backup Procedure

The new vault allows on-line backups:

1. Backup the database on-line

2. When complete, backup archive and attachment folders 2.1 Perform full backups weekly

2.2 Perform incremental backups daily

No need to lock the server!

Page 35: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam 7.0Recovery Procedure

To recover a full StarTeam configuration:

1. Reload the database

2. Simultaneously reload archive and attachment folders: 2.1 Load latest full backup

2.2 Load all incrementals since last full backup in parallel

Modify this procedure for partial recoveries

Page 36: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 37: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Reducing SPOFs

Servers– Dual power supplies, ECC/mirrored

memory, dual fans, etc.

Storage– Dual controllers, mirrored/RAID disks

Network– Dual network cards, redundant

switches, dual ISP connections, etc.

Page 38: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Redundant Everything

RAIDvaultdisks

RAIDDB

disks

SwitchStarTeam Server

Database Server

dualcontrollersdual NICs

Switch

ECC memory, dual fans, etc.

Page 39: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 40: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Failover Checklist

At least two identically configured systems

Shared disks

Network connections– Heartbeat/server network– Client-facing service network– Optional: administrative network

Failure Management System (FMS)

Cluster set: app, db connections, IP address

Page 41: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam Active/Passive Configuration Requirements

Each system identically configured– StarTeam release (including patches)– starteam-server-configs.xml– EventServices\<config>\*.xml– ServerLicenses.st

Access to shared vault and database

Only one instance can be running at a time

Failover time is secondary startup time

Page 42: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Active/Passive Configuration

Mirroreddisks

client-facing service network

heartbeat network

Active

Server

Passive

Server

12.34.56.78

Page 43: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Failover Condition

Mirroreddisks

client-facing service network

heartbeat network

Active

Server

Passive

Server

12.34.56.78X

Page 44: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam and BDOC

Borland Deployment Op-Center can assist with process monitoring and restart– StarTeam Server process– MPX Processes

• Message Broker• Multicast Service• Cache Agent

– Workflow Notification Agent

Page 45: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Op-Center Example

Page 46: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Availability

Investment

• Basic administrative practices

• Demand peak management (flash control)

• Backup procedures

• Redundancy (no SPOFs)

• Failover management

• Disaster recovery planning

Page 47: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Replication for DR

Types of replication based on latency– Synchronous: Remote site is always

up-to-date– Asynchronous: Remote site lags by a

small amount of time– Batch: Remote site receives periodic

snapshots (e.g., backups)

Page 48: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Synchronous Replication

Long-distance mirroring– Fibre channel: 10km or more with

newer technologies– Variation: disk replication software

(e.g., Veritas Volume Replicator)– Advantages: real-time replication– Disadantages: cost

Page 49: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Asynchronous Replication

Possible strategy for StarTeam:– Database-provided replication; e.g.:

• SQL Server “Log Shipping”• Oracle Standby Database Replication

– Continuous/incremental copy of attachment and archive files

• Exploits write-once feature of StarTeam 7.0 vault

– “Possible” because not yet in use!

Page 50: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Asynchronous Replication

Advantages– Less network bandwidth needed than

synchronous replication– Database “currency” window can be

tuned

Disadvantages– Requires reliable network– Not yet tested!

Page 51: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Batch Replication

Sending backups offsite– “Never underestimate the bandwidth

of a station wagon filled with tapes barreling down the highway”

– Make copies of backups or rotate backups through offsite storage

– Send backups via FedEx, UPS, Volvo net, etc.

Page 52: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Batch Replication

Advantages– Reliable– Low cost– Full backups ensure recoverability

Disadvantages– Asynchronous (time lag)– Manual process (media handling)

unless network bandwidth is available

Page 53: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

StarTeam High AvailabilityBest Practices

Other Topics

Page 54: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Other High Availability Features for StarTeam 7.0

New StarTeam 7.0 Vault– Conversion from StarTeam 6.0 vault

can occur in real-time as background or scheduled process

– Vault space can be increased dynamically by adding new hives

– Archive files can be offloaded/ reloaded dynamically

Page 55: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Other High Availability Features for StarTeam 7.0

New StarTeam 7.0 Memory Management– New memory management caps

memory growth with XxxCaching values > 0 (where Xxx = Files, ChangeRequests, etc.)

– Allows the server to run for very long periods without restarting

Page 56: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Summary

High availability is a cost/benefit pursuit– Review administrative practices– Smooth demand peaks (MPX)– Establish on-line backup procedures– Eliminate SPOFs– Consider clustering for failover– Create a disaster recovery plan

Document and test everything!

Page 57: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

References

Blueprints for High Availability 2nd Edition, Evan Marcus and Hal Stern, Wiley Publishing Inc. (2003); detailed discussion of all issues related to high availability

Applied Reliability, Paul Tobias and David Trindade, Kluwer Academic Publishers (1995); detailed mathematical treatment of failure rates and renewability

Page 58: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Questions?

Page 59: 3178 24 x 7 StarTeam Randy Guck Chief Scientist, DSP Borland

Thank You

317824 x 7 StarTeam

Please fill out the speaker evaluation

You can contact me further at …[email protected]