exchange server 2013 high availability - site resilience

56
Exchange Server 2013 High Availability | Site Resilience Scott Schnoll Principal Technical Writer Microsoft Corporation

Upload: microsoft-technet-belgium-and-luxembourg

Post on 15-Jan-2015

2.879 views

Category:

Documents


4 download

DESCRIPTION

More info on http://techdays.be.

TRANSCRIPT

Page 1: Exchange Server 2013 High Availability - Site Resilience

Exchange Server 2013High Availability | Site ResilienceScott SchnollPrincipal Technical WriterMicrosoft Corporation

Page 2: Exchange Server 2013 High Availability - Site Resilience

Agenda

Storage

High Availability

Site Resilience

Page 3: Exchange Server 2013 High Availability - Site Resilience

Storage

Page 4: Exchange Server 2013 High Availability - Site Resilience

Storage Challenges

Capacity is increasing, but IOPS are notDatabase sizes must be manageableReseeds must be fast and reliablePassive copy IOPS are inefficientLagged copies have asymmetric storage requirementsLow agility from low disk space recovery

Page 5: Exchange Server 2013 High Availability - Site Resilience

Storage Innovations

Multiple Databases Per VolumeAutomatic ReseedAutomatic Recovery from Storage FailuresLagged Copy Enhancements

Page 6: Exchange Server 2013 High Availability - Site Resilience

Multiple databases per volume

Page 7: Exchange Server 2013 High Availability - Site Resilience

Multiple Databases Per Volume

DB1 DB4DB3DB2

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

Passive

ActiveLagge

d

4-member DAG4 databases4 copies of each database4 databases per volume

Symmetrical design with balanced activation preference

Number of copies per database = number of databases per volume

Page 8: Exchange Server 2013 High Availability - Site Resilience

Multiple Databases Per Volume

DB1 DB1DB1DB1

Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs

DB1DB1

Passive

Active

20 MB/s

Page 9: Exchange Server 2013 High Availability - Site Resilience

Multiple Databases Per Volume

DB1 DB4DB3DB2

Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

Passive

ActiveLagge

d

4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs

DB1

12 MB/s

20 MB/s

20 MB/s

12 MB/s

DB1

DB4

DB3

DB2

Page 10: Exchange Server 2013 High Availability - Site Resilience

Multiple Databases Per Volume

Requirements

Single logical disk/partition per physical disk

Best Practices

Same neighbors on all servers

Balance activation preferences

Database copies per volume = copies per database

Page 11: Exchange Server 2013 High Availability - Site Resilience

Autoreseed

Page 12: Exchange Server 2013 High Availability - Site Resilience

Seeding Challenges

Disk failure on active copy = database failoverFailed disk and database corruption issues need to be addressed quicklyFast recovery to restore redundancy is needed

Page 13: Exchange Server 2013 High Availability - Site Resilience

Seeding Innovations

Automatic Reseed (Autoreseed) - use spares to automatically restore database redundancy after a disk failure

Page 14: Exchange Server 2013 High Availability - Site Resilience

Autoreseed

In-Use Storage

Spares

X

Page 15: Exchange Server 2013 High Availability - Site Resilience

Autoreseed Workflow

Periodically scan for

failed and suspended

copies

Check prerequisite

s: single copy, spare availability

Allocate and

remap a spare

Start the seed

Verify that the

new copy is

healthy

Admin replaces

failed disk

Page 16: Exchange Server 2013 High Availability - Site Resilience

Autoreseed Workflow

1. Detect a copy in an F&S state for 15 min in a row2. Try to resume copy 3 times (with 5 min sleeps in between)3. Try assigning a spare volume 5 times (with 1 hour sleeps in

between)4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1

hour sleeps in between)5. Once all retries are exhausted, workflow stops6. If 3 days have elapsed and copy is still F&S, workflow state is

reset and starts from Step 1

Page 17: Exchange Server 2013 High Availability - Site Resilience

Autoreseed Workflow

PrerequisitesCopy is not ReseedBlocked or ResumeBlockedLogs and EDB files are on same volumeDatabase and Log folder structure matches required naming conventionNo active copies on failed volumeAll copies are F&S on the failed volumeNo more than 8 F&S copies on the server (if so, we might be in a controller failure situation)

For InPlaceReseedIf EDB files exists, wait for 2 days before in-place reseeding (based on LastWriteTime of edb file)Only up to 10 concurrent seeds are allowed

Page 18: Exchange Server 2013 High Availability - Site Resilience

Autoreseed

Configure storage subsystem with spare

disks

Create DAG, add servers with configured

storage

Create directory and mount points

Configure 3 AutoDAG properties

Create mailbox databases and

database copies

\

ExchDbs

ExchVols

Vol1 Vol3MDB1 MDB2

MDB1

Vol2

MDB2

MDB1.DB MDB1.log

MDB1.DB MDB1.log

AutoDagDatabasesRootFolderPath

AutoDagVolumesRootFolderPath

AutoDagDatabaseCopiesPerVolume = 1

Page 19: Exchange Server 2013 High Availability - Site Resilience

Autoreseed

RequirementsSingle logical disk/partition per physical diskSpecific database and log folder structure must be used

RecommendationsSame neighbors on all serversDatabases per volume should equal the number of copies per databaseBalance activation preferences

Configuration instructions at http://aka.ms/autoreseed

Page 20: Exchange Server 2013 High Availability - Site Resilience

Autoreseed

Numerous fixes in CU1Autoreseed not detecting spare disks correctly or using detected disks

GetCopyStatus has a new field 'ExchangeVolumeMountPoint' which shows the mount point of the database volume under C:\ExchangeVolumes

Better tracking around mount path and ExchangeVolume path

Increased autoreseed copy limits (previously 4, now 8)

Page 21: Exchange Server 2013 High Availability - Site Resilience

Automatic Recovery From Storage Failures

Page 22: Exchange Server 2013 High Availability - Site Resilience

Storage Challenges

Storage controllers are essentially mini-PCs and they can crash/hang

Other operator-recoverable conditions can occurLoss of vital system elementsHung or highly latent IO

Page 23: Exchange Server 2013 High Availability - Site Resilience

Storage Innovations

Exchange Server 2013 includes functionality to automatically recovery from a variety of new storage-related failures

Innovations added in Exchange 2010 also carried forward

Even more behaviors added in CU1

Page 24: Exchange Server 2013 High Availability - Site Resilience

Automatic Recovery from Storage Failures

Exchange Server 2010

ESE Database Hung IO (240s)

Failure Item Channel Heartbeat (30s)

SystemDisk Heartbeat (120s)

Exchange Server 2013

System Bad State (302s)

Long I/O times (41s)

MSExchangeRepl.exe memory threshold (4GB)

System Bus Reset (Event 129)

Replication service endpoints not responding

Page 25: Exchange Server 2013 High Availability - Site Resilience

Lagged Copy Challenges

Page 26: Exchange Server 2013 High Availability - Site Resilience

Lagged Copy Challenges

Activation is difficultRequire manual careCannot be page patched

Page 27: Exchange Server 2013 High Availability - Site Resilience

Lagged Copy Innovations

Automatic play down of log files in critical situationsIntegration with Safety Net

Page 28: Exchange Server 2013 High Availability - Site Resilience

Lagged Copy Innovations

Automatic log play downLow disk space (enable in registry)Page patching (enabled by default)Less than 3 other healthy copies (enable in AD; configure in registry)

Simpler activation with Safety NetNo need for log surgery or hunting for the point of corruption

Page 29: Exchange Server 2013 High Availability - Site Resilience

High Availability

Page 30: Exchange Server 2013 High Availability - Site Resilience

High Availability Challenges

High availability focuses on database healthBest copy selection insufficient for new architectureManagement challenges around maintenance and DAG network configuration

Page 31: Exchange Server 2013 High Availability - Site Resilience

High Availability Innovations

Managed AvailabilityBest Copy and Server SelectionDAG Network Autoconfig

Page 32: Exchange Server 2013 High Availability - Site Resilience

Managed Availability

Page 33: Exchange Server 2013 High Availability - Site Resilience

Managed Availability

Key tenet for Exchange 2013All access to a mailbox is provided by the protocol stack on the Mailbox server that hosts the active copy of the user’s mailboxIf a protocol is down on a Mailbox server, all active databases lose access via that protocol

Managed Availability was introduced to detect these kinds of failures and automatically correct themFor most protocols, quick recovery is achieved via a restart actionIf the restart action fails, a failover can be triggered

Page 34: Exchange Server 2013 High Availability - Site Resilience

Managed Availability

An internal framework used by component teams

Sequencing mechanism to control when recovery actions are taken versus alerting and escalation

Includes a mechanism for taking servers in/out of service (maintenance mode)

Enhancement to best copy selection algorithm

Page 35: Exchange Server 2013 High Availability - Site Resilience

Managed Availability

MA failovers come in two formsServer: Protocol failure can trigger server failoverDatabase: Store-detected database failure can trigger database failover

MA includes Single Copy AlertAlert is per-server to reduce flowStill triggered across all machines with copiesMonitoring triggered through a notificationLogs 4138 (red) and 4139 (green) events based on 4113 (red) and 4114 (green) events

Page 36: Exchange Server 2013 High Availability - Site Resilience

Best Copy and Server Selection

Page 37: Exchange Server 2013 High Availability - Site Resilience

Best Copy Selection Challenges

Process for finding the “best” copy of a specific database to activate

Exchange 2010 uses several criteriaCopy queue lengthReplay queue lengthDatabase copy status – including activation blockedContent index status

Not good enough for Exchange Server 2013, because protocol health is not considered

Page 38: Exchange Server 2013 High Availability - Site Resilience

Best Copy and Server Selection

Still an Active Manager algorithm performed at *over time based on extracted health of the systemReplication health still determined by same criteria and phases

Criteria now includes health of the entire protocol stackConsiders a prioritized protocol health set in the selectionFour priorities – critical, high, medium, low (all health sets have a priority)Failover responders trigger added checks to select a “protocol not worse” target

Page 39: Exchange Server 2013 High Availability - Site Resilience

Best Copy and Server Selection

All HealthyChecks for a server hosting a copy that has all health sets in a healthy state

Up to Normal HealthyChecks for a server hosting a copy that has all health sets Medium and above in a healthy state

All Better than SourceChecks for a server hosting a copy that has health sets in a state that is better than the current server hosting the affected copy

Same as SourceChecks for a server hosting a copy of the affected database that has health sets in a state that is the same as the current server hosting the affected copy

Page 40: Exchange Server 2013 High Availability - Site Resilience

DAG Network Autoconfig

Page 41: Exchange Server 2013 High Availability - Site Resilience

DAG Network Autoconfig

Automatic or manual DAG network configDefault is AutomaticRequires specific configuration settings on MAPI and Replication network interfaces

Manual edits and EAC controls blocked when automatic networking is enabledSet DAG to manual network setup to edit or change DAG networks

DAG networks automatically collapsed in multi-subnet environment

Page 42: Exchange Server 2013 High Availability - Site Resilience
Page 43: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

Page 44: Exchange Server 2013 High Availability - Site Resilience

Site Resilience Challenges

Operationally complexMailbox and Client Access recovery connectedNamespace is a SPOF

Page 45: Exchange Server 2013 High Availability - Site Resilience

Site Resilience Innovations

Operationally simplifiedMailbox and Client Access recovery independentNamespace provides redundancy

Page 46: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

Key CharacteristicsDNS resolves to multiple IP addressesAlmost all protocol access in Exchange 2013 is HTTPHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failuresAdmins can switchover by removing VIP from DNSNamespace no longer a SPOFNo dealing with DNS latency

Page 47: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

Previously loss of CAS, CAS array, VIP, LB, some portion of the DAG required admin to perform a datacenter switchover

In Exchange Server 2013, recovery happens automaticallyThe admin focuses on fixing the issue, instead of restoring service

Page 48: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

Previously, CAS and Mailbox server recovery were tied together in site recoveries

In Exchange Server 2013, recovery is independent, and may come automatically in the form of failoverThis is dependent on the customer’s business requirements and configuration

Page 49: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

With the namespace simplification, consolidation of server roles, separation of CAS array and DAG recovery, de-coupling of CAS and Mailbox by AD site, and load balancing changes…

if available, three locations can simplify mailbox recovery in response to datacenter-level events

Page 50: Exchange Server 2013 High Availability - Site Resilience

Site Resilience

You must have at least three locationsTwo locations with Exchange; one with witness server

Exchange sites must be well-connected

Witness server site must be isolated from network failures affecting Exchange sites

Page 51: Exchange Server 2013 High Availability - Site Resilience

alternate datacenter: Portlandprimary datacenter: Redmond

Site Resilience

cas3 cas4cas1 cas2

VIP: 192.168.1.50 VIP: 10.0.1.50

mail.contoso.com: 192.168.1.50, 10.0.1.50

Page 52: Exchange Server 2013 High Availability - Site Resilience

alternate datacenter: Portlandprimary datacenter: Redmond

Site Resilience

cas3 cas4cas1 cas2

VIP: 192.168.1.50X VIP: 10.0.1.50

mail.contoso.com: 192.168.1.50, 10.0.1.50

Removing failing IP from DNS puts you in control of in service time of VIPWith multiple VIP endpoints sharing the same namespace, if one VIP fails, clients automatically failover to alternate VIP(s)

mail.contoso.com: 10.0.1.50

Page 53: Exchange Server 2013 High Availability - Site Resilience

third datacenter: Paris

alternate datacenter: Portland

primary datacenter: Redmond

Site Resilience

dag1mbx1 mbx2 mbx3 mbx4

Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur

witness

X

Page 54: Exchange Server 2013 High Availability - Site Resilience

alternate datacenter: Portlandprimary datacenter: Redmond

Site Resilience

dag1

witness

mbx1 mbx2 mbx3 mbx4XXX

Page 55: Exchange Server 2013 High Availability - Site Resilience

alternate datacenter: Portlandprimary datacenter: Redmond

dag1

Site Resilience

witness

mbx1 mbx2 mbx3 mbx4

alternate witness

1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond

2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc

3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland

X

Page 56: Exchange Server 2013 High Availability - Site Resilience

Questions?

Scott SchnollPrincipal Technical [email protected]://aka.ms/schnoll

schnoll