avanade: tid for oppetid bernt lervik infrastructure architect avanade
TRANSCRIPT
Avanade: Tid for oppetidAvanade: Tid for oppetidAvanade: Tid for oppetidAvanade: Tid for oppetid
Bernt LervikBernt LervikInfrastructure ArchitectInfrastructure ArchitectAvanadeAvanade
Avanade is the leading technology Avanade is the leading technology integrator specialising in the Microsoft integrator specialising in the Microsoft platform.platform.
Our people help customers around Our people help customers around the world maximise their IT investment the world maximise their IT investment and create comprehensive solutions and create comprehensive solutions that dive business results.that dive business results.
Additional information can be found at Additional information can be found at www.avanade.comwww.avanade.com
AgendaAgenda
Failover ClusteringFailover ClusteringBasic principles of failover clusteringBasic principles of failover clustering
Best practices for failover clusteringBest practices for failover clustering
Best practices for geo clusteringBest practices for geo clustering
Database MirroringDatabase MirroringHow database mirroring worksHow database mirroring works
What influences performanceWhat influences performance
Failover considerationsFailover considerations
Deployment considerationsDeployment considerations
Best practices database mirroringBest practices database mirroring
Basic Principle of Failover ClusteringBasic Principle of Failover Clustering
SQL Server Virtual ServerMicrosoft
Cluster Server
SharedDisk
Array*
Heartbeat
Public Network
* Disks in a shared-nothing * Disks in a shared-nothing configuration – each drive letter is configuration – each drive letter is owned by only one node (at a time)owned by only one node (at a time)
MicrosoftCluster Server
Node ANode A Node BNode B
Best Practices for Failover ClusteringBest Practices for Failover ClusteringIncreasing reliability for site/server/database failureIncreasing reliability for site/server/database failure
Understand the technology – multiple nodes access a shared disk array Understand the technology – multiple nodes access a shared disk array in a shared-nothing configurationin a shared-nothing configuration
ProsProsProvides redundancy at the server levelProvides redundancy at the server levelSupports automatic detection and automatic failoverSupports automatic detection and automatic failoverClients unaware of where SQL Server is running – SQL Server runs as a “Virtual Clients unaware of where SQL Server is running – SQL Server runs as a “Virtual Server” on the Windows ClusterServer” on the Windows ClusterEntire server is protected (there is ONLY one database/server, no external Entire server is protected (there is ONLY one database/server, no external dependencies)dependencies)Relatively easy once it’s set up and configuredRelatively easy once it’s set up and configured
ConsConsOnly *one* copy of the database Only *one* copy of the database
Single point of failure, failover when shared disk unaffectedSingle point of failure, failover when shared disk unaffectedNo standby server (aka, no copy) for reportingNo standby server (aka, no copy) for reporting
Proprietary hardware can be expensiveProprietary hardware can be expensive
Only choose hardware from the Microsoft Only choose hardware from the Microsoft Windows Server Catalog, Cluster Solutions: Windows Server Catalog, Cluster Solutions: http://www.microsoft.com/windows/catalog/server/default.aspx?subID=22&xslt=categoryProduct&pgn=8b712458-b91c-4a7d-8695-23e9cd3ae95b
Best Practices for Geo ClusteringBest Practices for Geo ClusteringIncreasing reliability for site/server/database Increasing reliability for site/server/database failurefailureUnderstand the Technology – based on a common hardware strategy Understand the Technology – based on a common hardware strategy called remote mirroring, where disk activity is mirrored remotely to a called remote mirroring, where disk activity is mirrored remotely to a second copy of the database. Can cause the secondary system to be second copy of the database. Can cause the secondary system to be corrupt if write order and block size are not preserved (be sure to choose corrupt if write order and block size are not preserved (be sure to choose a supported configuration) a supported configuration)
ProsProsProvides zero to minimal data loss through redundant storage area Provides zero to minimal data loss through redundant storage area networks/arrays, providing redundancy at the system levelnetworks/arrays, providing redundancy at the system level
Removes single point of failure when compared to failover clusteringRemoves single point of failure when compared to failover clustering
ConsConsPerformance may be impacted in synchronous (no data loss) configurations but Performance may be impacted in synchronous (no data loss) configurations but data loss is possible in high performance configurations (data loss is possible in high performance configurations (there’s your key trade-off there’s your key trade-off in a mirroring solutionin a mirroring solution))
Proprietary hardware can be expensiveProprietary hardware can be expensive
Only choose hardware from the Windows Server Catalog, Only choose hardware from the Windows Server Catalog, Geographically Dispersed Cluster Solution Category: Geographically Dispersed Cluster Solution Category: http://www.microsoft.com/windows/catalog/server/default.aspx?subID=22&xslt=categoryProduct&pgn=b55095f4-71f3-4b26-98b1-05f3a9506d0d
How Database Mirroring WorksHow Database Mirroring WorksNo MirroringNo Mirroring
PrincipalPrincipal
Log
Application
SQL Server
2
1
Data
>2
3
Commit
How Database Mirroring WorksHow Database Mirroring WorksAsynchronous MirroringAsynchronous Mirroring
MirrorMirrorPrincipalPrincipal
Log
Application
SQL Server
SQL Server
2
1
Data DataLog
>>2
>2
>2 >>>2
3
Commit
>>>2
How Database Mirroring WorksHow Database Mirroring WorksSynchronous MirroringSynchronous Mirroring
MirrorMirrorPrincipalPrincipal
Log
Application
SQL Server
SQL Server
2
2.1
4
1
Data DataLog
3>2 >3
5
Commit
WitnessWitness
Transaction SafetyTransaction SafetySynchronousSynchronous
SAFETY FULL (Default)SAFETY FULL (Default)ALTER DATABASE <database name> SET SAFETY FULLALTER DATABASE <database name> SET SAFETY FULL
Guaranteed protection of dataGuaranteed protection of data
High availability High availability / / High protection High protection (with a witness)(with a witness)
Allows automatic failover (with a witness)Allows automatic failover (with a witness)
AsynchronousAsynchronousSAFETY OFF SAFETY OFF
ALTER DATABASE <database name> SET SAFETY OFFALTER DATABASE <database name> SET SAFETY OFF
Potential loss of data in the event of failurePotential loss of data in the event of failure
High Performance High Performance modemode
Force service for failoverForce service for failover
What Influences Performance?What Influences Performance?Synchronous MirroringSynchronous Mirroring
MirroMirrorr
PrincipalPrincipal
Log
Application
SQL Server
SQL Server
2
2.1
4
1
Data DataLog
3>2 >3
5
Commit
WitnessWitness
The most important
factor is the log generation
rate
What Influences Performance ?What Influences Performance ?
Log generation rateLog generation rate
Network latency and bandwidthNetwork latency and bandwidth
Transaction safety levelTransaction safety level
Number of concurrent user connectionsNumber of concurrent user connections
Transaction size and volumeTransaction size and volume
Hardware sizing (spindles spindles spindles)Hardware sizing (spindles spindles spindles)
Test WorkloadsTest Workloads
Characteristic Workload1 Workload2
Database size (GB) 40 20
Number of concurrent user connections
1000 20
Maximum think time between transactions (sec)
4 0
Baseline (No Mirroring) %CPU 4 40
Baseline (No Mirroring) Transactions / sec
241 215
Baseline (No Mirroring) log generation rate (KB / sec)
720 12000
Transaction Safety vs. Performance Transaction Safety vs. Performance Workload1Workload1
Transaction Throughput for Workload1
2382412410
50
100
150
200
250
No Mirroring Safety OFF Safety FULL
Transaction Safety Levels
Tran
sact
ions
/sec
Marginal impact when log generation rate is
low
Transaction Safety vs. PerformanceTransaction Safety vs. PerformanceWorkload2Workload2
Transaction Throughput for Workload2
215 211 1580
50
100
150
200
250
No Mirroring Safety OFF Safety FULL
Transaction Safety Levels
Tra
nsa
ctio
ns/
sec
More impact when log generation rate is
high
Impact of Network LatencyImpact of Network LatencySynchronous with Workload1Synchronous with Workload1
Synchronous Mirroring with Network Latency for Workload1
0
50
100
150
200
250
2 14 20 50 100 200
Round Trip Time (ms)
Tra
nsacti
on
s /
sec
0
2
4
6
8
10
12
Resp
on
se T
ime (
sec)
Transactions/sec Response Time (sec)
Impact of Network BandwidthImpact of Network BandwidthSynchronous with Workload1Synchronous with Workload1
Synchronous Mirroring with varied Network Bandwidth for Workload1
0
50
100
150
200
250
300
1 10 100 1000
Network Bandwidth (Mbps)
Tra
nsacti
on
s/s
ec
0
5
10
15
20
Resp
on
se T
ime
(sec)
Transactions/sec Response Time (sec)
Failover ConsiderationsFailover Considerations
Failover is at a database levelFailover is at a database levelNo group / instance failoverNo group / instance failover
Data outside the database is not propagatedData outside the database is not propagatedMaster: logins, user written stored procedure, etc.Master: logins, user written stored procedure, etc.
MSDB: Jobs, histories, etc.MSDB: Jobs, histories, etc.
Events During an Automatic FailoverEvents During an Automatic Failover
Time
Time to detect failure.
Fixed overhead
Database available
Time to coordinate with witness.
Failure detected
Redo Complete
Redo Phase
Undo Phase
The time from detecting the failure of the principal The time from detecting the failure of the principal to the time the mirror assumes the role of the to the time the mirror assumes the role of the principal is the database failover timeprincipal is the database failover time
Failure
occurs
Decide to failover
Failure Detection for Automatic Failure Detection for Automatic FailoverFailover
Two different types of failuresTwo different types of failuresSQL ServerSQL Server
Ping each other once a secondPing each other once a second
By default if 10 “pings” are missed, then declare a By default if 10 “pings” are missed, then declare a failurefailure
Outside SQL ServerOutside SQL ServerOperating systemOperating system
Network errorsNetwork errors
IO errorsIO errors
Process errors Process errors
Examples of FailuresExamples of FailuresFastFast
SQL Server instance crashesSQL Server instance crashesEndpoint closes port quicklyEndpoint closes port quickly
Network retry from partner quickly failsNetwork retry from partner quickly failsOS says that the port is closedOS says that the port is closed
Fast failure!Fast failure!
Failover begins in secondsFailover begins in seconds
Examples of FailuresExamples of FailuresNot as fastNot as fast
Catastrophic server failureCatastrophic server failurePower supply failsPower supply fails
Network retry from partner waits for timeoutNetwork retry from partner waits for timeout
SQL Server “ping” will most likely fail firstSQL Server “ping” will most likely fail first
Failover begins in 10 secondsFailover begins in 10 seconds
Examples of FailuresExamples of FailuresSlower…Slower…
Someone pulls the log drive on principalSomeone pulls the log drive on principalPending IOs to the log drive queue upPending IOs to the log drive queue up
SQL Server “pings” are working fineSQL Server “pings” are working fine
After 20 seconds, SQL Server issues IO warningAfter 20 seconds, SQL Server issues IO warning
After 40 seconds, SQL Server declares IO failureAfter 40 seconds, SQL Server declares IO failure
Failover begins 40 seconds after log drive is Failover begins 40 seconds after log drive is pulledpulled
Examples of FailuresExamples of FailuresEitherEither No failover or Fast failoverNo failover or Fast failover
Database page fails checksumDatabase page fails checksumClient connection is broken Client connection is broken
Transaction rolls back automaticallyTransaction rolls back automatically
No failoverNo failover
Database page fails checksumDatabase page fails checksumTransaction was in the middle of a rollbackTransaction was in the middle of a rollback
Now the database is inconsistentNow the database is inconsistent
Database goes SUSPECTDatabase goes SUSPECT
Fast failover!!!Fast failover!!!
Issues with Extended DisconnectsIssues with Extended Disconnects
Long DisconnectsLong DisconnectsMirror unavailable → DISCONNECTEDMirror unavailable → DISCONNECTED
Mirroring session suspended → SUSPENDEDMirroring session suspended → SUSPENDED
Log records keep accumulating at the principalLog records keep accumulating at the principal
Transaction log can NOT be truncated, even if you backup Transaction log can NOT be truncated, even if you backup transaction logtransaction log
May eventually fill up the transaction log space and the database comes May eventually fill up the transaction log space and the database comes to haltto halt
Look at LOG_REUSE_WAIT_DESC column in sys.databasesLook at LOG_REUSE_WAIT_DESC column in sys.databases
RESUME the mirroring session, or break it (manually resynchronize via RESUME the mirroring session, or break it (manually resynchronize via backup/copy/restore, resume mirroring – just as when you setup backup/copy/restore, resume mirroring – just as when you setup mirroring)mirroring)
Deployment Considerations Deployment Considerations 22
Customer storiesCustomer storiesMission critical applications deploying synchronous with witness Mission critical applications deploying synchronous with witness
For DR, customers deploy asynchronous with great successFor DR, customers deploy asynchronous with great success
Some customers want synchronous, but prefer manual failoverSome customers want synchronous, but prefer manual failoverMultiple databasesMultiple databases
Corporate IT policies demand human involvementCorporate IT policies demand human involvement
Start simple with asynchronous mirroringStart simple with asynchronous mirroring
Increase complexity as needed – one at a timeIncrease complexity as needed – one at a timeTurn on synchronousTurn on synchronous
Add a witness Add a witness
Summary Performance ConsiderationsSummary Performance ConsiderationsApplications generating more transaction log experience Applications generating more transaction log experience higher performance impact with database mirroringhigher performance impact with database mirroringApplications with fewer connections experience more impact Applications with fewer connections experience more impact on transaction throughput when synchronous mirroring is on transaction throughput when synchronous mirroring is turned onturned onApplications with smaller transaction size experience Applications with smaller transaction size experience relatively larger performance impact with database mirroringrelatively larger performance impact with database mirroringApplications with low transaction log rate may sustain Applications with low transaction log rate may sustain acceptable throughput with slight reduction in network acceptable throughput with slight reduction in network bandwidth or slight increase in the network latencybandwidth or slight increase in the network latencyApplications with high transaction log rate may experience Applications with high transaction log rate may experience severe performance degradation with lower network severe performance degradation with lower network bandwidth or higher network latencybandwidth or higher network latencyWhile using asynchronous mirroring, monitor send queue to While using asynchronous mirroring, monitor send queue to determine the possible data loss in the event of failure of the determine the possible data loss in the event of failure of the principalprincipal
Best Practices for Database MirroringBest Practices for Database MirroringIncreasing reliability for site/server/database Increasing reliability for site/server/database failurefailureUnderstand the technology – Communicates through a dedicated TCP Understand the technology – Communicates through a dedicated TCP endpoint and continuously sends transactional information to the mirror endpoint and continuously sends transactional information to the mirror (copy) database(copy) database
ProsProsProvides zero to minimal data loss through a configured database mirroring partnership – Provides zero to minimal data loss through a configured database mirroring partnership – which includes a copy of the database, database redundancy which includes a copy of the database, database redundancy Removes single point of failure when compared to failover clusteringRemoves single point of failure when compared to failover clusteringNo hardware dependenciesNo hardware dependenciesIncludes transparent client redirect for better client connection managementIncludes transparent client redirect for better client connection management
ConsConsPerformance may be impacted in synchronous (no data loss) configurations but data loss is Performance may be impacted in synchronous (no data loss) configurations but data loss is possible in high performance configurations (again, the key trade-off in any mirroring solution)possible in high performance configurations (again, the key trade-off in any mirroring solution)
Understand the ConfigurationsUnderstand the ConfigurationsHigh AvailabilityHigh Availability-Synchronous mirroring, Automatic detection/failover, no -Synchronous mirroring, Automatic detection/failover, no data lossdata lossHighHigh ProtectionProtection-Synchronous mirroring, manual failover, no data loss-Synchronous mirroring, manual failover, no data lossHigh PerformanceHigh Performance-Asynchronous mirroring, manual failover, -Asynchronous mirroring, manual failover, some data loss possiblesome data loss possible
Summary Best Practices RecommendationsSummary Best Practices Recommendations
Start simple (asynchronous) and then gradually increase complexity to Start simple (asynchronous) and then gradually increase complexity to synchronous without witness (therefore without automatic synchronous without witness (therefore without automatic detection/automatic failover) and then add the witnessdetection/automatic failover) and then add the witnessIf you are not interested in automatic failover, don’t setup a witnessIf you are not interested in automatic failover, don’t setup a witnessUnderstand performance and availability requirements of the applicationUnderstand performance and availability requirements of the applicationSynchronous database mirroring is “generally” not recommended for a Synchronous database mirroring is “generally” not recommended for a remote mirrorremote mirrorKeep the mirror prepared for a failover, but transferring the logins, jobs, Keep the mirror prepared for a failover, but transferring the logins, jobs, etc.etc.Test performance implications thoroughly before setting up in productionTest performance implications thoroughly before setting up in productionTest performance over network before deploying mirroring between two Test performance over network before deploying mirroring between two geographically distant serversgeographically distant serversTest failover with different failure scenariosTest failover with different failure scenarios