oracle active data guard performance

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.1

Oracle Active Data GuardPerformance

Joseph MeeksDirector, Product ManagementOracle High Availability Systems


Note to viewer

These slides provide various aspects of performance data for Data Guard and Active Data Guard – we are in the process of updating for Oracle Database 12c.

It can be shared with customers, but is not intended to be a canned presentation ready to go in its entirety

It provides SC’s data that can be used to substantiate Data Guard performance or to provide focused answers to particular concerns that may be expressed by customers.


Note to viewer

See this FAQ for more customer and sales collateral– http://database.us.oracle.com/pls/htmldb/f?

p=301:75:101451461043366::::P75_ID,P75_AREAID:21704,2


Agenda – Data Guard Performance

Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance


Data Guard 12.1 Example - Faster Failover

# of database sessions on primary and standby


43 seconds2,000 sessions on both primary and standby


Preliminary


Data Guard 12.1 Example – Faster Switchover



83 seconds 500 sessions on both primary and standby


Preliminary


Agenda – Data Guard Performance



Synchronous Redo Transport Primary database performance is impacted by the total round-trip time for

acknowledgement to be received from the standby database– Data Guard NSS process transmits Redo to the standby directly from log buffer, in

parallel with local log file write– Standby receives redo, writes to a standby redo log file (SRL), then returns ACK– Primary receives standby ACK, then acknowledges commit success to app

The following performance tests show the impact of SYNC transport on primary database using various workloads and latencies

In all cases, transport was able to keep pace with generation – no lag We are working on test data for Fast Sync (SYNCNOAFFIRM) in Oracle

Database 12c (same process as above, but standby acks primary as soon as redo is received in memory – it does not wait for SRL write.

Zero Data Loss


Test 1) Synchronous Redo Transport

Workload:– Random small inserts (OLTP) to 9 tables with 787 commits per second– 132 K redo size, 1368 logical reads, 692 block changes per transaction

Sun Fire X4800 M2 (Exadata X2-8)– 1 TB RAM, 64 Cores, Oracle Database 11.2.0.3, Oracle Linux– InfiniBand, seven Exadata cells, Exadata Software 11.2.3.2

Exadata Smart Flash, Smart Flash Logging and Write-Back flash enabled provided significant gains

OLTP with Random Small Insert < 1ms RTT Network Latency



Local standby, <1ms RTT

99MB/s redo rate <1% impact on

database throughput

1% impact on transaction rate

OLTP with Random Small Inserts and < 1ms RTT Network Latency

Txn Rate

Redo Rate

0 20000000 40000000 60000000 80000000 100000000 120000000

104,143,368.00

104,051,368.80With Data Guard Synchronous Transport Enabled

Data Guard Transport Disabled

RTT = network round trip time



Exadata X2-8, 2-node RAC database– smart flash logging, smart write back flash

Swingbench OLTP workload– Random DMLs, 1 ms think time, 400 users, 6000+ transactions per

second, 30MB/s peak redo rate (different from test 2) Transaction profile

– 5K redo size, 120 logical reads, 30 block changes per transaction 1 and 5ms RTT network latency

Swingbench OLTP Workload with Metro-Area Network Latency



30 MB/s redo 3% impact at

1ms RTT 5% impact at

5ms RTT

Swingbench OLTP Workload with Metro-Area Network Latency

0

1000

2000

3000

4000

5000

6000Swingbench OLTP

BaselineNo Data Guard

Data Guard SYNC1ms RTT

Network Latency

Data Guard SYNC5ms RTT

Network Latency

6363tps

6151tps

6077tps

Transactions per/second




Large insert OLTP workload– 180+ transactions per second, 83MB/s peak redo rate, random tables

Transaction profile– 440K redo size, 6000 logical reads, 2100 block changes per transaction

1, 2 and 5ms RTT network latency

Large Insert OLTP Workload with Metro-Area Network Latency


0

50

100

150

200


83 MB/s redo <1%% impact

at 1ms RTT 7% impact at

2ms RTT 12% impact at

5ms RTT

Large Insert OLTP Workload with Metro-Area Network Latency

Large Insert - OLTP

BaselineNo

Data Guard

1ms RTTNetwork Latency

5ms RTTNetworkLatency

Transactions per/second

189tps

188tps

167tps

177tps

2ms RTTNetwork Latency




Mixed workload with high TPS – Swingbench plus large insert workloads– 26000+ txn per second and 112 MB/sec peak redo rate

Transaction profile– 4K redo size, 51 logical reads, 22 block changes per transaction

1, 2 and 5ms RTT network latency

Mixed OLTP workload with Metro-Area Network Latency


No Sync 0ms 2ms 5ms 10ms 20ms 0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Txn

Rate

Redo

Rat

e

Test 4) Synchronous Redo TransportMixed OLTP workload with Metro-Area Network Latency

Swingbench plus large insert 112 MB/s redo 3% impact at < 1ms RTT 5% impact at 2ms RTT 6% impact at 5ms RTT

Note: 0ms latency on graph represents values falling in the range <1ms


Additional SYNC Configuration Details

No system bottlenecks (CPU, IO or memory) were encountered during any of the test runs

– Primary and standby databases had 4GB online redo logs– Log buffer was set to the maximum of 256MB– OS max TCP socket buffer size set to 128MB on both primary and standby– Oracle Net configured on both sides to send and receive 128MB with an

SDU for 32k– Redo is being shipped over a 10GigE network between the two systems.– Approximately 8-12 checkpoints/log switches are occurring per run

For the Previous Series of Synchronous Transport Tests


Customer References for SYNC Transport

Fannie Mae Case Study that includes performance data Other SYNC references

– Amazon– Intel– MorphoTrak – prior biometrics division of Motorola, case study, podcast, presentation– Enterprise Holdings– Discover Financial Services, podcast, presentation– Paychex– VocaLink

http://www.oracle.com/technetwork/database/features/availability/fanniemaeprofile-133292.pdf

http://www.oracle.com/technetwork/database/features/availability/amazon-fast-start-failover-185419.pdf

http://www.oracle.com/technetwork/database/features/availability/13441-intel-515449.pdf

http://www.oracle.com/technetwork/database/features/availability/morphotrak-casestudy-097139.html

http://streaming.oracle.com/ebn/podcasts/media/10053934_MorphoTrak_060111.mp3

http://www.oracle.com/technetwork/database/features/availability/307560-129613.pdf

http://www.oracle.com/technetwork/database/features/availability/13447-enterprise-515168.pdf

http://streaming.oracle.com/ebn/podcasts/media/9592275_Discover_011311.mp3

http://www.oracle.com/technetwork/database/features/availability/s316924-2-175933.pdf

http://www.oracle.com/technetwork/database/availability/paychex-513965.pdf

http://www.oracle.com/technetwork/database/features/availability/s316927-2-175931.pdf


Synchronous Redo Transport

Redo rates achieved are influenced by network latency, redo-write size, and commit concurrency – in a dynamic relationship with each other that will vary for every environment and application

Test results illustrate how an example workload can scale with minimal impact to primary database performance

Actual mileage will vary with each application and environment. Oracle recommends customers conduct their own tests, using their

workload and environment. Oracle tests are not a substitute.

Caveat that Applies to ALL SYNC Performance Comparisons


Agenda



Asynchronous Redo Transport

ASYNC does not wait for primary acknowledgement A Data Guard NSA process transmits directly from log buffer in parallel with

local log file write– NSA reads from disk (online redo log file) if log buffer is recycled before redo

transmission is completed ASYNC has minimal impact on primary database performance Network latency has little, if any, impact on transport throughput

– Uses Data Guard 11g streaming protocol & correctly sized TCP send/receive buffers Performance tests are useful to characterize max redo volume that ASYNC is

able to support without transport lag– Goal is to ship redo as fast as generated without impacting primary performance

Near Zero Data Loss


Asynchronous Test Configuration

100GB online redo logs Log buffer set to the maximum of 256MB OS max TCP socket buffer size set to 128MB on primary and standby Oracle Net configured on both sides to send and receive 128MB Read buffer size set to 256 (_log_read_buffer_size=256) and archive buffers

set to 256 (_log_archive_buffers=256) on primary and standby Redo is shipped over the IB network between primary and standby nodes

(insures that transport is not bandwidth constrained)– Near-zero network latency, approximate throughput of 1200MB/sec.

Details


ASYNC Redo Transport Performance Test

Single Instance0

100

200

300

400

500

600 Data Guard ASYNC transport can sustain very high rates‒ 484 MB/sec on single node‒ Zero transport lag

Add RAC nodes to scale transport performance‒ Each node generates its own redo thread and has a

dedicated Data Guard transport process‒ Performance will scale as nodes are added assuming

adequate CPU, I/O, and network resources A 10GigE NIC on standby receives data at

maximum of 1.2 GB/second‒ Standby can be configured to receive redo across two

or more instances

Oracle Database 11.2.

Redo TransportMB/sec

484


Data Guard 11g Streaming Network Protocol

Streaming protocol is new with Data Guard 11g Test measured throughput with 0 – 100ms RTT ASYNC tuning best practices

– Set correct TCP send/receive buffer size = 3 x BDP (bandwidth delay product)

BDP = bandwidth x round-trip network latency– Increase log buffer size if needed to keep NSA

process reading from memory See support note 951152.1 X$LOGBUF_READHIST to determine

buffer hit rate

High Network Latency has Negligible Impact on Network Throughput

ASYNC0

5

10

15

20

25

30

35

0ms25ms50ms

RedoTransportRateMB/sec

NetworkLatency


Agenda



Multi-Standby Configuration

A growing number of customers use multi-standby Data Guard configurations.

Additional standbys are used for:– Local zero data loss HA failover with remote DR– Rolling maintenance to reduce planned downtime– Offloading backups, reporting, and recovery from primary – Reader farms – scale read-only performance

This leads to the question: How is primary database performance affected as the number of remote transport destinations increases?

Primary - A Local Standby - B

RemoteStandby - C

SYNC

ASYNC


Redo Transport in Multi-Standby Configuration

97.0%98.0%99.0%

100.0%101.0%102.0%103.0%104.0%105.0%

Primary Performance Impact: 14 Asynchronous Transport Destinations

93.0%94.0%95.0%96.0%97.0%98.0%99.0%

100.0%101.0%102.0%

Increase in CPU(compared to baseline)

Change in redo volume(compared to baseline)

0 - 14 destinations 0 -14 destinations


Redo Transport in Multi-Standby Configuration

96.0%

98.0%

100.0%

102.0%

104.0%

Primary Performance Impact: 1 SYNC and multiple ASYNC Destinations

93.0%94.0%95.0%96.0%97.0%98.0%99.0%

100.0%101.0%

Increase in CPU(compared to baseline)

Change in redo volume(compared to baseline)

# of SYNC/ASYNC destinations

Zero 1/0 1/1 1/14

# of SYNC/ASYNC destinations

Zero 1/0 1/1 1/14


Redo Transport for Gap Resolution

Standby databases can be configured to request log files needed to resolve gaps from other standby’s in a multi-standby configuration

A standby database that is local to the primary database is normally the preferred location to service gap requests

– Local standby database are least likely to be impacted by network outages– Other standby’s are listed next– The primary database services gap requests only as a last resort – Utilizing a standby for gap resolution avoids any overhead on the primary

database


Agenda



0

500

1000

1500

2000

2500

Redo Transport Compression

Test configuration– 12.5 MB/second bandwidth– 22 MB/second redo volume

Uncompressed volume exceeds available bandwidth

– Recovery Point Objective (RPO) impossible to achieve

– perpetual increase in transport lag 50% compression ratio results in:

– volume < bandwidth = achieve RPO – ratio will vary across workloads

Requires Advanced Compression

Conserve Bandwidth and Improve RPO when Bandwidth Constrained

22 MB/secuncompressed

12 MB/seccompressed

Elapsed Time - Minutes

TransportLag - MB


Agenda



Standby Apply Performance Test

Redo apply was first disabled to accumulate a large number of log files at the standby database. Redo apply was then restarted to evaluate max apply rate for this workload.

All standby log files were written to disk in Fast Recovery Area Exadata Write Back Flash Cache increased the redo apply rate from

72MB/second to 174MB/second using test workload (Oracle 11.2.0.3)– Apply rates will vary based upon platform and workload

Achieved volumes do not represent physical limits– They only represent the particular test case configuration and workload,

higher apply rates have been achieved in practice by production customers


Apply Performance at Standby Database

Test 1: no write-back flash cache

On Exadata x2-2 quarter rack Swing bench OLTP workload 72 MB/second apply rate

– I/O bound during checkpoints– 1,762ms for checkpoint

complete– 110ms DB File Parallel Write


Apply Performance at Standby Database

Test 2: a repeat of the previous test but with write-back flash cache enabled

On Exadata x2-2 quarter rack Swing bench OLTP workload 174 MB/second apply rate

– Checkpoint completes in 633ms vs 1,762ms

– DB File Parallel Write is 21ms vs 110ms


Two Production Customer Examples

Thomson-Reuters– Data Warehouse on Exadata, prior to write-back flash cache– While resolving a gap of observed an average apply rate of 580MB/second

Allstate Insurance– Data Warehouse ETL processing resulted in average apply rate over a 3

hour period of 668MB/second, with peaks hitting 900MB/second

Data Guard Redo Apply Performance


Redo Apply Performance for Different Releases

0200400600

High End - BatchHigh End - OLTP

Range of Observed Apply Rates for Batch and OLTP

StandbyApplyRate

MB/sec

oracle active data guard performance

Documents