maximum availability architecture...standby database is located up to hundreds of miles away from...

45
Oracle9i Data Guard: Primary Site and Network Configuration Best Practices An Oracle White Paper June 2006 Maximum Availability Architecture Oracle Best Practices For High Availability

Upload: others

Post on 22-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Oracle9i Data Guard: Primary Site and Network Configuration Best Practices An Oracle White Paper June 2006

Maximum Availability Architecture

Oracle Best Practices For High Availability

Page 2: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Oracle9i Data Guard: Primary Site and Network Configuration Best Practices

Executive Summary .......................................................................................... 3 Best Practices ..................................................................................................... 5

Local Area Network ..................................................................................... 5 Metropolitan and Wide Area Network...................................................... 5

Test Environment ............................................................................................. 7 Database Server Hardware and Software.................................................. 7 Client Driver Hardware and Software ....................................................... 8 Network ......................................................................................................... 8

Test Description .............................................................................................. 10 Transaction Profiles.................................................................................... 10 Data Guard Transport Test Cases ........................................................... 11 Performance Metrics .................................................................................. 12

Test Results ...................................................................................................... 13 LAN Test Results ....................................................................................... 13 MAN / WAN Test Results ....................................................................... 15

Troubleshooting Data Guard Performance Issues .................................... 21 Data Guard Database Metrics .................................................................. 21 Diagnosing Performance Issues ............................................................... 22 Data Guard Database Wait Event Profiles ............................................. 24

Example Scenarios .......................................................................................... 27 Scenario 1: Business Requirement - DR with Zero Data Loss............ 27 Scenario 2: Business Requirement - DR and Optimum Performance 28

Conclusion........................................................................................................ 30 Appendix .......................................................................................................... 31

A. Database Parameters ......................................................................... 31 B. Oracle Net Services Configuration ................................................. 33 C. Standby Database Monitoring Script .............................................. 36 D. Operating System Monitoring ......................................................... 38 E. 9i Release 2 ASYNC Transport Details ......................................... 39 F. Network Throughput and Peak Redo Rates ................................. 41

Data Guard Primary Site and Network Configuration Best Practices Page 2

Page 3: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Oracle9i Data Guard: Primary Site and Network Configuration Best Practices

EXECUTIVE SUMMARY With Oracle Data Guard, Oracle is addressing the requirements of business continuity for disaster recovery. Because Data Guard transmits primary database changes to a standby database connected over a network, the performance and degree of protection of the databases in the Data Guard configuration are dependent on how the databases and the network are configured. This white paper determines the configuration best practices for Data Guard to optimize system performance while maximizing data protection, using Data Guard Redo Apply.

Oracle’s High Availability Systems team executed a series of tests over a local area network (LAN) and simulated metropolitan area (MAN) and wide area networks (WAN). The series of tests used different application profile workloads to determine Data Guard configuration best practices. The team determined that Data Guard’s zero data loss synchronous log transport (i.e. the Maximum Protection and Maximum Availability modes) is recommended for a configuration in which the standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages with very little impact on the production database. Data Guard’s asynchronous log transport (i.e. the Maximum Performance mode) is recommended for a configuration in which the network distance is up to thousands of miles, providing continual maximum performance, while minimizing the risks of transaction loss in the event of a disaster.1

Based on the test results the following are realistic scenarios:

1. Building a Data Guard environment between Tokyo and Kyoto (229 miles / 366 kilometers, ~ 7ms RTT) using synchronous mode will provide zero data loss with minimal performance impact (less than 3% in tests). The performance impact is dependent on the application commit frequency. Our tests showed a higher commit frequency incurred more impact to the primary (2-3% more in our 400 user tests).

2. Building a Data Guard environment between San Francisco and New York (2582 miles / 4157 kilometers, ~ 78ms RTT) using asynchronous mode will have almost no performance impact (1% in tests) and minimal application impact in the event of a disaster. On average, the transaction loss potential was 0.6-1.3 seconds and the worst-case loss was 6-13 seconds worth of transactions in our tests.

1 These general best practices should apply to most customer environments. The results you experience will depend on your redo generation rate and commit rate. Testing with your own application(s) and systems is recommended to validate your service level requirements.

Data Guard Primary Site and Network Configuration Best Practices Page 3

Page 4: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Note that this is not a benchmark of the Oracle9i Data Guard redo apply database feature. These tests did not attempt to determine the maximum rate at which redo can be transferred and applied.

Data Guard Primary Site and Network Configuration Best Practices Page 4

Page 5: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

BEST PRACTICES These best practices were derived after extensive testing on Oracle9i databases as part of the studies within the Maximum Availability Architecture (MAA) best practices and recommendations. For more information about MAA, refer to http://otn.oracle.com/deploy/availability/htdocs/maa.htm. Descriptions of the test environment, test cases, and test results used to identify these best practices are included in subsequent sections of this paper.

The following experiments are documented:

• Performance of the different transport services in a LAN and MAN/WAN network environment

• Impact of different network round trip times (RTT) on primary database throughput

• Effect of using Secure Shell (SSH) port forwarding with compression on primary database throughput and network utilization

The Data Guard transport service options were tested in a LAN and an emulated MAN / WAN network environment using high transaction rates of 180 and 360 transactions per second (TPS) and redo rates ranging from 0.7 to 1.6 MB/sec for OLTP tests and 4.5 MB/sec for our batch runs.

Local Area Network Data Guard is often used in a local area network (LAN) to protect against data loss due to human error or data failure. LAN testing was done on an internal private network with a network RTT of approximately 0.2 milliseconds (ms). When using Data Guard Redo Apply in a LAN the following is recommended:

• Use Maximum Protection or Maximum Availability modes for zero data loss; the performance impact was less than 3% in all synchronous tests.

• For very good performance and a minimal risk of transaction loss in the event of a disaster, use Maximum Performance mode, with LGWR ASYNC and a 10 MB async buffer (ASYNC=20480). LGWR ASYNC performance degraded no more than 1% as compared to using the ARCH transport. LGWR ASYNC also bounds the risk of potential transaction loss much better than the ARCH transport. The 10 MB async buffer outperformed smaller buffer sizes and reduced the chance of network timeout errors in a high latency / low bandwidth network.

Metropolitan and Wide Area Network Data Guard is used across a metropolitan area networks (MAN) or WANs to get complete disaster recovery protection. Typically a MAN covers a large metropolitan area and has network Round-Trip-Times (RTT) from 2-10 ms. For the MAN/WAN tests, different network RTT’s were simulated during testing to measure the impact of the RTT on the primary database performance. The tests

Data Guard Primary Site and Network Configuration Best Practices Page 5

Page 6: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

were conducted for the following RTT’s: 2 ms (MAN), 10 ms, 50 ms, and 100 ms (WAN) Additionally, tests using Secure Shell (SSH) port forwarding with compression were also done for different RTT’s. Best practices recommendations are:

• Use Maximum Protection and Maximum Availability modes over a MAN for zero data loss. For these modes, the network RTT overhead over a WAN can impact response time and throughput of the primary database. The performance impact was less than 6% with a 10 ms network RTT and a high transaction rate.

• For very good performance and a minimal risk of transaction loss in the event of a disaster, use Maximum Performance mode, with LGWR ASYNC and a 10 MB async buffer (ASYNC=20480). LGWR ASYNC performance degraded no more than 2% as compared to remote archiving. The 10 MB async buffer outperformed smaller buffer sizes and reduced the chance of network timeout errors in a high latency / low bandwidth network.

• For optimal primary database performance throughput, use remote archiving (i.e. the ARCH process as the log transport). This configuration is best used when network bandwidth is limited and when your applications can risk some transaction loss in the event of a disaster.

• If you have sufficient memory, then set the TCP send and receive buffer sizes (these affect the advertised TCP window sizes) to the bandwidth delay product, the bandwidth times the network round trip time. This can improve transfer time to the standby by as much as 10 times, especially with the ARCH transport.

• Set SDU=32767 (32K) for the Oracle Net connections between the primary and standby. Setting the Oracle network services session data unit (SDU) to its maximum setting of 32K resulted in a 5% throughput improvement over the default setting of 2048 (2K) for LGWR ASYNC transport services and a 10% improvement for the LGWR SYNC transport service.

• Use SSH port forwarding with compression for WAN’s with a large RTT when using maximum performance mode. Do not use SSH with compression for Maximum Protection and Maximum Availability modes since it adversely affected the primary throughput. Using SSH port forwarding with compression reduced the network traffic by 23-60% at a 3-6% increase in CPU usage. This also eliminated network timeout errors. With the ARCH transport, using SSH also reduced the log transfer time for RTT’s of 50 ms or greater. For RTT’s of 10ms or less, the ARCH transport log transfer time was increased when using SSH with compression.

Data Guard Primary Site and Network Configuration Best Practices Page 6

Page 7: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

TEST ENVIRONMENT The environment was setup to follow the recommendations from the Maximum Availability Architecture (MAA) paper. See http://otn.oracle.com/deploy/availability/htdocs/maa.htm for details on those recommendations. The database initialization parameters used are in appendix A and the Oracle Net services files used are in appendix B.

Database Server Hardware and Software The profile of the database is a warehouse environment with a simplified OLTP transaction profile, approximately 200 GB with full archive files being 256 MB in size. The production system consists of a 2-node Oracle Real Application Clusters (RAC) with two SUN E4500 Servers. Each node has 8 CPUs, 8 GB RAM with the database files striped across EMC Symmetrix storage with 36 spindles over two controllers. Storage was configured using the Stripe And Mirror Everything2 (SAME) methodology with a stripe size of 1MB.

The standby system was configured identically to the primary system; the managed recovery process (MRP) was run on a single node of the standby system in the Real Application Clusters system.

Hardware

2-node Primary RAC cluster, 2-node Standby RAC cluster. Each node in the RAC Primary and the RAC Standby clusters has the following configuration:

• 8 400 Mhz CPU’s per node, 8Gb memory

• Sun Solaris 2.8 64-bit (SunOS 5.8 Generic_108528-12 sun4u SPARC SUNW, Ultra-Enterprise)

• EMC SYMMETRIX-SUNAPE Shared disk configured following the SAME2 methodology, using a 1MB stripe

• Archive destinations on a clustered file system using the SAME methodology, 1 MB stripe size

Software

• Sun Cluster 3.0 • Oracle

Enterprise Edition Release 9.2.0.3.0 - Production With the Partitioning and Real Application Clusters options

• 256 MB redo log files

2 For more information about SAME, refer to http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf

Data Guard Primary Site and Network Configuration Best Practices Page 7

Page 8: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Client Driver Hardware and Software

Hardware

• 2 360 Mhz Sparcv9 processors per node, 2Gb memory

• Sun Solaris 2.8

Software

• Oracle Enterprise Edition Release 9.2.0.3.0 - Production • Client driver software

Network The following operating system TCP send and receive buffer size settings were made on each database server node using /usr/sbin/ndd:

• tcp_xmit_hiwat = 32767 (TCP send buffer size) • tcp_recv_hiwat = 32767 (TCP receive buffer size)

The size of the receive buffer is the maximum size of the advertised window for that connection. The optimal setting for tcp_xmit_hiwat and tcp_recv_hiwat correspond to the bandwidth delay product, which is the bandwidth of the network link between the primary and standby systems multiplied by the packet round trip time. See the “Setting the TCP Send and Receive Buffer Sizes” under “Test Results” for more details."

LAN Tests

• Dedicated 100 Mbps private network between primary and standby.

• 100 Mbps switched network between clients and database servers.

MAN/WAN Tests

The MAN/WAN environment was simulated using Shunra\Storm. Shunra\Storm from Shunra3 Software re-creates a global network environment in the test lab, and was used in our tests to emulate different network latencies with a constant bandwidth of 100 Mbps. Network latency is the time it takes a signal to propagate from one machine to the other. The bandwidth available for communication between the two machines is the number of bits/second that can be transferred across the network. The emulations produced results that are in line with expectations when FTP testing is used to measure file transfer Round-Trip-Times (RTT's).

The ‘traceroute’ utility was used to measure the network RTT, the time it takes a signal to propagate to one machine and back. The Shunra\Storm appliance was installed between the primary and the standby hosts and acted as a bridge. Thus, the WAN simulation had only one network hop.

In general, the RTT was double the latency setting made in the Shunra\Storm appliance, e.g. setting the Shunra\Storm latency to 50ms yielded a 100 ms RTT. The WAN bandwidth was limited to 100 Mbps by the network interface cards

3 http://shunra.com/products/storm/storm_1.php

Data Guard Primary Site and Network Configuration Best Practices Page 8

Page 9: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

(NICS) on each host See Appendix F for details on calculating the maximum network throughput.

Ordinarily, one would expect to see a correlation between latency and distance used, a general rule of thumb being 33 miles (53 KM) per 1 millisecond RTT. This ratio is a rule of thumb and will vary depending on network configuration. Distance (propagation delay) does affect the latency, which in turn affects the network RTT. However, other factors also impact the network RTT; such as number of repeaters, network traffic, poor systems performance and network congestion. Therefore, rather than give a ratio of network latency to distance the test results include the network RTT. The network RTT can easily be measured by using the ‘traceroute’ utility and the network RTT metric can be a basis of comparison with other environments

Using the Oracle Net Services Session Data Unit (SDU) parameter

Preliminary testing showed that setting the Oracle Net Services Session Data Unit (SDU) parameter to its maximum value of 32767 (i.e. 32K) provided about a 5% gain in throughput. Thus, all documented tests used the SDU=32767 for the Oracle Net connections between the primary and standby. The client connections to the primary did not set SDU.

The SDU parameter needs to be set at the listener and connection levels, i.e., in the tnsnames.ora and listener.ora on the primary and the standby. When using the SDU setting, dynamic instance registration cannot be used prior to release 9.2.0.4.

With 9.2.0.4 and later you need to set the DEFAULT_SDU_SIZE parameter in the sqlnet.ora file to the desired SDU, this is a new parameter in 9.2.0.4.

Prior to 9.2.0.4, using dynamic instance registration with SDU overrides any SDU setting to be the default SDU of 2048 (2K). If you’re using a release prior to 9.2.0.4, then to ensure that the appropriate SDU is being used:

• Use static instance registration

• Do not use the default port of 1521 since that automatically attempts to dynamically register. Our tests used port 1508.

Appendix B has example network services files for using SDU under the static registration section.

Static instance registration prevents the use of advanced Oracle Net features such as connect-time failover (using a list of addresses), connection load balancing, and transparent application failover. With a RAC standby, the Oracle Net connect-time failover option can be used so that if the first standby instance (the first node in the Oracle Net TNS address list) fails then the primary will reconnect to the next instance in the Oracle Net TNS address list for Maximum Availability or Maximum Performance protection modes. In this case, the performance gain of SDU needs to be evaluated against the requirement for Oracle network services connect-time failover. An example of using connect-time failover is in Appendix B under the dynamic registration section.

Data Guard Primary Site and Network Configuration Best Practices Page 9

Page 10: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

TEST DESCRIPTION Each of the test cases described under the “Data Guard Transport Test Cases“ section below was run in the LAN. Both OLTP test cases were also run in MAN/WAN environments. All tests used a 2-node RAC primary database and a 2-node RAC standby database with one standby database instance utilized as the apply instance.

Transaction Profiles This test environment included three simulated transaction loads labeled SQL*Loader, OLTP Large (360 TPS), and OLTP Small (180 TPS). A 200+-GB database was used. The same database was used with different transaction profiles, varying the TPS/redo rate by changing the number of clients and using a 3 second think time. Think time is the maximum wait time between transactions for each session. These tests used 256 MB online redo and standby redo logs. The redo rate and TPS listed here is for the ARCH transport service, which was used as the baseline of comparison for each profile’s test results. The redo rate is the sum of the average “Redo size” statistic from each primary node during the peak load and it is captured from three 5-minute Statspack snapshots per node. Likewise, the TPS is the sum of the average “Transactions” statistic from the Statspack snapshots. The CPU percentage is the average CPU utilization of each primary node ((node-1 CPU + node-2 CPU) / 2).

Table 1 LAN Transaction Test Profiles below describes the different tests, their combined 2-node production redo generation rates, TPS, and the CPU utilization per node. The OLTP type tests updated, queried, and inserted into several large partitioned tables (millions of rows, many indexes).

Data Guard Primary Site and Network Configuration Best Practices Page 10

Page 11: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Table 1 LAN Transaction Test Profiles

Test Redo Rate MB / sec

TPS CPU %

Description

SQL*Loader 4.6 N/A 25 This test consists of a SQL*Loader direct path load that inserts 312,500 records in APPEND mode for 4 distinct record sets in succession. After the 4 record sets are loaded, the loaded table is truncated and then the process restarts. This test runs on one node of the cluster. Changes are clustered together and primarily consist of sequentially changed blocks.

OLTP Large 1.6 362 39 400 concurrent sessions (200/client/node), 3 second think time, and 2 types of transactions:

1. select; update cust table; commit 2. select, insert ordr table; insert ordl table; commit

Changes are scattered.

OLTP Small 0.75 182 18 200 concurrent sessions (100/client/node), 3 second think time, and 2 types of transactions:

1. select; update cust table; commit 2. select, insert ordr table; insert ordl table; commit

Changes are scattered.

Data Guard Transport Test Cases For each load profile, a set of three tests was run and local archiving was done in all cases. Each of these tests was repeated twice and run for 20 minutes for the following service configurations:

1. Maximum performance mode with ARCH remote archiving to a physical standby database with 2 ARCH processes (log_archive_max_processes=2) "service=DGPERF reopen=15 max_failure=10 arch optional"

2. Maximum performance mode with LGWR ASYNC 10MB async 10 MB buffer "service=DGPERF lgwr async=20480 reopen=15 max_failure=10 optional net_timeout=30"

3. Maximum availability or protection with LGWR SYNC sync "service=DGPERF reopen=15 lgwr sync affirm optional"

Data Guard Primary Site and Network Configuration Best Practices Page 11

Page 12: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

These test cases are referred to by their transport (ARCH, ASYNC, or SYNC) throughput the paper. The Data Guard protection mode is implied by the transport option:

1. ARCH or ASYNC implies Maximum Performance protection mode.

2. SYNC implies Maximum Availability or Maximum protection mode.

Performance Metrics These tests were all run using a 2-node RAC primary and a 2 node RAC standby with the client load being driven by 2 Sun Ultra-60’s with 2 360 Mhz processors per Ultra-60. See the Test Environment section for complete testing environment details. Operating system (OS), network, and database performance statistics were gathered. The operating and network statistics were gathered using the SE Toolkit (http://www.setoolkit.com/) to capture CPU, memory, I/O, and network performance data for all hosts (production role database servers, standby database server, and clients). No OS bottlenecks were observed on the primary, standby, or the client driver machines during any test.

The primary nodes database statistics were captured via the RDBMS Statspack utility with snapshots at 5-minute intervals. For details on using Statspack, see the “Oracle9i Database Performance Tuning Guide and Reference” documentation. The database statistics are averages from the two peak snapshots in the third and fourth intervals (the 10-20 minute period) to allow for ramp up time. Since a 2 node RAC was used, the individual node statspack results were combined for each snapshot. The primary statistics being observed and compared were as follows:

• redo rate was calculated from the average of the “Redo size” per second statistic value from the peak-load Statspack snapshots.

• transaction rate was calculated from the average of the “Transactions” per second value from each peak-load Statspack snapshot.

• database wait events were also analyzed as a comparative metric between tests. The “Top 5 Wait Events” section of the Statspack snapshots as well as the event average wait times were compared.

The standby database metrics were captured using the script in Appendix C.

Data Guard Primary Site and Network Configuration Best Practices Page 12

Page 13: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

TEST RESULTS

LAN Test Results The different transaction profiles showed no more than a 3% decrease in throughput for the different transport services. All the LAN test volumes were within the 100 Mbps bandwidth capacity. The detailed table of test results is contained in Appendix G.

SQL*Loader

As the chart below shows, the SQL*Loader profile showed no more than a 2% difference for the different transport services. This can be attributed to the serial insert characteristic of the SQL*Loader direct path load and the fact that the write sizes to LGWR are consistently large as compared to the OLTP profile which has smaller more frequent writes and commits.

OLTP

The two OLTP profiles performed similarly to each other. For all the transaction profiles the LGWR ASYNC option provided nearly the same optimal performance as the ARCH transport. The 10 MB async buffer provided the best throughput in as compared to smaller async buffer sizes. The synchronous option used with Maximum Protection mode did have a small affect on throughput for the OLTP cases. For the OLTP synchronous tests, the throughput degradation is attributed to the network RTT. Even though this is a LAN, there is still network latency. Running a UNIX ‘traceroute’ command showed the network RTT to be 0.2 milliseconds for the LAN. Comparing the OLTP transaction profiles, large and small, the synchronous transport throughput held at or above 97% of the ARCH transport throughput for both the large and small transaction profiles. The OLTP Large redo rate was 97% of ARCH while the OLTP Small redo rate was 99% of ARCH. This can be attributed to the fact that increasing the number of sessions and the transaction rate also increases the redo write size (Redo size / Redo writes) by LGWR which makes the synchronous LGWR network write more efficient but also takes it longer to send as the redo write size increases. This assumes that no new bottlenecks are incurred as the number of sessions are increased.

Lastly, the LGWR SYNC option did use slightly less CPU (2-7%) than the LGWR ASYNC and ARCH options. This is because there is a wait for the synchronous network I/O confirmation (SYNC AFFIRM).

Data Guard Primary Site and Network Configuration Best Practices Page 13

Page 14: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

LAN Transaction Profile Comparison

0.010.020.030.040.050.060.070.080.090.0

100.0

SQL*Loader OLTP Large OLTP Small

% o

f AR

CH

ARCHASYNCSYNC

Figure 1 LAN Transaction Profile Comparison

LAN Network Profile

The Data Guard network throughput profile is dependent on whether ARCH or LGWR is used. This is important to understand in relation to the available network bandwidth, network traffic patterns, and the network RTT. The chart below shows the amount of network traffic being output from a primary node during a 20-minute interval. As can be seen the ARCH transport spikes the network traffic immediately following a log switch. The spike is large and short. The network profile for the LGWR transport service, ASYNC and SYNC, is a steady smooth line. In a LAN with a dedicated network link between the primary and standby, each primary node’s LGWR traffic is equal to the redo rate plus other network overhead. This is important information to discuss with your network architects and administrators when monitoring and planning your network. See Appendix F.

LAN Transport Network Profiles

100600

1100160021002600310036004100460051005600

1 8 15 22 29 36 43 50 57 64 71 78 85

Time (15 sec. Intervals)

Net

wor

k O

utpu

t (K

/sec

)

ARCH ASYNC SYNC

Figure 2 LAN Network Profile

Data Guard Primary Site and Network Configuration Best Practices Page 14

Page 15: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

MAN / WAN Test Results The same transport tests were executed for a MAN and WAN using different network round trip times (RTT) of 2 (MAN), 10, 50, and 100 milliseconds (WAN). Additionally, tests were run using SSH compression to identify the effects of compression. Guidelines for setting up Data Guard with SSH compression are detailed in MetaLink note 225633.1.

ARCH Transport

As the charts below show, the ARCH transport performed the best and the network RTT did not impact production database throughput. There should be a minimum of four production log groups per database instance to prevent LGWR from waiting for a group to be available following a log switch because a checkpoint has not yet completed or the group has not yet been archived. While ARCH has the least impact on the primary database throughput, if a disaster occurs, ARCH does have greater transaction loss risk than LGWR ASYNC or LGWR SYNC.

ASYNC Transport

With the ASYNC transport and differing buffer sizes, there was no significant difference in throughput. However, there was a difference in behavior depending on the transport option used. Starting with Oracle RDBMS patchset 3 (release 9.2.0.3), if the async network buffer becomes full and remains full for 5 seconds, then the transport will timeout and convert to the ARCH transport for the remainder of that log. See Appendix E for further details on identifying this condition. In these tests, the buffer full condition occurred only for the 100 ms RTT in the 200 user tests (OLTP Small) . In the 400 user tests (OLTP Large), the buffer full condition occurred in the 50 ms and 100 ms tests and intermittently in the 10 ms RTT tests. The buffer full condition occurred more frequently with the smaller async buffer sizes. Thus, using the largest async buffer size of 10 MB proved to avoid the buffer full timeout errors.

SYNC Transport

The SYNC transport tests were tested up to a 10 ms RTT. Going beyond 10 ms RTT is plausible in many cases but should be tested in your own network configuration. Generally, due to the nature of synchronous communication, the RTT inversely impacts the primary throughput. The SYNC tests performed at a high rate with the 10 ms RTT test yielding a throughput that was 97% of ARCH for the 200-user test and 94% of ARCH for the 400-user test. These tests show that Data Guard zero data loss (maximum availability or maximum protection modes) is viable in a MAN.

Data Guard Primary Site and Network Configuration Best Practices Page 15

Page 16: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

200 User (OLTP Small) MAN / WAN RTT Tests

0100200300400500600700800

ARCH ASYNC SYNC

Transport

Avg

. Red

o R

ate

(K/s

ec)

LAN

2ms

10ms

50ms

100ms

400 User (OLTP Large) MAN / WAN RTT Tests

0200400600800

10001200140016001800

ARCH ASYNC SYNC

Transport

Avg.

Red

o Ra

te (K

/sec

)

LAN2ms10ms50ms100ms

Figure 3 MAN / WAN RTT Test Results

SSH with Compression

Again, proper setup of Data Guard with SSH compression is detailed in MetaLink note 225633.1. The SSH compression test reduced network traffic by 40-50% for the ASYNC 10 MB test and by 30-90% for the ARCH test. It also reduced network traffic for the SYNC transport tests but it also degraded primary throughput by 20% in the SYNC case. Thus, do NOT use SSH compression with SYNC and maximum availability and protection modes. In addition to the reduced network traffic, the ARCH transport log transfer times were reduced by 15-30% for the 50 ms RTT tests and by more than 30% for the 100 ms tests. For RTT’s 10 ms or less the ARCH transport log transfer time increased with SSH compression.

Data Guard Primary Site and Network Configuration Best Practices Page 16

Page 17: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

The different compression rates are due to the nature of the transport service. ARCH sends a complete file in consistent 1 MB chunks whereas LGWR ASYNC sends a maximum of 1 MB per transfer. SYNC sends varying sizes depending on the transaction load and the complete RTT for affirming the standby write. The reduction in network volume allowed the 10 MB ASYNC test to avoid conversion to ARCH due to async buffer full timeouts. This is important since exposure to transaction loss is minimized significantly if the ARCH conversion does not occur. The CPU usage increased by 6-9% when using SSH port forwarding with compression. A separate test using SSH without compression was run to identify how much additional CPU was used when using SSH just with encryption. That test showed that the SSH encryption processing overhead was 3-5% CPU. Thus, 50% of the SSH with compression overhead can be attributed to the encryption piece of SSH.

Using SSH with compression showed a higher compression rate in a higher network throughput environment. The network RTT inversely affected the network throughput which in turn reduced the SSH compression rate as shown below for the 200 user tests.

SSH Network Traffic Reduction

0

20

40

60

80

100

ARCH ASYNC

Transport

% R

educ

tion 2ms

10ms50ms100ms

Figure 4 SSH Compression Network Reduction

Use Oracle Net Session Data Unit (SDU)

Preliminary testing showed that setting the Oracle Net Services Session Data Unit (SDU) parameter to its maximum value of 32767 (i.e. 32K) provided about a 5% gain in throughput. Thus, all documented tests used the SDU=32767 for the Oracle Net connections between the primary and standby. Appendix B has example network services files for using SDU under the static registration section.

Increasing the SDU improves throughput on high-speed networks when large messages are being transmitted. The throughput gain is realized by reducing the

Data Guard Primary Site and Network Configuration Best Practices Page 17

Page 18: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

number of system calls and the associated CPU. Increasing SDU does increase memory usage. On low speed networks, the benefit is less significant, especially if the systems are not CPU bound. There is no benefit if the typical message size is less than the SDU.

The setting of SDU can made even more precise by using the following formula:

(INT(Max. SDU / MSS ) * MSS) +Oracle*Net overhead

e.g. (INT(32767 / 1460) * 1460) + 37 = 32120 + 37 = 32157

MSS=Maximum Segment Size which is usually 1460 bytes on an Ethernet connection, a 1500 byte maximum transmission unit (MTU) minus 20 bytes header for TCP and a 20 byte header for IP. If window scaling is enabled (tcp_wscale_always=1 on Solaris or the TCP send or receive buffer size is > 64K) then the MSS will be 1448.

Oracle*Net overhead = 37 bytes

An optimally tuned SDU setting will reduce the number of packets by ensuring that every packet, except perhaps the very last, is completely full. The table below is a brief summary of the effects of setting SDU as opposed to the default:

Table 2 SDU Setting Effects on Network Packets

SDU Description Avg. Packet Size (1 MB / total # of packets)

Packets Sent for a ARCH 1 MB Transfer

% Reduction in # of Packets from SDU default

2048 SDU default 1448 1005 1043 0

32767 Maximum SDU setting

1448 1422 737 29

31893 Optimal SDU w/ window scaling

1448 1446 725 30

32157 Optimal SDU w/out window scaling

1460 1458 719 31

* If window scaling is enabled (tcp_wscale_always=1 on Solaris or the TCP send or receive buffer size is > 64K) then the MSS will be 1448 since 12 bytes of header are used.

MSS*

This table illustrates that the efficiency of packeting the data is substantially impacted by setting the SDU to its maximum value. As stated previously the CPU usage is also impacted since Oracle Network Services requires less writes and therefore fewer system calls. As can bee seen, fine tuning the SDU to maximize

Data Guard Primary Site and Network Configuration Best Practices Page 18

Page 19: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

the packet filling only reduces the number of packets by 1-2%. It should also be noted that even though the last item, “Optimal SDU without window scaling”, is sending the fewest packets but it is also only using a TCP send and receive buffer of 64K. Therefore, while this has the fewest number of packets, it will not have optimal TCP send and receive buffer sizes and thereby increase the transfer time due to filling the pipe quicker. Making a 1-2% reduction in the number of packets will not prove more beneficial than setting the TCP send and receive buffers to the bandwidth delay product as described below.

Setting the TCP Send and Receive Buffer Sizes

TCP uses a sliding window algorithm to process data as it sends and receives. The details of this algorithm are discussed in Request for Comments (RFC) 793 and 1323. This sliding window method presents inefficiency when there is a large bandwidth delay product (BDP) (the product of the estimated minimum bandwidth and the round trip time between two machines). This inefficiency can be improved by overriding the default TCP buffer size settings. The default buffer sizes must be changed on both the sender and the receiver. TCP buffer sizes should be set to the BDP to achieve maximum throughput. Increasing the TCP send and receive buffer sizes can be done at the host level or system–wide (all connections) and does use memory for the buffers so consider this as well. Here’s an example on Solaris 2.8 for system-wide setting:

1. Environment: Primary and secondary connected by a T3 (44.736 Mbps) link with a network RTT of 50 ms.

2. BDP=44.736 Mbps * 50ms (.050 secs) =44,736,000 * .050 =2236800 Megabits / 8 =279,600 bytes

3. Check the TCP settings on both hosts by using the ‘ndd’ command:

echo tcp_xmit_hiwat = `/usr/sbin/ndd /dev/tcp tcp_xmit_hiwat ` echo tcp_recv_hiwat = `/usr/sbin/ndd /dev/tcp tcp_recv_hiwat ` Both hosts have the following settings:

tcp_xmit_hiwat = 16384 tcp_recv_hiwat = 24576

4. Change the settings on both hosts to the BDP of 279,600. This requires root privilege:

/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 279600

/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 279600

To change the buffer settings for any connections between a pair of hosts on Solaris, do the following:

Primary host = PRIM, Standby host name = STBY

Data Guard Primary Site and Network Configuration Best Practices Page 19

Page 20: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

1. On PRIM: ndd -set /dev/tcp tcp_host_param ‘STBY sendspace 279600 recvspace 279600’

2. On STBY: ndd -set /dev/tcp tcp_host_param ‘PRIM sendspace 279600 recvspace 279600’

These new settings take affect immediately for new connections and no reboot is required, however, existing connections are not affected. Thus, following these changes, the databases and listeners should also be restarted to obtain the new TCP buffer settings. To ensure these settings are maintained across system restarts these commands should also be put in a system startup script, e.g. /etc/rc2.d/S69inet on Solaris. This host-to-host capability may not be available on other platforms.

Data Guard Primary Site and Network Configuration Best Practices Page 20

Page 21: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

TROUBLESHOOTING DATA GUARD PERFORMANCE ISSUES Troubleshooting performance requires an end-to-end understanding of the Data Guard configuration and the proper measurement tools. The tools used in these tests are described under the “Test Description” section under “Performance Metrics” and in the appendices. To summarize, the database server statistics -- CPU, memory, I/O, network, and the database statistics all need to be periodically measured. These periodic measurements should also be kept historically for trend analysis and comparison against a “healthy” baseline.

The focus of this section is to:

• summarize the key Statspack database metrics for Data Guard, including wait events

• outline how to use the metrics to diagnose performance issues

• highlight what the typical wait event profiles were for each transport

For more comprehensive database and OS performance tuning documentation see the “Oracle9i Database Performance Planning” and the “Oracle 9i Performance Tuning Guide and Reference” manuals.

Before using Statspack, ensure the database initialization parameter, TIMED_STATISTICS, is set to TRUE. This allows the Statspack utility to collect important database timings.

Data Guard Database Metrics

Data Guard Database Statistics The key Oracle statistics monitored were load based:

• Redo size – the amount of redo bytes generated during this report

• Transactions – TPS for the report, also equals ‘user commits’ statistic

• Redo writes – Number of redo writes made during this report

The ‘redo size’ divided by the ‘redo writes’ give the average redo write size in bytes. This is an important metric for understanding the SYNC performance picture to be discussed later.

Data Guard Wait Events

The Statspack report gives a “Top 5 Timed Events” summary near the start of the Statspack report. In addition to this summary, the average wait time for each event also needs to be monitored as well. The average wait time is under the wait event details further on in the Statspack report. The definitions of the wait events can be found in the “Oracle 9i Reference Guide” and in the 9.2.0.3 patch set release notes.

Data Guard Primary Site and Network Configuration Best Practices Page 21

Page 22: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Diagnosing Performance Issues With the proper monitoring in place as described above and a baseline performance profile for “normal” operations to compare against you have the foundation for diagnosing Data Guard performance issues.

Start the diagnostics investigation at the OS level to verify that no OS bottlenecks exist. Diagnosing those issues may lead to diagnosing the database as the culprit but don’t immediately delve into diagnosing the database performance.

Data Guard primary database diagnosis checklist:

1. Validate OS performance metrics against baseline. This should be done on the primary and the standby.

2. Verify that network is not receiving errors or collisions. Use ‘netstat –i’ on UNIX.

3. Check the database alert log and the database dump destinations, bdump, udump, and cdump to make sure that no errors are occurring. Frequent errors or disconnection with the standby RFS process(es) can cause added overhead to the primary.

4. Validate database performance metrics against baseline. Check redo rate, TPS, top 5 wait events, average wait event times for ‘log file parallel write’ and wait events pertinent to the transport being used as described below.

The above should suffice for diagnosing ARCH or ASYNC but SYNC requires further analysis. SYNC performance is affected by the standby performance and the network performance more directly. Below is a more detailed method for investigating SYNC performance issues.

Breaking Down SYNC Performance

The SYNC end-to-end performance profile includes the factors below which make up the primary instance’s ‘LGWR wait on SENDREQ’ (LWOS) time. Included below also is an example inline for determining the breakdown of a 200 user test with a 10 ms RTT. Part of this includes measuring the standby ‘log file parallel write’ (LFPW) wait event, which is the time it takes RFS to write to the standby redo log. This can be summarized as follows

Primary LWOS time = Network RTT to transfer average redo write size (includes ack packet back) + standby LFPW time (time for standby to write the average redo write size)

• Primary average redo write size This is the ‘redo size’ / ‘redo writes’ from the Statspack snapshot. e.g. redo size = 115,453,376 bytes, redo writes = 16,668 average redo write size = 6,926 bytes/sec

• Network RTT This simulates the time to transfer the redo write size remote LGWR and

Data Guard Primary Site and Network Configuration Best Practices Page 22

Page 23: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

to receive the acknowledgement from the standby. Use traceroute or ping with the redo write size for the packet size. E.g. Using the redo write size of 6926 bytes, obtain the average RTT traceroute 192.168.233.13 6926

traceroute to 192.168.233.13 (192.168.233.13), 30 hops max,

6296 byte packets

1 192.168.233.13 (192.168.233.13) 11.891 ms 11.461 ms 11.354 ms

The three bolded times are the traceroute times for the default 3 probe packets it sends at each hop. These can be averaged to get the average network RTT for the specified packet size. In this traceroute there is only 1 hop.

• Standby LFPW time (time for standby to write the average redo write size) Using the standby database monitoring script shown in appendix C take the difference of two LFPW snapshots to get the average LFPW. The columns used from 2 LFPW snapshots are the TIME_WAITED_MICRO (TWM) (total wait time in microseconds) and the TOTAL_WAITS (TW) (total number of waits. These times are cumulative. E.g. Standby LFPW = (TWMT2 - TWMT1) / (TW T2 - TW T1) TWMT1 = 50089160 , TW T1 = 82199

TWMT2 = 62531532 , TW T2 = 88598

Standby LFPW = (62531532 – 50089160) / (88598 – 82199) Standby LFPW = 1944 microseconds = 1.944 ms

Verify with your system administrator if the average LFPW time is reasonable . Larger redo write sizes will take longer to write but in general, your average IO time should be less than 5 ms. If not, you may need to tune your IO on the standby.

With this information in hand the LWOS on the primary should be about 13 ms (11.354 ms + 1.944 ms). If the average wait time is higher than this, then the most likely culprit is the network. Verify with your network administrator if there is sufficient bandwidth and if the latency varies due to load or collisions on your network. This was an example of a normal system. In the case of a performance issue leading to this diagnosis, either the RTT of the redo write size packet or the standby LWOS time should show where the issue is, in the network or on the standby system. To add to the understanding of SYNC performance a flow of a SYNC AFFIRM redo write is illustrated below in Figure 5. Note that the local online redo log write occurs asynchronously and that the reply is not sent until both writes are complete.

Data Guard Primary Site and Network Configuration Best Practices Page 23

Page 24: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Figure 5 SYNC AFFIRM Redo Write Flow

Data Guard Database Wait Event Profiles The different transport options. ARCH, LGWR ASYNC, and LGWR SYNC have different wait event profiles. Below are sample wait event profiles taken from the 200-user OLTP tests Statspack snapshots for a LAN during a 5-minute period where no log switches occurred. There were no OS bottlenecks.

ARCH Wait Event Profile

The ARCH profile may vary during a log switch but this has no impact on the primary database throughput. The ‘ARCH wait on SENDREQ’ wait event increases during a log switch period. The wait event profile increased in a WAN as the RTT increased. This event was reduced network RTT’s of 50 ms and 100 ms when using SSH with compression. The ‘log file parallel write’ average wait time was 1 ms across all RTT tests. Always ensure that there are enough online redo log groups to avoid any database hanging situations due to unavailable online redo log groups. A minimum of 4 online log groups will allow for a minimum of 3 log group buffers for archiver(s) to catch up. It will also allow for multiple archive processes (ARCn processes) to archive different groups simultaneously when the redo rate is high. Sizing the redo group is equally important.

~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time ------------------------------ ------------ ----------- -------- CPU time 302 38.17 db file sequential read 56,814 261 33.00 control file sequential read 179,515 108 13.71 global cache open x 31,226 30 3.84 log file sync 27,814 29 3.65

Figure 6 ARCH LAN Wait Event Profile

Data Guard Primary Site and Network Configuration Best Practices Page 24

Page 25: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Figure 7 ARCH 100 ms RTT Wait Event Profile

~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time ------------------------------ ------------ ----------- -------- ARCH wait on SENDREQ 90 294 31.29 CPU time 268 28.49 db file sequential read 57,590 243 25.87 log file sync 28,062 35 3.76 global cache open x 31,950 32 3.41

ASYNC Wait Event Profile

The ASYNC profile did not vary during a log switch. The wait event profile did change in a WAN for the different RTT tests. The ‘LNS wait on SENDREQ ‘ wait event increased as the RTT increased. The ‘log file parallel write’ average wait time was 1 ms across all RTT tests.

~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time------------------------------ ------------ ----------- --------CPU time 269 40.75db file sequential read 58,295 252 38.16log file sync 28,289 31 4.70global cache open x 32,424 31 4.65log file parallel write 27,704 23 3.46

Figure 8 ASYNC LAN Wait Event Profile

~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time------------------------------ ------------ ----------- --------ARCH wait on SENDREQ 91 297 31.98CPU time 265 28.54db file sequential read 57,485 243 26.16global cache open x 31,490 32 3.41log file sync 28,154 30 3.26

Figure 9 ASYNC 100 ms RTT Wait Event Profile (switched to ARCH)

Another wait event that may occur for the ASYNC transport is the ‘LGWR wait on full LNS buffer’. This wait event is new in release 9.2.0.3 and monitors the amount of time spent by the log writer (LGWR) process waiting for the network server (LNS) to free up ASYNC buffer space. If buffer space has not been freed in a reasonable amount of time, availability of the primary database is not compromised by allowing the archiver process (ARCn) to transmit the redo log data. Continual occurrence of this event can degrade the primary throughput. If it is determined that this event is degrading performance then test using SSH compression or switch to the ARCH transport. Frequent occurrence of this event is usually due to a network configuration issue. With the 100 ms RTT the ASYNC transport timed out due to a ASYNC buffer full condition (see Appendix E) and consequently converted the transport to ARCH.

SYNC Wait Event Profile

The SYNC profile does not vary during a log switch. The wait event profile did change in a MAN for the different RTT tests. The ‘LGWR wait on SENDREQ‘

Data Guard Primary Site and Network Configuration Best Practices Page 25

Page 26: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

and ‘log file parallel write’ wait events increased as the RTT increased. The ‘LGWR wait on SENDREQ‘ event is always equal to the ‘log file parallel write’ time. Additionally, the redo write size also increased as the RTT increased and this consequently increases the log file parallel write time as well since there is a larger amount of redo being sent and written that has to wait for a standby write acknowledgement. This can be analyzed in detail as described above.

~~~~~~~~~~~~~~~~~~ % TotalEvent Waits Time (s) Ela Time------------------------------ ------------ ----------- --------log file sync 27,731 398 32.67db file sequential read 56,300 301 24.70CPU time 268 21.99log file parallel write 23,833 78 6.44LGWR wait on SENDREQ 23,836 76 6.26

Figure 10 SYNC LAN Wait Event Profile

~~~~~~~~~~~~~~~~~~ % TotalEvent Waits Time (s) Ela Time------------------------------ ------------ ----------- --------log file sync 27,187 1,048 49.25db file sequential read 55,329 292 13.71CPU time 261 12.28log file parallel write 14,293 226 10.60LGWR wait on SENDREQ 14,293 224 10.54

Figure 11 SYNC 10 ms RTT Wait Event Profile

Data Guard Primary Site and Network Configuration Best Practices Page 26

Page 27: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

EXAMPLE SCENARIOS The following two scenarios are illustrations of how the best practices detailed in this paper might be applied in a real world situation. These examples are intended to clarify application of the best practices. These scenarios are also dependent on optimally tuned systems that follow the best practices detailed in the Maximum Availability Architecture paper, http://otn.oracle.com/deploy/availability/htdocs/maa.htm.

Scenario 1: Business Requirement - DR with Zero Data Loss

Background

• Financial institution that requires a zero data loss disaster recovery solution

• OLTP system with an Oracle9i database

• Average transaction rate of about 200 transactions per second (TPS) with the size of each transaction approximately 2-3K (peak redo rate=800 K/sec)

• long term growth target of 1000 TPS

• dedicated 1 Gbps network link is available between the primary and secondary data center

• another application with a 400 K/sec peak network utilization rate will be sharing the network connection

Analysis

A symmetric (same hardware and software configuration) standby site was placed at a site 40 miles (64 KM) from the primary based on the network providers network RTT SLA. Using ‘traceroute’ and ‘ping’ the network RTT has been determined to be 1 ms. The peak redo rate of the primary has been determined by looking at the “Redo size” statistic from RDBMS Statspack snapshots during peak transaction processing. The peak redo rate is 800 K/sec, which corresponds to 200 TPS (the “Transactions” statistic from the Statspack snapshots). Extrapolating that to the long term projected growth rate of 1000 TPS will yield a peak redo rate of about 4 MB/sec. Using a rough calculation as described in Appendix F, this yields a long-term bandwidth requirement of about 48 Mbps. Even with the 1 Gbps network connection being shared with an application that has a peak network utilization rate of 400 K/sec, there should be adequate capacity for this growth.

An Oracle Data Guard test environment is installed at the primary and standby sites to test the impact of using maximum protection mode (LGWR SYNC AFFIRM). The test environment can simulate a 100 TPS load without any OS or database bottlenecks occurring. A base line test is run with just local archiving to measure the system throughput without using Data Guard and then a test using Data Guard in maximum protection mode (LGWR SYNC AFFIRM) is run.

Data Guard Primary Site and Network Configuration Best Practices Page 27

Page 28: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Comparing the baseline test to the Data Guard maximum protection mode test, a 2% decrease (98 TPS, 392 K/sec redo rate) in primary throughput is observed.

Conclusion

The 2% overhead for adding a zero data loss standby is well within the required service levels. Proper monitoring and trend analysis should be put in place to make sure service levels are maintained, including the network performance and usage. Additional testing should also be done to quantify failover and switchover times for planned and unplanned outages as well.

Scenario 2: Business Requirement - DR and Optimum Performance

Background

• Manufacturing firm that can tolerate 30 minutes of transaction loss in the event of a disaster

• OLTP system with an Oracle9i database

• transaction rate of about 125 transactions per second (TPS) with the size of each transaction approximately 2K

• the SLA of the primary has a 100 TPS performance requirement

• online redo logs are 500 MB, log switches occur about every 40 minutes during normal operation and at 35 minutes during peak operation

• the network link is a leased T1 (1.538 Mbps) line between the primary (Boston) and secondary (Chicago) data centers

Analysis

The standby site in Chicago is a smaller sized system in terms of CPU power due to cost constraints with the understanding that performance may be sacrificed if a failover is necessary. The network RTT has been determined to be 25 ms. The peak redo rate of the primary has been determined by looking at the “Redo size” statistic from RDBMS Statspack snapshots. The peak redo rate is 250 K/sec, which is for 125 TPS. Using a rough calculation as described in Appendix F with the 1.538 Mbps bandwidth, the network throughput capacity is about 135 K/sec. Based on the peak redo rate this network link does not have enough capacity. Currently the network capacity cannot be upgraded.

Based on the SLA performance requirement of 100 TPS and the 30-minute maximum transaction loss toleration the maximum performance protection mode will be used. Also based on the network RTT, if LGWR ASYNC is used with a 10 MB buffer (ASYNC=20480) then the transport will most likely convert to ARCH according to the Data Guard performance study but this should be validated through testing. With the limited bandwidth, using SSH port forwarding with compression will be investigated using ASYNC and ARCH and compared with ASYNC and ARCH.

Data Guard Primary Site and Network Configuration Best Practices Page 28

Page 29: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Conclusion

Ideally, this situation could use an upgraded network link. Since it is not a realistic option in this scenario, looking at SSH compression is likely the most feasible option. If SSH compression with ASYNC encounters network server timeouts as described in Appendix D then SSH compression with the ARCH transport is the next choice. Finally, if using the ARCH transport becomes the choice then another consideration is keeping transaction loss within the 30-minute SLA. Set the database initialization parameter, ARCHIVE_LAG_TARGET, to 1800 seconds (30 minutes). This will force a log switch to occur every 30 minutes.

Data Guard Primary Site and Network Configuration Best Practices Page 29

Page 30: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

CONCLUSION The various test results discussed previously demonstrate that Data Guard is a highly efficient and flexible disaster recovery solution that can maximize protection without sacrificing performance. This paper has highlighted that:

• The Maximum Performance protection mode performs well in all environments

• The Maximum Protection and Maximum Availability protection modes perform well in LAN and MAN environments

• The network bandwidth and latency between the production and standby sites have an impact on production database throughput

• Maintaining optimal performance with the required database protection requires testing and continual monitoring of the database, system and network, and response time

Optimally, tuned database server systems are paramount in meeting application service level agreement (SLA) requirements consistently. When using Data Guard it is also recommended that the network RTT be a component of an SLA. This implies that you also have a well-understood SLA with your network provider and that you ask your network provider the right questions.4 Setting a network RTT guarantee as part of a SLA and monitoring RTT data, long term is useful for trend analysis and correlation against other performance metrics. Ideally the Data Guard primary/secondary network connection should be dedicated or at a minimum given a higher priority within an IP classification and prioritization scheme to reduce variance in the network RTT.

A performance assessment will detail and document your end-to-end performance and allow for the establishment of baseline performance numbers as part of your SLA(s). Following the best practices in this paper will assist with implementing a properly tuned Data Guard configuration that minimizes any impact to the primary database and gives superior protection to your business from disasters, site failures, data failures, and human errors.

.

4 ”Metropolitan-Area Magic?”, Kevin Tolly, Network World, 7/23/01,

http://www.tolly.com/News/KT_NWW/20010723Metropolitan.asp

Data Guard Primary Site and Network Configuration Best Practices Page 30

Page 31: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

APPENDIX

A. Database Parameters This spfile was used for testing. The log_archive_dest_2 settings were changed for each test case and are described under the “Test Cases” section. This spfile was used with Oracle Net static registration to facilitate use of SDU. Some lines wrap to two lines, a line NOT starting with on e of the following is the continuation of the previous line; an asterisk (*=same for all instances), a comment (#) or an ORACLE_SID (DGPERF1 or DGPERF2).

*.COMPATIBLE='9.2.0.1.0' *.FAL_CLIENT='DGPERF' *.FAL_SERVER='DGPERF' *.FAST_START_MTTR_TARGET=3600 *.STANDBY_ARCHIVE_DEST='/arch1/DGPERF' *.STANDBY_FILE_MANAGEMENT='auto' *.archive_lag_target=0 *.background_dump_dest='/mnt/app/oracle/admin/DGPERF/bdump' *.cluster_database=TRUE *.control_files='/dev/vx/rdsk/ha-dg/DGPERF_control_01.ctl' *.core_dump_dest='/mnt/app/oracle/admin/DGPERF/cdump' *.db_block_checking=true *.db_block_checksum=true *.db_block_size=8192 *.db_cache_size=750M *.db_create_online_log_dest_1='' *.db_name='DGPERF' DGPERF1.instance_name='DGPERF1' DGPERF2.instance_name='DGPERF2' DGPERF1.instance_number=1 DGPERF2.instance_number=2 *.java_pool_size=0 # Used for dynamic registration tests only *.local_listener='DGPERF_lsnr' *.log_archive_max_processes=2 *.log_archive_dest_1='location=/arch1/DGPERF arch mandatory alternate=log_archive_dest_3' # # ASYNC 10 MB buffer test *.log_archive_dest_2=‘service=DGPERF lgwr async=20480 reopen=15 max_failure=10 optional net_timeout=30' # # ASYNC 4 MB buffer test #*.log_archive_dest_2=‘service=DGPERF lgwr async=8192 reopen=15 max_failure=10 optional net_timeout=30' # # ASYNC 2 MB buffer test #*.log_archive_dest_2=‘service=DGPERF lgwr async=4096 reopen=15 max_failure=10 optional net_timeout=30' # # SYNC for maximum protection mode tests #*.log_archive_dest_2=‘service=DGPERF reopen=15 lgwr sync affirm optional’

Data Guard Primary Site and Network Configuration Best Practices Page 31

Page 32: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

# *.log_archive_dest_3='location=/arch2/DGPERF' *.log_archive_dest_state_1='ENABLE' *.log_archive_dest_state_2='ENABLE' *.log_archive_dest_state_3='alternate' *.log_archive_format='arch_%t_%S.arc' *.log_archive_start=TRUE *.log_archive_trace=0 *.log_buffer=1310720 *.log_checkpoint_interval=0 *.log_checkpoint_timeout=0 *.log_checkpoints_to_alert=TRUE *.max_enabled_roles=30 *.open_cursors=300 *.os_authent_prefix='' *.parallel_max_servers=50 *.pga_aggregate_target=524288000 *.processes=400 *.remote_archive_enable='true' *.remote_login_passwordfile='none' *.resource_manager_plan='system_plan' *.service_names='DGPERF' *.sessions=1000 *.shared_pool_size=252321536 *.sort_area_retained_size=655360 *.sort_area_size=655360 DGPERF1.thread=1 DGPERF2.thread=2 *.timed_statistics=true *.transactions=1000 *.undo_management='auto' *.undo_retention=0 DGPERF1.undo_tablespace='rbs01' DGPERF2.undo_tablespace='rbs02' *.user_dump_dest='/mnt/app/oracle/admin/DGPERF/udump' *.workarea_size_policy='AUTO'

Data Guard Primary Site and Network Configuration Best Practices Page 32

Page 33: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

B. Oracle Net Services Configuration # 192.168.233.3 = hasun3 dedicated network (Primary 1) # 192.168.233.4 = hasun4 dedicated network (Primary 2) # 192.168.233.13 = hasun13 dedicated network (Standby 1) # hasun3 = public network (Primary 1 client connections) # hasun4 = public network (Primary 2 client connections)

Static Registration

Primary 1 listener.ora # primary/standby listener Primary 1, using static registration DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.3)(PORT = 1508))) (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = hasun3)(PORT = 1522)))) ) SID_LIST_DGPERF_LIST = (SID_LIST = (SID_DESC = (SDU=32767) (ORACLE_HOME = /mnt/app/oracle/product/9.2) (SID_NAME = DGPERF1) ) ) Primary 1 & 2 tnsnames.ora DGPERF = (DESCRIPTION = (SDU = 32767) (ADDRESS_LIST = (ADDRESS = (PROTOCOL = tcp)(PORT = 1508) (HOST = 192.168.234.13))) (CONNECT_DATA = (SID = DGPERF1) ) ) Primary 2 listener.ora # primary/standby listener Primary 2, using static registration DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.4)(PORT = 1508))) (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = hasun4)(PORT = 1522)))) ) SID_LIST_DGPERF_LIST = (SID_LIST = (SID_DESC = (SDU=32767) (ORACLE_HOME = /mnt/app/oracle/product/9.2) (SID_NAME = DGPERF2)))

Data Guard Primary Site and Network Configuration Best Practices Page 33

Page 34: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Standby 1 listener.ora DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.13)(PORT = 1508))))) SID_LIST_DGPERF_LIST = (SID_LIST = (SID_DESC = (SDU=32767) (ORACLE_HOME = /mnt/app/oracle/product/9.2)

(SID_NAME = DGPERF1) ) )

Standby 1 tnsnames.ora DGPERF = (DESCRIPTION = (SDU = 32767) (ADDRESS_LIST = (ADDRESS = (PROTOCOL = tcp)(PORT = 1508)(HOST = 192.168.233.3))) (CONNECT_DATA = (SID = DGPERF1) ) )

Data Guard Primary Site and Network Configuration Best Practices Page 34

Page 35: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Dynamic Registration

Dynamic registration without the default port of 1521 as we have below, using port 1508, requires also setting the following database initialization parameters in the init.ora or spfile:

*.local_listener=DGPERF_LSNR *.service_names=DGPERF

If you want to modify the SDU as instructed in the best practice, “Set SDU=32767 (32K) for the Oracle Net connections between the primary and standby“, then for release 9.2.0.4 and later you must set the DEFAULT_SDU_SIZE=32767 in the sqlnet.ora file. To use SDU prior to 9.2.0.4 you must use static instance registration.

Primary 1 listener.ora DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.3)(PORT = 1508))) (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = hasun3)(PORT = 1522)))) ) Primary 2 listener.ora DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.4)(PORT = 1508))) (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = hasun4)(PORT = 1522)))) ) Primary 1 & 2 tnsnames.ora DGPERF = (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(PORT=1521)(HOST=192.168.233.13)) (ADDRESS=(PROTOCOL=tcp)(PORT=1521)(HOST=192.168.233.14))) (CONNECT_DATA= (SERVICE_NAME=DGPERF))) # Local listener for dynamic instance registration DGPERF_LSNR= (DESCRIPTION= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(PORT=1508)))) Standby 1 listener.ora DGPERF_LIST = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.233.13)(PORT = 1508))))) Standby 1 tnsnames.ora DGPERF = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = tcp)(PORT = 1508)(HOST = 192.168.233.3))) (CONNECT_DATA = (SERVICE_NAME = DGPERF)))

Data Guard Primary Site and Network Configuration Best Practices Page 35

Page 36: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

C. Standby Database Monitoring Script #!/bin/ksh trap "" TTIN TTOU export ORACLE_HOME=/mnt/app/oracle/product/9.2 export ORACLE_SID=DGPERF1 export LD_LIBRARY_PATH=/mnt/app/oracle/product/9.2/lib export TNS_ADMIN=/var/opt/oracle export PATH=/mnt/app/oracle/product/9.2/bin:/usr/bin while true do sqlplus '/ as sysdba' << END spool $1/dbstats.`date +%b%d_%T` set pagesize 10000 echo off feedback off TERMOUT OFF # Verify the state of the DG processes for more detailed # analysis if required select process, status , thread#, sequence#, blocks from v\$managed_standby; select max(sequence#), thread# from v\$log_history group by thread#; column event format a35 column p1text format a20 column p2text format a20 # Obtain session wait information for more detailed # analysis if required select sid, event, p1, p1text, p2, p2text from v\$session_wait where wait_time !=0 and event not in ('rdbms ipc message','smon timer') order by wait_time desc; # Obtain file READ I/O and WRITE I/O times to ensure # there’s no IO bottlenecks on the standby. Should # be similar to production I/O times. column datafile format A45 column tspace format A30 select fs.*, df.name datafile, ts.name tspace from v\$filestat fs, v\$datafile df, v\$tablespace ts where fs.file#=df.file# and df.ts#=ts.ts# and PHYWRTS >0 order by writetim desc; # Obtain top system wait events. Leveraged to get # average log file parallel write times on the standby. select * from v\$system_event where time_waited > 100 order by time_waited desc; # Obtain sysstat detailed statistics for detailed # analysis if required.

Data Guard Primary Site and Network Configuration Best Practices Page 36

Page 37: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

select name, value from v\$sysstat where name like 'recovery%'; spool off exit END sleep 60 # interval to obtain statistics done exit 0

Data Guard Primary Site and Network Configuration Best Practices Page 37

Page 38: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

D. Operating System Monitoring Operating system statistics were gathered at 15 second intervals and captured using the SE Toolkit, an independently developed monitoring tool for Sun Solaris. See http://www.setoolkit.com/ for details. The actual SE Toolkit commands used are listed below.

CPU Statistics /opt/RICHPse/bin/se /opt/RICHPse/examples/cpustat.se 15 999999

I/O Statistics /net/hasun3/private/scripts/io_vx.sh -s hasun3,hasun4 -i 15 -c 999999 -x DGPERF_system_01.dbf

Memory Statistics /opt/RICHPse/bin/se /opt/RICHPse/examples/vmstat.se 15 999999

Network Statistics /opt/RICHPse/bin/se /opt/RICHPse/examples/nx.se 15

Data Guard Primary Site and Network Configuration Best Practices Page 38

Page 39: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

E. 9i Release 2 ASYNC Transport Details Patchsets 1-5 for RDBMS release 9.2.0 have differing behavior in the way the ASYNC buffer is used, causing different effects on throughput as the network RTT increases. Below is a description of the behavior for each patchset.

Inactivity Timeout

In addition to the details below, the ASYNC buffer has a hidden database initialization parameter, _ns_max_flush_wt, which comes into play on systems with periods of inactivity, i.e. when no redo is being generated. If there is less than the optimal buffer size (described below and usually 1MB) of redo in the async buffer, then the network server will wait for 30 seconds (the _ns_max_flush_wt default) before sending the buffer contents to the standby. Thus, if a disaster occurs before the 30 seconds has elapsed, those transactions that haven’t been sent yet are lost. This did not affect our tests since the transaction profiles did not have any idle periods, i.e. the transaction flow was high and steady. If your application has periods of inactivity then you can bound data loss further by setting _ns_max_flush_wt to a lower value N, and this would flush the async redo buffer every N seconds rather than the default of 30 seconds. The 200 user small OLTP async test was run with _ns_max_flush_wt=1 and this showed no impact to the steady transaction profiles used in our tests. Starting with patchset 9.2.0.5, the default for _ns_max_flush_wt has been changed to be 1.

9.2.0.1

The ASYNC transport service is impacted by higher RTT’s. This is primarily due to a default wait of 30 seconds when an "ASYNC buffer full" condition is reached. If the "ASYNC buffer full" condition is reached then the network server will wait for a maximum of 30 seconds before timing out and switching the transport to ARCH. If the async buffer space frees up before the 30 seconds is up then LGWR proceeds to write to the network server. Additionally, an internal algorithm calculates an optimal buffer send threshold (how much of the async buffer should be filled before it sends the data over the network) that is too small causing more transfers than necessary. Example alert.log messages when the “ASYNC buffer full” condition is encountered:

Sun Nov 24 13:26:16 2002 Timing out on NetServer 1 LGWR: I/O error 2 archiving log 9 to 'DGPERF' Sun Nov 24 13:26:16 2002 Errors in file /mnt/app/oracle/admin/DGPERF/bdump/dgperf1_lgwr_1438.trc: ORA-00002: Message 2 not found; product=RDBMS; facility=ORA

9.2.0.2

Same as 9.2.0.1 except that the ASYNC transport service will convert to ARCH immediately if the "ASYNC buffer full" condition is reached, i.e. there is no wait time. THUS, ASYNC performs closely to the ARCH transport and is not impacted by network RTT. This was based on the premise that availability is more

Data Guard Primary Site and Network Configuration Best Practices Page 39

Page 40: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

important than minimizing transaction loss, but it isn’t "truly" ASYNC. The “ASYNC buffer full” condition alert.log message is the same as with 9.2.0.1.

9.2.0.3

This patchset changes the way the ASYNC buffer behaves and the size of the optimal buffer send threshold. When the network server starts, the optimal buffer send threshold is set. The optimal buffer send threshold is set to 1MB if the ASYNC buffer > 2 MB and if the ASYNC buffer is < 2 MB (this is not recommended), then the optimal buffer send threshold is set to 50% of the ASYNC buffer size.

When the "ASYNC buffer full" condition is reached, a default wait of 5 seconds is now used. If the "ASYNC buffer full" condition is reached then the network server will wait for a maximum of 5 seconds before timing out and switching the transport to ARCH. If the async buffer space frees up before the 5 seconds is up then LGWR continues to write to the network server. If the transport had switched to ARCH then at the next log switch or when the REOPEN interval has elapsed, the transport will revert to LGWR ASYNC.

Additional wait events have been added and are documented in the 9.2.0.3 Release Notes. Some are also highlighted in the “Troubleshooting Data Guard Performance Issues” section.

Example alert.log messages when the “ASYNC buffer full” condition is encountered:

Fri Jan 10 22:03:46 2003 Timing out on NetServer 0 prod=9761,cons=9120,threshold=640 Fri Jan 10 22:03:46 2003 Errors in file /mnt/app/oracle/admin/DGPERF/bdump/dgperf2_lgwr_16548.trc: ORA-16166: LGWR network server failed to send remote message LGWR: I/O error 16166 archiving log 3 to 'DGPERF'

9.2.0.4

Modifying the SDU setting with dynamic instance registration now works.

9.2.0.5

The default for _ns_max_flush_wt has been changed from 30 seconds to 1 second. See the “Inactivity Timeout” paragraph above for details.

Data Guard Primary Site and Network Configuration Best Practices Page 40

Page 41: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

F. Network Throughput and Peak Redo Rates For optimal performance of a Data Guard configuration the network throughput must be greater than the maximum redo generation rate or performance will be throttled by the lack of the network bandwidth. The peak redo rate can be taken from a Statspack snapshot taken during the peak load. The “Redo size” line in the Statspack report will give the redo rate in bytes/second for a given time interval. Next, the bandwidth of the network link between the primary and the standby sites must be found.

Network throughput is the number of bits (not bytes) transmitted per second through a communication medium or device. It is also referred to as data rate or wire speed. Rated throughput, or network bandwidth, can effectively be reduced because there are typically several devices in the communication path that will add delays. Delays, or an increase in transmission latency, can be caused by various factors, like processor limitations, network congestion, buffering inefficiencies, transmission errors, traffic loads, congestion, or inadequate hardware designs. Throughput can vary widely over time due to variations in network traffic and congestion. In addition, the data sent by Data Guard is packaged in lower protocol “packets” that contain header information (or protocol overhead). If the actual data throughput is to be measured, the bits used by this overhead must be subtracted from the size of the packet. There is additional overhead, outside of the data packet, which TCP/IP and Ethernet both require. That said, a very general calculation for estimating the maximum network throughput capacity is included in the equation below. This calculation uses a conservative worst-case network overhead ratio of 30% (multiplies actual bandwidth in bytes by .7).

A more precise calculation for the network overhead factor can be made if the considerations mentioned above are well understood through measuring and monitoring of the network. This can give a more precise ratio for the network overhead. When determining the bandwidth, use the maximum bandwidth of the least-capable hop of all of the router hops between the primary and secondary hosts.

The abobandwidredo rate

Data Gua

max. transfer rate capacity / sec =(bandwidth in bytes) * (network data usage (subtract 30% overhead)) =(bandwidth / 8) * .7 e.g. for a 10 Mbit (10,240,000 bits/second) bandwidth link, (10,240,000 / 8) * .7 = 896000

veth t

rd

Equation 1 Bandwidth Capacity in Bytes

equation gives an estimate for the maximum redo rate that an existing can support but is not indicative of the maximum database or network hat is achievable.

Primary Site and Network Configuration Best Practices Page 41

Page 42: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

To estimate the maximum redo rate achievable with Data Guard enabled for a given network bandwidth, you must

1. Tune the network with the prescribed best practices described in this document.

2. Simulate database load to mimic average and maximum database load and redo rates

3. Test without Data Guard and gather database and system statistics information

4. Test with Data Guard and gather database, system, and network statistics

5. Tune using the best practices described in this document and repeat any of above steps

With synchronous redo transport, the redo and network rates are equal since local and network redo writes are completed prior to a commit. If the round trip latency is higher than the local redo write, then throughput may be impacted. With asynchronous redo transport, the LGWR local redo write is decoupled from the LNS redo network write. The database network redo rate is impacted by the maximum network bandwidth and Data Guard application confirmation processing which ensures that 1 MB of redo is acknowledged from the standby before sending the next 1 MB of redo. Due to the impact of network bandwidth, network latency, network collisions and the Data Guard application confirmation processing, there’s no clear formula to estimate the maximum network redo rate.

Data Guard Primary Site and Network Configuration Best Practices Page 42

Page 43: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

G. Test Results Data

LAN Data

Table 3 LAN Test Results Summary

Transaction Profile

SQL*Loader OLTP Large OLTP Small

Transport service Redo Rate

Scale Redo Rate

Scale Redo Rate

Scale

ARCH 4687 100 1630.7 100.0 745.9 100.0

Async 10 MB 4661 99 1628.9 99.9 752.0 100.8

SYNC 4715 101 1587.6 97.4 738.6 99.0

MAN / WAN Data

Table 2 200 User MAN / WAN Test Results

RTT

Transport

0.2 ms

2ms 10ms 50ms 100ms 100ms SSH

2ms SSH

10ms SSH

50ms SSH

ARCH Redo Rate 746 744 742 751 741 745 751 745 746

TPS 181 182 182 182 182 181 183 182 182

ASYNC Redo Rate 752 744 745 745 7321 740 744 746 749

TPS 182 182 182 180 1791 181 181 182 182

TPS Scale

101 100 100 99 99 100 99 100 100

SYNC Redo Rate

739 732 722 577 563

TPS 180 178 177 143 140

TPS Scale 99 98 97 78 77

1 Both nodes/threads converted to ARCH

2 A subset of the logs converted to ARCH

Data Guard Primary Site and Network Configuration Best Practices Page 43

Page 44: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Table 5 400 User MAN / WAN Test Results

RTT

Transport

0.2 ms 2ms 10ms 50ms 100ms 100ms SSH 2ms SSH

ARCH Redo Rate 1631 1633 1638 1637 1634 1635 1633

TPS 362 363 363 363 362 363 363

ASYNC Redo Rate 1629 1639 1631 1622 1629 1627 1624

TPS 360 363 362 359 360 360 360

TPS Scale 99 100 100 99 99 99 99

SYNC Redo Rate 1588 1554 1541 859

TPS 352 345 342 194

TPS Scale 97 95 94 54

1 Both nodes/threads converted to ARCH

2 A subset of the logs converted to ARCH

Data Guard Primary Site and Network Configuration Best Practices Page 44

Page 45: Maximum Availability Architecture...standby database is located up to hundreds of miles away from the production database, providing maximum protection and availability advantages

Oracle9i Data Guard Primary Site and Network Configuration Best Practices June 2006 Authors: Ray Dutcher, High Availability Systems Team Contributing Authors: Lawrence To, HA Systems Team, Ashish Ray, Kevin Reardon Oracle Corporation World Headquarters 500 Oracle Parkway Redwood Shores, CA 94065 U.S.A. Worldwide Inquiries: Phone: +1.650.506.7000 Fax: +1.650.506.7200 www.oracle.com Oracle is a registered trademark of Oracle Corporation. Various product and service names referenced herein may be trademarks of Oracle Corporation. All other product and service names mentioned may be trademarks of their respective owners. Copyright © 2006 Oracle Corporation All rights reserved.