rac system test plan outline 11gr2 v2 0 (1)

39
Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Page 1 RAC Assurance Team RAC System Test Plan Outline 11gR2 Version 2.0 Purpose Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and recommendations for testing a new RAC system. This test plan outline can be used as a framework for building a system test plan specific to each company’s RAC implementation and their associated service level objectives. Scope of System Testing This document provides an outline of basic testing guidelines that will be used to validate core component functionality for RAC environments in the form of an organized test plan. Every application exercises the underlying software and hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be tested thoroughly, in an environment that is a realistic representation of the production environment in terms of configuration, capacity, and workload prior to going live or after implementing significant architectural/system modifications. Without a completed system implementation and functional available end-user applications, only core component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the Networking, I/O subsystem and miscellaneous database administrative functions. In addition to the specific system testing outlined in this document additional testing needs to be defined and executed for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires specific operational procedures to be documented and maintained to address site-specific requirements. Testing Objectives In addition to application functionality testing, overall system testing is normally performed for one or more of the following reasons: Verify that the system has been installed and configured correctly. Check that nothing is broken. Establish a baseline of functionality behavior such that we can answer the question down the road: ‘has this ever worked in this environment?’ Verify that basic functionality still works in a specific environment and for a specific workload. Vendors normally test their products very thoroughly, but it is not possible to test all possible hardware/software combinations and unique workloads. Make sure that the system will achieve its objectives, in particular, availability and performance objectives. This can be very complex and normally requires some form of simulated production environment and workload. Test operational procedures. This includes normal operational procedures and recovery procedures. Train operations staff. Planning System Testing Effective system testing requires careful planning. The service level objectives for the system itself and for the testing must be clearly understood and a detailed test plan should be documented. The basis for all testing is that the current best practices for RAC system configuration have been implemented before testing.

Upload: dinesh-kumar

Post on 26-Nov-2015

262 views

Category:

Documents


14 download

DESCRIPTION

RAC_System_Test_Plan_Outline_11gr2_v2_0

TRANSCRIPT

Page 1: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan

Page 1

RAC Assurance Team

RAC System Test Plan Outline

11gR2 Version 2.0

Purpose Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and recommendations for testing a new RAC system. This test plan outline can be used as a framework for building a system test plan specific to each company’s RAC implementation and their associated service level objectives.

Scope of System Testing This document provides an outline of basic testing guidelines that will be used to validate core component functionality for RAC environments in the form of an organized test plan. Every application exercises the underlying software and hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be tested thoroughly, in an environment that is a realistic representation of the production environment in terms of configuration, capacity, and workload prior to going live or after implementing significant architectural/system modifications. Without a completed system implementation and functional available end-user applications, only core component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the Networking, I/O subsystem and miscellaneous database administrative functions. In addition to the specific system testing outlined in this document additional testing needs to be defined and executed for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires specific operational procedures to be documented and maintained to address site-specific requirements.

Testing Objectives In addition to application functionality testing, overall system testing is normally performed for one or more of the following reasons: • Verify that the system has been installed and configured correctly. Check that nothing is broken. Establish a

baseline of functionality behavior such that we can answer the question down the road: ‘has this ever worked in this environment?’

• Verify that basic functionality still works in a specific environment and for a specific workload. Vendors normally test their products very thoroughly, but it is not possible to test all possible hardware/software combinations and unique workloads.

• Make sure that the system will achieve its objectives, in particular, availability and performance objectives. This can be very complex and normally requires some form of simulated production environment and workload.

• Test operational procedures. This includes normal operational procedures and recovery procedures. • Train operations staff.

Planning System Testing Effective system testing requires careful planning. The service level objectives for the system itself and for the testing must be clearly understood and a detailed test plan should be documented. The basis for all testing is that the current best practices for RAC system configuration have been implemented before testing.

Page 2: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan

Page 2

Testing should be performed in an environment that mirrors the production environment as much as possible. The software configuration should be identical but for cost reasons it might be necessary to use a scaled down hardware configuration. All testing should be performed while running a workload that is as close to production as possible. When planning for system testing it is extremely important to understand how the application has been designed to handle the failures outlined in this plan and to ensure that the expected results are met at the application level as well as the database level. Oracle technologies that enable fault tolerance of the database at the application level include the following: • Fast Application Notification (FAN) – Notification mechanism that alerts application of service level changes of the

database. • Fast Connection Failover (FCF) – Utilizes FAN events to enable database clients to proactively react to down events

by quickly failing over connections to surviving database instances. • Transparent Application Failover (TAF) – Allows for connections to be automatically reestablished to a surviving

database instance in the case that the instance servicing the initial connection should fail. TAF has the ability to fail over in-flight select statements (if configured) but insert, update and delete transactions will be rolled back.

• Runtime Connection Load Balancing (RCLB) – Provides intelligence about the current service level of the database instances to application connection pools. This increases the performance of the application by utilizing least loaded servers to service application requests and allows for dynamic workload balancing in the event of the loss of service by a database instance or increase of service by adding a database instance.

More information on each of the above technologies can be found in the Oracle Real Application Clusters Administration and Deployment Guide 11g Release 2. Generating a realistic application workload can be complex and expensive but it is the most important factor for effective testing. For each individual test in the plan, a clear understanding of the following is required: • What is the objective of the test and how does this relate to the overall system objectives? • Exactly how will the test be performed and what are the execution steps? • What are the success/failure criteria, and what are the expected results? • How will the test result be measured? • Which tools will be used? • Which logfiles and other data will be collected? • Which operational procedures are relevant? • What are the expected results of the application for each of the defined tests (TAF, FCF, RCLB)?

Notes for Windows Users Many of the Fault Injection Tests outlined in this document involve abnormal termination of various processes within the Oracle Software stack. On Unix/Linux systems this is easily achieved by using “ps” and “kill” commands. Natively, Windows does not provide the ability to view enough details of running processes to properly identify and kill the processes involved in the Fault Injection Testing. To overcome this limitation a utility called Process Explorer (provided by Microsoft) will be used to identify and kill the necessary processes. Process Explorer can be found on the Windows Sysinternals website within Microsoft Technet (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx). In addition to Process Explorer, a utility called orakill will be used to kill individual threads within the database. More information on orakill can be found under Note: 69882.1.

Production Simulation / System Stress Test The best way to ensure that the system will perform well without any problems is to simulate production workload and conditions before going live. Ideally the system should be stressed a little more than what is expected in production. In addition to running the normal user and application workload, all normal operational procedures should also be tested at the same time. The output from the normal monitoring procedures should be kept and compared with the real data when going live. Normal maintenance operations such as adding users, adding disk space, reorganizing tables and indexes, backup, archiving data, etc. must also be tested. A commercial or in-house developed workload generator is essential.

Fault Injection Testing The system configuration and operational procedures must also be tested to make sure that component failures and other problems can be dealt with as efficiently as possible and with minimum impact on system availability. This section provides some examples of tests that can be used as part of a system test plan. The idea is to test the system’s robustness against various failures. Depending on the overall architecture and objectives, only some of the tests might be used

Page 3: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan

Page 3

and/or additional tests might have to be constructed. Introducing multiple failures at the same time should also be considered. This list only covers testing for RAC-related components and procedures. Additional tests are required for other parts of the system. These tests should be performed with a realistic workload on the system. Procedures for detecting and recovering from these failures must also be tested. In some worst-case scenarios it might not be possible to recover the system within an acceptable time frame and a disaster recovery plan should specify how to switch to an alternative system or location. This should also be tested. The result of a test should initially be measured at a business or user level to see if the result is within the service level agreement. If a test fails it will be necessary to gather and analyze the relevant log and trace files. The analysis can result in system tuning, changing the system architecture or possibly reporting component problems to the appropriate vendor. Also, if the system objectives turn out to be unrealistic, they might have to be changed.

Page 4: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 4

System Testing Scenarios Test # Test Procedure Expected Results Measures Actual Results/Notes

Test 1 Planned Node Reboot • Start client workload • Identify instance with most client connections • Reboot the node where the most loaded instance

is running o For AIX, HPUX, Windows: “shutdown –r” o For Linux: “shutdown –r now” o For Solaris: “reboot”

• The instances and other Clusterware resources that were running on that node go offline (no value for ‘SERVER’ field of crsctl stat res –t output)

• The node VIP fails over to one of the surviving nodes and will show a state of “INTERMEDIATE” with state_details of “FAILED_OVER”

• The SCAN VIP(s) that were running on the rebooted node will fail over to surviving nodes.

• The SCAN Listener(s) running on that node will fail over to a surviving node.

• Instance recovery is performed by another instance.

• Services are moved to available instances, if the downed instance is specified as a preferred instance

• Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration). With TAF configured select statements should continue. Active DMLwill be aborted.

• After the database reconfiguration, surviving instances continue processing their workload.

• Time to detect node or instance failure

• Time to complete instance recovery. Check alert log for instance performing the recovery

• Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload)

• Duration of database reconfiguration.

• Time before failed instance is restarted automatically by Clusterware and is accepting new connections

• Successful failover of the SCAN VIP(s) and SCAN Listener(s)

Page 5: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 5

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 2 Unplanned Node

Failure of the OCR Master

• Start client workload. • Identify the node that is the OCR master using the

following grep command from any of the nodes: grep -i "OCR MASTER" $GI_HOME/log/<node_name>/crsd/crsd.l*

NOTE: Windows users must manually review the $GI_HOME/log/<node_name>/crsd/crsd.l* logs to determine the OCR Master.

• Power off the node that is the OCR master.

NOTE: On many servers the power-off switch will perform a controlled shutdown, and it might be necessary to cut the power supply

• Same as Planned Node Reboot • Same as Planned Node Reboot

Test 3 Restart Failed Node • On clusters having 3 or fewer nodes, one of the SCAN VIPs and Listeners will be relocated to the restarted node when the Oracle Clusterware starts.

• The VIP will migrate back to the restarted node.

• Services that had failed over as a result of the node failure will NOT automatically be relocated.

• Failed resources (asm, listener, instance, etc) will be restarted by the Clusterware.

• Time for all resources to become available again, Check with “crsctl stat res –t”

Test 4 Reboot all nodes at the same time

• Issue a reboot on all nodes at the same time o For AIX, HPUX, Windows: ‘shutdown –r’ o For Linux: ‘shutdown –r now’ o For Solaris: ‘reboot’

• All nodes, instances and resources are restarted without problems

• Time for all resources to become available again, Check with “crsctl stat res –t”.

Page 6: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 6

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 5 Unplanned Instance

Failure • Start client workload • Identify single database instance with the most

client connections and abnormally terminate that instance: o For AIX, HPUX, Linux, Solaris:

Obtain the PID for the pmon process of the database instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid>

o For Windows: Obtain the thread ID of the pmon thread of the database instance by running:

SQL> select b.name, p.spid from v$bgprocess b, v$process p where b.paddr=p.addr and b.name=’PMON’;

Run orakill to kill the thread: cmd> orakill <SID> <Thread ID>

• One of the other instances performs instance recovery

• Services are moved to available instances, if a preferred instance failed

• Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration)

• After a short freeze, surviving instances continue processing the workload

• Failing instance will be restarted by Oracle Clusterware, unless this feature has been disabled

• Time to detect instance failure

• Time to complete instance recovery. Check alert log for recovering instance

• Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload)

• Duration of database freeze during failover.

• Time before failed instance is restarted automatically by Oracle Clusterware and is accepting new connections

Test 6 Planned Instance Termination

• Issue a ‘shutdown abort’ • One other instance performs instance recovery

• Services are moved to available instances, if a preferred instance failed

• Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration)

• The instance will NOT be automatically restarted by Oracle Clusterware due to the user invoked shutdown.

• Time to detect instance failure.

• Time to complete instance recovery. Check alert log for recovering instance.

• Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload).

• The instance will NOT be restarted by Oracle Clusterware due to the user induced shutdown.

Page 7: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 7

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 7 Restart Failed

Instance • Automatic restart by Oracle Clusterware if it is an

uncontrolled failure • Manual restart necessary if a “shutdown”

command was issued. • Manual restart when the "Auto Start" option for

the related instance has been disabled.

• Instance rejoins RAC cluster without any problems (review alert logs etc.)

• Client connections and workload will be load balanced across the new instance (Manual procedure might be required to redistribute workload if long running / permanent connections)

• Time before services and workload are rebalanced across all instances (including any manual steps)

Test 8 Unplanned ASM Instance Failure

• Start client workload • Identify a single ASM instance in the cluster: o For AIX, HPUX, Linux, Solaris:

Obtain the PID for the pmon process of the ASM instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid>

o For Windows: Obtain the thread ID of the pmon thread of the ASM instance by running:

SQL> select b.name, p.spid from v$bgprocess b, v$process p where b.paddr=p.addr and b.name=’PMON’;

Run orakill to kill the thread: cmd> orakill <SID> <Thread ID>

• The *.dg, *.acfs, *.asm and *.db resources that were running on that node will go offline (crsctl stat res –t). By default these resources will be automatically restarted by Oracle Clusterware.

• One other instance performs instance recovery

• Services are moved to available instances, if a preferred instance failed

• Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration)

• After the database reconfiguration is complete, surviving instances continue processing the workload

• The Clusterware alert log will show crsd going offline due to an inaccessible OCR if the OCR is stored in ASM. CRSD will automatically restart

• Time to detect instance failure

• Time to complete instance recovery. Check alert log for recovering instance

• Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload)

• Duration of database reconfiguration.

• Time before failed resources are restarted and the database instance is accepting new connections

Page 8: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 8

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 9 Unplanned Multiple

Instance Failure • Start client workload • Abnormally terminate 2 different database

instances from the same database at the same time: o For AIX, HPUX, Linux, Solaris:

Obtain the PID for the pmon process of the database instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid>

o For Windows: Obtain the thread ID of the pmon thread of the database instance by running:

SQL> select b.name, p.spid from v$bgprocess b, v$process p where b.paddr=p.addr and b.name=’PMON’;

Run orakill to kill the thread: cmd> orakill <SID> <Thread ID>

• Same as instance failure. • Both instances should be

recovered and restarted without problems.

• Same as instance failure.

Page 9: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 9

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 10 Listener Failure • For AIX, HPUX, Linux and Solaris:

Obtain the PID for the listener process: # ps –ef | grep tnslsnr Kill the pmon process: # kill –9 <listener pid>

• For Windows: Use Process Explorer to identify the tnslistener.exe process for the database listener. This will be the tnslistener.exe registered to the “<home name>TNSListener” service (not the <home name>TNSListenerLISTENER_SCAN<n>” service). Once the proper tnslistener.exe is identified kill the process by right clicking the executable and choosing “Kill Process”.

• No impact on connected database sessions.

• New connections are redirected to listener on other node (depends on client configuration)

• Local database instance will receive new connections if shared server is used. Local database instance will NOT receive new connections if dedicated server is used.

• The Listener failure is detected by the ORAAGENT and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

crsd/crsd.log o $GI_HOME/log/<nodename>/

agent/crsd/oraagent_<GI_owner>/oraagent_<GI_owner>.log

• Time for the Clusterware to detect failure and restart listener.

Test 11 SCAN Listener Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the SCAN listener process: # ps –ef | grep tnslsnr Kill the pmon process: # kill –9 <listener pid>

• For Windows: Use Process Explorer to identify the tnslistener.exe process for the SCAN listener. This will be the tnslistener.exe registered to the “<home name> TNSListenerLISTENER_SCAN<n>” service (not the <home name>TNSListener” service). Once the proper tnslistener.exe is identified kill the process by right clicking the executable and choosing “Kill Process”.

• No impact on connected database sessions.

• New connections are redirected to listener on other node (depends on client configuration)

• The Listener failure is detected by CRSD ORAAGENT and is automatically restarted. Review the following logs:

• $GI_HOME/log/<nodename>/crsd/crsd.log

• $GI_HOME/log/<nodename>/agent/crsd/oraagent_<GI_owner>/oraagent_<GI_owner>.log

• Same as Listener Failure

Page 10: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 10

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 12 Public Network

Failure • Unplug all network cables for the public network

NOTE: Configurations using NIS must also have implemented NCSD for this test to succeed with the expected results. NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

• Check with “crsctl stat res –t” o The ora.*.network and listener

resources will go offline for the node.

o SCAN VIPs and SCAN LISTENERs running on the node will fail over to a surviving node.

o The VIP for the node will fail over to a surviving node.

• The database instance will remain up but will be unregistered with the remote listeners.

• Database services will fail over to one of the other available nodes.

• If TAF is configured, clients should fail over to an available instance.

• Time to detect the network failure and relocate resources.

Test 13 Public NIC Failure • Assuming dual NICs are configured public interface for redundancy (e.g. bonding, teaming, etc).

• Unplug the network cable from 1 of the NICs.

NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

• Network traffic should fail over to other NIC without impacting any of the cluster resources.

• Time to fail over to other NIC card. With bonding /teaming configured this should be less than 100ms.

Page 11: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 11

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 14 Interconnect Network

Failure (11.2.0.1) Note: The method in which a node is evicted has changed in 11.2.0.2 with the introduction of a new feature called Reboot less Restart. Reboot less restart aims to achieve a node eviction without actually rebooting the node.

• Unplug all network cables for the interconnect network NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

For 11.2.0.1: • CSSD will detect split-brain

situation and perform one of the following: o In a two-node cluster the node

with the lowest node number will survive the other node will be rebooted.

o In a multiple node cluster the largest sub-cluster will survive others will be rebooted.

• Review the following logs: o $GI_HOME/log/<nodename>/

cssd/ocssd.log o $GI_HOME/log/<nodename>/

alert<nodename>.log

• Time to detect split brain and start eviction.

For 11.2.0.1: • See measures for node

failure.

Page 12: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 12

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 14 Cont’d

Interconnect Network Failure (11.2.0.2 and higher) Note: The method in which a node is evicted has changed in 11.2.0.2 with the introduction of a new feature called Reboot less Restart. Reboot less restart aims to achieve a node eviction without actually rebooting the node.

• Unplug all network cables for the interconnect network NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

For 11.2.0.2 and above: • CSSD will detect split-brain

situation and perform one of the following: o In a two-node cluster the node

with the lowest node number will survive.

o In a multiple node cluster the largest sub-cluster will survive.

• On the node(s) that is being evicted, a graceful shutdown of Oracle Clusterware will be attempted. o All I/O capable client

processes will be terminated and all resources will be cleaned up. If process termination and/or resource cleanup does not complete successfully the node will be rebooted.

o Assuming that the above has completed successfully, OHASD will attempt to restart the stack. In this case the stack will be restarted once the network connectivity of the private interconnect network has been restored.

• Review the following logs: o $GI_HOME/log/<nodename>/

alert<nodename>.log o $GI_HOME/log/<nodename>/

cssd/ocssd.log

For 11.2.0.2 and above: • Oracle Clusterware will

gracefully shutdown, should graceful shutdown fail (due to I/O processes not being terminated or resource cleanup) the node will be rebooted.

• Assuming that the graceful shutdown of Oracle Clusterware succeeded, OHASD will restart the stack once network connectivity for the private interconnect has been restored.

Page 13: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 13

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 15 Interconnect NIC

Failure (OS or 3rd Party NIC Redundancy)

• Assuming dual NICs are configured for the private interface for redundancy (e.g. bonding, teaming, etc).

• Unplug the network cable from 1 of the NICs. NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

• Network traffic should fail over to other NIC without impacting any of the cluster resources.

• Time to fail over to other NIC card. With bonding / teaming configured this should be less than 100ms.

Test 16 Interconnect NIC Failure (Oracle Redundant Interconnect, 11.2.0.2 and higher only) Note: This test is applicable for those on 11.2.0.2 and higher using Oracle Redundant Interconnect/HAIP

• Assuming 2 or more NICs configured for Oracle Redundant Interconnect and HAIP.

• Unplug the network cable from 1 of the NICs NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results.

• The HAIP running on the NIC in which the cable was pulled will failover to one of the surviving NICs in the configuration.

• Clusterware and/or RAC communication will not be impacted.

• Review the following logs: o $GI_HOME/log/<nodename>/

cssd/ocssd.log o $GI_HOME/log/<nodename>/

gipcd/gipcd.log • Upon reconnecting the cable, the

HAIP that failed over will relocate back to its original interface.

• Failover (and fail back) will be seamless (no disruption in service from any node in the cluster).

Test 17 Interconnect Switch Failure (Redundant Switch Configuration)

• In a redundant network switch configuration, power off one switch

• Network traffic should fail over to other switch without any impact on interconnect traffic or instances.

• Time to fail over to other NIC card. With bonding /teaming/11.2 Redundant Interconnect configured this should be less than 100ms.

Page 14: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 14

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 18 Node Loses Access to

Disks with CSS Voting Device Note: The method in which a node is evicted has changed in 11.2.0.2 with the introduction of a new feature called Reboot less Restart. Reboot less restart aims to achieve a node eviction without actually rebooting the node.

• Unplug external storage cable connection (SCSI, FC or LAN cable) from one node to disks containing the CSS Voting Device(s).

NOTE: To perform this test it may be necessary to isolate the CSS Voting Device(s) to an isolated ASM diskgroup or CFS.

For 11.2.0.1: • CSS will detect this and evict the

node with a reboot. Review the following logs: o $GI_HOME/log/<nodename>/

cssd/ocssd.log o $GI_HOME/log/<nodename>/

alert<nodename>.log For 11.2.0.2 and above: • CSS will detect this and evict the

node as follows: o All I/O capable client

processes will be terminated and all resources will be cleaned up. If process termination and/or resource cleanup does not complete successfully the node will be rebooted.

o Assuming that the above has completed successfully, OHASD will attempt to restart the stack. In this case the stack will be restarted once the network connectivity of the private interconnect network has been restored.

• Review the following logs: o $GI_HOME/log/<nodename>/

alert<nodename>.log o $GI_HOME/log/<nodename>/

cssd/ocssd.log

For 11.2.0.1: • See measures for node

failure

For 11.2.0.2 and above: • Oracle Clusterware will

gracefully shutdown, should graceful shutdown fail (due to I/O processes not being terminated or resource cleanup) the node will be rebooted.

• Assuming that the graceful shutdown of Oracle Clusterware succeeded, OHASD will restart the stack once network connectivity for the private interconnect has been restored.

Page 15: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 15

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 19 Node Loses Access to

Disks with OCR Device(s)

• Unplug external storage cable connection (SCSI, FC or LAN cable) from one node to disks containing the OCR Device(s).

NOTE: To perform this test it may be necessary to isolate the OCR Device(s) to an isolated ASM diskgroup or CFS.

• CRSD will detect the failure of the OCR device and abort. OHASD will attempt to restart CRSD 10 times after which manual intervention will be required.

• The database instance, ASM instance and listeners will not be impacted.

• Review the following logs: o $GI_HOME/log/<nodename>/

cssd/crsd.log o $GI_HOME/log/<nodename>/

alert<nodename>.log o $GI_HOME/log/<nodename>/

ohasd/ohasd.log

• Monitor database status under load to ensure no service interruption occurs.

Test 20 Node Loses Access to Single Path of Disk Subsystem (OCR, Voting Device, Database files)

• Unplug external storage cable connection (SCSI, FC or LAN cable) from node to disk subsystem.

• If multi-pathing is enabled, the multi-pathing configuration should provide failure transparency

• No impact to database instances.

• Monitor database status under load to ensure no service interruption occurs.

• Path failover should be visible in the OS logfiles.

Test 21 ASM Disk Lost • Assuming ASM normal redundancy • Power off / pull out / offline (depending on

config) one ASM disk.

• No impact on database instances • ASM starts rebalancing (view

ASM alert logs).

• Monitor progress: select * from v$asm_operation

Test 22 ASM Disk Repaired • Power on / insert / online the ASM disk • No impact on database instances • ASM starts rebalancing (view

ASM alert logs).

• Monitor progress: select * from v$asm_operation

Page 16: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 16

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 23 One multiplexed

Voting Device is inaccessible

• Remove access to a multiplexed voting disk from all nodes. If voting disks are in a normal redundancy disk group remove access to one of the ASM disks.

• Cluster will remain available. • The voting disk will be

automatically brought online when access is restored.

• Voting Disks can be queried using “crsctl query css votedisk”. Review the following logs: o $GI_HOME/log/<nodename>/

cssd/cssd.log o $GI_HOME/log/<nodename>/

alert<nodename>.log

• No Impact on Cluster

Test 24 Lose and Recover one copy of OCR

1. Remove access to one copy of OCR or force dismount of ASM diskgroup (asmcmd umount <dg_name> -f).

2. Replace the disk or remount the diskgroup, ocrcheck will report the OCR to be out of sync.

3. Delete the corrupt OCR (ocrconfig –delete +<diskgroup>) and read the OCR (ocrconfig –add +<diskgroup>). This avoids having to stop CRSD. NOTE: This test assumes that the OCR is mirrored to 2 ASM diskgroups that do not contain voting disks or data or stored on CFS

• There will be no impact on the cluster operation. The loss of access and restoration of the missing/corrupt OCR will be reported in: o $GI_HOME/log/<nodename>/

cssd/crsd.log o $GI_HOME/log/<nodename>/

alert<nodename>.log

• There is no impact on the cluster operation

• The OCR can be replaced online, without a cluster outage.

Test 25 Add a node to the cluster and extend the database (if admin managed) to that node

• Follow the procedures in Oracle Clusterware Administration and Deployment Guide 11g Release 2 Chapter 4 to extend Grid Infrastructure to the new node.

• After extending the Grid Infrastructure, follow the procedures in Oracle® Real Application Clusters Administration and Deployment Guide 11g Release 2 Chapter 10 to extend the RDBMS binaries and database to the new node.

• The new node will successfully be added to the cluster.

• If the database is policy managed and there is free space in the server pool for the new node the database will be extended to the new node automatically (OMF should be enabled so no user intervention is required).

• The new database instance will begin servicing connections.

• The node is dynamically added to the cluster

• If the database is policy managed an instance for the database will automatically be created on the new node.

Page 17: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 17

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 26 Remove a node from

the cluster • Follow the procedures in Oracle® Real

Application Clusters Administration and Deployment Guide 11g Release 2 Chapter 10 to delete the node from the cluster.

• After successfully removing the RDBMS installation, follow the procedures in Oracle Clusterware Administration and Deployment Guide 11g Release 2 Chapter 4 to remove the node from the cluster.

• The connections on to the database instance being removed will fail over to the remaining instances (if configured).

• The node will be successfully removed from the cluster.

• The node will be dynamically removed from the cluster.

Page 18: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 18

System Testing Scenarios: Clusterware Process Failures NOTE: This section of the system testing scenarios demonstrate failures of various Oracle Clusterware processes. These process failures are NOT within the realm of typical failures within a RAC system. Killing of these processes under normal operation is highly discouraged by Oracle Support. This section is to be used to provide a better understanding of the Clusterware processes, the functionality of these processes and a general understanding of the logging performed by each of these processes. Test # Test Procedure Expected Results Measures Actual Results/Notes Test 1 CRSD Process

Failure • For AIX, HPUX, Linux and Solaris:

Obtain the PID for the CRSD process: # ps –ef | grep crsd Kill the CRSD process: # kill –9 <crsd pid>

• For Windows: Use Process Explorer to identify the crsd.exe process. Once the crsd.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• CRSD process failure is detected by the orarootagent and CRSD is restarted. Review the following logs: o $GI_HOME/log/<nodename>/

crsd/crsd.log o $GI_HOME/log/<nodename>/

agent/ohasd/orarootagent_root/orarootagent_root.log

• Time to restart CRSD process

Test 2 EVMD Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the EVMD process: # ps –ef | grep evmd Kill the EVMD process: # kill –9 <evmd pid>

• For Windows: Use Process Explorer to identify the evmd.exe process. Once the evmd.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• EVMD process failure is detected by the OHASD orarootagent and CRSD is restarted. Review the following logs: o $GI_HOME/log/<nodename>/

evmd/evmd.log o $GI_HOME/log/<nodename>/

agent/ohasd/oraagent_grid /oraagent_grid.log

• Time to restart the EVMD process

Page 19: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 19

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 3 CSSD Process Failure • For AIX, HPUX, Linux and Solaris:

Obtain the PID for the CSSD process: # ps –ef | grep cssd Kill the CSSD process: # kill –9 <cssd pid>

• For Windows: Use Process Explorer to identify the ocssd.exe process. Once the ocssd.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The node will reboot. • Cluster reconfiguration will take

place

• Time for the eviction and cluster reconfiguration on the surviving nodes

• Time for the node to come back online and reconfiguration to complete to add the node as an active member of the cluster.

Test 4 CRSD ORAAGENT RDBMS Process Failure NOTE: Test Valid for Only Multi User Installations.

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the CRSD oraagent for the

RDBMS software owner: # cat

$GI_HOME/log/<nodename>/agent/crsd/oraagent_<rdbms_owner>/oraagent_<rdbms_owner>.pid

# kill –9 <pid for RDBMS oraagent process>

• The ORAAGENT process failure is detected by CRSD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

crsd/crsd.log o $GI_HOME/log/<nodename>/

agent/crsd/oraagent_<rdbms_owner>/oraagent_<rdbms_owner>.log

• Time to restart the ORAAGENT process

Test 5 CRSD ORAAGENT Grid Infrastructure Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the CRSD oraagent for the GI

software owner: # cat

$GI_HOME/log/<nodename>/agent/crsd/oraagent_<GI_owner>/oraagent_<GI_owner>.pid

# kill –9 <pid for GI oraagent process>

• For Windows: Use Process Explorer to identify the crsd oraagent.exe process that is a child process of crsd.exe (or obtain the pid for the crsd oraagent.exe as shown in the Unix/Linux instructions above). Once the proper oraagent.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The Grid Infrastructure ORAAGENT process failure is detected by CRSD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

crsd/crsd.log o $GI_HOME/log/<nodename>/

agent/crsd/oraagent_<GI_owner>/oraagent_<GI_owner>.log

• Time to restart the ORAAGENT process

Page 20: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 20

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 6 CRSD

ORAROOTAGENT Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the CRSD orarootagent: # cat

$GI_HOME/log/<nodename>/agent/crsd/orarootagent_root/orarootagent_root.pid”

# kill –9 <pid for orarootagent process>

• For Windows: Use Process Explorer to identify the crsd orarootagent.exe process that is a child process of crsd.exe (or obtain the pid for the crsd orarootagent.exe as shown in the Unix/Linux instructions above). Once the proper orarootagent.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The ORAROOTAGENT process failure is detected by CRSD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

crsd/crsd.log o $GI_HOME/log/<nodename>/

agent/crsd/orarootagent_root/orarootagent_root.log

• Time to restart the ORAROOTAGENT process

Test 7 OHASD ORAAGENT Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the OHASD oraagent: # cat

$GI_HOME/log/<nodename>/agent/ohasd/oraagent_<GI_owner>/oraagent_<GI_owner>.pid

# kill –9 <pid for oraagent process>

• For Windows: Use Process Explorer to identify the ohasd oraagent.exe process that is a child process of ohasd.exe (or obtain the pid for the ohasd oraagent.exe as shown in the Unix/Linux instructions above). Once the proper oraagent.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The ORAAGENT process failure is detected by OHASD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

ohasd/ohasd.log o $GI_HOME/log/<nodename>/

agent/ohasd/oraagent_<GI_owner>/oraagent_<GI_owner>.log

• Time to restart the ORAAGENT process

Page 21: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 21

Test # Test Procedure Expected Results Measures Actual Results/Notes Test 8 OHASD

ORAROOTAGENT Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the OHASD orarootagent: # cat

$GI_HOME/log/<nodename>/agent/ohasd/orarootagent_root/orarootagent_root.pid

# kill –9 <pid for orarootagent process>

• For Windows: Use Process Explorer to identify the ohasd orarootagent.exe process that is a child process of ohasd.exe (or obtain the pid for the ohasd orarootagent.exe as shown in the Unix/Linux instructions above). Once the proper orarootagent.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The ORAROOTAGENT process failure is detected by OHASD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

ohasd/ohasd.log o $GI_HOME/log/<nodename>/

agent/ohasd/orarootagent_root/orarootagent_root.log

• Time to restart the ORAROOTAGENT process

Test 9 CSSDAGENT Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the CSSDAGENT: # ps –ef | grep cssdagent # kill –9 <pid for cssdagent process>

• For Windows: Use Process Explorer to identify the cssdagent.exe process. Once the cssdagent.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The CSSDAGENT process failure is detected by OHASD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

ohasd/ohasd.log o $GI_HOME/log/<nodename>/

agent/ohasd/oracssdagent_root/oracssdagent_root.log

• Time to restart the CSSDAGENT process

Test 10 CSSMONITOR Process Failure

• For AIX, HPUX, Linux and Solaris: Obtain the PID for the CSSDMONITOR: # ps –ef | grep cssdmonitor # kill –9 <pid for cssdmonitor process>

• For Windows: Use Process Explorer to identify the cssdmonitor.exe process. Once the cssdmonitor.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.

• The CSSDMONITOR process failure is detected by OHASD and is automatically restarted. Review the following logs: o $GI_HOME/log/<nodename>/

ohasd/ohasd.log o $GI_HOME/log/<nodename>/

agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log

• Time to restart the CSSMONITOR process

Page 22: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 22

Component Functionality Testing Normally it should not be necessary to perform additional functionality testing for each individual system component. However, for some new components in new environments it might be useful to perform additional testing to make sure that they are configured properly. This testing will also help system and database administrators become familiar with new technology components.

Cluster Infrastructure To simplify testing and problem diagnosis it is often very useful to do some basic testing on the cluster infrastructure without Oracle software or a workload running. Normally this testing will be performed after installing the hardware and operating system, but before installing any Oracle software. If problems are encountered during System Stress Test or Destructive Testing, diagnosis and analysis can be facilitated by testing the cluster infrastructure separately. Typically some of these destructive tests will be used: •Node Failure. Obviously without Oracle software or workload. •Restart Failed Node •Reboot all nodes at the same time •Lost disk access •HBA failover. Assuming multiple HBAs with failover capability. •Disk controller failover. Assuming multiple disk controllers with failover capability. •Public NIC Failure •Interconnect NIC Failure •NAS (Netapps) storage failure – In case of a complete mirror failure, measure the time that the storage reconfiguration needed to be completed. Check the same if going into maintenance mode. If using non-Oracle cluster software: •Interconnect Network Failure •Lost access to cluster voting/quorum disk

ASM Test and Validation This test and validation plan is intended to give the customer or engineer a procedural approach to: •Validating the installation of RAC-ASM •Functional and operation validation of ASM

Page 23: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 23

Component Testing: ASM Functional Tests

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 1 Verify that candidate

disks are available.

• Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by ASM.

• Login to ASM via SQL*Plus and run: “select name, group_number, path, state, header_status, mode_status, label from v$asm_disk”

• The newly added LUN will appear as a candidate disk within ASM.

Test 2 Create an external redundancy ASM diskgroup using SQL*Plus

• Login to ASM via SQL*Plus and run: “create diskgroup <dg name> external redundancy disk ‘<candidate path>’ ;“

• A successfully created diskgroup. This diskgroup should also be listed in v$asm_diskgroup.

• The diskgroup will be registered as a Clusterware resource (crsctl stat res –t)

Test 3 Create an normal or high redundancy ASM diskgroup using SQL*Plus

• Login to ASM via SQL*Plus and run: “create diskgroup <dg name> norma lredundancy disk '<candidate1 path>, '<candidate 2 path> ;”

• A successfully created diskgroup with normal redundancy and two failure groups. For high redundancy, it will create three fail groups.

• The diskgroup will be registered as a Clulsterware resource (crsctl stat res –t)

Test 4 Add a disk to a ASM disk group using SQL*Plus

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add disk '<candidate1 path> ;”

NOTE: Progress can be monitored by querying v$asm_operation

• The disk will be added to the diskgroup and the data will be rebalanced evenly across all disks in the diskgroup.

Test 5 Drop an ASM disk from a diskgroup using SQL*Plus

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop disk <disk name>;”

NOTE: Progress can be monitored by querying v$asm_operation

• The data from the removed disk will be rebalanced across the remaining disks in the diskgroup. Once the rebalance is complete the disk will have a header_status of “FORMER” (v$asm_disk) and will be a candidate to be added to another diskgroup.

Page 24: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 24

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 6 Undrop a ASM disk

that is currently being dropped using SQL*Plus

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop disk <disk name>;”

• Before the rebalance completes run the following command via SQL*Plus: “alter diskgroup <dg name> undrop

disk <disk name>;” NOTE: Progress can be monitored by querying v$asm_operation

• The undrop operation will rollback the drop operation (assuming it has not completed). The disk entry will remain in v$asm_disk as a MEMBER.

Test 7 Drop a ASM diskgroup using SQL*Plus

• Login to ASM via SQL*Plus and run: “drop diskgroup <dg name>;”

• The diskgroup will be successfully dropped. • The diskgroup will be unregistered as a Clusterware

resource (crsctl stat res –t)

Test 8 Modify rebalance power of an active operation using SQL*Plus

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add disk '<candidate1 path> ;”

• Before the rebalance completes run the following command via SQL*Plus: “alter diskgroup <dg name> rebalance power <1 – 11>;”. 1 is the default rebalance power. NOTE: Progress can be monitored by querying v$asm_operation

• The rebalance power of the current operation will be increased to the specified value. This is visible in the v$asm_operation view.

Test 9 Verify CSS-database communication and ASM files access.

• Start all the database instances and query the v$asm_client view in the ASM instances.

• Each database instance should be listed in the v$asm_client view.

Test 10 Check the internal consistency of disk group metadata using SQL*Plus

• Login to ASM via SQL*Plus and run: “alter diskgroup <name> check all”

• If there are no internal inconsistencies, the statement “Diskgroup altered” will be returned (asmcmd will return back to the asmcmd prompt). If inconsistencies are discovered, then appropriate messages are displayed describing the problem.

Page 25: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 25

Component Testing: ASM Functional Tests –ASMCMD Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 1

Verify that candidate disks are available.

• Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by ASM.

• Login to ASM via ASMCMD and run: “lsdsk --candidate

• The newly added LUN will appear as a candidate disk within ASM.

Test 2

Create an external redundancy ASM diskgroup using ASMCMD

• Identify the candidate disks for the diskgroup by running: “lsdsk –candidate”

• Create a XML config file to define the diskgroup e.g. <dg name="<dg name>" redundancy="external">

<dsk string="<disk path>" /> <a name="compatible.asm" value="11.1"/> <a name="compatible.rdbms" value="11.1"/> </dg>

• Login to ASM via ASMCMD and run: “mkdg <config file>.xml”

• A successfully created diskgroup. This diskgroup can be viewed using the “lsdg” ASMCMD command.

• The diskgroup will be registered as a Clusterware resource (crsctl stat res –t)

Page 26: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 26

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 3 Create a normal or

high redundancy ASM diskgroup using ASMCMD

• Identify the candidate disks for the diskgroup by running: “lsdsk –candidate”

• Create a XML config file to define the diskgroup e.g. <dg name="<dg_name>" redundancy="normal"> <fg name="fg1"> <dsk string="<disk path>" /> </fg> <fg name="fg2"> <dsk string="<disk path>" /> </fg> <a name="compatible.asm" value="11.1"/> <a name="compatible.rdbms" value="11.1"/> </dg>

• Login to ASM via ASMCMD and run: “mkdg <config file>.xml”

• A successfully created diskgroup. This diskgroup can be viewed using the “lsdg” ASMCMD command.

• The diskgroup will be registered as a Clusterware resource (crsctl stat res –t)

Test 4 Add a disk to a ASM disk group using ASMCMD

• Identify the candidate disk to be added by running: “lsdsk –candidate”

• Create a XML config file to define the diskgroup change e.g. <chdg name="<dg name>"> <add> <dsk string="<disk path>"/> </add>

</chdg> • Login to ASM via ASMCMD and run:

“chdg <config file>.xml”

NOTE: Progress can be monitored by running “lsop”

• The disk will be added to the diskgroup and the data will be rebalanced evenly across all disks in the diskgroup. Progress of the rebalance can be monitored by running the “lsop” ASMCMD command.

Page 27: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 27

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 5 Drop an ASM disk

from a diskgroup using ASMCMD

• Identify the ASM name for the disk to be dropped from the given diskgroup: “lsdsk -G <dg name> -k

• Create a XML config file to define the diskgroup change e.g. <chdg name="<dg name>"> <add> <dsk name="<disk name>"/> </add>

</chdg> • Login to ASM via ASMCMD and run:

“chdg <config file>.xml”

NOTE: Progress can be monitored by running “lsop”

• The data from the removed disk will be rebalanced across the remaining disks in the diskgroup. Once the rebalance is complete the disk will be listed as a candidate (lsdsk –candidate) to be added to another diskgroup. Progress can be monitored by running “lsop”

• The diskgroup will be unregistered as a Clusterware resource (crsctl stat res –t)

Test 6 Modify rebalance power of an active operation using ASMCMD

• Add a disk to a diskgroup (as shown above).

• Identify the rebalance operation by running “lsop” via ASMCMD.

• Before the rebalance completes run the following command via ASMCMD: “rebal –power <1-11> <dg name>. NOTE: Progress can be monitored by running “lsop”

• The rebalance power of the current operation will be increased to the specified value. This is visible with the lsop command.

Test 7 Drop a ASM diskgroup using ASMCMD

• Login to ASM via ASMCMD and run: “dropdg <dg name>;”

• The diskgroup will be successfully dropped. • The diskgroup will be unregistered as a Clusterware

resource (crsctl stat res –t)

Page 28: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 28

Component Testing: ASM Objects Functional Tests Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 1 Create an ASM

template

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add template unreliable attributes(unprotected fine);”

• The ASM template will be successfully created and visible within the v$asm_template view.

Test 2 Apply an ASM template

• Use the template above and apply it to a new tablespace to be created on the database

• Login to ASM via SQL*Plus and run: “create tablespace test datafile '+<dg name>/my_files(unreliable)' size 10M;”

• The datafile is created using the attributes of the ASM template

Test 3 Drop an ASM template • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop template unreliable;”

• This template should be removed from v$asm_template.

Test 4 Create an ASM directory

• Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add directory '+<dg name>/my_files';”

• You can use the asmcmd tool to check that the new directory name was created in the desired diskgroup.

• The created directory will have an entry in v$asm_directory

Test 5 Create an ASM alias • Login to ASM via SQL*Plus and run: “alter diskgroup DATA add alias '+DATA/my_files/datafile_alias' for '+<dg name>/ <db name>/DATAFILE/<file name>';”

• Verify that the alias exists in v$asm_alias

Test 6 Drop an ASM alias • Login to ASM via SQL*Plus and run: “alter diskgroup DATA drop alias '+<dg name>/my_files/ datafile_alias ';”

• Verify that the alias does not exist in v$asm_alias.

Page 29: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 29

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 7 Drop an active database

file within ASM • Identify a data file from a running

database. • Login to ASM via SQL*Plus and run:

“alter diskgroup data drop file '+<dg name>/<db name>/DATAFILE/<file name>';”

• This will fail with the following message: ERROR at line 1: ORA-15032: not all alterations performed ORA-15028: ASM file '+DATA/V102/DATAFILE/TEST.269.654602409' not dropped; currently being accessed

Test 8 Drop an inactive database file within ASM

• Identify a datafile that is no longer used by a database

• Login to ASM via SQL*Plus and run: “alter diskgroup data drop file '+<dg name>/<db name>/DATAFILE/<file name>';”

• Observe that file number in v$asm_file is now removed.

Component Testing: ASM ACFS Functional Tests Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 1 Create an ASM

Dynamic Volume • Create an ASM diskgroup to house

the ASM Logical Volume. ASMCMD or SQL*Plus may be used to achieve this task. The diskgroup compatibility attributes COMPATIBLE.ASM and COMPATIBLE.ADVM must be set to 11.2 or higher.

• Login to ASM via ASMCMD and create the logical volume to house the ACFS filesystem: “volcreate –G <dg name> -s <size>

<vol name>”

• The volume will be created with the specified attributes. The volume can be viewed in ASMCMD by running “volinfo –a”.

Page 30: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 30

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 2 Create an ACFS

filesystem • Within ASMCMD issue the “volinfo

–a” command and take note of the Volume Device path.

• As the root user create an ACFS filesystem on the ASM Volume as follows: “/sbin/mkfs –t acfs <volume device

path>”

• The filesystem will be successfully created. The filesystem attributes can be viewed by running “/sbin/acfsutil info fs”

Test 3 Mount the ACFS filesystem

• As the root user execute the following to mount the ACFS filesystem: “/sbin/mount –t acfs <volume device

path> <mount point> NOTE: If acfsutil was not used to

register the file system, the dynamic volume must be enabled on the remote nodes before mounting (within ASMCMD run volenable).

• The filesystem will successfully be mounted and will be visible.

Test 4 Add an ACFS filesystem to the ACFS mount registry

• Use acfsutil to register the ACFS filesystem: “/sbin/acfsutil registry –a <volume

device path> <mount point>

• The filesystem will be registered with the ACFS registry. This can be validated by running “/sbin/acfsutil registry –l”

• The filesystem will be automounted on all nodes in the cluster on reboot

Test 5 Create a file on the ACFS filesystem

• Perform the following: “echo “Testing ACFS” > <mount

point>/testfile • Perform a “cat” command on the file

on all nodes in the cluster.

• The file will exist on all nodes with the specified contents.

Test 6 Remove an ACFS filesystem from the ACFS mount registry

• Use acfsutil to register the ACFS filesystem: “/sbin/acfsutil registry –d <volume device path>

• The filesystem will be unregistered with the ACFS registry. This can be validated by running “/sbin/acfsutil registry –l”

• The filesystem will NOT be automounted on all nodes in the cluster on reboot

Page 31: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 31

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 7 Add an ACFS filesystem

as a Clusterware resource NOTE: This is required when using ACFS for a shared RDBMS Home. When ACFS is registered as a CRS resource it should NOT be registered in the ACFS mount registry.

• Execute the following command as root to add a ACFS filesystem as a Clusterware resource: “svrctl add filesystem –d < volume

device path> -v <volume name> -g <dg name> -m <mount point> -u root”

• Start the ACFS filesystem resource: “svrctl start filesystem –d <volume

device path>”

• The filesystem will be registered as a resource within the Clusterware. This can be validated by running “crsctl stat res –t”

• The filesystem will be automounted on all nodes in the cluster on reboot

Test 8 Increase the size of a ACFS filesystem

• Add a disk to the diskgroup housing the ACFS filesystem (if necessary)

• Use acfsutil as the root user to resize the ACFS filesystem: “acfsutil size <size><K|M|G>

<mount point>”

• The dynamic volume and filesystem will be resized without an outage of the filesystem provided enough free space exists in the diskgroup. Validate with “df –h”.

Test 9 Install a shared Oracle Home on an ACFS filesystem

• Create an ACFS filesystem that is a minimum of 6GB in size.

• Add the ACFS filesystem as a Clusterware resource.

• Install the 11gR2 RDBMS on the shared ACFS filesystem (see install guide)

• The shared 11gR2 RDBMS Home will be successfully installed.

Test 10 Create a snapshot of a ACFS filesystem

• Use acfsutil to create a snapshot of an ACFS filesystem: “/sbin/acfsutil snap <name> <ACFS mount point>”

• A snapshot of the ACFS file system will be created under <ACFS mount point>/.ACFS/snaps.

Test 11 Delete a snapshot of a ACFS filesystem

• Use acfsutil to delete a previously created snapshot of an ACFS filesystem: “/sbin/acfsutil snap delete <name> <ACFS mount point>”

• The specified snapshot will be deleted and will no longer appear under <ACFS mount point>/.ACFS/snaps.

Page 32: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 32

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 12 Perform a FSCK of a

ACFS filesystem • Dismount the ACFS filesystem to be

checked on ALL nodes: o If the filesystem is registered as a

Clusterware resource issue “srvctl stop filesystem –d <device path>” to dismount the filesystem on all nodes

o If the filesystem is only in the ACFS mount registry or is not registered with Clusterware in any way dismount the filesystem using “umount <mount point>”.

• Execute fsck on the ACFS filesystem as follows: “sbin/fsck -a -v -y -t acfs <device

path>” This command will automatically fix any errors (-a), answer yes to any prompts (-y) and provide verbose output (-v).

• FSCK will check the specified ACFS filesystem for errors, automatically fix any errors (-a), answer yes to any prompts (-y) and provide verbose output (-v).

Page 33: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 33

Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 13 Delete an ACFS

filesystem • Dismount the ACFS filesystem to be

deleted on ALL nodes: o If the filesystem is registered as a

Clusterware resource issue “srvctl stop filesystem –d <device path>” to dismount the filesystem on all nodes

o If the filesystem is only in the ACFS mount registry or is not registered with CRS in any way dismount the filesystem using “umount <mount point>”.

• If the filesystem is registered with the ACFS mount registry deregister the mount point using acfsutil as follows: “/sbin/acfsutil registry –d <device

path>” • Remove the filesystem from the

Dynamic Volume using acfsutil: “/sbin/acfsutil rmfs <device path>”

• The ACFS filesystem will be removed from the ASM Dynamic Volume. Attempts to mount the filesystem should now fail.

Test 14 Remove an ASM Dynamic Volume

• Use ASMCMD to delete a ASM Dynamic Volume: “voldelete –G <dg name> <vol

name>”

• The removed Dynamic Volume will no longer be listed in the output of “volinfo –a”.

• The disk space utilized by the Dynamic Volume will be returned to the diskgroup.

Page 34: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 34

Component Testing: ASM Tools & Utilities Test # Test Procedure Expected Results/Measures Actual Results/Notes Test 1

Run dbverify on the database files.

• Specify each file individually using the dbv utility: dbv userid=s<user>/<password>file='<ASM filename>' blocksize=<blocksize>

• The output should be similar to the following, with no errors present:

DBVERIFY - Verification complete Total Pages Examined : 640 Total Pages Processed (Data) : 45 Total Pages Failing (Data) : 0 Total Pages Processed (Index): 2 Total Pages Failing (Index): 0 Total Pages Processed (Other): 31 Total Pages Processed (Seg) : 0 Total Pages Failing (Seg) : 0 Total Pages Empty : 562 Total Pages Marked Corrupt : 0 Total Pages Influx : 0 Highest block SCN : 0 (0.0)

Test 2 Use dbms_file_transfer to copy files from ASM to filesystem

• Use dbms_file_transfer.put_file and get_file functions to copy database files (datafiles, archives, etc) into and out of ASM.

NOTE: This requires that a database directory be pre-created and available for the source and destination directories. See PL/SQL Guide for dbms_file_transfer details

• The put_file and get file functions will copy files successfully to/from filesystem. This provides an alternate option for migrating to ASM, or to simply copy files out of ASM.

Page 35: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 35

Component Testing: Miscellaneous Tests Test # Test Procedure Expected Results/Measures Measures Actual Results/Notes Test 1

Diagnostics Procedure for Hang/Slowdown

• Start client workload • Execute automatic and manual

procedures to collect database, Clusterware and operating system diagnostics (hanganalyze, racdiag.sql)

• Diagnostics collection procedures complete normally.

• Time to run diagnostics procedures. Is it acceptable to wait for this time before restarting instances or nodes in a production situation?

Appendix I: Linux Specific Tests Test #

Test Procedure Expected Results/Measures Actual Results/Notes

Test 1 Create an OCFS2 filesystem

• Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by OCFS2.

• Create the appropriate partition table on the disk and use “partprobe” to rescan the partition tables.

• Create the OCFS2 filesystem by running: “/sbin/mkfs –t ocfs2 <device path>”

• Add the filesystem to /etc/fstab on all nodes

• Mount the filesystem on all nodes

• The OCFS2 filesystem will be created.

• The OCFS2 filesystem will be mounted on all nodes

Page 36: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 36

Test #

Test Procedure Expected Results/Measures Actual Results/Notes

Test 2 Create a file on the OCFS filesystem

• Perform the following: “echo “Testing OCFS2” > <mount

point>/testfile • Perform a “cat” command on the file

on all nodes in the cluster.

• The file will exist on all nodes with the specified contents.

Test 3 Verify that the OCFS2 filesystem is available after a system reboot

• Issue a “shutdown –r now” • The OCFS2 filesystem will automatically mount and be accessible to all nodes after a reboot.

Test 4 Enable database archive logs to OCFS2 NOTE: If using the OCFS2 filesystem for database files it must be mounted with the following options: rw,datavolume,nointr

• Modify the database archive log settings to utilize OCFS2

• Archivelog files are created, and available to all nodes on the specified OCFS2 filesystem.

Test 5 Create an RMAN on a OCFS2 filesystem NOTE: If using the OCFS2 filesystem for database files it must be mounted with the following options: rw,datavolume,nointr

• Back up ASM based datafiles to OCFS2 filesystem.

• Execute baseline recovery scenarios (full, point-in-time, datafile).

• RMAN backupsets are created, and available to all nodes on the specified OCFS2 filesystem.

• Recovery scenarios completed with no errors.

Test 6 Create a datapump export on a OCFS2 filesystem

• Using datapump, take an export of the database to an OCFS2 filesystem.

• A full system export should be created without errors or warnings.

Test 7 Validate OCFS2 functionality during node failures.

• Issue a “shutdown –r now” from a single node in the cluster

• OCFS2 filesystem should remain available to surviving nodes.

Page 37: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 37

Test #

Test Procedure Expected Results/Measures Actual Results/Notes

Test 8 Validate OCFS2 functionality during disk/disk subsystem path failures NOTE: Only applicable on multipath storage environments.

• Unplug external storage cable connection (SCSI, FC or LAN cable) from node to disk subsystem.

• If multi-pathing is enabled, the multi-pathing configuration should provide failure transparency

• No impact to the OCFS2 filesystem. • Path failover should be visible in the

OS logfiles.

Test 9 Perform a FSCK of a OCFS2 filesystem

• Dismount the OCFS2 filesystem to be checked on ALL nodes

• Execute fsck on the OCFS2 filesystem as follows: “sbin/fsck -v -y -t ocfs2 <device

path>” This command will automatically, answer yes to any prompts (-y) and provide verbose output (-v).

• FSCK will check the specified OCFS2 filesystem for errors, answer yes to any prompts (-y) and provide verbose output (-v).

Test 10 Check the OCFS2 cluster status

• Check the OCFS2 cluster status on all nodes by issuing “/etc/init.d/o2cb status”.

• The output of the command will be similar to:

Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Checking O2CB heartbeat: Active

Page 38: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 38

Appendix II: Windows Specific Tests

Test#

Test Procedure Expected Results Actual Results/Notes

Test 1 Create an OCFS filesystem

• Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by OCFS.

• Create the appropriate partition table on the disk and validate disk and partition table is visible on ALL nodes (this can be achieved via diskpart).

• Assign a drive letter to the logical drive • Create the OCFS filesystem by

running: cmd> %GI_HOME%\cfs\ocfsformat

/m <drive_letter> /c <cluster size> /v <volume name> /f /a

• The OCFS filesystem will be created. • The OCFS filesystem will be mounted

on all nodes

Test 2 Create a file on the OCFS filesystem

• Perform the following: Use notepad to create a text file

containing the text “TESTING OCFS” on an OCFS drive.

• Use notepad to validate that the file exists on all nodes.

• The file will exist on all nodes with the specified contents.

Test 3 Verify that the OCFS filesystem is available after a system reboot

• Issue a “reboot” • The OCFS filesystem will automatically mount and be accessible to all nodes after a reboot.

Test 4 Enable database archive logs to OCFS

• Modify the database archive log settings to utilize OCFS

• Archivelog files are created, and available to all nodes on the specified OCFS filesystem.

Page 39: RAC System Test Plan Outline 11gr2 v2 0 (1)

Oracle Support Services RAC Starter Kit RAC Assurance Team System Test Plan Outline

Page 39

Test#

Test Procedure Expected Results Actual Results/Notes

Test 5 Create an RMAN backup on an OCFS filesystem

• Back up ASM based datafiles to OCFS filesystem.

• Execute baseline recovery scenarios (full, point-in-time, datafile).

• RMAN backupsets are created, and available to all nodes on the specified OCFS filesystem.

• Recovery scenarios completed with no errors.

Test 6 Create a datapump export on an OCFS filesystem

• Using datapump, take an export of the database to an OCFS filesystem.

• A full system export should be created without errors or warnings.

Test 7 Validate OCFS functionality during node failures.

• Issue a “reboot” from a single node in the cluster

• OCFS filesystem should remain available to surviving nodes.

Test 8 Remove a drive letter and ensure that the letter is re-established for that partition

• Using Windows disk management use the ‘Change Drive Letter and Paths …’ option to remove a drive letter associated with an OCFS partition.

• OracleClusterVolumeService should restore the drive letter assignment within a short period of time.

Test 9 Run ocfscollect tool • OCFSCollect is available as an attachment to Note: 332872.1

• A .zap file (rename to .zip and extract). Can be used as a baseline regarding the health of the available OCFS drives.