recover a failed raid without deleting data on apg40

"Recover a failed RAID without deleting data on APG40"

ID: SCS128928

Domain: primus_owner@PRIMPRD

Usage Count: 3365

Class: External1

Conflicts: 0

Audience: Internal

Initiated by: epamks (Mark

Scrivener)

Date Created: 9/18/2002

Date Modified: 12/22/2011

Modified By: epamks (Mark

Scrivener)

Owner: epamks (Mark

Scrivener)

Status: REL (released)

Suspected_Faulty: No

Type: How to

Goal

Recover a failed RAID without deleting data on APG40

Re-create a dead RAID without deleting data on APG40

Re-create a dead array without deleting data on APG40

Fact

APG40

APG40C/2

Network: CDMA

Network: GSM

Network: WCDMA

Network: Wireline

Node: AXE BSC

Node: AXE FNR

Node: AXE HLR

Node: AXE MSC

Service: Engine Integral

Symptom

Both nodes down

AP FAULT

PROBLEM: DOMAIN CONNECTION

PROBLEM: GENERAL ERROR

AP REBOOT, CAUSE by Command initiated

AP PROCESS STOPPED, CAUSE by Process death

Alarm: AP FAULT, MIRRORED DISKS NOT REDUNDANT.

Both disks of a RAID have failed

RAID marked as dead in DPT Storage Manager

STS stopped due to dead RAID disk

FOS failed

Command: raidutil displays an extra RAID entry

One node is Passive and one node is Undefined

fcc_integrate was not executed correctly

RTR is failed

Event ID: 1034

The disk associated with cluster disk resource 'Disks J: K: L: M:' could not be found.

Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...

1 of 11 30-3-12 5:16 p.m.

The disk associated with cluster disk resource 'Disks ...' could not be found.

The expected signature of the disk was xxxxxxxx. If the disk was removed from the cluster, the resource should be deleted. If the disk was

replaced, the resource must be deleted and created again in order to bring the disk online. If the disk has not been removed or replaced, it may

be inaccessible at this time because it is reserved by another cluster node.

Both nodes in state undefined

Command: net start clussvc fails with A system error has occurred., Size of job is %1 bytes.

A system error has occurred.

Size of job is %1 bytes.

Command: net start clussvc fails with A system error has occurred., System error 2 has occurred., The system cannot find the file specified.

System error 2 has occurred.

No STS & no MML & One Node is undefined

The system cannot find the file specified.

Disk Resource is Failed

Cluster disk resource failed

fcc_save_to_remove other gives "removing mirroring: failed"

'fcc_save_to_remove other' command hangs

System error 1067 has occurred.

AP NOT AVAILABLE

Alarm: STATISTICS AND TRAFFIC MEASUREMENT FILE ACCESS FAULT, STS COULD NOT ACCESS FILE

OSS heartbeat failure alarm

Cause

The RAID will be failed (dead) when both disk drives belonging to the RAID are failed.

The RAID information is corrupt and/or a RAID controller is faulty.

One known cause is loading/updating the RAID firmware on an incompatible board. For example loading the FT06 RAID firmware (CN-I APZ

212 20/5-584 and -585) on version 3.1.3.3 of the PSU-HDD board.

An incorrectly terminated SCSI bus. e.g. not doing "fcc_save_to_remove other".

A task force was created in PDU to address the large number of emergencies caused by RAID failures.

The first outcome of the task force is improved handling at the repair centre. e.g. If a node is returned due to a RAID failure the RAIDs are now

being tested.

The second outcome of the task force was a modification of the SCSI BUS RESELECTION time-out parameters on the SCSI disks. PDU believe

that this will reduce the number of emergencies caused by RAID failures by at least 30% to 50%.

The APG40 GCC (GSDC Spain) and PDU have setup a monthly "KCS Triggered Product Improvement" report to determine the most common

problems in APG4x and make recommendations on how to fix them. The first SOLUTION fix in this Primus will be continuously updated to

included any revelent information from this report.

Ericsson internal only

Fix

REMEDY:

CONDITIONS:

This solution is applicable to APG40C/1 and APG40C/2.

1.

The status of a RAID is Failed, Impacted or Dead.

If none of the RAIDs have the status Failed, Impacted or Dead then this solution is normally not applicable. See see the note "Is this

solution right for me?" below for more information.

AP Command:raidutil -L logical

Example:C:\> raidutil -L logical

Address Type Manufacturer/Model Capacity Status

---------------------------------------------------------------------------

d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal


d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed

2.

The RAID NVRAM is not "MOT V1.1".

RAID NVRAM version "MOT V1.1" has introduced problems which may cause this Primus solution to fail. Primus solution

SCS684731 should be used to upgrade or downgrade the RAID firmware on both nodes.

3.


2 of 11 30-3-12 5:16 p.m.

AP Command:raidutil -L version

Example printout:# Controller Cache FW NVRAM BIOS SMOR Serial

---------------------------------------------------------------------------

d0 DPT PM3757U2 0MB FT0A MOT V1.1 10-10035

PROCEDURE:

When a RAID is failed and/or both disks of the RAID are failed the OPI "AP, System Data Disk Restore" should normally be followed to fix the

problem. The OPI fixes the problem by zapping the drives, destroying all data on the data disks. This Primus solution fixes the problem by

deleting and re-creating the RAID definitions without data loss. This Primus solution is meant to be used as an alternative to the OPI.

This Primus solution should therefore be used in similar circumstances. If this Primus solution does not fix the problem then the OPI "AP,

System Data Disk Restore" should be considered.

The procedure takes about 30 minutes and during this time there will be no MML contact, charging will be buffered and STS data will be lost.

Collect information for further analysis.

Log the information below from both nodes and send the result to the owner of this solution.

AP Command:hostname

prcstate

date/t

time/t

raidutil -L all

frlbbdiag -v

raidutil -K

raidutil -e soft d0

raidutil -e recov d0

raidutil -e nonrecov d0

raidutil -e status d0

aehevls -l app -c dptelog

mktr <YYMMDD>-<HHMM> -c

1.

Determine the source disk for the RAID re-create.

When the RAID is deleted and re-created a disk must be chosen as the source of the data for the RAID.

In this solution the node that will be used as the source of the data will be be referred to as the good node and the other node will be

referred to as the faulty node.

This is the most important step of the procedure and it is recommdended that second line support performs this step. The "raidutil -e

status d0" logs from both nodes should be used to determine the sequence of events. The node where the disks failed last should

normally used as the source node. The frlbbdiag command must also be used to verify that the source node is also free from fault.

Command:frlbbdiag -v

raidutil -e status d0

2.

Connect to the faulty node.

This is the node that will not be used as the source of the data for the RAID.

AP Command:hostname

3.

Shutdown the node.

AP Command:prcboot -s

4.

Connect to the good node.

Use the node IP address and not the cluster IP address.

This is the node that will be used as the source of the data for the RAID.

AP Command:hostname

5.

Disable the "Cluster Server" and Ericsson services startup.

Do not disable the "Cluster Disk" device as this will prevent the RAID from being deleted.

Windows 2003 Command:sc config Clussvc start= Disabled

6.


3 of 11 30-3-12 5:16 p.m.

sc config ACS_PRC_ClusterControl start= Disabled

sc config ACS_FCH_Server start= Disabled

sc config ACS_FCR_Server start= Disabled

Windows NT Command:echo REGEDIT4 > C:\TEMP\Cluster_Disabled.reg

echo. >> C:\TEMP\Cluster_Disabled.reg

echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Disabled.reg

echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg


echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Disabled.reg



echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Disabled.reg


type C:\TEMP\Cluster_Disabled.reg

regedit /s C:\TEMP\Cluster_Disabled.reg

del C:\TEMP\Cluster_Disabled.reg

Set BIOS "Cluster Support" to Disabled (Off).

AP Command:raidutil +cluster off

7.

Reboot the node.

Do not use prcboot. The normal "prcboot" command normalises the "Cluster Server" and Ericsson services startup.

There may be no response from the terminal until the AP finishes rebooting after the shutdown command is entered. This will take

about 6 minutes.

Windows 2003 Command:shutdown /f /r /t 0

Windows NT Command:shutdown /f /r %COMPUTERNAME%

8.

Check that SCSI disks are correct and available.

If the 6 SCSI disks, 3 per node, can not be seen or the targets are incorrect then it will not be possible to re-create the RAID.

AP Command:raidutil -L physical

Example:C:\> raidutil -L physical


---------------------------------------------------------------------------

d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal

d0b0t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal

d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive



d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive

9.

Check the size of the RAID.

Make a note of the size of the RAID that will be deleted and re-created.

If the capacity of the disks are different then the size of the RAID has to be set when it is re-created.

AP Command:raidutil -L raid

Example where the RAID size is 17432:C:\> raidutil -L raid


---------------------------------------------------------------------------








d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive

d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive

10.


4 of 11 30-3-12 5:16 p.m.

Delete the RAID.

Only delete the RAIDs that are Failed, Impacted or Dead.

If it is not possible to delete the RAID then follow the note " Additional steps to delete the RAID" below and then continue with the

next step.

AP Command:raidutil -D d0b0t<#>d0

Examples:

Delete RAID d0b0t0d0:C:\> raidutil -D d0b0t0d0

d0b0t0d0


d0b0t1d0


d0b0t2d0

11.

Check that the RAID has been deleted.

If the RAID has not been deleted then follow the note "Additional steps to delete the RAID" below and then continue with the next

step.


Expected Printout:Failure:Can't find component by address

Expected Printout:Failure:Can't find component by address

12.

Set the disk cache to write back.

AP Command:raidutil -w on d0b0t<#>d0

raidutil -w on d0b1t<#>d0

Examples:

RAID d0b0t0d0 deleted:C:\> raidutil -w on d0b0t0d0

C:\> raidutil -w on d0b1t0d0





13.

Re-create the RAID.

The first disk specified after the "-g" parameter is used as the source of the data when re-creating the RAID.

The "-s" parameter is only required if the size of the RAID has to be set as described above. If the "-s" parameter is not specified then

the size of the RAID is set to the capacity of the first disk specified after the "-g" parameter.

Note: If it is not possible to re-create the RAID then follow the note "Disconnect SCSI cables" and then continue this procedure from

the next step (that is, from step 15, without recreating the RAIDs). It is important to disconnect the SCSI cables or it is possible a disk

on the shutdown node will still be accessed. This will leave the RAID deleted and allow the AP to run as a single node. The faulty node

should be left shutdown as it will be unable to be active. The RAID must be re-created when the faulty node is replaced using the note

"RAID re-create during node change" below.

AP Command:raidutil -l 1 -g d0b0t<#>d0,d0b1t<#>d0 [-i -s <size>]

Examples:

Re-create RAID d0b0t0d0:C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0

Created: RAID 1


14.


5 of 11 30-3-12 5:16 p.m.

Created: RAID 1


Created: RAID 1

Re-create RAID d0b0t0d0 with size 17432MB:

C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0 -i -s 17432

Created: RAID 1



Created: RAID 1



Created: RAID 1

Stop the RAID rebuild.

This is a precaution in case the wrong node has been chosen as the source.

AP Command:raidutil -a stop d0

15.

Set the RAID cache to write through.

AP Command:raidutil -w off d0b0t0d0

raidutil -w off d0b0t1d0

raidutil -w off d0b0t2d0

16.

Check that the RAID has been re-created.

If the RAID has not been re-created then contact the next level of support.


Example:C:\> raidutil -L logical


---------------------------------------------------------------------------

d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded



17.

Set BIOS "Cluster Support" to Enabled (On).

AP Command:raidutil +cluster on

18.

Normalise the "Cluster Server" and Ericsson services startup.

Note: In APZ 11.3 and later the ACS_PRC_ClusterControl service startup type should be set to automatic. This will be done in a later

step.

Windows 2003 Command:sc config ClusSvc start= Auto

sc config ACS_FCH_Server start= Auto

sc config ACS_FCR_Server start= Auto

Windows NT Command:echo REGEDIT4 > C:\TEMP\Cluster_Enabled.reg

echo. >> C:\TEMP\Cluster_Enabled.reg

echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Enabled.reg

echo "Start"=dword:00000002 >> C:\TEMP\Cluster_Enabled.reg


echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Enabled.reg



echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Enabled.reg


type C:\TEMP\Cluster_Enabled.reg

regedit /s C:\TEMP\Cluster_Enabled.reg

del C:\TEMP\Cluster_Enabled.reg

19.


6 of 11 30-3-12 5:16 p.m.

Reboot the node.

The prcboot command is not used with Windows Server 2003 due to problems with the node not rebooting.

Windows 2003 Command:shutdown /f /r /t 0

If the printout below is received then repeat the command until successful. The command will not be successful until the "Preparing

network connections" dialog disappears.

The computer is processing another action and thus cannot be shut down. Wait until the computer has finished its

action, and then try again.(21)

Windows NT Command:prcboot

20.

Check the status of the RAIDs.

If the RAID status has returned to the status failed then replace the faulty node and repeat the procedure. If a spare node is not

immediately available then follow the note "Disconnect SCSI cables" below and repeat this procedure. This will leave the RAID deleted

and allow the AP to run as a single node. The faulty node should be left shutdown until a replacement is available. If this is done the

RAID must be re-created when the faulty node is replaced using the note "RAID re-create during node change" below.


21.

Wait for all resources to come online.

The resources owned by the faulty, shutdown node will not come online.

If the faulty node is going to be replaced then the procecure is complete.

22.

Reboot the faulty, shutdown node.

This step should not be performed if the faulty node should be left shutdown or if the RAID was not re-created.

AP Command:fcc_reset other

23.

Wait for all resources to come online.24.

Normalise the ACS_PRC_ClusterControl resource.

This step is not required with Windows NT as the prcboot command above sets the startup type.

Windows Server 2003 Command:sc config ACS_PRC_ClusterControl start= Auto

25.

Make sure the RAID rebuild is set to fast.

AP Command:raidutil -r fast d0

Example:C:\> raidutil -r fast d0

Address Type Rate

---------------------------------------------------------------------------

d0b0t7d0 HBA 9.0s (fast)

d0b0t2d0 RAID 1 (Mirrored) 9.0s (fast)



26.

Check the RAID disks for faults.

If there are any faults then follow the OPI "AP FAULT" and do not attempt a rebuild - do not perform the remaining steps in this

procedure.

Command:frlbbdiag

27.

Rebuild the re-created RAIDs.

AP Command:raidutil -a rebuild d0b0t<#>d0

Examples:

Rebuild RAID d0b0t0d0:C:\> raidutil -a rebuild d0b0t0d0

28.


7 of 11 30-3-12 5:16 p.m.

d0b0t0d0


d0b0t1d0


d0b0t2d0

Perform a health check of the AP.

Follow Primus solution SCS123402.

29.

Query and change the SCSI BUS RESELECTION settings with FrChangeDisk.

Follow Primus solution SCS841510.

30.

Implement APG40C/2 RAID improvements as per the SOLUTION fix below.31.

SOLUTION:

CONDITIONS:

As in the REMEDY above.1.

PROCEDURE:

Implement recommendations from PDU task force and GCC/PDU APG40 KCS Triggered Product Improvement.

Implement the SCSI BUS RESELECTION time-out parameter change.

This change is introduced with CN-I APZ 212 30/4-1126.

This CN-I is included in the follow packages:

- BSC PLM: APG40 One Trace: IP-A203.

- MSC PLM: APG40 One Track EP-A111.

- APZ PLM: APG40 One Track AGM018.

The FrChangeDisk tool introduced with CN-I APZ 212 30/4-1126 has been updated to fix faults in the following CN-Is.

CN-I APZ 212 30/4-1233.

This CN-I is included in the following packages:


CN-I APZ 212 30/4-1487.


- APZ PLM: APG40 One Track UAM009.

1.

Implement the FrLbbDiag tool and ContLogCollector service.




The FrLbbDiage tool introduced with CN-I APZ 212 30/4-1140 has been updated in the following CN-Is.

CN-I APZ 212 30/4-1375.


- APZ PLM: APG40 One Tracke AGM020.

2.

Implement NVRAM Force V2.1.




3.

Use the updated AP FAULT OPI when rebuilding the RAID when the AP FAULT alarm is raised.



4.


8 of 11 30-3-12 5:16 p.m.



SOLUTION:

CONDITIONS:

As in the REMEDY above.1.

Following the procedure in the above remedy either did not fix the problem, or there was a subsequent occurrence of the same fault.2.

PROCEDURE:

Using the information from the log files gathered in the REMEDY above, determine which node is most likely to be faulty.1.

Change the node. See the Operational Instruction "APG40, Node, Change".

If unsure about which node should be changed, please contact the next level of support for assistance. It is possible that the actual

fault is not in the indicated node, but in the other node, or in one of the SCSI cables connecting the two nodes. These may also need to

be changed if changing the indicated node still does not fix the problem.

2.

SOLUTION:

CONDITIONS:

As in the first REMEDY above.1.

PROCEDURE:

It is normal for a RAID to be failed when both hard disks have failed.

It is the opinion of design that the RAID should be failed when both disk drives belonging to the RAID have failed. Preventing this issue

from occurring requires choosing a disk drive to be used as the source of the data for the RAID. It is the opinion of design that it is too

dangerous to allow the system to do this and it is better to follow the OPI "AP, System Data Disk Restore" and erase the data disks. It is

therefore important to replace any node with a failed disk drive as soon as possible.

There have been several other TRs raised on this issue with TR HE82881 is a good example of the designs opinion of the problem.

TR HG51610 has been raised to address this issue.

1.


Note

Is this solution right for me?

If the "raidutil -L raid" printout displays 3 "RAID 1" entries each with 2 "Disk Drive" entries, with correct targets, and the status of the "RAID

1" entries is Optimal or Degraded then this solution is NOT applicable.

Example 1: The printout shows a degraded RAID and a failed disk drive. This solution is NOT applicable.

The OPI "AP FAULT" should be used instead.

C:\> raidutil -L raid


---------------------------------------------------------------------------


d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal






d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Failed


Example 2: The printout shows a degraded RAID and a missing disk drive. This solution is NOT applicable.

Primus solution SCS388828 should be used instead. It may be necessary to reboot the node after power cyclicing the faulty node for the SCSI

disks to be scanned.



---------------------------------------------------------------------------







9 of 11 30-3-12 5:16 p.m.




d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 0MB Missing

Example 3: The printout shows a degraded RAID and a missing disk drive with an invalid target. This solution may be applicable.

If "raidutil -L physical" correctly shows all 6 disks then Primus solution SCS388828 should be used.

If the problem perists the faulty node should be shut down and the source node rebooted.

If the problem persists this solution should be followed to delete and re-create the corrupted RAID information.



---------------------------------------------------------------------------









d0b1t3d0 Disk Drive (DASD) FUJITSU MAN3184MP 0MB Missing

Example 4: The printout shows a failed RAID.

This solution is applicable. The failed RAIDs need to be deleted and re-created.



---------------------------------------------------------------------------










Special note when the OPI "APG40, Node, Change" was followed without zapping the RAIDs on the replaced node.

Example 5: The printout shows a failed RAID.

This solution is applicable. The failed RAIDs need to be deleted and re-created.

In this case the hard disks on the non-replaced node should be used as the source of the data. Therefore the procedure should be performed on

that node.



---------------------------------------------------------------------------


d0b1t4d0 Disk Drive (DASD) DPT --UNKNOWN-- 0MB Missing

d0b0t0d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive







RAID re-create during node change

Follow the OPI "APG40, Node, Change" until the SCSI cables are reconnected and the node is powered on.

OPI "APG40, Node, Change, APG40C/2":

7/154 31-CRZ 222 02 revision X: Stop after step 136.

7/154 31-CRZ 222 02 revision Z: Stop after step 146.

7/154 31-CRZ 222 04 revision A: Stop after step 136.

7/154 31-CRZ 222 04 revision B: Stop after step 145.

7/154 31-CRZ 222 05 revision E: Stop after step 132.

7/154 31-CRZ 222 05 revision K: Stop after step 132.

7/154 31-CRZ 222 05 revision M: Stop after step 164.

7/154 31-CRZ 222 05 revision S: Stop after step 153.

7/154 31-CRZ 222 05 revision T: Stop after step 143.

7/154 31-CRZ 222 05 revision U: Stop after step 178.

OPI "APG40, Node, Change, C/2, Win 2003 Spare":

12/154 31-CRZ 222 05 revision C: Stop after step 141

1.

Repeat the procedure above to re-create the deleted RAID.2.


10 of 11 30-3-12 5:16 p.m.

Note: As the RAID has already been deleted the step "Delete the RAID" in the procedure should be skipped.

Continue with the OPI "APG40, Node, Change" from the next step.3.

Additional steps to delete the RAID.

This note contains additional steps for step "Delete the RAID" in the procecure above.

Disconnect the SCSI cables.

Remove the upper (top) SCSI cable from the good node.

Remove the lower (bottom) SCSI cable from the good node.

Remove the upper (top) SCSI cable from the faulty node.

Remove the lower (bottom) SCSI cable from the faulty node.

1.

Delete the RAID.

If it is not possible to delete the RAID then contact the next level of support.

AP Command:raidutil -D d0b0t<#>d0

Examples:


d0b0t0d0


d0b0t1d0


d0b0t2d0

2.

Check that the RAID has been deleted.

If the RAID has not been deleted then contact the next level of support.


3.

Reconnect the SCSI cables.

Connect the upper (top) SCSI cable to the faulty node.

Connect the lower (bottom) SCSI cable to the faulty node.

Connect the upper (top) SCSI cable to the good node.

Connect the lower (bottom) SCSI cable to the good node.

4.

Check that The six SCSI disks are correct and available.

If the 3 SCSI disks on bus 1 are not visible then follow the step "Reboot the node" in the procedure.

AP Command:raidutil -L physical

Example:C:\> raidutil -L physical


---------------------------------------------------------------------------




5.

Continue with the procedure above.

Continue with the step "Re-create the RAID" in the procedure.

6.

Disconnect SCSI cable

Disconnect the SCSI cables.

Remove the upper (top) SCSI cable from the good node.

Remove the lower (bottom) SCSI cable from the good node.

Remove the upper (top) SCSI cable from the faulty node.

Remove the lower (bottom) SCSI cable from the faulty node.

1.


11 of 11 30-3-12 5:16 p.m.

recover a failed raid without deleting data on apg40

Documents