recover a failed raid without deleting data on apg40
TRANSCRIPT
"Recover a failed RAID without deleting data on APG40"
ID: SCS128928
Domain: primus_owner@PRIMPRD
Usage Count: 3365
Class: External1
Conflicts: 0
Audience: Internal
Initiated by: epamks (Mark
Scrivener)
Date Created: 9/18/2002
Date Modified: 12/22/2011
Modified By: epamks (Mark
Scrivener)
Owner: epamks (Mark
Scrivener)
Status: REL (released)
Suspected_Faulty: No
Type: How to
Goal
Recover a failed RAID without deleting data on APG40
Re-create a dead RAID without deleting data on APG40
Re-create a dead array without deleting data on APG40
Fact
APG40
APG40C/2
Network: CDMA
Network: GSM
Network: WCDMA
Network: Wireline
Node: AXE BSC
Node: AXE FNR
Node: AXE HLR
Node: AXE MSC
Service: Engine Integral
Symptom
Both nodes down
AP FAULT
PROBLEM: DOMAIN CONNECTION
PROBLEM: GENERAL ERROR
AP REBOOT, CAUSE by Command initiated
AP PROCESS STOPPED, CAUSE by Process death
Alarm: AP FAULT, MIRRORED DISKS NOT REDUNDANT.
Both disks of a RAID have failed
RAID marked as dead in DPT Storage Manager
STS stopped due to dead RAID disk
FOS failed
Command: raidutil displays an extra RAID entry
One node is Passive and one node is Undefined
fcc_integrate was not executed correctly
RTR is failed
Event ID: 1034
The disk associated with cluster disk resource 'Disks J: K: L: M:' could not be found.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
1 of 11 30-3-12 5:16 p.m.
The disk associated with cluster disk resource 'Disks ...' could not be found.
The expected signature of the disk was xxxxxxxx. If the disk was removed from the cluster, the resource should be deleted. If the disk was
replaced, the resource must be deleted and created again in order to bring the disk online. If the disk has not been removed or replaced, it may
be inaccessible at this time because it is reserved by another cluster node.
Both nodes in state undefined
Command: net start clussvc fails with A system error has occurred., Size of job is %1 bytes.
A system error has occurred.
Size of job is %1 bytes.
Command: net start clussvc fails with A system error has occurred., System error 2 has occurred., The system cannot find the file specified.
System error 2 has occurred.
No STS & no MML & One Node is undefined
The system cannot find the file specified.
Disk Resource is Failed
Cluster disk resource failed
fcc_save_to_remove other gives "removing mirroring: failed"
'fcc_save_to_remove other' command hangs
System error 1067 has occurred.
AP NOT AVAILABLE
Alarm: STATISTICS AND TRAFFIC MEASUREMENT FILE ACCESS FAULT, STS COULD NOT ACCESS FILE
OSS heartbeat failure alarm
Cause
The RAID will be failed (dead) when both disk drives belonging to the RAID are failed.
The RAID information is corrupt and/or a RAID controller is faulty.
One known cause is loading/updating the RAID firmware on an incompatible board. For example loading the FT06 RAID firmware (CN-I APZ
212 20/5-584 and -585) on version 3.1.3.3 of the PSU-HDD board.
An incorrectly terminated SCSI bus. e.g. not doing "fcc_save_to_remove other".
A task force was created in PDU to address the large number of emergencies caused by RAID failures.
The first outcome of the task force is improved handling at the repair centre. e.g. If a node is returned due to a RAID failure the RAIDs are now
being tested.
The second outcome of the task force was a modification of the SCSI BUS RESELECTION time-out parameters on the SCSI disks. PDU believe
that this will reduce the number of emergencies caused by RAID failures by at least 30% to 50%.
The APG40 GCC (GSDC Spain) and PDU have setup a monthly "KCS Triggered Product Improvement" report to determine the most common
problems in APG4x and make recommendations on how to fix them. The first SOLUTION fix in this Primus will be continuously updated to
included any revelent information from this report.
Ericsson internal only
Fix
REMEDY:
CONDITIONS:
This solution is applicable to APG40C/1 and APG40C/2.
1.
The status of a RAID is Failed, Impacted or Dead.
If none of the RAIDs have the status Failed, Impacted or Dead then this solution is normally not applicable. See see the note "Is this
solution right for me?" below for more information.
AP Command:raidutil -L logical
Example:C:\> raidutil -L logical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed
2.
The RAID NVRAM is not "MOT V1.1".
RAID NVRAM version "MOT V1.1" has introduced problems which may cause this Primus solution to fail. Primus solution
SCS684731 should be used to upgrade or downgrade the RAID firmware on both nodes.
3.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
2 of 11 30-3-12 5:16 p.m.
AP Command:raidutil -L version
Example printout:# Controller Cache FW NVRAM BIOS SMOR Serial
---------------------------------------------------------------------------
d0 DPT PM3757U2 0MB FT0A MOT V1.1 10-10035
PROCEDURE:
When a RAID is failed and/or both disks of the RAID are failed the OPI "AP, System Data Disk Restore" should normally be followed to fix the
problem. The OPI fixes the problem by zapping the drives, destroying all data on the data disks. This Primus solution fixes the problem by
deleting and re-creating the RAID definitions without data loss. This Primus solution is meant to be used as an alternative to the OPI.
This Primus solution should therefore be used in similar circumstances. If this Primus solution does not fix the problem then the OPI "AP,
System Data Disk Restore" should be considered.
The procedure takes about 30 minutes and during this time there will be no MML contact, charging will be buffered and STS data will be lost.
Collect information for further analysis.
Log the information below from both nodes and send the result to the owner of this solution.
AP Command:hostname
prcstate
date/t
time/t
raidutil -L all
frlbbdiag -v
raidutil -K
raidutil -e soft d0
raidutil -e recov d0
raidutil -e nonrecov d0
raidutil -e status d0
aehevls -l app -c dptelog
mktr <YYMMDD>-<HHMM> -c
1.
Determine the source disk for the RAID re-create.
When the RAID is deleted and re-created a disk must be chosen as the source of the data for the RAID.
In this solution the node that will be used as the source of the data will be be referred to as the good node and the other node will be
referred to as the faulty node.
This is the most important step of the procedure and it is recommdended that second line support performs this step. The "raidutil -e
status d0" logs from both nodes should be used to determine the sequence of events. The node where the disks failed last should
normally used as the source node. The frlbbdiag command must also be used to verify that the source node is also free from fault.
Command:frlbbdiag -v
raidutil -e status d0
2.
Connect to the faulty node.
This is the node that will not be used as the source of the data for the RAID.
AP Command:hostname
3.
Shutdown the node.
AP Command:prcboot -s
4.
Connect to the good node.
Use the node IP address and not the cluster IP address.
This is the node that will be used as the source of the data for the RAID.
AP Command:hostname
5.
Disable the "Cluster Server" and Ericsson services startup.
Do not disable the "Cluster Disk" device as this will prevent the RAID from being deleted.
Windows 2003 Command:sc config Clussvc start= Disabled
6.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
3 of 11 30-3-12 5:16 p.m.
sc config ACS_PRC_ClusterControl start= Disabled
sc config ACS_FCH_Server start= Disabled
sc config ACS_FCR_Server start= Disabled
Windows NT Command:echo REGEDIT4 > C:\TEMP\Cluster_Disabled.reg
echo. >> C:\TEMP\Cluster_Disabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Disabled.reg
echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg
echo. >> C:\TEMP\Cluster_Disabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Disabled.reg
echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg
echo. >> C:\TEMP\Cluster_Disabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Disabled.reg
echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg
type C:\TEMP\Cluster_Disabled.reg
regedit /s C:\TEMP\Cluster_Disabled.reg
del C:\TEMP\Cluster_Disabled.reg
Set BIOS "Cluster Support" to Disabled (Off).
AP Command:raidutil +cluster off
7.
Reboot the node.
Do not use prcboot. The normal "prcboot" command normalises the "Cluster Server" and Ericsson services startup.
There may be no response from the terminal until the AP finishes rebooting after the shutdown command is entered. This will take
about 6 minutes.
Windows 2003 Command:shutdown /f /r /t 0
Windows NT Command:shutdown /f /r %COMPUTERNAME%
8.
Check that SCSI disks are correct and available.
If the 6 SCSI disks, 3 per node, can not be seen or the targets are incorrect then it will not be possible to re-create the RAID.
AP Command:raidutil -L physical
Example:C:\> raidutil -L physical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal
d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive
d0b1t0d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal
d0b1t1d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive
9.
Check the size of the RAID.
Make a note of the size of the RAID that will be deleted and re-created.
If the capacity of the disks are different then the size of the RAID has to be set when it is re-created.
AP Command:raidutil -L raid
Example where the RAID size is 17432:C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17432MB Optimal
d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
d0b1t0d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17432MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
d0b1t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17432MB Failed
d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive
d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive
10.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
4 of 11 30-3-12 5:16 p.m.
Delete the RAID.
Only delete the RAIDs that are Failed, Impacted or Dead.
If it is not possible to delete the RAID then follow the note " Additional steps to delete the RAID" below and then continue with the
next step.
AP Command:raidutil -D d0b0t<#>d0
Examples:
Delete RAID d0b0t0d0:C:\> raidutil -D d0b0t0d0
d0b0t0d0
Delete RAID d0b0t1d0:C:\> raidutil -D d0b0t1d0
d0b0t1d0
Delete RAID d0b0t2d0:C:\> raidutil -D d0b0t2d0
d0b0t2d0
11.
Check that the RAID has been deleted.
If the RAID has not been deleted then follow the note "Additional steps to delete the RAID" below and then continue with the next
step.
AP Command:raidutil -L logical
Expected Printout:Failure:Can't find component by address
Expected Printout:Failure:Can't find component by address
12.
Set the disk cache to write back.
AP Command:raidutil -w on d0b0t<#>d0
raidutil -w on d0b1t<#>d0
Examples:
RAID d0b0t0d0 deleted:C:\> raidutil -w on d0b0t0d0
C:\> raidutil -w on d0b1t0d0
RAID d0b0t1d0 deleted:C:\> raidutil -w on d0b0t1d0
C:\> raidutil -w on d0b1t1d0
RAID d0b0t2d0 deleted:C:\> raidutil -w on d0b0t2d0
C:\> raidutil -w on d0b1t2d0
13.
Re-create the RAID.
The first disk specified after the "-g" parameter is used as the source of the data when re-creating the RAID.
The "-s" parameter is only required if the size of the RAID has to be set as described above. If the "-s" parameter is not specified then
the size of the RAID is set to the capacity of the first disk specified after the "-g" parameter.
Note: If it is not possible to re-create the RAID then follow the note "Disconnect SCSI cables" and then continue this procedure from
the next step (that is, from step 15, without recreating the RAIDs). It is important to disconnect the SCSI cables or it is possible a disk
on the shutdown node will still be accessed. This will leave the RAID deleted and allow the AP to run as a single node. The faulty node
should be left shutdown as it will be unable to be active. The RAID must be re-created when the faulty node is replaced using the note
"RAID re-create during node change" below.
AP Command:raidutil -l 1 -g d0b0t<#>d0,d0b1t<#>d0 [-i -s <size>]
Examples:
Re-create RAID d0b0t0d0:C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0
Created: RAID 1
Re-create RAID d0b0t1d0:C:\> raidutil -l 1 -g d0b0t1d0,d0b1t1d0
14.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
5 of 11 30-3-12 5:16 p.m.
Created: RAID 1
Re-create RAID d0b0t2d0:C:\> raidutil -l 1 -g d0b0t2d0,d0b1t2d0
Created: RAID 1
Re-create RAID d0b0t0d0 with size 17432MB:
C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0 -i -s 17432
Created: RAID 1
Re-create RAID d0b0t1d0 with size 17432MB:
C:\> raidutil -l 1 -g d0b0t1d0,d0b1t1d0 -i -s 17432
Created: RAID 1
Re-create RAID d0b0t2d0 with size 17432MB:
C:\> raidutil -l 1 -g d0b0t2d0,d0b1t2d0 -i -s 17432
Created: RAID 1
Stop the RAID rebuild.
This is a precaution in case the wrong node has been chosen as the source.
AP Command:raidutil -a stop d0
15.
Set the RAID cache to write through.
AP Command:raidutil -w off d0b0t0d0
raidutil -w off d0b0t1d0
raidutil -w off d0b0t2d0
16.
Check that the RAID has been re-created.
If the RAID has not been re-created then contact the next level of support.
AP Command:raidutil -L logical
Example:C:\> raidutil -L logical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
17.
Set BIOS "Cluster Support" to Enabled (On).
AP Command:raidutil +cluster on
18.
Normalise the "Cluster Server" and Ericsson services startup.
Note: In APZ 11.3 and later the ACS_PRC_ClusterControl service startup type should be set to automatic. This will be done in a later
step.
Windows 2003 Command:sc config ClusSvc start= Auto
sc config ACS_FCH_Server start= Auto
sc config ACS_FCR_Server start= Auto
Windows NT Command:echo REGEDIT4 > C:\TEMP\Cluster_Enabled.reg
echo. >> C:\TEMP\Cluster_Enabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Enabled.reg
echo "Start"=dword:00000002 >> C:\TEMP\Cluster_Enabled.reg
echo. >> C:\TEMP\Cluster_Enabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Enabled.reg
echo "Start"=dword:00000003 >> C:\TEMP\Cluster_Enabled.reg
echo. >> C:\TEMP\Cluster_Enabled.reg
echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Enabled.reg
echo "Start"=dword:00000002 >> C:\TEMP\Cluster_Enabled.reg
type C:\TEMP\Cluster_Enabled.reg
regedit /s C:\TEMP\Cluster_Enabled.reg
del C:\TEMP\Cluster_Enabled.reg
19.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
6 of 11 30-3-12 5:16 p.m.
Reboot the node.
The prcboot command is not used with Windows Server 2003 due to problems with the node not rebooting.
Windows 2003 Command:shutdown /f /r /t 0
If the printout below is received then repeat the command until successful. The command will not be successful until the "Preparing
network connections" dialog disappears.
The computer is processing another action and thus cannot be shut down. Wait until the computer has finished its
action, and then try again.(21)
Windows NT Command:prcboot
20.
Check the status of the RAIDs.
If the RAID status has returned to the status failed then replace the faulty node and repeat the procedure. If a spare node is not
immediately available then follow the note "Disconnect SCSI cables" below and repeat this procedure. This will leave the RAID deleted
and allow the AP to run as a single node. The faulty node should be left shutdown until a replacement is available. If this is done the
RAID must be re-created when the faulty node is replaced using the note "RAID re-create during node change" below.
AP Command:raidutil -L logical
21.
Wait for all resources to come online.
The resources owned by the faulty, shutdown node will not come online.
If the faulty node is going to be replaced then the procecure is complete.
22.
Reboot the faulty, shutdown node.
This step should not be performed if the faulty node should be left shutdown or if the RAID was not re-created.
AP Command:fcc_reset other
23.
Wait for all resources to come online.24.
Normalise the ACS_PRC_ClusterControl resource.
This step is not required with Windows NT as the prcboot command above sets the startup type.
Windows Server 2003 Command:sc config ACS_PRC_ClusterControl start= Auto
25.
Make sure the RAID rebuild is set to fast.
AP Command:raidutil -r fast d0
Example:C:\> raidutil -r fast d0
Address Type Rate
---------------------------------------------------------------------------
d0b0t7d0 HBA 9.0s (fast)
d0b0t2d0 RAID 1 (Mirrored) 9.0s (fast)
d0b0t0d0 RAID 1 (Mirrored) 9.0s (fast)
d0b0t1d0 RAID 1 (Mirrored) 9.0s (fast)
26.
Check the RAID disks for faults.
If there are any faults then follow the OPI "AP FAULT" and do not attempt a rebuild - do not perform the remaining steps in this
procedure.
Command:frlbbdiag
27.
Rebuild the re-created RAIDs.
AP Command:raidutil -a rebuild d0b0t<#>d0
Examples:
Rebuild RAID d0b0t0d0:C:\> raidutil -a rebuild d0b0t0d0
28.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
7 of 11 30-3-12 5:16 p.m.
d0b0t0d0
Rebuild RAID d0b0t1d0:C:\> raidutil -a rebuild d0b0t1d0
d0b0t1d0
Rebuild RAID d0b0t2d0:C:\> raidutil -a rebuild d0b0t2d0
d0b0t2d0
Perform a health check of the AP.
Follow Primus solution SCS123402.
29.
Query and change the SCSI BUS RESELECTION settings with FrChangeDisk.
Follow Primus solution SCS841510.
30.
Implement APG40C/2 RAID improvements as per the SOLUTION fix below.31.
SOLUTION:
CONDITIONS:
As in the REMEDY above.1.
PROCEDURE:
Implement recommendations from PDU task force and GCC/PDU APG40 KCS Triggered Product Improvement.
Implement the SCSI BUS RESELECTION time-out parameter change.
This change is introduced with CN-I APZ 212 30/4-1126.
This CN-I is included in the follow packages:
- BSC PLM: APG40 One Trace: IP-A203.
- MSC PLM: APG40 One Track EP-A111.
- APZ PLM: APG40 One Track AGM018.
The FrChangeDisk tool introduced with CN-I APZ 212 30/4-1126 has been updated to fix faults in the following CN-Is.
CN-I APZ 212 30/4-1233.
This CN-I is included in the following packages:
- APZ PLM: APG40 One Track AGM019.
CN-I APZ 212 30/4-1487.
This CN-I is included in the following packages:
- APZ PLM: APG40 One Track UAM009.
1.
Implement the FrLbbDiag tool and ContLogCollector service.
This change is introduced with CN-I APZ 212 30/4-1140.
This CN-I is included in the following packages:
- APZ PLM: APG40 One Track AGM018.
The FrLbbDiage tool introduced with CN-I APZ 212 30/4-1140 has been updated in the following CN-Is.
CN-I APZ 212 30/4-1375.
This CN-I is included in the following packages:
- APZ PLM: APG40 One Tracke AGM020.
2.
Implement NVRAM Force V2.1.
This change is introduced with CN-I APZ 212 30/4-1217.
This CN-I is included in the follow packages:
- APZ PLM: APG40 One Track AGM019.
3.
Use the updated AP FAULT OPI when rebuilding the RAID when the AP FAULT alarm is raised.
This change is introduced with CN-I APZ 212 30/4-1373.
This CN-I is included in the follow packages:
4.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
8 of 11 30-3-12 5:16 p.m.
- APZ PLM: APG40 One Track AGM020.
Ericsson internal only
SOLUTION:
CONDITIONS:
As in the REMEDY above.1.
Following the procedure in the above remedy either did not fix the problem, or there was a subsequent occurrence of the same fault.2.
PROCEDURE:
Using the information from the log files gathered in the REMEDY above, determine which node is most likely to be faulty.1.
Change the node. See the Operational Instruction "APG40, Node, Change".
If unsure about which node should be changed, please contact the next level of support for assistance. It is possible that the actual
fault is not in the indicated node, but in the other node, or in one of the SCSI cables connecting the two nodes. These may also need to
be changed if changing the indicated node still does not fix the problem.
2.
SOLUTION:
CONDITIONS:
As in the first REMEDY above.1.
PROCEDURE:
It is normal for a RAID to be failed when both hard disks have failed.
It is the opinion of design that the RAID should be failed when both disk drives belonging to the RAID have failed. Preventing this issue
from occurring requires choosing a disk drive to be used as the source of the data for the RAID. It is the opinion of design that it is too
dangerous to allow the system to do this and it is better to follow the OPI "AP, System Data Disk Restore" and erase the data disks. It is
therefore important to replace any node with a failed disk drive as soon as possible.
There have been several other TRs raised on this issue with TR HE82881 is a good example of the designs opinion of the problem.
TR HG51610 has been raised to address this issue.
1.
Ericsson internal only
Note
Is this solution right for me?
If the "raidutil -L raid" printout displays 3 "RAID 1" entries each with 2 "Disk Drive" entries, with correct targets, and the status of the "RAID
1" entries is Optimal or Degraded then this solution is NOT applicable.
Example 1: The printout shows a degraded RAID and a failed disk drive. This solution is NOT applicable.
The OPI "AP FAULT" should be used instead.
C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Failed
d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
Example 2: The printout shows a degraded RAID and a missing disk drive. This solution is NOT applicable.
Primus solution SCS388828 should be used instead. It may be necessary to reboot the node after power cyclicing the faulty node for the SCSI
disks to be scanned.
C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
9 of 11 30-3-12 5:16 p.m.
d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 0MB Missing
Example 3: The printout shows a degraded RAID and a missing disk drive with an invalid target. This solution may be applicable.
If "raidutil -L physical" correctly shows all 6 disks then Primus solution SCS388828 should be used.
If the problem perists the faulty node should be shut down and the source node rebooted.
If the problem persists this solution should be followed to delete and re-create the corrupted RAID information.
C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded
d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t3d0 Disk Drive (DASD) FUJITSU MAN3184MP 0MB Missing
Example 4: The printout shows a failed RAID.
This solution is applicable. The failed RAIDs need to be deleted and re-created.
C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed
d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
Special note when the OPI "APG40, Node, Change" was followed without zapping the RAIDs on the replaced node.
Example 5: The printout shows a failed RAID.
This solution is applicable. The failed RAIDs need to be deleted and re-created.
In this case the hard disks on the non-replaced node should be used as the source of the data. Therefore the procedure should be performed on
that node.
C:\> raidutil -L raid
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed
d0b1t4d0 Disk Drive (DASD) DPT --UNKNOWN-- 0MB Missing
d0b0t0d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive
d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed
d0b1t5d0 Disk Drive (DASD) DPT --UNKNOWN-- 0MB Missing
d0b0t1d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive
d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed
d0b1t3d0 Disk Drive (DASD) DPT --UNKNOWN-- 0MB Missing
d0b0t2d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive
RAID re-create during node change
Follow the OPI "APG40, Node, Change" until the SCSI cables are reconnected and the node is powered on.
OPI "APG40, Node, Change, APG40C/2":
7/154 31-CRZ 222 02 revision X: Stop after step 136.
7/154 31-CRZ 222 02 revision Z: Stop after step 146.
7/154 31-CRZ 222 04 revision A: Stop after step 136.
7/154 31-CRZ 222 04 revision B: Stop after step 145.
7/154 31-CRZ 222 05 revision E: Stop after step 132.
7/154 31-CRZ 222 05 revision K: Stop after step 132.
7/154 31-CRZ 222 05 revision M: Stop after step 164.
7/154 31-CRZ 222 05 revision S: Stop after step 153.
7/154 31-CRZ 222 05 revision T: Stop after step 143.
7/154 31-CRZ 222 05 revision U: Stop after step 178.
OPI "APG40, Node, Change, C/2, Win 2003 Spare":
12/154 31-CRZ 222 05 revision C: Stop after step 141
1.
Repeat the procedure above to re-create the deleted RAID.2.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
10 of 11 30-3-12 5:16 p.m.
Note: As the RAID has already been deleted the step "Delete the RAID" in the procedure should be skipped.
Continue with the OPI "APG40, Node, Change" from the next step.3.
Additional steps to delete the RAID.
This note contains additional steps for step "Delete the RAID" in the procecure above.
Disconnect the SCSI cables.
Remove the upper (top) SCSI cable from the good node.
Remove the lower (bottom) SCSI cable from the good node.
Remove the upper (top) SCSI cable from the faulty node.
Remove the lower (bottom) SCSI cable from the faulty node.
1.
Delete the RAID.
If it is not possible to delete the RAID then contact the next level of support.
AP Command:raidutil -D d0b0t<#>d0
Examples:
Delete RAID d0b0t0d0:C:\> raidutil -D d0b0t0d0
d0b0t0d0
Delete RAID d0b0t1d0:C:\> raidutil -D d0b0t1d0
d0b0t1d0
Delete RAID d0b0t2d0:C:\> raidutil -D d0b0t2d0
d0b0t2d0
2.
Check that the RAID has been deleted.
If the RAID has not been deleted then contact the next level of support.
AP Command:raidutil -L logical
3.
Reconnect the SCSI cables.
Connect the upper (top) SCSI cable to the faulty node.
Connect the lower (bottom) SCSI cable to the faulty node.
Connect the upper (top) SCSI cable to the good node.
Connect the lower (bottom) SCSI cable to the good node.
4.
Check that The six SCSI disks are correct and available.
If the 3 SCSI disks on bus 1 are not visible then follow the step "Reboot the node" in the procedure.
AP Command:raidutil -L physical
Example:C:\> raidutil -L physical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
d0b0t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal
d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal
5.
Continue with the procedure above.
Continue with the step "Re-create the RAID" in the procedure.
6.
Disconnect SCSI cable
Disconnect the SCSI cables.
Remove the upper (top) SCSI cable from the good node.
Remove the lower (bottom) SCSI cable from the good node.
Remove the upper (top) SCSI cable from the faulty node.
Remove the lower (bottom) SCSI cable from the faulty node.
1.
Recover a failed RAID without deleting data on APG40 http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
11 of 11 30-3-12 5:16 p.m.