58348378 cl troubleshooting 2ndedition b03

222
EMC / CLARiiON Troubleshooting Guide 2 nd Edition EMC Global Services - Problem Resolution & Escalation Management - CLARiiON

Upload: ephemeron

Post on 13-Oct-2014

732 views

Category:

Documents


14 download

TRANSCRIPT

Page 1: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting Guide

2nd Edition

EMC Global Services - Problem Resolution & Escalation Management - CLARiiON

Page 2: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

EMC / CLARiiON Troubleshooting – 2nd Edition Description This is a 2nd edition version of the CLARiiON Troubleshooting Manual that was first introduced in January/2004. The original manual had an accompanying training course that still has relevant material. This document introduces new information and also updated information on topics related to the CLARiiON disk storage product. Please note that not all information will be accessible or available to all readers of this document. Authors Wayne Brain - Consulting Engineer [email protected] David Davis - Technical Support Engineer [email protected] Joseph Primm - Consulting Engineer [email protected] Roy Carter - Corporate Systems Engineer [email protected] Other various engineering sources - Our thanks to everyone’s input in putting this document together. Intended Audience

• EMC and Vendor Technical Support Professionals • CLARiiON trained CS Specialists [ie: RxS or LRS] • Other field personnel with management approval Objectives Build a solid understanding of specific topics related to CLARiiON. Prerequisites Good knowledge of fibre channel and an understanding of basic CLARiiON operations and functionality. The following are recommended to have been taken prior to use of this manual. • CLARiiON Core Curriculum (e-Learning) • CLARiiON Core Curriculum (workshop) Field experience providing knowledge of the theory of operation of the CLARiiON CX Series hardware, and Implementation of a CLARiiON using Navisphere 6.x Content The course will cover the topic areas noted below. Section 1 Layered Applications Section 2 NDU Basic Operations and Troubleshooting Section 3 Backend Architecture Section 4 Troubleshooting & Tools Section 5 General FLARE Date Approved By Rev Description 01/15/04 Joseph Primm A02 CL_Troubleshooting_1stEdition (original document) 02/06/07 Joseph Primm B00 CL_Troubleshooting_2ndEdition (initial draft) 02/07/07 Joseph Primm B01 Formatting and statement corrections 02/28/07 Joseph Primm B02 Corrections, added bookmarks, major changes to section 5 08/30/07 Joseph Primm B03 Added CX3 Port numbering, page 218

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 1

Page 3: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 1 Layered Applications – Page 5 General Terms – Page 5

SnapView Snapshot Terms – Page 5 SnapView Clone Terms – Page 6 MirrorView/S and MirrorView/A Terms – Page 6 SAN Copy Terms – Page 7

SnapView Snapshots – Page 7 Source LUN – Page 7 Snapshot LUN – Page 8 Reserved LU – Page 9 Step-by-step snapshots overview - all platforms – Page 10

SnapView Clones – Page 18 Source LUN – Page 18

Clone LUN (Fractured) – Page 20 Clone LUN (Unfractured) – Page 21

CPL – Page 22 Step-by-step clone overview - all platforms – Page 23 Reverse synchronization - all platforms – Page 27

MirrorView/S – Page 28 Primary LUN – Page 28 Secondary LUN – Page 30 WIL – Page 31 How MirrorView/S handles failures – Page 33

Access to the SP fails – Page 33 Primary Image Fails – Page 33 Promoting a secondary image to a primary image – Page 34 Running MirrorView/S on a VMware ESX Server – Page 35 Recovering by promoting a secondary image – Page 35 Restoring the original mirror configuration after recovery of a failed primary image – Page 36 Recovering without promoting a secondary image – Page 37 Failure of the secondary image – Page 37 Promoting a secondary image when there is no failure – Page 38

Summary of MirrorView/S failures – Page 39 Recovering from serious errors – Page 40

How consistency groups handle failures – Page 40 Access to the SP fails – Page 40 Primary storage system fails – Page 40 Recovering by promoting a secondary consistency group – Page 41 Normal promotion – Page 41 Force promote – Page 41 Local only promote – Page 41 Recovery policy after promoting – Page 42

MirrorView/A – Page 43 Primary LUN – Page 43 Secondary LUN – Page 45 Reserved LU (Primary) – Page 46 Reserved LU Secondary – Page 47 How MirrorView/A handles failures – Page 49

Access to the primary SP fails – Page 49 Primary image fails – Page 49 Promoting a secondary image to a primary image – Page 50 Running MirrorView/A on a VMware ESX Server – Page 51 Recovering by promoting a secondary image – Page 52 Restoring the original mirror configuration after recovery of a failed primary image – Page 52 Recovering without promoting a secondary image – Page 53 Failure of the secondary image – Page 54 Promoting a secondary image when there is no failure – Page 54

Summary of MirrorView/A failures – Page 55 Recovering from serious errors – Page 56

How consistency groups handle failures – Page 56 Access to the SP fails – Page 56 Primary storage system fails – Page 56

Recovering by promoting a secondary consistency group – Page 57 Normal promotion – Page 57 Force promote – Page 57 Local only promote – Page 57

Failure of the secondary consistency group – Page 58 SAN Copy – Page 59

Destination LUN (Full Copy) – Page 59 Source LUN (Incremental Copy) – Page 60

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 2

Page 4: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Destination LUN (Incremental Copy) – Page 61 Reserved LU – Page 61

SAN Copy (ISC) – Page 63 Creating an Incremental SAN Copy Session – Page 63 Marking/Unmarking the Incremental SAN Copy Session – Page 64 Starting the Incremental SAN Copy Session – Page 64 Viewing/Modifying an Incremental SAN Copy Session – Page 64 Destroying an Incremental SAN Copy Session – Page 65 Error Cases – Page 65

Incremental SAN Copy Session Failure – Page 65 Incremental SAN Copy Session Destination Failure – Page 65 Out of SnapCache for Incremental SAN Copy Session – Page 65 SnapCache failure – Page 66

Restrictions – Page 66 Issues – Page 66

Case Studies – Page 70 MirrorView/A - Target array upgraded from CX500 to CX700, MV/A has stopped working – Page 70 MirrorView/S - SPB mirrorview initiator is missing after switch cable change – Page 71 SanCopy - SanCopy failure – Page 71 San Copy - Host I/O failed with MV and host I/O running with 200ms and SPA rebooted – Page 74 San Copy - LUN 23 corrupted – Page 74 SnapView - Snapsession failure during a trespass – Page 76 SnapView - Unable to delete LUNs that were part of a mirror. – Page 77 SnapView - SP Bugcheck 0x000000d1, 0x00000002, 0x00000000, 0x00000000 – Page 78 SnapView - Bugcheck 0xe111805f (0x81ff6c48, 0x00000000, 0x00000000, 0x000003cd) – Page 79

Section 2 NDU Basic Operations and Troubleshooting – Page 82 General Theory – Page 82

NDU Process – Page 82 Sample Cases – Page 85

Dependency Check Failed – Page 85 PSM Access Failed – Page 85 Cache Disable Failed – Page 86 Check Script Failed – Page 86 Setup Script failed – Page 87 Quiesce Failed – Page 87 Deactivate Hang – Page 87 Panic During Activate – Page 88 Reboot Failed – Page 88 Registry Flush Failed – Page 88 Commit Failed – Page 88 Post Conversion Bundle Inconsistency in Release 14 – Page 88 R12/R13 to R16/R17 stack size problem – Page 88 Initial Cleanup Failed – Page 88 iSCSIPortx IP Configuration Restoration and Device Discovery – Page 90 QLogic r4/r3 issue – Page 90 One or both SPs in reboot cycle – Page 90

Tips and Tricks – Page 92 SPCollects – Page 92 Event Logs – Page 92 Ktrace – Page 92 NDU Output Files – Page 92 Force degraded mode – Page 92

Section 3 Backend Architecture – Page 92

General Theory – Page 92 CLARiiON Backend Arbitrated Loop – Page 93

Backend data flow – Page 94 How does this relate to the backend of a CLARiiON Storage System? – Page 94

Data flow through each enclosure type – Page 95 FC-series data flow – Page 95 CX data flow – Page 95

CX-series data flow with DAE2 – Page 96 ATA (Advanced Technology Attachment) Disk Enclosures – Page 97

ATA Disk Ownership – Page 98 Ultrapoint (Stiletto) Disk Array Enclosure – DAE2P/DAE3P – Page 101

Fibre Channel Data Path – Page 102 How to troubleshoot an Ultrapoint backend bus using the ‘counters’. – Page 102

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 3

Page 5: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Descriptions of the registers returned in the lccgetstats output. – Page 104 How to interpret the output of the Ultrapoint counters – Page 105

Other options for backend isolation – Page 109 SP Event Logs – Page 109 RLS Monitor Logs – Page 110

Section 4 Troubleshooting & Tools – Page 112

CAP – Page 112 DRU – Page 133 TRiiAGE – Page 137

FLARE Centric Log Error Reporting Information – Page 147 SP State – Page 156 Advanced LUstat – Page 156 Ktcons lustat – Page 157 Ktcons Vpstat – Page 159 FCOScan – Page 159 Displaying Coherency Error Count – Page 160 RAID Group Error Summary Information – Page 160 DISKS SENSE DATA from SP*_System.evt files – Page 161 FBI Error Information – Page 162 YUKON Log Analysis – Page 163

SPCollect Information – Page 164 SPQ – Page 164

Section 5 General Troubleshooting and Information – Page 166

Private Space Reference – Page 166 SP Will Not Boot – Page 167 First Steps To Try – Page 168 CX Boot Failure Modes / Establishing PPP Connection to SP – Page 169 LAN Service Port CX3 / EMCRemote Password R24 / SP Fault LED Blink Rates – Page 170 Summary of Boot Process – Page 171 CX200/400/600 Powerup – Page 172 CX300/500/700 Powerup – Page 174 CX3-20/CX3-40/CX3-80 Powerup – Page 177 Data Sector Protection – Page 180

How do these bytes work? – Page 181 What can cause uncorrectable sectors? – Page 182 Power Loss Scenario – Page 183 Pro Active Data Integrity – Page 184 Dual Active Storage Processors – Page 186 Stripe Access Management – Page 187 How do we check the integrity of the (4) 2-byte sectors? – Page 188 How to approach & resolve uncorrectable sector issues – Page 192

CLARiiON stand alone storage environment – Page 192 New tool – BRT – Page 195 CELERRA storage environment – Page 196 CDL storage environment – Page 197

General Array and Host Attach Related Information – Page 198 Binding / Assignment / Initial Assignment / Auto-Assignment – Page 198 Failover Feature (relative to Auto-Assign and not trespassing) / Trespass / Auto-Trespass – Page 200 Storage Groups / Setting Up Storage Groups (SGs) / Special (predefined) Storage Groups – Page 201 Default Storage Group / Defining Initiators / Heterogeneous Hosts – Page 202

Initiatortype – Page 203 Arraycommpath / Failovermode – Page 205 Logical Unit Serial Number Reporting – Page 207

emc99467 - Parameter settings – Page 208

APPENDIX – Page 209 Flare Revision Decoder – Page 209

CX/CX3 Bus numbering charts – Page 210 CX3-Series Array Port Numbering – Page 218

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 4

Page 6: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 1 Layered Applications General Terms Source LUN – This LUN is often considered the production LUN. Replicas are taken off of source LUNs. SP – Most CLARiiON storage systems have two Storage Processors for high availability. LUNs are owned by one SP. I/O is performed by the SP owner of a LUN (including replication I/O). Trespass – The owning SP of a LUN can be changed to the peer SP via a trespass event. Trespass events are initiated via server path failover software, or through Navisphere administrative command. PSM – On the first five drives of every CLARiiON lives a storage system database maintained by the persistent storage manager (PSM) component. This database contains the replication features configuration information (among other things). The PSM database is maintained on a triple mirror; therefore, this document does not need to cover failure cases where the PSM is totally inaccessible as the storage system will not be able to function in that case (catastrophic failure). It is possible that transient I/O failures from PSM can occur; however, due to the infrequent nature of these failures and the complexity of the error handling, these PSM failures are left outside the scope of this document. LCC/BCC – There are two link controller cards for each DAE (disk array enclosure). LCCs are used for FC enclosures and BCCs are used for ATA enclosures. LCCs and BCCs can fail due to pulling the cards, the cables connecting the cards, or due to failures of the HW or SW running on the cards. Disk Failure – Failure of two disks in a RAID 1, 1/0, 3, or 5 or one disk in a RAID 0 RAID group will cause any LUs on the RAID group to be inaccessible. A RAID group can fail due to manually pulling disks or due to physical disk failures. Cache Dirty – A LU can be marked as cache dirty if modified data that was maintained in both SP’s memory could not be flushed out the to the physical drives on which the LU lives. Cache dirty LUs are inaccessible until a procedure is invoked to clear the cache dirty state. Bad block – Every CLARiiON storage system maintains block level checksums on disk. When a block is read, the checksum is recalculated and compared with the saved checksum. If the checksum does not compare, a read failure occurs. Overwriting the block will repair the bad block. SnapView Snapshot Terms Snap Session – A point in time virtual representation of a source LUN. A source LUN can have up to 8 snap sessions associated with it. Snap sessions incur copy on first write processing in order to maintain the data point in time of the source LUN at the time the snap session was started. When a snap session is stopped, the data point in time is lost and the resources associated with the snap session are freed back to the system for use by new sessions as needed. Snap Source LUN – A source LUN that has one or more snap sessions started on it. Snapshot LUN – One of up to 8 virtual LUNs associated with a snap source LUN that can have a snap session activated upon it. The snap snapshot LUN immediately appears to contain the point in time data of a snap session the instant a snap session is activated upon it via Navisphere or admsnap command. Reserved LU – A private LU used to store the copy on first write data and associated map pointers in order to preserve up to 8 points in time for up to 8 snap sessions on a snap source LUN. A reserved LU is assigned to a source LUN the first time a session is started on the LUN. More reserved LUs will be associated with the snap source LUN as needed by the storage system. In addition to maintaining point in time data, reserved LUs also maintain tracking and transfer information for incremental SAN Copy and MirrorView/A.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 5

Page 7: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SnapView Clone Terms Clone Group – A clone group is a construct for associating clones with a source LUN. Clone Source LUN – A source LUN that has a clone group associated with it. It can have zero or more SnapView clones associated with it. Clone LUN – One of up to 8 LUNs associated with a clone source LUN. Each clone LUN is the exact size of the associated clone source LUN. Clone LUNs are added and removed from a clone group. Clone image condition – Condition of a clone LUN provides information about the status of updates for the clone. Clone image state – Clone image states reflect contents of data contained in clone with respect to clone source LUN. Fractured and Unfractured Clone LUN – A clone can be either fractured or unfractured. It can be available for I/O or unavailable for I/O. A clone LUN that is unfractured is never available for I/O. A fractured clone LUN is only available for I/O if the clone was not in the synchronizing state or the reverse synchronizing state when the administrative fracture occurred. A clone can be fractured via Navisphere administrative command or under certain failure scenarios. Protected and Unprotected Clone reverse sync – A clone can optionally be protected or unprotected during a reverse synchronization. If protected, the clone will remain fractured to allow for subsequent reverse synchronizations. If unprotected, the clone chosen for a reverse synchronization will remain mirrored with the Clone source LUN. When the clone reverse synchronization is complete, the unprotected clone will be consistent with the clone source LUN. CPL – Clone private LU (CPL) contains bit maps which describe changed regions to provide incremental synchronizations for clones. There is one CPL for each SP in the storage system. MirrorView/S and MirrorView/A Terms Primary LUN – Source LUN whose data contents are replicated on a remote storage system for the purpose of disaster recovery. Each primary LUN can have one or more secondary LUNs (MirrorView/S supports two secondary LUNs per primary and MirrorView/A supports one) associated with it. Secondary LUN - A LUN that contains a data mirror (replica) of the Primary LUN. This LUN must reside on a different CLARiiON storage system than the Primary LUN. WIL – Write Intent Log (WIL) contains bit maps which describe changed regions to provide incremental synchronizations for MirrorView/S. There is one WIL for each SP in the storage system. Secondary image condition - The condition of a secondary LUN provides additional information about the status of mirror updates to the secondary. Secondary image state – The secondary image states reflect the contents of the data contained in the secondary LUN with respect to the primary LUN. Consistency group - A set of mirrors that are managed as a single entity and whose secondary images remain in a write order consistent and recoverable state (except when synchronizing) with respect to their primary image and each other. Fracture - A condition in which I/O is not mirrored to the secondary image (also will not mirror when the secondary image condition is in the waiting on administrative action state) and can be caused via administrative command or under certain failure scenarios (administratively fractured) or when the system determines that the secondary image is unreachable (system fractured). Auto recovery – Property of a mirror which will cause the storage system to automatically start a synchronization operation as soon as a system-fractured secondary image is determined to be reachable.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 6

Page 8: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Manual recovery – Property of a mirror which will cause the storage system to wait for a synchronization request from an administrator when a system fractured secondary image is determined to be reachable (opposite of auto recovery). Promote - The operation by which the administrator changes an image’s or group’s role from secondary to primary. As part of this operation, the previous primary image becomes a secondary image. SAN Copy Terms SAN Copy Session – A SAN Copy session describes the copy operation. The session contains information about the source LUN and all destination LUNs (SAN Copy can copy a source LUN to multiple destination LUNs in one session). A session can be for a full copy or for an incremental copy. Incremental SAN Copy sessions can be in the marked or unmarked state. Marked sessions protect the point in time of the data for copying when the mark Navisphere command was issued. Incremental SAN Copy sessions require reserved LUs. SAN Copy Storage System – This is the storage system where the SAN Copy session resides. The SAN Copy processing occurs on the SAN Copy storage system. The SAN Copy storage system can contain a source LUN and/or one or more destination LUN(s) for any given SAN Copy session. Target Storage System – This is the storage system where the SAN Copy session does not reside and can contain a source LUN or one or more destination LUN(s) for any given SAN Copy session. The SAN Copy processing does not occur on the target storage system. Destination LUN – A destination LUN is the recipient LUN of a data transfer. Source LUNs are copied to destination LUNs. All destination LUNs must be the exact same size or larger than the source LUN. SAN Copy can copy a source LUN to multiple destination LUNs. SnapView Snapshots There are three user visible storage system objects that are used by the SnapView snapshot capability: Snap source LUN(s), snapshot LUN(s), and reserved LU(s). There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. For the purposes of this document, only persistent snap session behavior is described (non-persistent sessions will terminate in all events described). Source LUN Snap Source LUN Action Event Result Server write to a snap source LUN. Storage system will need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails a read from the snap source LUN and writes to reserved LU(s) before the server write to source LUN can proceed.

The storage system generated read from the snap source LUN fails due to a bad block, LCC/BCC failure, cache dirty LUN, etc.

The server write request succeeds. All snap sessions that a copy on first write was required in order to maintain the point in time data for that write will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 7

Page 9: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Snap Source LUN Action Event Result Server write to a snap source LUN. The storage system will need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails a read from the snap source LUN and writes to reserved LU(s) before the server write to source LUN can proceed.

After the copy on first write processing is completed, the write to the snap source LUN fails due to a LCC/BCC failure or some storage system software problem that happens after the copy on first write processing (if required), and while processing the write to the snap source LUN.

The server write request fails which may trigger server-based path failover software to trespass the snap source LUN (see the description of the trespass action below). All snap sessions associated with the snap source LUN are maintained.

Server read from a snap source LUN.

The read from the snap source LUN fails due to a bad block.

The server read request fails. All snap sessions associated with the snap source LUN are maintained.

SP that owns the snap source LUN is shutdown

Active I/O to the snap source LUN. SP can be shutdown due to a Navisphere command to reboot (includes NDU), the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

All snap sessions remain intact. If the snap source LUN is trespassed, I/O can resume on the peer SP.

Snap source LUN is trespassed.

Active I/O to the snap source LUN. Trespass of the snap source LUN can be triggered as a result of an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from server to snap source LUN is determined to be bad.

All snap sessions remain intact. I/O can resume on the peer SP.

Snapshot LUN Snapshot LUN Action Event Result Server I/O to a snapshot LUN. Snapshot LUNs are virtual LUNs. A single snap session may be activated on a snapshot LUN at any point in time. The point in time data represented by an activated snap session on a snapshot LUN is made up using data from the snap source LUN and the reserved LU(s) associated with snap source LUN.

The read from the snap source LUN fails due to a bad block , LCC/BCC failure, cache dirty LUN, etc. or a write to a reserved LU fails (all writes to snapshot LUNs always be written to associated reserved LUs).

The server I/O to the snapshot LUN fails. All snap sessions that require reading from the snap source LUN in order to maintain the point in time data will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.

SP that owns the snapshot LUN is shutdown (note the snapshot LUN SP owner will always be the same as the snap source LUN owner)

Active I/O to snapshot LUN. SP can be shutdown due to Navisphere command to reboot (includes NDU), SP panics due to SW or HW malfunction, or the SP is physically pulled while active.

All snap sessions remain intact. If the snap source LUN is trespassed (which will trespass all associated snapshot LUNs), I/O can resume on the peer SP.

Snapshot LUN is trespassed.

Active I/O to the snap source LUN and snapshot LUN.

All snap sessions remain intact. I/O can resume on the peer SP.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 8

Page 10: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Snapshot LUN Action Event Result

Trespass of the snapshot LUN can happen due to failover software explicit or auto trespass when a path from the server to the snapshot LUN is determined to be bad. Snapshot LUNs cannot be explicitly trespassed via Navisphere.

The snap source LUN associated with the snapshot LUN (and all other snapshot LUNs associated with the snap source LUN) will be trespassed. It is possible for a trespass storm (or trespass ping pong) to occur if a path to the snapshot LUN is bad to one SP and the path for the associated snap source LUN or another snapshot LUN also associated with the same snap source LUN is bad to the peer SP. Server path failover software on one or more servers may try to trespass the LUN only to have another server’s path failover software try to trespass the LUN back to where it was before causing the LUN ownership to go back and forth resulting in really bad performance.

Reserved LU Reserved LU Action Event Result Server write request to a snap source LUN. The storage system may need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails I/Os to reserved LU(s) before the server write to a snap source LUN or snapshot LUN can proceed.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from the reserved LU due to a bad block. This also includes running out of reserved LU space (no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool).

Server write request succeeds. All snap sessions associated with the snap source LUN are stopped. Allocated reserved LUs are freed back to the reserved LU pool.

Server I/O to a snapshot LUN. The array, in processing a server read or write to a snapshot LUN, will entail I/Os to associated reserved LU(s). A read from a snapshot LUN will never cause a write to any associated reserved LU, but will cause one or more reads.

A read or write to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes read failure from reserved LU due to a bad block. This also includes running out of reserved LU space (no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool).

Server I/O request fails. All snap sessions that require I/O to the reserved LU that failed (includes running out of space) will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 9

Page 11: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Reserved LU Action Event Result Rollback operation has started. The rollback process entails reads from associated reserved LU(s) and writes to the snap source LUN. A server I/O will cause a region of the snap source to be rolled back on demand in order to complete the server request.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from a reserved LU due to a bad block. Server I/O may be happening while the rollback is processing.

Any server I/O to the snap source LUN request proceeds. If the server request was a read which required the data to be returned from the reserved LU (not the source LUN) and the region to be read failed due to a bad block, the server read request fails. The rollback process continues. Blocks that were bad in any associated reserved LU(s) will have the appropriate blocks marked bad on the snap source LUN (even though the disk region on the snap source LUN is good) to insure the integrity of the rolled back data.

SP that owns the snap source LUN or snapshot LUN is shutdown (note all the reserved LUs SP owners will always be the same as the snap source LUN owner)

Active I/O to a snap source LUN or a snapshot LUN which generates I/Os to reserved LUs associated with the source LUN. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

All snap sessions remain intact. If the snap source LUN and all associated snapshot LUNs are trespassed, any active server I/Os to the snap source LUN or associated snapshot LUN(s) and any associated I/Os to the reserved LUs can resume on the peer SP. Any rollback operations that were in progress are automatically continued on the peer SP.

Snap source LUN or snapshot LUN is trespassed (which will cause associated reserved LUs to trespass).

Active I/O to a snap source LUN or a snapshot LUN which generates I/Os to reserved LUs associated with source LUN. Trespass of the snap source LUN or any associated snapshot LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the snap source LUN is determined bad.

All snap sessions remain intact. Active server I/O and associated reserved LU I/O can resume on the peer SP. The snap source LUN associated with the snapshot LUN (and all other snapshot LUNs also associated with the snap source LUN) will be trespassed along with any associated reserved LUs. Any rollback operations that were in progress are automatically continued on the peer SP.

Step-by-step snapshots overview - all platforms This contains examples, from setting up snapshots (with Navisphere CLI) to using them (with admsnap and Navisphere CLI). Some examples show the main steps outlined in the examples; other examples are specific to a particular platform. In the following procedures, you will use the SnapView snapshot CLI commands in addition to the admsnap snapshot commands to set up (from the production server) and use snapshots (from the secondary server). 1. Choose the LUNs for which you want a snapshot. The size of these LUNs will help you determine an approximate reserved LUN pool size. The LUN(s) in the reserved LUN pool store the original data when that data is first modified on the source LUN(s). To manually estimate a suitable LUN pool size, refer to Managing Storage Systems > Configuring and Monitoring the Reserved LUN Pool in the Table of Contents for the Navisphere Manager online help and select the Estimating the Reserved LUN Pool Size topic or the chapter on the reserved LUN pool in the latest revision of the EMC Navisphere Manager Administrator's Guide. 2. Configure the reserved LUN pool. You must configure the reserved LUN pool before you start a SnapView session. Use Navisphere Manager to configure the reserved LUN pool (refer to the online help topic Managing Storage Systems >

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 10

Page 12: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Configuring and Monitoring the Reserved LUN Pool or the chapter on the reserved LUN pool in the latest revision of the EMC Navisphere Manager Administrator's Guide. 3. Stop I/O and make sure all data cached on the production server is flushed to the source LUN(s) before issuing the admsnap start command. • For a Windows server, you can use the admsnap flush command to flush the data. • For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount

command. If unable to unmount the file system, you can issue the admsnap flush command. • For an IRIX server, unmount the file system by issuing the umount command. If you cannot unmount

the file system, you can use the sync command to flush cached data. The sync fsck command on the secondary server’s file system. Refer to your system's man pages for sync command usage.

• For a Novell NetWare server, use the dismount command on the volume to dismount the file system. Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system.

4. On the production server, log in as admin or root and, issue an admsnap start command for the desired data object (drive letter, device name, or file system) and session name. The admsnap start command starts the session. You must start a session for each snapshot of a specific LUN(s) you want to access simultaneously. You start a session from the production server based on the source LUN(s). You will mount the snapshot on a different server (the secondary server). You can also mount additional snapshots on other servers. You can start up to eight sessions per source LUN. This limit includes any reserved sessions that are used for another application such as SAN Copy and MirrorView/Asynchronous. However, only one SnapView session can be active on a secondary server at a time. If you want to access more than one snapshot simultaneously on a secondary server (for example, 2:00 p.m. and 3:00 p.m. snapshots of the same LUN(s), to use for rolling backups), you can create multiple snapshots, activate each one on a different SnapView session and add the snapshots to different storage groups. Or you can activate and deactivate snapshots on a single server. For an IRIX fabric connection only, the device name includes the worldwide port name. It has the form: /dev/rdsk/ZZZ/lunVsW/cXpYYY where: ZZZ - worldwide node name V - LUN number W - slice/partition number X - controller number YYY - port number The SnapView driver will use this moment as the beginning of the session and will make a snapshot of this data available. Sample start commands follow. IBM AIX Server (UNIX) admsnap start -s session1 -o /dev/hdisk21 (for a device name) admsnap start -s session1 -o /database (for a file system) HP-UX Server (UNIX) admsnap start -s session1 -o /dev/rdsk/c0t0d0 (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/c1t0d0

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 11

Page 13: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

IRIX Server (UNIX) admsnap start -s session1 -o /dev/rdsk/dks1d0l9 (for a device name) admsnap start -s session1 -o /database (for a file system) Linux Server (UNIX) admsnap start -s session1 -o /dev/sdc (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/sdc6 NetWare Server load sys:\emc\admsnap\admsnap start -s session1 -o V596-A2-D0:2 (for a device name) (V596 is the vendor number.) Sun Solaris Server (UNIX) admsnap start -s session1 -o /dev/rdsk/c0t0d0s7 (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Solaris Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/c1t0d0s2 Windows Server admsnap start -s session1 \.\\PhysicalDrive1 (for a physical drive name) admsnap start -s session1 -o H: (for a drive letter) 5. Using Navisphere CLI, create a snapshot of the source LUN(s) for the storage system that holds the source LUN(s), as follows. You must create a snapshot for each session you want to access simultaneously. Use the naviseccli or navicli snapview command with -createsnapshot to create each snapshot. naviseccli -h hostname snapview -createsnapshot 6. If you do not have a VMware ESX Server - Use the storagegroup command to assign each snapshot to a storage group on the secondary server. If you have a VMware ESX Server - skip to step 7 to activate the snapshot. 7. On the secondary server, use an admsnap activate command to make the new session available for use. A sample admsnap activate command is admsnap activate -s session1

• On a Windows server, the admsnap activate command finishes rescanning the system and assigns drive letters to newly discovered snapshot devices. You can use this drive immediately.

• On an AIX server, you need to import the snap volume (LUN) by issuing the chdev and importvg commands as follows: o chdev -l hdiskn -a pv=yes (This command is needed only once for any LUN.) o importvg -y volume-group-name hdiskn where n is the number of the hdisk that contains a LUN in the volume

group and volume-group-name is the volume group name. • On a UNIX server, after a delay, the admsnap activate command returns the snapshot device name. You will need to run fsck

on this device only if it contains a file system and you did not unmount the source LUN(s). Then, if the source LUN(s) contains a file system, mount the file system on the secondary server using the snapshot device name to make the file system available for use. If you failed to flush the file system buffers before starting the session, the snapshot may not be usable. Depending on your operating system platform, you may need to perform an additional step before admsnap activate to rescan the I/O bus. For more information, see the product release notes.

• For UNIX, run fsck on the device name returned by the admsnap command, but when you mount that device using the mount command, use device name beginning with /dev/dsk instead of device name /dev/rdsk as returned by admsnap command.

• On a NetWare server, issue a list devices or Scan All LUNs command from the server console. After a delay, the system returns the snapshot device name. You can then mount the volume associated with this device name to make a file system available for use. You may need to perform an additional step to rescan the I/O bus. For more information, see the product release notes.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 12

Page 14: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

8. If you have a VMware ESX Server, do the following:

a. Use storagegroup command to add snapshot to SG connected to ESX Server that will access the snapshot. b. Rescan the bus at the ESX Server level. c. If a Virtual Machine (VM) is already running, power off the VM and use Service Console of ESX Server to assign snapshot to

the VM. If a VM is not running, create a VM on the ESX Server and assign the snapshot to the VM. d. Power on VM and scan bus at VM level. For VMs running Windows, use the admsnap activate command to rescan the bus.

9. On the secondary server, you can access data on the snapshot(s) for backup, data analysis, modeling, or other use. 10. On secondary server, when you finish with snapshot data, release each active snapshot from operating system:

• On a Windows server, release each snapshot device you activated, using the admsnap deactivate command. • On an AIX server, export the snap volume (LUN) by issuing the varyoff and export commands as follows:

o varyoffvg volume-group-name o exportvg volume-group-name o Then release each snapshot device you activated, using the admsnap deactivate command.

• On a UNIX server, you need to unmount any file systems that were mounted from the snapshot device by issuing the umount command. Then release each snapshot device activated, using the admsnap deactivate command

• On a NetWare server, use dismount command on the volume to dismount the file system. A deactivate command is required for each active snapshot. If you do not deactivate a snapshot, the secondary server cannot activate another session using the pertinent source LUN(s). When you issue the admsnap deactivate command, any writes made to the snapshot are destroyed.

11. On the production server, stop the session using the admsnap stop command. This frees the reserved LUN and SP memory used by the session, making them available for use by other sessions. Sample admsnap stop commands are identical to the start commands shown in step 4. Substitute stop for start. 12. If you will not need the snapshot of the source LUN(s) again soon, use CLI snapview-rmsnapshot command to remove it. If you remove the snapshot, then for a future snapshot you must execute all previous steps. If you do not remove the snapshot, then for a future snapshot you can skip steps 5 and 3. HP-UX - admsnap snapshot script example Example showing how to use admsnap with scripts for copying and accessing data on an HP-UX secondary server. 1. From the production server, create the following script: Script 1

a. Quiesce I/O on the source server. b. Unmount the file system by issuing the umount command. If you are unable to unmount the file system, issue the admsnap

flush command. The flush command flushes all cached data. The flush command is not a substitute for unmounting the file system; the command only complements the unmount operation.

c. Start the session by issuing the following command: • /usr/admsnap/admsnap start -s snapsession_name -o device_name or filesystem_name

d. Invoke Script 2 on the secondary server using the remsh command. e. Stop the session by issuing the following command:

a. /usr/admsnap/admsnap stop -s snapsession_name -o device_name or filesystem_name 2. From the secondary server, create the following script: Script 2

a. Perform any necessary application tasks in preparation for the snap activation (for example, shut down database). b. Activate the snapshot by issuing the following command:

• /usr/admsnap/admsnap activate -s snapsession_name c. Create a new volume group directory, by using the following form:

• mkdir/dev/volumegroup_name • mknod/dev/volumegroup_name/group c 64 0x X0000

d. Issue the vgimport command, using the following form: • vgimport volumegroup_name/dev/dsk/cNtNdN

e. Activate the volume group for this LUN by issuing the following command: • vgchange -a y volumegroup_name

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 13

Page 15: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

f. Run fsck on the volume group, by doing the following: • fsck -F filesystem_type /dev/volumegroup_name/logicalvolume_name

This step is not necessary if secondary server has different HP-UX o/s revision than the production server. g. Mount the file system using the following command:

• mount/dev/volumegroup_name/logicalvolume_name/filesystem_name h. Perform desired tasks with the mounted data (ie; copy contents of mounted f/s to another location on secondary server). i. Unmount the file system mounted in step g using the following command:

• umount /dev/volumegroup_name/logicalvolume_name j. Deactivate and export the volume group for this LUN, by issuing the following commands:

• vgchange -a n volumegroup_name • vgexport volumegroup_name

k. Unmount the file system by issuing the umount command. If you are unable to unmount the file system, issue the admsnap flush command. The flush command flushes all cached data. If this is not done, the next admsnap session may yield stale data.

l. Deactivate the snapshot by using the following command: • /usr/admsnap/admsnap deactivate -s snapsession_name

m. Perform any necessary application tasks in preparation for using the data captured in step 6 (ie; start up the database). n. Exit this script, and return to Script 1.

UNIX - admsnap single session example The following commands start, activate, and stop a SnapView session. This example shows UNIX device names. On the production server, make sure all cached data is flushed to the source LUN, by unmounting the file system. umount /dev/dsk/c12d0s4 If unable to unmount the file system on a Solaris, HP-UX, AIX, or Linux server, issue the admsnap flush command. admsnap flush -o/dev/rdsk/c12d0s4 On an IRIX server, the admsnap flush command is not supported. Use the sync command to flush all cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary server’s file system. Refer to your system's man pages for sync command usage. A typical example would be: sync /dev/dsk/c12d0s4 Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system.

1. Start the session: admsnap start -s friday -o /dev/rdsk/c1t2d0s4 Attempting to start session friday on device /dev/rdsk/c1t2d0s4 Attempting to start the session on the entire LUN. Started session friday.

The start command starts a session named friday with the source named /dev/rdsk/c1t2d0s4.

2. On the secondary server, activate the session: admsnap activate -s friday Session friday activated on /dev/rdsk/c1t2d0s4.

On the secondary server, the activate command makes the snapshot image accessible.

3. On a UNIX secondary server, if the source LUN has a file system, mount the snapshot: mount /dev/dsk/c5t3d2s1 /mnt

4. On the secondary server, the backup or other software accesses the snapshot as if it were a standard LUN.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 14

Page 16: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

5. When the desired operations are complete, from the secondary server, unmount the snapshot. With UNIX, you can use

admsnap deactivate to do this. admsnap deactivate -s friday -o /dev/dsk/c5t3d2s1

6. And from the production server, stop the session: admsnap stop -s friday -o /dev/dsk/c1t2d0s4 Stopped session friday on object /dev/rdsk/c1t2d0s4.

The stop command terminates session friday, freeing reserved LUN used by session and making snapshot inaccessible. Windows - admsnap multiple session example The following example shows three SnapView sessions, started and activated sequentially, using Windows device names. The example shows how each snapshot maintains the data at the time the snapshot was started—here, the data is a listing of files in a directory. The activity shown here is the only activity on this LUN during the sessions. Procedural overview

1. Make sure the directory that holds admsnap is on your path. 2. Start sessions snap1, snap2, and snap3 on the production server in sequence and activate each session in turn on the

secondary server. All sessions run on the same LUN. 3. When session snap1 starts, four files exist on the LUN. Before starting snap2, create four more files in the same directory. On

the secondary server, deactivate snap1. Deactivate is needed because only one session can be active per server at one time. 4. On the production server start snap2, and on the secondary server activate snap2. After activating snap2, list files, displaying

the files created between session starts. 5. Create three more files on the source LUN and start session snap3. After deactivating snap2 and activating snap3, verify that

you see the files created between the start of sessions snap2 and snap3. The filenames are self-explanatory. Detailed procedures with output examples Session Snap1 1. On the production server, list files in the test directory. F:\> cd test F:\Test> dir Directory of F:\Test

01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt

2. On the production server, flush data on source LUN, and then start the first session, snap1. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap1 -o f: Attempting to start session snap1 on device \\.\PhysicalDrive1. Attempting to start session on the entire LUN. Started session snap1 F:\Test>

3. On the secondary server, activate the first session, snap1. C:\> prompt $t $p 14:57:10.79 C:\> admsnap activate -s snap1

Scanning for new devices. Activated session snap1 on device F:

4. On the secondary server, list files to show production files that existed at session 1 start. 14:57:13.09 C:\ dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 15

Page 17: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Session Snap2 1. On prod server, list files in test directory. Listing shows files created before session 1 started. Notice we created four additional files. F:\Test> dir

Directory of F:\Test 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt

2. On the production server, start the second session, snap2. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap2 -o f: Attempting to start session snap2 on device \\.\PhysicalDrive1. Attempting to start the session on the entire LUN. Started session snap2. F:\

3. On the secondary server, deactivate the session snap1, and activate the second session, snap2. 15:10:10.52 C:\> admsnap deactivate -s snap1

Deactivated session snap1 on device F:. 15:10:23.89 C:\> admsnap activate -s snap2 Activated session snap2 on device F:

4. On the secondary server, list files to show source LUN files that existed at session 2 start. 15:10:48.04 C:\> dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt

Session Snap3 1. On production server, list files in test directory. The listing shows files created between the start of sessions 2 and 3. F:\Test> dir

Directory of F:\Test 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-a.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-b.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 16

Page 18: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

2. On the production server, flush buffers and start the third session, snap3. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap3 - o f: Attempting to start session snap3 on device PhysicalDrive1. Attempting to start the session on the entire LUN. Started session snap3. F:\Test>

3. On secondary server, flush buffers, deactivate session snap2, and activate third session, snap3. 15:28:06.96 C:\> admsnap flush -o f:

Flushed f:. 15:28:13.32 C:\> admsnap deactivate -s snap2 Deactivated session snap2 on device F:. 15:28:20.26 C:\> admsnap activate -s snap3 Scanning for new devices. Activated session snap3 on device F:.

4. On secondary server, list files to show production server files that existed at session 3 start. 15:28:39.96 C:\> dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/02 09:21a 0 FilesafterS1BeforeS2-c.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-a.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-b.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt

5. On the secondary server, deactivate the last session. 15:28:45.04 C:\> admsnap deactivate -s snap3 6. On the production server, stop all sessions. F:\Test> admsnap stop -s snap1 -o f:

F:\Test> admsnap stop -s snap2 -o f: F:\Test> admsnap stop -s snap3 -o f:

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 17

Page 19: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SnapView Clones There are three user visible storage system objects that are used by the SnapView clone capability: clone source LUN(s), clone LUN(s), and CPL(s). Since the fractured and unfractured state of a clone affects the actions greatly, a fractured clone LUN and an unfractured clone LUN are treated as two different objects in order to reduce the complexity. There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. Source LUN Clone Source LUN Action Event Result Server I/O to a clone source LUN.

I/O to the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The server I/O request fails. Server based path failover software may trespass the clone source LUN. If the I/O error condition was due to a problem related to the owner SP of the clone source LUN, the I/O will be able to continue on the peer SP. See the trespass action for a clone source LUN below. The administrator may trespass the clone source LUN or repair access to the LUN in an attempt to restore availability to the clone LUN (an unfractured clone LUN cannot be trespassed directly). When repaired, the clone LUN will require a manual restart of the synchronization. If trespassed, see the clone source LUN trespass action below. All fractured clones associated with the clone source LUN are unaffected. All unfractured clones image condition will be set to “administratively fractured” with a clone property that indicates a media failure and the image state will be changed to “consistent” if the clone was not synchronizing. If the clone was synchronizing, the image state will be set to “out of sync” (or “reverse out of sync”) until the clone source LUN is repaired. The repair may happen due to a trespass of the clone source LUN if the peer SP has access to the clone source LUN (see trespass action below).

Server read from a clone source LUN.

The read from the clone source LUN fails due to a bad block.

The server read request fails. No effect to any clones associated with the clone source LUN.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 18

Page 20: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Clone Source LUN Action Event Result A storage system generated read from the clone source LUN as part of a clone synchronization.

The read from the clone source LUN fails due to a bad block.

All unfractured clones will be marked with bad block(s) at same corresponding logical offset(s) that were bad in clone source LUN. If more than 32KB consecutive bad blocks from the clone source LUN are encountered as part of the synchronization operation, the clone image will be set to “administratively fractured” with a clone property that indicates a media failure and the image state will be changed to “out of sync” until the clone source LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

A storage system generated read from the clone source LUN as part of a clone synchronization.

The read from the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The clone synchronization is aborted. The clone image will be set to “administratively fractured” with a clone property that indicates a media failure and the image state will be changed to “out of sync” until the clone source LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated write to a clone source LUN. The storage system does this write as part of a reverse synchronization (read from the clone LUN and a write to the clone source LUN).

Write to the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not matter for this scenario.

Until access to the clone source LUN is restored, the clone source LUN and the unfractured clone will be unusable and the image condition will be set to “administratively fractured” with a clone property that indicates a media failure and the image state will be set to “reverse out of sync”. All other fractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone source LUN is shutdown.

Active I/O to the clone source LUN. SP is shutdown due to a Navisphere command to reboot, NDU, SP panics due to a SW or HW malfunction, or the SP is physically pulled.

The clone source LUN is trespassed by the storage system, Active server I/O to the clone source LUN can resume on the peer SP. See clone source LUN trespass below.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 19

Page 21: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Clone Source LUN Action Event Result Clone source LUN is trespassed.

Active server I/O to the clone source LUN. Trespass of the clone source LUN can happen due to a Navisphere trespass request or via failover software when a path from the server to the clone source LUN has failed.

Active server I/O to the clone source LUN can resume on the peer SP. All fractured clones associated with the clone source LUN are unaffected and will not trespass with the clone source LUN. All unfractured clones whose image condition is “normal” will trespass with the clone source LUN. Any clone synchronizations (including reverse) that were in progress when the clone source LUN is trespassed will be queued to be started on the peer SP and will automatically start.

Clone LUN (Fractured) Fractured Clone LUN Action Event Result Server write to a fractured clone LUN.

I/O to the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Server write request fails. Server based path failover software may trespass the clone LUN. If the I/O error condition was due to a problem with the owner SP of the clone LUN, the I/O will be able to continue on the peer SP. The clone source LUN is unaffected. All other clones associated with the same clone source LUN are unaffected.

A server read from a fractured clone LUN.

Read from clone LUN fails due to a bad block.

The server read request fails. All other fractured and unfractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone LUN is shutdown

Active server I/O to the clone source LUN. SP can be shutdown due to a Navisphere command to reboot, NDU, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

Server based path failover software may trespass the clone LUN, active server I/O can resume on the peer SP. See clone LUN trespass action below. The clone source LUN is unaffected (provided it is owned by the peer SP). All other clones associated with the same clone source LUN are unaffected.

Clone LUN is trespassed. Active server I/O to the clone LUN. Trespass of the clone LUN can happen due to a trespass command, or failover software explicit or

Active server I/O to the clone LUN can resume on the peer SP. The clone source LUN associated with this fractured clone LUN and all other clones associated with the same clone source LUN are

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 20

Page 22: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Fractured Clone LUN Action Event Result

auto trespass when a path from the server to the clone LUN has failed.

unaffected.

Clone LUN (Unfractured) Unfractured Clone LUN Action Event Result Storage system generated I/O to an unfractured clone LUN. The storage system does a write in order to replicate data to a clone source LUN or as a result of a synchronization (not reverse synchronization). The storage system does a read of the unfractured clone LUN as part of a reverse synchronization operation.

I/O to the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The clone LUN will be fractured with media failure property set indicating the inability to access the clone LUN and the image condition will be set to “administratively fractured” with a clone property that indicates a media failure. The clone will be unusable and the image state will be set to “out of sync”. The administrator may trespass the clone source LUN or repair access to the LUN in an attempt to restore availability to the clone LUN (an unfractured clone LUN cannot be trespassed directly). When repaired, the clone LUN will require a manual restart of the synchronization. If trespassed, see the clone source LUN trespass action above. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated read from a clone LUN. The storage system does this read as part of a reverse synchronization (read from the clone LUN and a write to the clone source LUN).

The read from the clone LUN fails due to a bad block. Protected or unprotected clone reverse synchronization does not matter for this scenario.

The clone source LUN will be marked with bad block(s) at the same corresponding logical offset(s) that were bad in the clone LUN. If more than 32KB consecutive bad blocks from the clone LUN are encountered as part of the reverse synchronization operation, the clone image will be set to “administratively fractured” with a clone property that indicates a media failure and the image state will be changed to “reverse out of sync” until the clone LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated read from a clone LUN. The storage system does this read as part of a reverse synchronization (read from the clone LUN and a write

The read from the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not

The clone reverse synchronization operation is aborted. The clone source LUN is made inaccessible. The clone image will be set to “administratively fractured” with a clone property that indicates a

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 21

Page 23: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Unfractured Clone LUN Action Event Result to the clone source LUN). matter for this scenario.

media failure and the image state will be changed to “out of sync” until the clone LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone LUN is shutdown (this is the same SP that owns the clone source LUN).

Active I/O to the clone source LUN. SP can be shutdown due to a Navisphere command to reboot or during an NDU.

All unfractured clone LUNs will follow the SP owner of the clone source LUN. See the event description under clone source LUN SP shutdown action above. Any clone synchronizations that were in progress will be queued to be started on the peer SP if the recovery policy is set to automatic; otherwise the clone image condition will be set to “administratively fractured” with an image state of “out of sync” or “reverse out of sync” until the SP that failed reboots.

SP that owns the clone LUN fails (this is the same SP that owns the clone source LUN).

Active I/O to the clone source LUN. SP can fail due a SW or HW malfunction, or the SP is physically pulled.

All unfractured clone LUNs will follow the SP owner of the clone source LUN. See the event description under clone source LUN SP shutdown action above. Any clone synchronizations that were in progress will be queued to be started on the peer SP.

CPL CPL Action Event Result Storage system generated write to the CPL. The storage system does this write in order to mark regions that were changed due to a write to a clone source LUN or a write to an fractured clone LUN to provide incremental synchronizations. Writes are also done by the storage system to clear marked regions as part of a synchronization operation.

Write to the CPL fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not matter for this scenario.

If the CPL write was to mark a region for the clone source LUN or a fractured clone LUN, the operation proceeds without error as the marked regions will be maintained in SP memory. The CPL can be reassigned to a newly bound LUN while the system is running to repair it. If the CPL owning SP reboots before the reassignment can take place, all clone LUNs will require a full synchronization. If the CPL write was due to mirroring data or a synchronization (includes reverse synchronization), all unfractured clone LUN image states are set to “out of sync” or “reverse out of sync” until the CPL is repaired. All clone LUNs that were synchronizing, will have their image condition set to “administratively fractured” with

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 22

Page 24: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CPL Action Event Result

a clone property that indicates a media failure. CPL can be repaired while the system is running (as described above). After CPL is repaired, all clone synchronizations must be manually restarted via administrative command.

Storage system generated read from a CPL. The storage system does this read as part of a synchronization (including reverse synchronization) to determine which blocks need to be copied from the clone LUN to the clone source LUN.

The read from the CPL fails due to a bad block, a LCC/BCC failure, cache dirty LUN, etc.

Since the regions represented by the block(s) that cannot be read from the CPL are unknown, the storage system will cause full synchronizations (includes reverse synchronization). If the operation was a protected reverse synchronization, the writes that were performed to the clone source LUN during the reverse synchronization operation will not be retained (the clone source LUN will have the data of the protected clone).

Step-by-step clone overview - all platforms Clones use an asynchronous write until they are in sync. Once they are in sync they are a synchronous write. When the clone is using the synchronous write you may see a performance impact. Clones spend most of their existence fractured. This contains examples, from setting up clones (with Navisphere CLI) to using them (with admsnap and Navisphere CLI). Some examples show the main steps outlined in the examples; other examples are specific to a particular platform. In the following example, you will use the SnapView clone CLI commands in addition to the admsnap clone commands to set up (from the production server) and use a clone (from the secondary server). 1. On the storage system, bind a LUN for each SP to serve as a clone private LUN. The clone private LUNs (one for each SP) are shared by all clone groups on a storage system. The clone private LUNs store temporary system information used to speed up synchronization of the source LUN and its clone. These structures are called fracture logs. The clone private LUN can be any public LUN that is not part of any storage group. The minimum and standard size for each clone private LUN is 250000 blocks. There is no benefit in performance or otherwise, to use clone private LUNs larger than 250000 blocks. 2. On the storage system, bind a LUN to serve as the clone. Each clone should be of the same size as the source LUN. The source and clone LUNs can be on the same SP or different SPs. 3. If the source LUN does not exist (for example, because you are creating a new database), you can bind it at the same time as the clone. Then you can add the new source LUN to a storage group. 4. Assign the LUN you plan on using as your clone to a storage group. You must assign the clone LUN to a storage group other than the storage group that holds the source LUN. Use the Navisphere CLI command storagegroup as described in the EMC Navisphere Command Line Interface (CLI) Reference. 5. On the storage system, allocate the clone private LUNs. Use the CLI command function -allocatecpl for this. 6. On the storage system, create the clone group. Use the CLI command function -createclonegroup for this. 7. If the LUN you choose as your clone is mounted on a secondary server, deactivate the LUN from the server it is mounted on by issuing the appropriate command for your operating system:

• On a Windows server, use the following admsnap command: admsnap clone_deactivate -o clone drive_letter • On a UNIX server, unmount the file system on the LUN you want to use as a clone by issuing the umount command. • On a Novell NetWare server, use the dismount command on the volume to dismount the file system.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 23

Page 25: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

8. On the storage system, add the LUN you bounded as your clone in step 2, to the clone group. Use the CLI command -addclone for this. By default, when you use the -addclone command, the software starts synchronizing the clone (copying source LUN data to the clone). If the source LUN has meaningful data on it, then synchronization is necessary. Depending on the size of the source LUN, a synchronization may take several hours. If you do not want the default synchronization to occur when you add the clone to the clone group, then you can tell the CLI that synchronization is not required. To do this use the -issyncrequired option in the -addclone command. An initial synchronization is not required if your source LUN does not contain any data. If you specify an initial sync with an empty source LUN, resources are used to synchronize the source LUN to the clone LUN. 9. After the clone is synchronized, you can use it independently, by performing the following steps before fracturing it:

a. Quiesce I/O to the source LUN. b. Flush all cached data to the source LUN by issuing the appropriate command for your operating system.

– For a Windows server, use the admsnap flush command to flush all server buffers. admsnap flush -0 E: – For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount command. If you are unable to unmount the file system, you can issue the admsnap flush command. admsnap flush -o /dev/rdsk/c1t0d2s2 – For an IRIX server, the admsnap flush command is not supported. Unmount the file system by issuing the umount command. If you cannot unmount the file system, use the sync command to flush cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary server’s file system. Refer to your system's man pages for sync command usage. – For a Novell NetWare server, use the dismount command on the volume to dismount the file system. Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system. With some operating systems, additional steps may be required from the secondary server in order to flush all data and clear all buffers on the secondary server. For more information, see the product release notes.

c. Wait for the clone to transition to the synchronized state. d. Fracture the clone using the CLI fracture command. e. Resume I/O to the source LUN. f. For Windows, use the admsnap clone_activate command to make newly fractured clone available to the operating system.

After a delay, the admsnap clone_activate command finishes rescanning the system and assigns drive letters to newly discovered clone devices.

g. Important: If the secondary server is running Windows NT and the clone was already mounted on a secondary server, a reboot is required after you activate the fractured clone. If the secondary server is running Windows 2000, a reboot is recommended but not required. For UNIX servers, for all platforms except Linux, clone_activate tells the operating system to scan for new LUNs. For Linux, you must either reboot the server or unload and load the HBA driver. On a NetWare server, run the command list devices or use the command scan all LUNs on the console.

10. If you have a VMware ESX Server, do the following: a. Rescan the bus at the ESX Server level. b. If a Virtual Machine (VM) is already running, power off the VM and use the Service Console of the ESX Server to assign the

clone to the VM. If a VM is not running, create a VM on the ESX Server and assign the clone to the VM. c. Power on the VM and scan the bus at the VM level. For VMs running Windows, you can use the admsnap activate command

to rescan the bus. 11. Use the fractured clone as you wish—for backup, reverse synchronization, or other use. 12. If you want to synchronize the clone LUN, perform the following steps to deactivate the clone:

a. For Windows, use the admsnap clone_deactivate command, which flushes all server buffers, dismounts, and removes the drive letter assigned by clone _activate. For multi-partitioned clone devices, those having more than one drive letter mounted on it, all other drive letters associated with this physical clone device will also be flushed, dismounted, and removed. admsnap clone_deactivate E:

a. For UNIX, unmount the file system by issuing the umount command. If you cannot unmount the file system, you can use the sync command to flush buffers. The sync command is not considered a substitute for unmounting the file system, but you can use it to reduce the number of incidents of having to fsck the file system on your backup server. Refer to your system's man pages for sync command usage.

b. For NetWare, use the dismount command on the clone volume to dismount the file system. c. Start synchronizing the clone. Use the CLI command -syncclone for this.

13. If you have finished with this clone, you can remove the clone from its clone group. You can also do the following: • Destroy the clone group by using the CLI command -destroyclonegroup. • Remove the clone LUN by using the CLI command -removeclone. • Deallocate the clone private LUNs by using the CLI command -deallocatecpl.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 24

Page 26: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

For future clone operations, if you have not removed any required clone components as in step 13, then synchronize if needed, and return to step 9. Windows - clone example The following example shows all the naviseccli or navicli and admsnap commands needed to set up and use a clone on a Windows platform. It includes binding and unbinding the LUNs and RAID Groups. 1. Create the source and clone RAID Groups and bind the LUNs.

naviseccli -h ss_spA createrg 10 1_0 1_1 1_2 1_3 1_4 naviseccli -h ss_spA createrg 11 1_5 1_6 1_7 1_8 1_9 naviseccli -h ss_spA bind r5 20 -rg 10 -sp A naviseccli -h ss_spA bind r5 21 -rg 11 -sp A

To use these commands with navicli, replace naviseccli with navicli. 2. Create the clone private LUNs, each 250000 blocks long.

naviseccli -h ss_spA createrg 100 2_1 2_2 2_3 2_4 2_5 naviseccli -h ss_spA bind r5 100 -rg 10 -sp A -sq mb -cp 200 naviseccli -h ss_spa bind r5 101 -rg 10 -sp A -sq mb -cp 200

To use these commands with navicli, replace naviseccli with navicli. 3. Wait for all the LUNs to complete binding. Then set up the storage groups.

naviseccli -h ss_spa storagegroup -create -gname Production naviseccli -h ss_spa storagegroup -create -gname Backup naviseccli -h ss_spa storagegroup -connecthost -o –server ServerABC -gname Production naviseccli -h ss_spa storagegroup -connecthost -o –server ServerXYZ -gname Backup naviseccli -h ss_spa storagegroup -addhlu -gname Production -hlu 20 -alu 20 naviseccli -h ss_spa storagegroup -addhlu -gname Backup –hlu 21 -alu 21

To use this command with navicli, replace naviseccli with navicli. 4. On both servers, rescan or reboot to let the operating systems see the new LUNs. 5. Allocate the clone private LUNs.

naviseccli -User GlobalAdmin -Password mypasssw -Scope 0 -Address ss_spa snapview -allocatecpl -spA 100 -spB 101 -o

To use this command with navicli.jar, replace naviseccli with java –jar navicli.jar. 6. Create the clone group and add the clone.

naviseccli -user GlobalAdmin -password mypassw -scope 0 -address ss_spa snapview -createclonegroup –name lun20_clone -luns 20 -description Creatinglun20_clone -o naviseccli -user GlobalAdmin -password password -scope 0 -address ss_spa snapview -addclone -name lun20_clone -luns 20

To use this command with navicli.jar, replace naviseccli with java –jar navicli.jar. 7. Run Disk Management on production server and create an NTFS file system on the source LUN. Copy files to the drive letter assigned to the source LUN on the production server. This example uses g: as the driver letter for the source LUN, 8. On the production server, run admsnap to write the buffers.

admsnap flush -o g: The clone transitions to the synchronized state. 9. Fracture the clone.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -fractureclone -name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java –jar navicli.jar. 10. On the secondary server, run admsnap to activate the clone.

admsnap clone_activate

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 25

Page 27: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The admsnap software returns a driver letter for the drive assigned to the clone that was just fractured. This example uses h: as the driver letter for the clone LUN. 11. Verify that the files that were copied to the source LUN also appear on the clone LUN. 12. If you have a VMware ESX Server, do the following:

a. Rescan the bus at the ESX Server level. b. If a Virtual Machine (VM) is already running, power off the VM and use the Service Console of the ESX Server to assign the

clone to the VM. If a VM is not running, create a VM on the ESX Server and assign the clone to the VM. c. Power on the VM and scan the bus at the VM level. For VMs running Windows, you can use the admsnap activate command

to rescan the bus. 13. On the secondary server, delete the existing files and copy different files to the clone (to h:). 14. On the secondary server, run admsnap to deactivate the clone.

admsnap clone_deactivate -o h: 15. On the production server, run admsnap to deactivate the source.

admsnap clone_deactivate -o g: 16. Reverse synchronize to copy the data written to the clone back to the source.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -reversesyncclone –name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java -jar navicli.jar. 17. On the production server, run admsnap to activate the source.

admsnap clone_activate Wait for the reverse-sync operation to complete and the clone to transition to the synchronized state. 18. Fracture the clone again to make the source independent.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -fractureclone -name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java –jar navicli.jar. 19. On the production server, verify that the clone (g:) contains the files that were written to the clone on the secondary server. It also should not contain the files that were deleted from the clone. 20. On the production server, use admsnap to deactivate the source.

admsnap clone_deactivate -o g: 21. Clean up the storage system by removing and destroying the clone group.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -removeclone -name lun20_clone -cloneid 0100000000000000 -o naviseccli -User GlobalAdmin -Password password -Scope 0-Address ss_spa snapview -destroyclonegroup –name lun20_clone -o

To use this command with navicli.jar, replace naviseccli with java -jar navicli.jar.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 26

Page 28: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Reverse synchronization - all platforms The following example illustrates the admsnap and Navisphere CLI commands required to reverse synchronize a fractured clone. 1. From the production server, stop I/O to the source LUN. 2. Using admsnap, do the following:

a. From production server, deactivate source LUN by issuing the appropriate command for your operating system. • On a Windows server, use the following admsnap command:

admsnap clone_deactivate -o source-drive-letter • On a UNIX server, unmount the file system by issuing the umount command. If you cannot unmount the file system, use

the sync command to flush buffers. Although the sync command is not a substitute for unmounting the file system, you can use it to reduce the number of times you need to issue the fsck command on the secondary server’s file system. Refer to your system's man pages for sync command usage.

• On a NetWare server, use the dismount command on the volume to dismount the file system. b. If the clone is mounted on a secondary server, flush all cached data to the clone LUN by issuing the appropriate command for

your operating system. • For a Windows server, use the admsnap flush command. • For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount command. If you are unable

to unmount the file system, issue the admsnap flush command. The flush command flushes all data and clears all buffers. • For an IRIX server, the admsnap flush command is not supported. Unmount the file system by issuing the umount

command. If you cannot unmount the file system, use the sync command to flush cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary server’s file system. Refer to your system's man pages for sync command usage.

• On a Novell NetWare server, use the dismount command on the volume to dismount the file system. c. Neither the flush command nor sync command is a substitute for unmounting the file system. Both commands only

complement unmounting the file system. With some operating systems, additional steps may be required from the secondary server in order to flush all data and clear all buffers on secondary server. For more information, see product release notes.

3. Using Navisphere CLI, issue the following command from the SP that owns the source LUN: snapview -reversesyncclone -name name|-clonegroupUid uid -cloneid id [-UseProtectedRestore 0|1] Before you can use the protected restore feature, you must globally enable it by issuing the snapview -changeclonefeature [-AllowProtectedRestore 1] command. Important: When the reverse synchronization begins, the software automatically fractures all clones in the clone group. Depending on whether or not you enabled the Protected Restore feature, the following occurs to the clone that initiated the reverse synchronization:

• With the Protected Restore feature - the software fractures the clone after the reverse synchronization completes. • Without the Protected Restore feature - the software leaves the clone unfractured.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 27

Page 29: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/S There are three user visible storage system objects that are used by the MirrorView/S capability: Mirror primary LUN(s), secondary LUN(s), and WIL(s). There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. The image states and image conditions described below apply to all the mirrors if any one of the mirrors is in a consistency group (e.g., if a mirror fractures, and the mirror is in a consistency group, all the mirrors fracture in the group). Primary LUN MirrorView/S Primary LUN Action Event Result Server write to a mirror primary LUN.

Server write to the mirror primary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Server write request fails. Server based path failover software may trespass the mirror primary LUN. If the write error condition was due to a problem related to the owner SP of the mirror primary LUN, the write will be able to continue on the peer SP. See the trespass action for a mirror primary LUN below. Each mirror secondary is updated with the write request in parallel even though the data on the primary may not be identical to the secondary. The storage system keeps track of the in-flight writes and will insure that each mirror secondary LUN will be resynchronized with the data from the primary to insure they are eventually identical. Since the write that failed has never been acknowledged to the server, the data on each secondary is undefined (and therefore is crash recoverable in the event of a disaster).

Server read from a mirror primary LUN.

The read from the mirror primary LUN fails due to a bad block.

No effect to any mirror secondary LUNs associated with the mirror primary LUN. The server read request fails.

A storage system generated read from the mirror primary LUN as part of a mirror synchronization.

The read from the mirror primary LUN fails due to a bad block.

All mirror secondary LUNs will be marked with bad block(s) at the same corresponding logical offset(s) that were bad in the mirror primary LUN.

A storage system generated read from the mirror primary LUN as part of a mirror synchronization.

The read from the mirror primary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The mirror synchronization is aborted. The secondary image will be set to “administratively fractured” with a mirror property that indicates a media failure and the image state will be changed to “out of sync” until the mirror primary LUN is repaired.

SP that owns the mirror primary LUN fails.

Active I/O to the mirror primary LUN.

All mirror primary LUNs owned by the SP that has failed are trespassed by the storage system,

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 28

Page 30: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/S Primary LUN Action Event Result

SP can be shutdown due to a SW or HW malfunction, or the SP is physically pulled.

I/O can resume on the peer SP. See trespass action below for more details. Any synchronizations that are currently in progress will automatically continue on the peer SP. The synchronization will continue where it left off if the mirror is configured to use the WIL. If mirror does not use the WIL, synchronization will be a full synchronization as fracture changed region information (maintained in SP memory) will be lost when SP fails.

SP that owns the mirror primary LUN is shutdown

Active I/O to the mirror primary LUN. SP can be shutdown due to a Navisphere command to reboot, or NDU.

All mirror primary LUNs owned by the SP that has been shutdown are trespassed by the storage system, I/O can resume on the peer SP. See trespass action below for more details. Any mirror synchronization that is currently in progress will automatically continue on the peer SP. The synchronization will continue where it left off even for mirrors that are not configured to use the WIL. As part of the shutdown process, the fracture changed region information (maintained in SP memory) is sent to the peer SP to avoid a full synchronization.

Mirror primary LUN is trespassed.

Active I/O to the mirror primary LUN. Trespass of the mirror primary LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the mirror primary LUN is determined to be bad.

I/O to the mirror primary LUN can resume on the peer SP. All fractured secondary LUNs associated with the mirror primary LUN are unaffected and will not trespass to the corresponding new owning SP on the remote storage system of the mirror primary LUN. When the mirror is incrementally synchronized, the secondary will automatically trespass to the corresponding SP owner of the primary. All secondary LUNs that were actively mirroring with the primary LUN (normal image condition) will trespass to the corresponding new owning SP on the remote storage system (e.g., if the mirror primary LUN trespasses to SP B, the mirror secondary LUN that previously owned by SP A will be trespassed to SP B). If there is no mirror connectivity to the remote storage system SP as a result of the trespass, the

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 29

Page 31: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/S Primary LUN Action Event Result

secondary image condition will be set to system fractured The image state will be set to either “consistent” or “out of sync” (if a synchronization was in progress). When connectivity is restored, if synchronization was in progress and the mirror property of automatic recovery is set, then the synchronization will be queued to be started and will start automatically; otherwise the image condition will be set to “waiting on administrative action” and will require a manual synchronization.

Secondary LUN MirrorView/S Secondary LUN Action Event Result Storage system generated write to a mirror secondary LUN. The storage system does this write in order to replicate a write to a mirror primary LUN or as a result of a synchronization.

Write to the secondary LUN fails due to a connectivity issue between the primary storage system and the remote storage system. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

Until access to the secondary LUN is restored, the image condition will be set to “system fractured”. If mirror synchronization was in progress, the image state is set to “out of sync”; otherwise it will be set to “consistent”. When connectivity is restored, if the mirror property of automatic recovery is set, then the synchronization will be queued to be started and will start automatically; otherwise the image condition will be set to “waiting on administrative action” and will require a manual synchronization. The administrator may trespass the mirror primary LUN in an attempt to restore connectivity to the secondary LUN (a secondary LUN cannot be trespassed directly). See the mirror primary LUN trespass event above.

Storage system generated write to a mirror secondary LUN. The array does this write in order to replicate a write to a mirror primary LUN or as result of synchronization.

Write to the secondary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Until access to the secondary LUN is restored, the image condition will be set to “administratively fractured” (to indicate that administrative action is required to repair and synchronize the secondary LUN). If mirror synchronization was in progress, the image state is set to “out of sync”; otherwise it will be set to “consistent”. The administrator may trespass the

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 30

Page 32: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/S Secondary LUN Action Event Result

mirror primary LUN in an attempt to restore availability to the secondary LUN (a secondary LUN cannot be trespassed directly). See the mirror primary LUN trespass event above.

SP that owns the secondary LUN is shutdown.

Active I/O to the mirror primary LUN. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

A failure of remote storage system SP that owns secondary LUN is handled same as connectivity event as described above.

Mirror secondary LUN is promoted to the role of primary.

There is a mirror connectivity issue between the primary storage system and the remote storage system. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

The secondary LUN is promoted successfully. The “old” mirror primary LUN is treated as a different mirror from the newly promoted secondary LUN. If the “old” primary had another secondary LUN (MirrorView/S supports up to two secondary LUNs) associated with it, and mirror connectivity between that secondary and the newly promoted secondary exists, the secondary will become a secondary of the promoted LUN, but will require a full manual synchronization (image condition will be set to “waiting on administrative action”). If connectivity does not exist to the non-promoted secondary, that secondary will be left as a secondary of the “old” primary LUN.

WIL WIL Action Event Result Storage system generated write to the WIL. The storage system does this write in order to mark regions that were changed due to a write to a mirror primary LUN to provide incremental syncs and to avoid full resyncs for writes that were in flight when a SP fails. Writes are also done by the storage system to clear marked regions as part of a sync operation.

Write to the WIL fails due to a LCC/BCC failure, cache dirty LUN, etc.

If the WIL write was to mark a region for the mirror primary LUN the image condition is set to “administratively fractured”. The WIL can be reassigned to a newly bound LUN while the system is running to repair it; however, this procedure requires that every mirror turn off the WIL feature before the WIL can be unallocated and reallocated to a new LUN. If the WIL owning SP reboots before the reallocation can take place, all mirror secondary LUNs on the SP that owns the WIL will require a full synchronization.

Storage system generated read from a WIL. The storage system does this read to avoid full mirror synchronizations after a

The read from the WIL fails due to a bad block.

Since the regions represented by the block(s) that cannot be read from the WIL are unknown, the storage system will stop all synchronizations and all secondary

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 31

Page 33: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

WIL Action Event Result SP has shutdown. The read can occur after the SP that owns the WIL is rebooted or from the peer SP if the primary LUN is trespassed. This includes restarting synchronizations that were in progress as well as recovering from in flight writes that were in progress when the SP was shutdown.

image states are set to “out of sync”. All secondary LUNs that were synchronizing with the primary LUN will have their image condition set to “administratively fractured”. All the regions in the WIL for the mirrors that were synchronizing will be set to indicate that they need to be synchronized. Therefore, when any of the synchronization operations are resumed, the storage system will perform a full synchronization. Once a write operation to the WIL is successful due to a write or the add of an out of synchronization secondary, or when mirror transitions to the “in sync” image state, the in memory bit map will be flushed out to the WIL which will repair any bad blocks in the WIL.

SP that owns the WIL fails.

SP can be shutdown due to a SW or HW malfunction, or the SP is physically pulled.

This is handled exactly the same as described above for when the owning SP of a mirror primary LUN fails.

SP that owns the WIL LUN is shutdown

SP can be shutdown due to a Navisphere command to reboot, or NDU.

This is handled exactly the same as described above for when the owning SP of a mirror primary LUN is shutdown.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 32

Page 34: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

How MirrorView/S handles failures MirrorView/S uses a synchronous write to the secondary image. Mirrors spend most of their existence in sync. When a failure occurs during normal operations, MirrorView/S lets you perform several actions to recover. In recovering from failures, MirrorView/S achieves two goals:

• Preserves data integrity • Minimizes the amount of time that data is unavailable to the user

Access to the SP fails

If the SP that has ownership of a mirrored LUN on the primary system fails, the LUN may be trespassed (for example, by PowerPath software running on an attached Windows host) to the other SP. This allows mirroring to continue, provided the host is set up properly to handle the failover (for example, a Windows host with PowerPath software).

When the primary LUN is trespassed, MirrorView/S sends a trespass request to any secondary images. Therefore, you may notice that the mirrored LUNs on the secondary system have moved from SP A to SP B, or vice versa. MirrorView/S keeps the SP ownership the same on the primary and secondary systems. If the primary image is on SP A, then the secondary image(s) will be on SP A. If a secondary image is fractured when a trespass occurs on the primary, then the secondary image will not be trespassed until synchronization is started. Primary Image Fails

If the storage system controlling the primary image fails, access to the mirror stops until you either repair the storage system or promote a secondary image of the mirror to primary. If the mirror has two secondary images and you promote one, the other secondary image becomes a secondary image to the promoted mirror. You can recover with a promotion, or you can wait until the primary image is repaired and then continue where you left off.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 33

Page 35: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Promoting a secondary image to a primary image In situations where you must replace the primary storage system due to a catastrophic failure, you can use a promotion to access data on the secondary storage system. To recover and restore I/O access, you must promote a secondary mirror image to the role of primary mirror image, so that a host can access it. Note that you can also promote a secondary image even if there has not been a catastrophic failure. If the primary image and secondary image can communicate with each other, then when the secondary image is promoted, the former primary image is demoted to a secondary image.

To promote a secondary image, the following conditions must be true:

• You must direct the navicli.jar mirror commands to the storage system holding the secondary image. • The state of the secondary image you will promote must be either Consistent or Synchronized.

Note: If you promote a consistent secondary image, you must perform a full sync to re-establish mirror after promotion. ! CAUTION ! Promoting when I/O is going to the primary image can cause data loss. Any I/Os in progress during the promotion may not be recorded to the secondary image and will be unavailable after the secondary image is promoted. It will also probably require a full synchronization of the new secondary image after the promotion. In a failure situation, before promoting a secondary image to a primary image:

a. If the existing primary image is accessible, remove the primary image from any storage groups before promoting the secondary image to avoid I/O and therefore inconsistent data.

b. Ensure that no I/O, either generated from a host or by a synchronization in progress, is occurring in the mirror. c. If existing primary available, make sure that it lists the secondary image that is to be promoted as "synchronized."

To promote a secondary image to a primary image:

a. Issue the mirror -sync -promoteimage command. b. Add the newly promoted image to a storage group if necessary. If you have two secondary images, the other secondary will

also be added to the new mirror if it can be contacted. If there are two secondary images and one is promoted, but cannot communicate with the other secondary, then the other secondary remains part of a mirror for which there is no primary image. You must remove this orphaned image by using the force destroy option.

At some point later, you can also perform the following steps:

a. Verify that the failed storage system is not the master of the domain. If it is, assign another storage system to be the master. See the EMC Navisphere Command Line Interface (CLI) Reference.

b. Verify that the failed storage system is not a portal. If it is a portal, remove the portal and configure a different storage system as a portal. See the EMC Navisphere Command Line Interface (CLI) Reference.

Note: If a planned promotion of a secondary (for example, for disaster recovery testing) occurs, make sure that the image you are promoting is in the Synchronized state to avoid a full resynchronization.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 34

Page 36: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Running MirrorView/S on a VMware ESX Server When you use MirrorView/S on a VMware ESX Server, after you promote the secondary image to a primary, perform the following steps:

1. Assign the newly promoted primary image to a storage group of the same or standby ESX Server. 2. Rescan the bus at the ESX Server level. 3. Create a Virtual Machine (VM) on the same or standby ESX Server. 4. Assign the newly promoted primary to the VM. Assign it to a different VM unless you remove the failed primary, in which case

you can assign it to the same VM. 5. Power up the VM.

If the VM is created and running, perform these steps:

1. Power it down. 2. Use the Service Console on the ESX Server to assign the newly promoted primary to the powered-down VM. 3. Power up the VM. The primary image (which is now the secondary image) will not be accessible to the primary ESX Server.

Recovering by promoting a secondary image

When you promote the secondary image, the software assigns a new mirror ID to the promoted image to distinguish it from the original mirror, even though the mirrors have the same name. To promote a secondary image use the mirror -sync –promoteimage command. The new image condition of the original primary image depends on whether the original primary image is accessible at the time of promotion. If the existing primary image is accessible when you promote, the software attempts to add the original primary image as a secondary image of the promoted mirror; that is, the images swap roles.

If the primary image is not accessible when you promote, the software creates a new mirror with the former secondary image as the new primary image and no secondary image, as shown in the example below. The mirror on the original primary storage system does not change. If the MirrorView/S connection between the storage systems is not working during a promotion, the storage system that holds the original primary image still has a record of the secondary image that was promoted. The original primary image is unable to communicate with the promoted secondary image even if the MirrorView/S connection between the storage systems is restored (since the secondary was promoted to a primary image). Mirror before promotion Mirror after promotion Mirror ID = aaa Primary image = LUN xxxx Secondary image = LUN yyyy

Mirror ID = bbb Primary image = LUN yyyy Secondary image = none

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 35

Page 37: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Restoring the original mirror configuration after recovery of a failed primary image If the original mirror becomes accessible following a failure and the mirror’s secondary image has been promoted, the original mirror will be unable to communicate with the new one. To restore your mirrors to their original configuration, do the following:

1. If you want to retain any data on the original primary LUN, copy it to another LUN before continuing, or alternatively, you can create a LUN that will become the primary LUN. The following process overwrites the contents of the original primary LUN.

2. Remove the original primary LUN from any storage group of which it is a member. 3. Destroy the original mirror using the mirror -sync -destroy –force command.

Original Mirror New Mirror Original mirror is destroyed. Original LUN used for primary image remains (LUN xxxx)

Primary image = LUN yyyy Secondary image = none

! CAUTION ! Data from promoted LUN will overwrite all data on secondary image (original primary) LUN if administrator syncs mirror.

4. Add a secondary image to the new mirror using the LUN that was the primary image for the original mirror (LUN xxxx). 5. Synchronize the secondary image.

New mirror

Primary image = LUN xxxx Secondary image = LUN yyyy

! CAUTION ! Allow the image to transition to the Synchronized state following the synchronization. If the image is in the Consistent state when you promote it, another full resynchronization is required, and data may be lost.

6. Promote the secondary image (LUN xxxx) in the new mirror to primary. Original mirror / New mirror - the new mirror has the same configuration as the original mirror.

New mirror Primary image = LUN xxxx

Secondary image = LUN yyyy During a promotion, the recovery policy for a secondary image is always set to manual recovery. This prevents a full synchronization from starting until you want it to.

7. If required, reset the recovery policy back to automatic.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 36

Page 38: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering without promoting a secondary image

If the primary storage system fails, but can be readily repaired, recovery is simpler. For mirrors configured to use the write intent log, MirrorView/S records any writes that had been received before the failure and can transfer them to the remote image when the primary storage system is repaired, thus bringing the secondary back in sync with the primary. Any writes that were sent to the storage system but not yet acknowledged may be lost, but application-specific recovery techniques, such as chkdsk or fsck for file systems, can usually correct any issues. If you did not use the write intent log, you must perform a full resynchronization of the secondary image.

To recover without promoting a secondary image, follow these steps:

1. Repair the primary storage system and/or host. 2. Fracture the mirror(s). 3. Complete any necessary application-specific recovery of the data on the primary image. 4. Make sure that the data is flushed from the host to the storage system. 5. Synchronize the mirror(s).

Failure of the secondary image

When a primary image cannot communicate with a secondary image, it marks the secondary as unreachable and stops trying to write to it. The loss of communication may be due to a link between storage systems, an SP failure on the secondary storage system, or some other failure on the secondary storage system. In the event of the communication failure, the secondary image remains a member of the mirror. The primary image also attempts to minimize the amount of work required to synchronize the secondary after it recovers. It does this by fracturing the mirror.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 37

Page 39: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

This means that while the secondary is unreachable (fractured), the primary storage system keeps track of write requests to the mirror, so that only areas that were modified need to be copied to the secondary during recovery. When the secondary is repaired, a synchronization operation brings the image up to date. The primary recognizes that the secondary is alive and restarts writes to that image if the recovery mode is set as automatic.

Promoting a secondary image when there is no failure You may want to promote your secondary image even if no failure occurs. For example, you may want to test your disaster recovery procedure before a real failure occurs, or perhaps the server attached to the primary storage system failed, and you must resume operations using the server attached to the secondary storage system. If the original primary is accessible when you promote, then the promoted image becomes the primary image of the new mirror, while the original primary becomes a secondary image of the new mirror (that is, the images swap roles). The software then verifies whether the two images are synchronized. If the images are synchronized, then the software proceeds with mirrored I/O as usual. If the images are not synchronized, then the software performs a full synchronization using the promoted image as the primary image. A full synchronization will not start until you initiate it or change the synchronization policy to auto after a promotion. After a promotion, all the secondary images in the new mirror are set to manual recovery. Mirror before promotion Mirror after promotion Mirror ID = aaa Primary image = LUN xxxx Secondary image = LUN yyyy

Mirror ID = bbb Primary image = LUN yyyy Secondary image = LUN xxxx

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 38

Page 40: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Summary of MirrorView/S failures Table shows how MirrorView/S might help you recover from system failure at the primary and secondary sites. It assumes that the secondary image of the mirror is in either the Synchronized or Consistent state. EVENT RESULT AND RECOVERY Loss of access to primary image LUN Check connections between server and storage system, including

zoning and correct operation of switches. Check for SP reboot. Secondary SP is rebooted If the secondary SP reboots, for example, due to a software

failure, an explicit command or the SP is physically pulled and reseated, you may see the secondary image become system fractured. Also possible for secondary to become administratively fractured, in which case simply synchronize the image.

Server accessing primary image fails Catastrophic failure. I/O stops. After a defined period of time, all nonfractured secondaries in the Consistent state transition to the Synchronized state. Nothing more happens until the server is repaired or replaced or a secondary image is promoted.

Array running primary image fails Option 1 – Catastrophic failure. The mirror is left in the state it was already in. If the secondary image is in either the Consistent or Synchronized state, it may be promoted to provide access to your data. Note: Any writes in progress when the primary image fails may not propagate to the secondary image. Also, if the remote image was fractured at the time of the failure, any writes since the fracture will not have propagated. Option 2 – Non catastrophic failure, repair is feasible. The admin has the problem fixed, the normal production operation an resume. The write intent log, if used, shortens the sync time

If write intent log is not used, a full sync is needed. needed. Option 3 – Only one SP fails. If the SP that controls the mirror fails, software on the server (in ex; PowerPath) can detect the failure. This software can cause control of mirror to be transferred to surviving SP and normal operations can continue. If such software is not running on server, then you must either manually transfer control using Navisphere or access to mirror stops until the SP is back in service. If the SP that does not control the LUN fails, then mirroring continues as normal.

Array running secondary image fails

- If the SP that does not control the secondary image fails, nothing happens with respect to this mirror. - If the SP that controls the mirror fails (or both SPs fail or a catastrophic failure of the entire system occurs), the primary system will fracture the image hosted on this array. The mirror may consequently go to Attention state (if it is so configured), but I/O continues as normal to the primary image. The admin has a choice: If the secondary can easily be fixed (ie; if someone pulled out a cable), then the admin can have it fixed and let things resume. Otherwise, to regain protection of your data and you have another array available, you can force destroy the existing mirror, recreate it and add a secondary image on another working array. Protection is not established until the full sync of the secondary image completes.

Loss of connection between arrays (indicated by system fractures)

Check the cables, make sure that all SPs are still working and make sure the MV path between the arrays is still enabled and active. Check correct zoning and other function of any switches.

Failures when adding secondary images

Make sure that: The connection between arrays works; You are managing both arrays, which may require managing two domains; The secondary LUN is available and the same ‘block’ size as the primary image; The secondary image LUN is not in the storage group; The secondary image LUN is not already a secondary image, of either a sync or async mirror.

When the secondary image does not sync

Make sure that: The connection between the arrays is still good; The recovery policy is set to auto and not manual; The secondary SP is working. Try manually fracturing and then manually sync’ing the image.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 39

Page 41: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering from serious errors In the unlikely event that the mechanism for tracking changes made to the primary image fails (for example, insufficient memory available on the SP), the secondary image is marked as permanently fractured. To recover from this situation, you must remove the secondary image from the mirror, and then add it again (which does a full resynchronization). This failure may indicate that you are using close to the storage system’s capacity for layered features. Some other serious failures will transition MirrorView/S into a degraded mode of operation, where administrative requests will be rejected and no further resynchronizations run. Degraded mode affects only a single SP; the other SP in the storage system may continue to run normally (depending on the nature of the failure). When an SP enters degraded mode, the system logs an event that indicates why MirrorView/S is in the degraded mode. Usually you can recover from the degraded mode by simply rebooting the affected SP, but some specific cases require you to check other components that MirrorView/S uses before rebooting the SP. EVENT RESULT AND RECOVERY Internal memory corruption Mirror data does not match the expected value, reboot the SP. Serious, unexpected errors MV/S receives unexpected errors from its underlying components

during operation. Check the event log for a record of errors and take steps to correct them. Ie; if the reserved LUN pool LUNs are faulted, recover them, then reboot the SP.

Internal fracture failure A fracture operation fails due to reasons other than an error you made. Check the event log for the appropriate failure reason. Reboot the SP to fix the problem.

How consistency groups handle failures When a failure occurs during normal operations for consistency groups, MirrorView/S lets you perform several actions to recover. When recovering from failures, MirrorView/S achieves three goals:

• Preserves data integrity • Minimizes the amount of time that data is unavailable to the user • Ensures that the consistency of the consistency group is maintained

Access to the SP fails Consider a consistency group that has member mirrors, some of which SP A controls and some of which SP B controls. If SP A on the primary storage system fails, then software on the attached server, for example, PowerPath, moves control of the mirrors that were controlled by SP A to SP B. This allows applications on the server, as well as the mirroring of data to the secondary storage system, to continue uninterrupted. However, as part of the transfer of control, the consistency group becomes system fractured. If the recovery policy is set to automatic, a synchronization automatically starts on the surviving SP (SP B in this example). However, if the recovery policy is manual, you must manually start a synchronization. Primary storage system fails If the storage system running the primary consistency group fails, access to the data in the group’s member LUNs is lost. You can either repair the failed storage system and then continue operations, or you can promote the secondary consistency group, so as to access the data from the secondary storage system.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 40

Page 42: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering by promoting a secondary consistency group As part of consistency group promotion each of the mirror members is promoted. This section describes three types of group promotions, which are based on the connectivity status between the primary and the secondary and the condition of the individual members. Note: You can promote a consistency group only if it is in the Consistent, Synchronized, or Scrambled state. Normal promotion When you request promotion for a secondary group, the software determines if connectivity exists between the storage systems hosting the primary and secondary consistency groups. If connectivity is working, it tests the members of the group to determine if the result of promotion will be an out-of-sync group or a synchronized one. The promote operation will fail if the primary is unreachable or the secondary group will be out-of-sync after promotion. You then can do a local only promote or a force promote, described below. Force promote Force promote promotes each member of the group and places the newly promoted mirrors in the group (removing the original members). If the original primary storage system is available, the original primary images will become secondary images of the promoted mirrors. The promoted group is marked as Out-of-Sync and its recovery policy is set to manual. You must initiate a synchronization for the group in order to start the full synchronization, which is required for the group to be once again protecting your data. If the original primary storage system is unavailable, Force Promote has the same effect as Local Only Promote, described below. Note: Must perform full synchronization on the new secondary image, which will overwrite all existing data on that image. Local only promote A local only promote promotes the secondary image of each consistency group member to a primary image, but does not attempt to add the original primary image or any other existing secondary images to the promoting mirror. If a connection exists between the primary and the secondary, for each member of the primary, the software attempts to remove the image being promoted on the secondary. Thus, the original primary consistency group will have all primary images and no secondary images. If no connection exists, the promote will still continue on the secondary, and the operation will not fail. The original primary consistency group cannot communicate with the promoted secondary consistency group even if the MirrorView/S connection between the storage systems is restored consistency group). If a failure occurs during promoting (for example, an SP reboots), the consistency group may be left in an inconsistent state. It is possible that some members have only primary images or some have been promoted or not promoted at all. Check the state of the promoted consistency group to detect any problems during promotion. A consistency group is in the scrambled state if at least one of its member’s primary images is missing its corresponding secondary images.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 41

Page 43: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Note: Table lists the configurations in which the scrambled state can occur. Note: Either the Local only promote or the force promote operation can result in a consistency group that contains mirrors that have no secondary images at all. In this case, the consistency group is no longer performing its function. The best way to correct this is to remove the mirrors from the consistency group, add secondary images as required, and add the mirrors back to the group. Configuration Possible cause Recovery Options Is the consistency group

promotable in this state? Local Only: Consistency group members consist of only primary images. Individual members do not have any secondary images associated with them.

You performed a local only promote or force promotion and the connection was down.

Force destroy the consistency group on both arrays. Choose which array you want to be the secondary array and destroy the mirrors that were in the group on that array. Then, on the other array, add secondary images to the mirrors. Create the consistency group and add the mirrors to the group.

No, because there are no secondary images associated with the consistency group.

Incomplete: If no normal promotion or a force promotion fails in mid-operation, some members may consist of only primary images. The remaining members are successfully promoted and have secondaries associated with them.

A failure occurred during a normal or force promote

Force the removal of the members with no secondary image and then add secondaries to those mirrors. Add them back into the consistency group as necessary.

If the original primary consistency group has a secondary image in it, it is force promotable. If there is a secondary image in the consistency group and it is sync’d or consistent, then promotion is an option.

Recovery policy after promoting When you promote a mirror that is not part of a consistency group, its recovery policy automatically changes to manual. When you promote a consistency group, the recovery policy does not change. If a group with automatic recovery policy is in the synchronized state after being promoted, the group remains in full automatic recovery mode. If the group with an automatic recovery policy is out-of-sync after being promoted, automatic recovery temporarily stops. Thus, the new primary group may have older data than the secondary system (the original primary system). You must decide whether or not to synchronize the group. When you perform a successful group synchronization, the group returns to full automatic recovery mode. ! CAUTION ! After force promoting a consistency group that is out-of-sync after the promotion, use caution before synchronizing the consistency group. If the consistency group was fractured before the promotion, the new primary system may have data that is much older than the original primary (now secondary) system. Synchronizing the consistency group may overwrite newer data on the original primary system. Note: Failure scenarios may occur whereby a mirror may no longer exist, but is still considered a member of the consistency group. In this situation the consistency group is in the incomplete state. A mirror can be missing from the consistency group if the promotion fails between destroying the original mirror and creating the new one. All the properties of the missing mirrors are shown as unknown. In the Group Properties dialog box and in the group node, the name of the mirror appears as unknown:n, where n is the LUN number. From the Group Properties dialog box you can remove the mirror. If a consistency group is in the scrambled state, then a promotion is allowed.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 42

Page 44: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/A There are three user visible storage system objects that are used by the MirrorView/A capability: Mirror primary LUN(s), secondary LUN(s), and reserved LU(s) which exist on both the primary and secondary storage systems. There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. The image states and image conditions described below apply to all the mirrors if any one of the mirrors is in a consistency group (e.g., if a mirror fractures, and the mirror is in a consistency group, all the mirrors fracture in the group). Primary LUN MirrorView/A Primary LUN Action Event Result Server write to mirror primary LUN.

Server write to mirror primary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Server write request fails. Server based path failover software may trespass the mirror primary LUN. If the write error condition was due to a problem related to the owner SP of the mirror primary LUN, the write will be able to continue on the peer SP. See the trespass event for a mirror primary LUN below. If an update or synchronization to the secondary is not running, nothing happens to the mirroring state. When an update or synchronization is running, failure of a server write does not effect the mirror state; however since the primary LUN is inaccessible, the mirror will fracture as described below.

Update or synchronization is running on the mirror primary LUN. The storage system will read data from the mirror primary LUN in order to transfer the data to the secondary LUN.

Storage system read from the mirror primary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Failure to access the mirror primary LUN will cause the image state to be set to “out of sync” if synchronizing and the image condition to be set to “administratively fractured” to indicate that an administrative action is required. The secondary image state of a mirror that is performing an update or is in between updates is always “consistent” regardless of the failure scenario as the golden copy protects the secondary LUN data to consistent point in time.

Server read from a mirror primary LUN.

The read from the mirror primary LUN fails due to a bad block.

The server read request fails. No effect to the mirror secondary associated w/mirror primary LUN.

A storage system generated read from the mirror primary LUN as part of a mirror update or synchronization.

The read from the mirror primary LUN fails due to a bad block.

Failure to read the mirror primary LUN will cause the image state to be set to “out of sync” if synchronizing and the image condition to be set to “administratively fractured” to indicate that an administrative action is required.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 43

Page 45: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/A Primary LUN Action Event Result

The secondary image state of a mirror that is performing an update or is in between updates is always “consistent” regardless of the failure scenario as the golden copy protects the secondary LUN data to a consistent point in time.

SP that owns the mirror primary LUN is shutdown.

Active I/O to the mirror primary LUN. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

If the mirror primary LUN is trespassed, I/O can resume on the peer SP. See trespass action below for more details. Any updates or synchronizations that are currently in progress will stop until the SP is rebooted at which time they will automatically restart where they left off.

Mirror primary LUN is trespassed.

Active server I/O to the mirror primary LUN. An update to the secondary is not currently running. Trespass of the mirror primary LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the mirror primary LUN is determined to be bad.

Server I/O to the mirror primary LUN can resume on the peer SP. If the secondary LUN associated with the mirror primary LUN is fractured, it is unaffected and will not trespass to the corresponding new owning SP on the remote storage system of the mirror primary LUN. If there is mirror connectivity, the mirror secondary is trespassed to the corresponding SP owner of the mirror primary LUN (e.g., if the mirror primary LUN is owned by SP A after the trespass, then the associated secondary LUN will be trespassed to SP A).

Mirror primary LUN is trespassed.

Active I/O to the mirror primary LUN. An update to the secondary is currently running. Trespass of the mirror primary LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the mirror primary LUN is determined to be bad.

If there is no mirror connectivity to the remote storage system SP as a result of the trespass, the secondary image condition will be set to “system fractured” (or “administratively fractured” if the mirror recovery property is set to manual). The image state will be set to “consistent”. If there is mirror connectivity, the mirror secondary is trespassed to the corresponding SP owner of the mirror primary LUN (e.g., if the mirror primary LUN is owned by SP A after the trespass, then the associated secondary LUN will be trespassed to SP A). If an initial mirror synchronization was in progress and there is no mirror

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 44

Page 46: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/A Primary LUN Action Event Result

connectivity to the remote storage system SP as a result of the trespass, the secondary image condition will be set to “system fractured” (or “administratively fractured” if the mirror recovery property is set to manual). The image state will be set to “out of sync”.

Secondary LUN MirrorView/A Secondary LUN Action Event Result Storage system generated write to a mirror secondary LUN. The storage system does this write in order to update the secondary LUN with data that was modified on its mirror primary LUN or as a result of an initial mirror synchronization.

Write to the secondary LUN fails due to a connectivity issue between the primary storage system and the remote storage system. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

Until access to the secondary LUN is restored, the image condition will be set to “system fractured”. If an initial mirror synchronization was in progress, the image state is set to “out of sync”; otherwise it will be set to “consistent”. As soon as connectivity is restored, the secondary LUN will automatically be incrementally updated or synchronized (providing the mirror recovery property was set to automatic – default). If the recovery policy is set to manual, then the image condition will be set to “administratively fractured”. The administrator may trespass the mirror primary LUN in an attempt to restore connectivity to the secondary LUN (a secondary LUN cannot be trespassed directly). See the mirror primary LUN trespass event above.

Storage system generated write to a mirror secondary LUN. The storage system does this write in order to update the secondary LUN with data that was modified on its mirror primary LUN or as a result of an initial mirror synchronization.

Write to the secondary LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Until access to the secondary LUN is restored, the image condition will be set to “administratively fractured”. If an initial mirror synchronization was in progress, the image state is set to “out of sync”; otherwise it will be set to “consistent”. Once the secondary LUN is repaired, the administrator must manually restart the update/synchronization. The administrator may trespass the mirror primary LUN in an attempt to restore access to the secondary LUN (a secondary LUN cannot be trespassed directly). See mirror primary LUN trespass event above.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 45

Page 47: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

MirrorView/A Secondary LUN Action Event Result Storage system generated read from a mirror secondary LUN. The storage system does this read in order to maintain a golden copy of the data on the secondary LUN while an update is in progress.

The read from the secondary LUN fails due to a bad block.

The image condition is set to “administratively fractured” to indicate that administrative action is required. The administrator can start a new incremental update after the secondary LUN is repaired.

SP that owns the secondary LUN is shutdown.

Active I/O to the mirror secondary LUN (update or initial synchronization in progress). SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

A failure of the remote storage system SP that owns the secondary LUN is handled exactly the same as a connectivity event as described above.

Mirror secondary LUN is promoted to the role of primary.

There is a mirror connectivity issue between the primary storage system and the remote storage system. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

The secondary LUN is promoted successfully. The promoted secondary LUN becomes a different mirror from the “old” primary LUN. There are three types of promotes that can be performed (normal, local only, and forced); however, when there is a connectivity issue, only the forced promote or local only promote options can be used.

Reserved LU (Primary) Reserved LU (Primary) Action Event Result Server write to a mirror primary LUN. The storage system may need to perform a copy on first write to preserve the point in time data to be transferred during an update to the mirror secondary LUN. The storage system also needs to track disk regions that will need to be transferred or to clear disk regions that were previously tracked after the data has been updated. This entails I/Os to reserved LU(s) before server write to mirror primary LUN can proceed.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from the reserved LU due to a bad block.

Server write request succeeds. All allocated reserved LUs associated with the mirror primary LUN are freed back to the reserved LU pool. The image condition is set to “administratively fractured” to indicate that administrative action is required. Since the tracked data is lost, after the reserved LU(s) are repaired or removed from the pool, the administrator must destroy and recreate the mirror. This condition is referred to as a permanently fractured mirror.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 46

Page 48: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Reserved LU (Primary) Action Event Result Server write to a mirror primary LUN. The storage system may need to perform a copy on first write to preserve the point in time data to be transferred during an update to the mirror secondary LUN. The storage system also needs to track disk regions that will need to be transferred or to clear disk regions that were previously tracked after the data has been updated. This entails I/Os to reserved LU(s) before server write to mirror primary LUN can proceed.

There is no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool.

Server write request succeeds. The image condition is set to “administratively fractured” to indicate that administrative action is required. After additional free reserved LU(s) are added to the pool, the administrator can restart an update or synchronization. It may not be necessary to add more reserved LU(s) in order for the update to succeed as restarting an update will free back reserved LU space in the LUs associated with the mirror primary. Since the tracked data is not lost, the restarted update/synchronization will be incremental.

SP that owns the mirror primary LUN is shutdown (note all the reserved LUs SP owner will always be the same as the mirror primary LUN SP owner).

Active I/O to a mirror primary LUN which generates I/Os to associated reserved LUs. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

See the description above under the SP shutdown event for the mirror primary LUN.

Reserved LU Secondary Reserved LU (Secondary) Action Event Result As part of an update to a mirror secondary LUN, the storage system may need to perform a copy on first write to preserve a “golden” copy of the data. This entails I/Os to reserved LU(s) associated with mirror secondary LUN.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from the reserved LU due to a bad block.

The image condition is set to “administratively fractured” to indicate that administrative action is required. After the reserved LU(s) are repaired or removed from the pool, the administrator can manually restart the update/synchronization. It is possible that issuing a trespass command to the mirror primary LUN (secondary LUNs and associated reserved LUs cannot be trespassed directly) will enable access to the associated reserved LUs on the secondary. “Gold” copy maintained throughout event and thus secondary can be promoted if needed.

As part of an update to a mirror secondary LUN, the storage system may need to perform a copy on first write to preserve a “golden” copy of the data. This entails I/Os to

There is no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool.

The image condition is set to “administratively fractured” to indicate that administrative action is required. After additional free reserved LU(s) are added to the pool, the administrator can restart an

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 47

Page 49: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Reserved LU (Secondary) Action Event Result reserved LU(s) associated with the mirror secondary LUN.

update. Since the tracked data is not lost, the update will be incremental. “Gold” copy maintained throughout event and thus secondary can be promoted if needed.

Promote of a mirror secondary. If an update was in progress it is aborted and the promote will require I/Os to reserved LUs associated with the secondary LUN in order to restore the secondary LUN data to the data preserved in “golden” copy.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from a reserved LU due to a bad block. After the promote operation has succeeded, server I/O may be happening while the “golden” copy is being restored by the storage system in the background.

If the server I/O request was a read which required the data to be returned from the reserved LU (not the newly promoted mirror primary LUN) and the region to be read failed due to a bad block, the server read request fails. Server write requests will proceed. See failure scenarios in the SnapView snapshot section under rollback for more details. The background “golden” copy restore process continues. Blocks that were bad in any associated reserved LU(s) will have the appropriate blocks marked bad on the newly promoted mirror primary LUN (even though the disk region on the LUN is good) to insure the integrity of the “golden” copy data.

SP that owns the mirror secondary LUN is shutdown (note all the reserved LUs SP owner will always be the same as the mirror secondary LUN owner)

Update or initial synchronization to the mirror secondary LUN is in progress. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

Any updates or initial synchronizations that are currently in progress will stop. All secondary image states, for mirrors owned by the SP that has shutdown, are set to “out of sync” (if an initial synchronization was in progress) or “consistent” and the image conditions are set to either “system fractured” or “administratively fractured” (if automatic recovery property not selected). Once the SP reboots, after a reserved LU failure during an update, each “golden” copy associated with the failed reserved LU will be lost (a “purged” event log message is generated in this special case). Manually restarting each effected update will correct the secondary in this case; however this one update will be unprotected by a “golden” copy. Any background recoveries of “gold” copy that were in progress are automatically continued on the peer SP.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 48

Page 50: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

How MirrorView/A handles failures When a failure occurs during normal operations, MirrorView/A lets you perform several actions to recover. In recovering from failures, MirrorView/A achieves two goals:

• Preserves data integrity • Minimizes the amount of time that data is unavailable to the user

Access to the primary SP fails

If an SP that owns mirrored LUNs on the primary system fails, MirrorView/A on the other SP takes ownership of those mirrored LUNs by trespassing them when something on the server (like PowerPath) initiates the trespass. This allows mirroring to continue, provided server is set up properly to handle failover (for example, a Windows server with PowerPath). When the primary LUN is trespassed, MirrorView/A sends a trespass request to any secondary images when the next update starts.

Therefore, you may notice that the mirrored LUNs on the secondary system have moved from SP A to SP B, or vice versa. MirrorView/A keeps the SP ownership the same on the primary and secondary systems during updates. If the primary image is on SP A, then the secondary image will be on SP A. This may not occur until the start of the next update. Primary image fails

If the storage system controlling the primary image fails, access to the mirror stops until you either repair the storage system or promote a secondary image of the mirror to primary.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 49

Page 51: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Promoting a secondary image to a primary image In situations where you must replace the primary storage system due to a catastrophic failure, you can use a promotion to access data on the secondary storage system. To recover and restore I/O access, you must promote a secondary mirror image to the role of primary mirror image, so that a server can access it.

Note: You can also promote a secondary image even if there has not been a catastrophic failure. If the primary image and secondary image can communicate with each other, then when the secondary image is promoted, the former primary image is demoted to a secondary image. To promote a secondary image, the following conditions must be true:

• You must direct the navicli.jar mirror commands to the storage system holding the secondary image. • The state of the secondary image you will promote must be either Consistent or Synchronized. • An update is not currently transferring data for this mirror.

! CAUTION ! Promoting a secondary image will cause loss of data written to the primary after the start of the last completed update. If any updates have been made to the primary image since that time, a full resynchronization of the mirror will be required after the promotion. Also, if an update is currently active (that is, transferring data), the promotion will not be allowed; allow the update to complete and the image to transition into the Synchronized state; then perform the promotion. An alternative to allowing the update to complete is to fracture the image. In a failure situation, before promoting a secondary image to a primary image:

1. If the existing primary image is accessible, remove the primary image from any storage groups before promoting the secondary image to avoid I/O and therefore inconsistent data.

2. Ensure that no I/O, either generated from a server or by an update in progress, is occurring in the asynchronous mirror. 3. If the existing primary is available, make sure that it lists the secondary image that is to be promoted as "synchronized."

To promote a secondary image to a primary image:

1. Issue the mirror -async -promoteimage command. Note: If you do not specify the -type switch, the command performs a normal promote.

2. If the original primary storage system failed, remove the primary storage system from the domain. 3. Add the newly promoted image to a storage group if necessary.

At some point later, you can also perform the following steps:

1. Verify that the failed storage system is not the master of the domain. If it is, assign another storage system to be the master. See the EMC Navisphere Command Line Interface (CLI) Reference.

2. Verify that the failed storage system is not a portal. If it is a portal, remove the portal and configure a different storage system as a portal. See the EMC Navisphere Command Line Interface (CLI) Reference.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 50

Page 52: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The following scenarios illustrate examples of promoting a mirror. Scenario 1 You attempt to promote a mirror that has a secondary image, but the connection between the storage system is not working. The secondary image indicates that it is synchronized, when it is actually system fractured and consistent. An error, Existing image unreachable, appears. You can investigate the reason for the loss of connectivity and correct the problem before continuing with the promotion, or you can select the Local Only Promote option to complete the promotion. If you select Local Only Promote, the software promotes the local mirror and attempts to contact the original primary image and remove the promoted image from the mirror. In the case described here, it cannot contact the other storage system, so it converts the local image to a primary image in a mirror with no secondary images. Note: In this scenario, a Force Promote has exactly the same effect as the Local Only Promote operation. Since the software cannot contact the remote storage system, the original mirror still exists on the storage system originally hosting the primary image. However, even if connectivity is restored, any attempt to start an update will fail (since the secondary has been promoted), and the secondary image will remain administratively fractured forever. You should use Force Destroy to remove this original mirror. Scenario 2 You attempt to promote a mirror whose secondary image is in the Consistent state. An error, Existing primary will be out-of-sync, appears. If possible, allow the secondary to go to the Synchronized state (for example, stop application I/O to the primary image, flush data from the server, start an update and wait for it to complete). You can then promote the secondary without requiring a full resynchronization. Otherwise, you can select either the Force Promote or the Local Only Promote option to continue the promotion. In either case, you must perform a full resynchronization before you again have the mirror providing protection for your data. Running MirrorView/A on a VMware ESX Server When you use MirrorView/A on a VMware ESX Server, after you promote the secondary image to a primary, perform the following steps:

1. Assign the newly promoted primary image to a storage group of the same or standby ESX Server. 2. Rescan the bus at the ESX Server level. 3. Create a Virtual Machine (VM) on the same or standby ESX Server. 4. Assign the newly promoted primary to the VM. Assign it to a different VM unless you remove the failed primary, in which case

you can assign it to the same VM. 5. Power up the VM.

If the VM is created and running, perform these steps:

1. Power it down. 2. Use the Service Console on the ESX Server to assign the newly promoted primary to the powered-down VM. 3. Power up the VM. The primary image (which is now the secondary image) will not be accessible to the primary ESX Server.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 51

Page 53: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering by promoting a secondary image

When you promote the secondary image, the software assigns a new mirror ID to the promoted image to distinguish it from the old mirror, even though the mirrors have the same name. The new image condition of the old primary image depends on whether the old primary image is accessible at the time of promotion. If the existing primary image is accessible when you promote, the software attempts to add the old primary image as a secondary image of the promoted mirror; that is, the images swap roles.

If the primary image is not accessible when you promote, the software creates a new mirror with the former secondary image as the new primary image and no secondary image, as shown in the example below. The mirror on the original primary storage system does not change, and so continues to have a stale record of the former secondary. You must remove the original mirror with the –mirror -async -destroy -force command once the original primary storage system is available again. Mirror before promotion Mirror after promotion Mirror ID = aaa Primary image = LUN xxxx Secondary image = LUN yyyy

Mirror ID = bbb Primary image = LUN yyyy Secondary image = none

Restoring the original mirror configuration after recovery of a failed primary image If the original mirror becomes accessible following a failure and the mirror’s secondary image has been promoted, the old mirror will be unable to communicate with the new one. To restore mirrors to their original configuration, do the following:

1. If you want to retain any data on the original primary LUN, copy it to another LUN before continuing, or alternatively, you can create a LUN that will become the primary LUN. The following process overwrites the contents of the original primary LUN.

2. Remove the original primary LUN from any storage groups of which it is a member. 3. Destroy the original mirror using the -mirror -async –destroy -force command.

Original mirror New mirror Old mirror is destroyed. Original LUN used for primary image remains (LUN xxxx)

Primary image = LUN yyyy Secondary image = none

4. Add a secondary image to the new mirror using the LUN that was the primary image for the original mirror (LUN xxxx).

! CAUTION ! Data from the promoted LUN will overwrite all the data in the secondary image (original primary) LUN. The secondary image synchronizes automatically. Allow the synchronization to complete.

New mirror Primary image = LUN yyyy Secondary image = LUN xxxx ! CAUTION ! Allow the image to transition to the Synchronized state following the synchronization. If the image is in the Consistent state when you promote it, another full resynchronization is required, and data may be lost.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 52

Page 54: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

5. Promote the secondary image (LUN xxxx) in the new mirror to primary. If you attempt promotion and the system indicates that

the resulting mirror would be out-of-sync, do not complete the promotion. Instead, determine why the images are potentially different. If necessary, start an update of the mirror, wait for it to complete, and then for the secondary image to transition to the Synchronized state. Then you can retry the promotion. The new mirror has the same configuration as the original mirror.

New mirror Primary image = LUN yyyy Secondary image = LUN xxxx

During a promotion, the recovery policy for a secondary image is always set to manual recovery. This prevents a full synchronization from starting until you want it to.

6. If required, reset the recovery policy back to automatic.

Recovering without promoting a secondary image

If the primary storage system fails, but can be readily repaired, recovery is simpler. MirrorView/A records any writes that had completed before the failure and transfers them to the remote image when the next update occurs. Any writes that were sent to the storage system but not yet acknowledged may be lost, but application-specific recovery techniques, such as chkdsk or fsck for filesystems, can usually correct any issues.

To recover without promoting a secondary image, follow these steps:

1. Repair the primary storage system and/or server. 2. Fracture the asynchronous mirror(s). 3. Complete any necessary application-specific recovery of the data on the primary image. 4. Make sure that the data is flushed from the server to the storage system. 5. Synchronize the asynchronous mirror(s).

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 53

Page 55: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Failure of the secondary image

When a primary image cannot communicate with a secondary image, it marks the secondary as unreachable and stops updating the secondary image. The secondary image is marked System Fractured. The loss of communication may be due to a link between storage systems, an SP failure on the secondary storage system, or some other failure on the secondary storage system. In the event of the communication failure, the secondary image remains a member of the mirror.

If mirror is set for automatic recovery, an update automatically starts once secondary storage system is again accessible. Otherwise, you must manually start the update.

Promoting a secondary image when there is no failure You may want to promote your secondary image even if no failure occurs on the storage systems. For example, you may want to test your disaster recovery procedure before a real failure occurs, or perhaps the server attached to the primary storage system failed, and you must resume operations using the server attached to the secondary storage system. If the original primary is accessible when you promote the secondary, the software verifies whether the images are identical. If possible, the secondary image should be in the Synchronized state (stop application I/O, flush data from the servers, start and update and wait for it to complete). If the images are identical, they swap roles, resynchronization is not necessary, and the promotion is complete. If the images are potentially different (that is, the secondary image is not in the Synchronized state), then you must specify the type of promotion. As part of a promotion, any secondary images in the new mirror are set to manual recovery. Mirror before promotion Mirror after promotion Mirror ID = aaa Primary image = LUN xxxx Secondary image = LUN yyyy

Mirror ID = bbb Primary image = LUN yyyy Secondary image = none

If the images are not synchronized, you can specify to forcibly promote (oos), local promote, or to not promote. oos promote and local promote require a full resynchronization of the data before mirrored protection is again in effect.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 54

Page 56: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Summary of MirrorView/A failures Table shows how MirrorView/A might help you recover from system failure at the primary and secondary sites. It assumes that the secondary image of the mirror is in either the Synchronized or Consistent state. EVENT RESULT AND RECOVERY Loss of access to primary image LUN Check connections between server and storage system, including

zoning and correct operation of switches. Check for SP reboot. Secondary SP is rebooted If the secondary SP reboots, for example, due to a software

failure, an explicit command or the SP is physically pulled and reseated, you may see the secondary image become system fractured. Also possible for secondary to become administratively fractured, in which case simply synchronize the image.

Server accessing primary image fails Catastrophic failure. I/O stops. After a defined period of time, all nonfractured secondaries in the Consistent state transition to the Synchronized state. Nothing more happens until the server is repaired or replaced or a secondary image is promoted.

Array running primary image fails Option 1 – Catastrophic failure. The mirror is left in the state it was already in. If the secondary image is in either the Consistent or Synchronized state, it may be promoted to provide access to your data. Note: Any writes in progress when the primary image fails may not propagate to the secondary image. Also, if the remote image was fractured at the time of the

any writes since the fracture will not have propagated. failure,Option 2 – Non catastrophic failure, repair is feasible. The admin has the problem fixed, the normal production operation an resume. The write intent log, if used, shortens the sync time needed. If write intent log is not used, a full sync is needed. Option 3 – Only one SP fails. If the SP that controls the mirror fails, software on the server (in ex; PowerPath) can detect the failure. This software can cause control of mirror to be transferred to surviving SP and normal operations can continue. If such software is not running on server, then you must either manually transfer control using Navisphere or access to mirror stops until the SP is back in service. If the SP that does not control the LUN fails, then mirroring continues as normal.

Array running secondary image fails

- If the SP that does not control the secondary image fails, nothing happens with respect to this mirror. - If the SP that controls the mirror fails (or both SPs fail or a catastrophic failure of the entire system occurs), the primary system will fracture the image hosted on this array. The mirror may consequently go to Attention state (if it is so configured), but I/O continues as normal to the primary image. The admin has a choice: If the secondary can easily be fixed (ie; if someone pulled out a cable), then the admin can have it fixed and let things resume. Otherwise, to regain protection of your data and you have another array available, you can force destroy the existing mirror, recreate it and add a secondary image on another working array. Protection is not established until the full sync of the secondary image completes.

Loss of connection between arrays (indicated by system fractures)

Check the cables, make sure that all SPs are still working and make sure the MV path between the arrays is still enabled and active. Check correct zoning and other function of any switches.

Failures when adding secondary images

Make sure that: The connection between arrays works; You are managing both arrays, which may require managing two domains; The secondary LUN is available and the same ‘block’ size as the primary image; The secondary image LUN is not in the storage group; The secondary LUN is not part of a clone group; The secondary image LUN is not already a secondary image, of either a sync or async mirror; The RLP on both primary and secondary arrays are adequately configured; The secondary LUN is not set up as a destination for SAN Copy.

When the secondary image does not sync

Make sure that: The connection between the arrays is still good; The recovery policy is set to auto and not manual; The secondary SP is working. Try manually fracturing and then manually sync’ing the image.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 55

Page 57: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering from serious errors In the unlikely event that the mechanism for tracking changes made to the primary image fails (for example, insufficient memory available on the SP), the secondary image is marked as permanently fractured. To recover from this situation, you must remove the secondary image from the mirror, and then re-add it (which does a full resynchronization). This failure may indicate that you are using close to the storage system’s capacity for layered features. Some other serious failures will transition MirrorView/A into a degraded mode of operation, where administrative requests will be rejected and no further updates run. Degraded mode affects only a single SP; the other SP in the storage system may continue to run normally (depending on the nature of the failure). When an SP enters degraded mode, the system logs an event that indicates why MirrorView/A is in the degraded mode. Usually you can recover from the degraded mode by simply rebooting the affected SP, but some specific cases require you to check other components that MirrorView/A uses before rebooting the SP. Table A-2 lists various scenarios in which MirrorView/A goes to the degraded mode and the recovery options you can take. Table - Recovery from degraded mode EVENT RESULT AND RECOVERY Internal memory corruption Mirror data does not match the expected value, reboot the SP. Serious, unexpected errors MV/A receives unexpected errors from its underlying components

during operation. Check the event log for a record of errors and take steps to correct them. Ie; if the reserved LUN pool LUNs are faulted, recover them, then reboot the SP.

Internal fracture failure A fracture operation fails due to reasons other than an error you made. Check the event log for the appropriate failure reason. Reboot the SP to fix the problem.

How consistency groups handle failures When a failure occurs during normal operations for consistency groups, MirrorView/A lets you perform several actions to recover. When recovering from failures, MirrorView/A achieves three goals:

• Preserves data integrity • Minimizes the amount of time that data is unavailable to the user • Ensures that the consistency of the consistency group is maintained

Access to the SP fails Consider a consistency group that has member mirrors, some of which SP A controls and some of which SP B controls. If SP A on the primary storage system fails, then software on the attached server, for example, PowerPath, moves control of the mirrors that were controlled by SP A to SP B. This allows applications on the server, as well as the mirroring of data to the secondary storage system, to continue uninterrupted. However, as part of the transfer of control, the consistency group becomes system fractured. If the recovery policy is set to automatic, an update automatically starts on the surviving SP (SP B in this example). However, if the recovery policy is manual, you must manually start an update. Primary storage system fails If the storage system running the primary consistency group fails, access to the data in the group’s member LUNs is lost. You can either repair the failed storage system and then continue operations, or you can promote the secondary consistency group, so as to access the data from the secondary storage system.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 56

Page 58: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recovering by promoting a secondary consistency group As part of the consistency group promotion each of the mirror members is promoted. This section describes three types of group promotions, which are based on the connectivity status between the primary and the secondary and the condition of the individual members. Normal promotion When you request promotion for a secondary image, the software determines if connectivity exists between the storage systems hosting the primary and secondary consistency groups. If connectivity is working, it tests the members of the group to determine if the result of promotion will be an out-of-sync group or a synchronized. The promote operation will fail if the primary is unreachable or the secondary group will be out-of-sync after promotion. You then can do a local only promote or a force promote, described below. Force promote A force promote promotes each member of the group and places the newly promoted mirrors in the group (removing the original members). If the original primary storage system is available, the original primary images will become secondary images of the promoted mirrors. The promoted group is marked as Out-of-Sync and its recovery policy is set to manual. You must initiate an update for the group in order to start the full update, which is required for the group to be once again protecting your data. If the original primary storage system is unavailable, Force Promote has the same effect as Local Only Promote, described below. Important: Must perform a full update on the new secondary image, which will overwrite all existing data on that image Local only promote A local only promote promotes the secondary image of each consistency group member to a primary image, but does not attempt to add the original primary image or any other existing secondary images to the promoting mirror. If a connection exists between the primary and the secondary, for each member of the primary, the software attempts to remove the image being promoted on the secondary. Thus, the old primary consistency group will have all primary images and no secondary images. If no connection exists, the promote will still continue on the secondary, and the operation will not fail. The original primary consistency group cannot communicate with the promoted secondary consistency group even if the MirrorView/A connection between the storage systems is restored (since the secondary consistency group was promoted to a primary consistency group). If a failure occurs during promoting (for example, an SP reboots), the consistency group may be left in an inconsistent state. It is possible that some members have only primary images or some have been promoted or not promoted at all. Check the state of the promoted consistency group to detect any problems during promotion. A consistency group is in the scrambled state if at least one of its member’s primary images is missing its corresponding secondary images. Note: Table below lists the configurations in which the scrambled state can occur. Note: Either the Local Only Promote or the Force Promote operation can result in a consistency group that contains mirrors that have no secondary images at all. In this case, the consistency group is no longer performing its function. The best way to correct this is to remove the mirrors from the consistency group, add secondary images as required, and add the mirrors back to the group.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 57

Page 59: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Table - Configurations in which the scrambled state can occur Configuration Ways to reach

this state Recovery Options Is the consistency group

promotable in this state? Consistency group members consist of only primary images. Individual members do not have any secondary images associated w/them.

After a local promote.

Force removal of each member from the consistency group, add secondary images to each mirror and add the mirrors in the consistency group again

No, because there are no secondary images associated with the consistency group.

If no normal promotion or a force promotion fails in mid-operation, some members may consist of only primary images. The remaining members are successfully promoted and have secondaries associated w/them.

After a failed normal promotion or out-of-sync promotion. Failure can occur by pulling the SP to which the promotion was directed.

Force the removal of the members with no secondary image and then add secondaries to those mirrors. Add them back into the consistency group as necessary.

The consistency group is not promotable from the old primary until you remove the consistency group members hat lack a secondary image. However you can issue a local promotion on the old primary in this case.

If any type of promotion fails in mid-operation, some members may consist of only primary images. The remaining members are not successfully promoted.

After a promotion fails on the local SP before you attempt a remote promotion.

Force the removal of the members with no secondary image, add secondaries to those mirrors and add them back into the consistency group as needed

Not until you remove the consistency group members that lack a secondary image. You can issue a Force Promote again in order to promote the mirrors that were not promoted.

Failure of the secondary consistency group When a primary cannot communicate with a secondary consistency group, the group’s condition changes to system fractured. When a consistency group is system fractured, no writes are propagated to the secondary consistency group. The primary storage system attempts to minimize the amount of work required to synchronize the secondary after it recovers. It keeps track of the write requests to the consistency group, so that only modified areas will be copied to the secondary during recovery. Also, consider the case where the consistency group has some members whose primary image LUNs reside on SP A and some on SP B. If the MirrorView/A connection is broken between SP Bs of the primary and the secondary storage system, the consistency group is system fractured to maintain the consistent state of the consistency group.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 58

Page 60: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SAN Copy There are three user visible storage system objects that are used by the SAN Copy capability: Source LUN(s), Destination LUN(s), and reserved LU(s). The reserved LUs exist to provide the incremental SAN Copy capability on the SAN Copy storage system. The failure events for a full SAN Copy are different than for an incremental SAN Copy so in order to reduce the complexity, the tables for full SAN Copy are separated from the incremental SAN Copy tables. There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. Source LUN (Full Copy) Full SAN Copy Source LUN Action Event Result The SAN Copy storage system will read data from the source LUN in order to write the data to the destination LUN.

A read from the source LUN fails due to a LCC/BCC failure, cache dirty LUN, or some other storage system failure that prevents read access to the source LUN. Note, the source LUN can be on a remote storage system which may or may not be a CLARiiON.

Failure to access the source LUN will cause the SAN Copy session to fail. After the access to the source LUN is repaired the administrator can restart the copy at the last checkpoint (SAN Copy will periodically record the copy progress to avoid having to restart copies at the beginning).

The SAN Copy storage system will read data from the source LUN in order to transfer the data to the destination LUN.

The read from the source LUN fails due to a bad block.

Failure to read the source LUN will cause the SAN Copy session to fail. An event log message is created with information on the offending disk block. Once the bad block is repaired, the copy session can be restarted at the last checkpoint.

SP on the SAN Copy storage system that is processing the copy session is shutdown.

Active SAN Copy session. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

The copy session will fail. The session state will be set to “halted on reboot”. When the SP is rebooted, the state will be set to “auto-recovery in progress” and will resume automatically from the last checkpoint.

Source LUN is trespassed. Active copy session on source LUN. Trespass of the source LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the source LUN is determined to be bad.

Any active session will fail. Once the source LUN is trespassed back to the previous owning SP or the session is moved to the peer SP (providing there is connectivity to the destination LUN from the peer SP), and any destination LUNs on the SAN Copy storage system are trespassed, the copy session can be restarted at the last checkpoint.

Destination LUN (Full Copy) Full SAN Copy Destination LUN Action Event Result The SAN Copy storage system will issue write requests to the destination LUN(s) in order to copy the data from the source LUN.

A write to the destination LUN fails due to a LCC/BCC failure, cache dirty LUN, or some other storage system failure that prevents write access to the destination LUN. Note: the destination LUN can be on a remote storage system which may or may not be a CLARiiON.

Failure to write to a destination LUN will cause the SAN Copy session to fail for that specific destination LUN (other destination LUNs may succeed associated with the same session). It could be as long as 5 minutes depending upon the connectivity issue in order for the failure to occur. Once the

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 59

Page 61: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Full SAN Copy Destination LUN Action Event Result

The write to the storage system where one or more of the destination LUNs reside (assuming the copy was to a remote storage system) could also fail due to a connectivity issue. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

problem is repaired, the copy session can be restarted at the last checkpoint.

Destination LUN is trespassed.

Active copy session that is transferring data to the destination LUN. Trespass of the destination LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the destination LUN is determined to be bad.

Any active session will fail to copy the specific destination LUN (other destination LUNs may succeed associated with the same session). Once the destination LUN is trespassed back to the previous owning SP, the copy session can be restarted at the last checkpoint.

Source LUN (Incremental Copy) Incremental SAN Copy Source LUN Action Event Result The SAN Copy storage system will read data from the source LUN in order to transfer the data to the destination LUN.

Storage system read from the source LUN fails due to a LCC/BCC failure, cache dirty LUN, or a bad block.

Failure to access the source LUN will cause the SAN Copy session to fail. After the source LUN is repaired (which could be done by trespassing the source LUN) the administrator must change the session type to full and back to incremental in order to reset the tracking mechanism. This will incur a full copy once the session is started.

SP on the SAN Copy storage system is shutdown. It is the same SP that owns the source LUN.

Active SAN Copy session. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

The copy session will fail. The session state will be set to “halted on reboot”. When the SP is rebooted, the state will be set to “auto-recovery in progress” and will resume automatically from the last checkpoint.

Source LUN is trespassed. Active copy session on source LUN. Trespass of the source LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the source LUN is determined to be bad.

Any active session will fail. Once the source LUN is trespassed back to the previous owning SP or the session is moved to the peer SP (providing there is connectivity to the destination LUN from the peer SP), and any destination LUNs on the SAN Copy storage system are trespassed, copy session can be restarted from where it left off.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 60

Page 62: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Destination LUN (Incremental Copy) Incremental SAN Copy Destination LUN Action Event Result The SAN Copy storage system will issue write requests to the destination LUN(s) in order to copy the data from the source LUN.

A write to the destination LUN fails due to a LCC/BCC failure, cache dirty LUN, or some other storage system failure that prevents write access to the destination LUN. Note: the destination LUN can be on a remote storage system which may not be a CLARiiON. The write to the storage system where the destination LUN(s) reside (assuming the copy was to a remote storage system) could also fail due to a connectivity issue. Connectivity issues could occur due to switch failures (including firmware revision issues), storage system port failures, cable failures, zoning errors, ISP failures, etc.

Failure to write to a destination LUN will cause the SAN Copy session to fail for that specific destination LUN (other destination LUNs may succeed associated with the same session). It could be as long as 5 minutes depending upon the connectivity issue in order for the failure to occur. Once the problem is repaired, the copy session can be restarted from where it left off.

Destination LUN is trespassed.

Active copy session that is transferring data to the destination LUN. Trespass of the destination LUN can happen due to NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the destination LUN is determined to be bad.

Any active session will fail to copy the specific destination LUN (other destination LUNs may succeed associated with the same session). Once the destination LUN is trespassed back to the previous owning SP, the copy session can be restarted from where it left off.

Reserved LU Reserved LU (Incremental SAN Copy only) Action Event Result Server write to a source LUN. The SAN Copy storage system may need to perform a copy on first write to preserve the point in time data to be transferred during an incremental session. The storage system also needs to track disk regions that will need to be transferred or to clear disk regions that were previously tracked after the data has been updated. This entails I/Os to reserved LU(s) before the server write to a source LUN can proceed.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from the reserved LU due to a bad block.

Server write request succeeds. All allocated reserved LUs associated with the source LUN are freed back to the reserved LU pool. The SAN Copy session fails. After the reserved LU is repaired (which could be done by trespassing the source LUN) the administrator must change the session type to full and back to incremental in order to reset the tracking mechanism. This will incur a full copy once the session is started.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 61

Page 63: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Reserved LU (Incremental SAN Copy only) Action Event Result Server write to source LUN. The SAN Copy storage system may need to perform a copy on first write to preserve the point in time data to be transferred during an incremental session. The storage system also needs to track disk regions that will need to be transferred or to clear disk regions that were previously tracked after the data has been updated. This entails I/Os to reserved LU(s) before the server write to a source LUN can proceed.

There is no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool.

Server write request succeeds. The SAN Copy session fails. The session is unmarked because the point in time data at the time of the mark was not maintained via the copy on first write mechanism. After additional free reserved LU(s) are added to the pool, the administrator can restart the copy; however, a new mark point in time must be used. Since the tracked data is not lost, the new point in time copy will be incremental from the last successful copy session.

SP that is processing the incremental copy session is shutdown. It is the same SP that owns the source LUN.

Active SAN Copy session. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

The copy session will fail. The session state will be set to “halted on reboot”. When the SP is rebooted, the state will be set to “auto-recovery in progress” and will resume automatically from the last checkpoint.

Source LUN is trespassed which will cause any associated reserved LUs to trespass.

Active copy session on source LUN. Trespass of the source LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the source LUN is determined to be bad.

Any active session will fail. Once the source LUN is trespassed back to the previous owning SP, the copy session can be restarted from where it left off.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 62

Page 64: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SAN Copy (ISC) Introduction - this document will informally summarize the Incremental SAN Copy (ISC) usage model. Note - allocation of the Snap Cache is to be performed by the user prior to performing an Incremental SAN Copy operation. This operation is outside the scope of this project. Model - Incremental SAN Copy is a feature that provides users with the ability to perform incremental updates to copies of their production data. These copies can reside on the same array as the production data, or on remote arrays (CLARiiON and Symmetrix only). Incremental SAN Copy operations can be performed using a variety of LUN types as the source of the copy:

• Regular LUs • SnapSource LUs • Clone Source or Clone LUs • MirrorView Primary or Secondary LUs

A new Incremental SnapView Session is used to track the changes for an ISC Session. The Incremental SnapView Session can only be invoked via ISC administrative operations. A dedicated snapshot LU is used as a placeholder for each Copy Source LU. The Incremental SnapView Session is implicitly associated with the dedicated snapshot LU. ISC Sessions have two modes of operation “marked” and “unmarked”. When “unmarked”, the Incremental SnapView Session associated with the ISC Session keeps track of the areas of the Copy Source LUs that have been modified. When “marked”, it also keeps track of the areas of the Copy Source LUs that have been modified. Additionally, while in the “marked” state, it performs copy on first write (COFW) operations, as necessary, to protect the integrity of the data during the life of the SAN Copy data transfer. Creating an Incremental SAN Copy Session In order to perform an Incremental SAN Copy operation, users must first create an ISC Session for the operation. This is done using the existing SAN Copy Session Wizard. Users must indicate whether the SAN Copy Session will be used for performing Incremental Copies. If it is, the Session Name specified (along with an indicator that this is a SnapSession started for SAN Copy) will be used as the name of the Incremental SnapView Session and Incremental SnapShot LU created for the ISC Session. For example, if the user creates an ISC Session using “foo” as the Session Name, the Incremental SnapView Session started will have a name like “SAN Copy – foo”. Users can specify whether a full copy is to be performed when an ISC Session is started for the first time. If the In-Sync property is “FALSE ” (the default), a full copy is performed the first time the ISC Session is started. Subsequent starts copy only the changed data since the previous copy. If the In-Sync property is “TRUE”, the ISC Session, when started for the first time, will only copy the data that has changed since it was created. This is useful for users who either know the contents of their destination is the same as their source (already performed a full copy) or don’t care about the contents of their source (haven’t created the database or filesystem on the source yet). A Latency and Available Link Bandwidth value must be specified for ISC Sessions. The user can either specify the Latency when the ISC Session is created, or select to have the value automatically determined by SAN Copy (method is TBD). The user must specify the Available Link Bandwidth value when the ISC Session is created. Both of these values will be used to determine the buffer size/count for the optimal performance of the ISC Session. As part of the ISC Session’s creation, the following operations occur:

• An Incremental SnapShot LU is created for the specified Copy Source LU (using the Session Name along with an indicator that this is a SnapShot LU created for SAN Copy). This SnapShot LU will be reserved by SAN Copy and cannot be placed in a storage group. Users will be prohibited from performing any operations on Incremental SnapShot LUs.

• An Incremental SnapView Session is started on the Copy Source LU of the ISC Session (using the Session Name along with an indicator that this is a SnapSession started for SAN Copy).

• The Incremental SnapView Session is activated on the Incremental SnapShot LU created for the ISC Session.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 63

Page 65: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

• The act of starting the Incremental SnapView Session will automatically acquire one or more SnapView cache LUs (as needed) for each Copy Source LU. These SnapView cache LUs will remain locked to the Copy Session by virtue of the Incremental SnapView Session not being stopped until the Copy Session is destroyed. SnapView cache LUs will be assigned until there is enough disk space to maintain the complete DLM table for each Copy Source LU. The start of the Incremental SnapView Session could fail if there is not enough SnapView cache space defined. In this case the creation of the ISC Session will fail.

• A Copy Session is created and stored in the SAN Copy Database for the Incremental Copy operation. • Incremental SnapShot LUs that are automatically created for an ISC Session will be reserved by SAN Copy and will be

displayed as such. Incremental SnapView Sessions will be displayed in the list of existing SnapView Sessions for the Source LU. There will be an indication though that the SnapView Session has been started for an ISC Session.

Marking/Unmarking the Incremental SAN Copy Session Anytime after an ISC Session is created, the user can decide the point-in-time copy that is to be incrementally copied to the destination(s). The user does this by quiescing the appropriate source LU(s) and performing a “mark“ operation on the ISC Session. The “mark” operation can be performed via Navisphere. As soon as the “mark” operation is performed, the Incremental SnapView Session will start performing copy on first writes as necessary. Once the “mark” operation completes, the user unquiesces the source LU(s). The Copy Session does not necessarily have to be started at this time. For example, the user may decide to perform an Incremental Copy operation at 11:00 PM for a point-in-time copy of the data at 1:00 PM. After an ISC Session has been “marked”, the user can perform an “unmark” operation on it. After an ISC Session has been “unmarked”, the point-in-time copy of the data from the previous “mark” operation will be lost. Copy on first write operations will not be performed on an ISC Session that has been “unmarked” until it is “marked” again. Starting the Incremental SAN Copy Session The user can start an ISC Session any time after it has been created. If the user does not specify to “mark” an ISC Session before it is started, the ISC Session will be auto-marked. This will result in an incremental copy of the point-in-time data at the time the copy starts. The user must first quiesce the Copy Source LU before starting an ISC Session that is not “marked”. After the Copy Session has started, the user can unquiesce the Copy Source LU. Users can also select to perform a full copy when starting an ISC Session. Once the ISC Session completes successfully to all destinations, the Incremental SnapView Session automatically transitions to the “unmarked” state. The ISC Session will display the status of the copy operation. Users will be provided with the ability to perform a full copy to destinations without “marking” the ISC Session first. This ability will result in faster full copy operations to destinations, but will leave those destinations in an inconsistent state. A subsequent “start” operation by the user will make the data on all destinations consistent. Viewing/Modifying an Incremental SAN Copy Session The WWN of the Copy Source LU (not the SnapShot LU) will be displayed as the source for ISC Sessions as well as whether it is “marked” or ”unmarked”. If an ISC Session is “marked”, there will be an associated timestamp that indicates when the data was “marked”. ISC Sessions that are not in progress (have not been started) will display the number of blocks that would be copied if the Copy Session were started now. ISC Sessions that are in progress (have been started) will display the progress of the copy, the number of blocks left to copy, and the total number of blocks for this Copy Session. The Incremental Copy property of a SAN Copy Session can be modified to start/stop it from being used to perform Incremental Copies. When the Incremental Copy property of a SAN Copy Session is turned off:

• the Incremental SnapView Session started for the ISC Session is automatically stopped • the Incremental SnapShot LU created for the ISC Session is automatically destroyed

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 64

Page 66: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

When the property is turned on, an Incremental SnapShot LU is created and an Incremental SnapView Session is started using the specified Session Name (with a SAN Copy indicator) of the ISC Session. After the Incremental flag is set to “True” on an existing SAN Copy Session, the user can set the In-Sync property to “True”. If it is set to “True”, a full copy of the destinations will not be performed when the ISC Session is started for the first time. If it is not to “True” a full copy will be performed. Users can add destinations to an existing ISC Session that is not currently copying. If a new destination is added to an existing ISC Session, a full initial copy of the destination will be performed the first time the ISC Session is started. An automatic Incremental Copy to all the destinations will follow this full initial copy to the new destination. There will be an indication that an initial copy operation is in progress on the new Destination. Once the both the initial copy of the new destination and the Incremental Copy to all destinations have completed, all the destinations in the ISC Session will contain the point-in-time data for when the ISC Session was marked (or started if it had not been previously marked). Users can modify both the Latency and Available Link Bandwidth for an ISC Session. Destroying an Incremental SAN Copy Session When an ISC Session is removed, the following operations are performed:

• the Incremental SnapView Session started for the ISC Session is automatically stopped • the Incremental SnapShot LU created for that ISC Session is destroyed

The ISC Session entry is removed from the SAN Copy Database Error Cases Incremental SAN Copy Session Failure: If an ISC Session fails for any reason (except because there is no more available SnapCache), the status for the Copy Session will indicate that it has stopped and specify a reason for the failure. The Incremental SnapView Session associated with the Incremental Copy Session remains “marked”. The ISC Session can be restarted from the point of the failure using the “resume” operation (see the “Handling San Copy Session Failures” section of the “EMC SAN Copy Administrator’s Guide”). Incremental SAN Copy Session Destination Failure: If one or more destinations of an ISC Session fail, the Copy Session continues copying to other Destinations but stops clearing bits in the DeltaMap area of the DLM for the Incremental SnapView Session (assuming there is at least one non-failing destination). When the Copy Session completes, the status of the Copy Session will indicate which destinations have failed. The Incremental SnapView Session associated with the ISC Session stays in the “marked” state. An ISC Session that contains failed destinations will be inhibited from being “started” or “unmarked” until one of the following conditions is met.

• all failed Destinations have been removed from the Copy Session (they can be full-copied and added back later) • user performs a “resume” of the Copy Session as many times as necessary to finish copying to all Destinations.

When a “resume” operation is performed, SAN Copy copies to only the failed Destinations. If more than one Destination failed, the DeltaMap area of the DLM for the Incremental SnapView Session represents the first failure (a superset of subsequent failures) so some “repaired” Destinations will received redundant writes. As a future enhancement, the “resume” operation can begin copying data to each failed destination at the point of its failure (rather than where the first destination failed). Out of SnapCache for Incremental SAN Copy Session: If there is no more SnapCache for the copy-on-first-write data for an Incremental SnapView Session that is “marked”, it will become “unmarked”. If the ISC Session is in progress when this occurs, the Copy Session fails. The Incremental SnapView Session continues to track the changes for the ISC Session but the point-in-time copy of the data when it was “marked” will be lost. When this occurs, there will be an indication that the Incremental SnapView Session is “unmarked”. There will also be an indication that the ISC Session has failed. The user can then perform a new “mark” operation and start the ISC Session.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 65

Page 67: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SnapCache failure: If an error occurs while writing the DeltaMaps to the SnapCache LUs, the Incremental SnapView Session will stop. If this occurs an error will be returned to the user to indicate that the Incremental SnapView Session has terminated. When this occurs, the user can:

• Turn off the Incremental property of the SAN Copy Session and then turn it back on again. A full copy will be performed to all Copy Destinations the first time the ISC Session is started after this operation is performed.

• Destroy and then recreate the ISC Session. The ISC Session should be created Out-of-Sync so that a full copy is performed to all Copy Destinations the first time the ISC Session is started. This will ensure that all the destinations contain the correct data.

Restrictions:

• Users prohibited from performing any SnapView operations on the Incremental SnapView Session created for an ISC Session. • Users cannot use an existing SnapShot LU as the source of an ISC Session. • Users will be prohibited from changing the Incremental Copy property of a Copy Session while it is copying. • The user will be prohibited from changing the Session Name of an existing ISC Session. • Users will be prohibited from modifying the Copy Source LU of an existing ISC Session • Users will be prohibited from “starting” or “unmarking” an ISC Session that currently has failed destinations • The Session Name specified for an ISC Session cannot be the same name used for an existing SnapView Session. • Each Copy Session must have a unique Session Name. • Cannot “mark” or “unmark” an ISC Session while it is copying. • ISC Sessions cannot be destroyed while they are in progress • Users cannot perform a “mark” operation on an already “marked” ISC Session.

Issues: If the number of Copy Sessions that can be concurrently active is increased to 50 (from 16), how do we ensure the CPU isn’t saturated if the sessions are running at full throttle? Brief description - Many businesses require that information generated at a central location is available at branch or satellite offices. A variety of information (e.g., catalogs, buying trend databases, cost of goods databases, etc.) is created at the corporate headquarters. This information is periodically updated and transferred to the satellite offices. The satellite offices are geographically dispersed from the central office. Distribution of the data can be a manual process, invoked by administrative command, or an automatic process that runs without human intervention at preconfigured periodic intervals. The periodic interval will be limited by the cost. Shorter intervals over greater geographical distances can be achieved, but will be bounded by the amount of data transferred, the distance, and the technology deployed to transfer it (cost).

Businesses with geographically distributed offices typically have a modest sized IT staff and are willing to afford configurations that offer some degree of high availability. Cost is a concern, which makes this is a prime usage target for midrange storage arrays.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 66

Page 68: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

For this class of customer, infrastructure cost needs to be kept at a minimum. For the purposes of this document, it is assumed that the network has been properly provisioned to meet the content distribution business objectives. Given the cost differential between a T1 and a T3 line, it is further assumed that most customers will run their content distribution applications over one or more T1 lines and, in the not to distance future, over even less costly lines (e.g., VPN over DSL). In certain business environments, some customers can justify the ROI for more expensive communications infrastructure. In addition, it is assumed that a significant number of these customers will have preexisting T1 lines between offices which they would like to deploy for content distribution during off-peak hours.

Cost dictates the amount of data that can be distributed with a fixed period. For example, assume that the business objectives required a daily update frequency and that cross country data distribution must be done within a 4 hour window (off-peak business hours). 2.5GB of updated data can be transferred using SAN Copy to the satellite offices over a single dedicated T1 line across the country (~3000 miles) in about 4 hours at a line cost of just over $8000 per year.

Using a single T3 line, 50GB of updated data can be distributed in the same 4 hour window at a cost exceeding $175,000 per year.

The amount of data that needs to be distributed varies across a wide spectrum of applications. The amount of data, for the purpose of this use case analysis, is assumed to be sufficiently large enough that the costs will prohibit sending full copies of the content. Therefore, an incremental update capability is assumed. Business Rules Consistency - The data distributed to all the satellite offices must be identical and must represent the data at the central office at some previous point in time. Performance impact - Impact to the central office processing of the data must be minimal during the period the data is replicated to the satellite offices. Transfer period - The time needed to transfer the data must be less than the required update frequency (i.e., transferred in an off-peak time period). Some businesses are 7x24 and do not have an off-peak period. For these businesses it may be more advantageous to throttle the data transfer in order to reduce the impact on the production environment. Satellite data accessibility - The data at the satellite offices must allow read/write access to it. The data must be accessible regardless of whether the last periodic update failed while it was in the middle of transferring the data. Preconditions Server (Host) Applications - Each satellite office has one or more applications capable of interpreting the data distributed by the central office. The applications are used to read and analyze the data. Any changes to the data from a satellite office are expected to be overwritten during the next periodic update from the central office. Regardless of the fact that the data will change during the next update period, many of the applications will require read-write access to the data to properly execute. In some cases, the applications are file system based and the servers are used as file servers storing client documents and data (e.g., Word, Excel, and PowerPoint). Servers - The central office and each of the satellite offices have at least one server that is capable of running the applications referenced above. The servers must be running data compatible revisions of the applications and operating systems. At each server location is a storage array. Access to the storage array is typically configured for high availability, so that there is no single point of failure. The servers have at least two fiber channel (FC) connections via host bus adapters (HBAs) to each storage processor (SP) in the array. The servers are either direct connected to the array or are configured in a SAN typically using two redundant FC switches configured for high availability. Storage - The storage arrays at each office are interconnected to the storage arrays at the other offices via a storage router (FC to IP converter). Typically there is a storage router for each SP connected to the network. It is assumed that the array at the central office and each satellite office is a CLARiiON CX series array. The arrays at the central office and each satellite office must have SnapView software installed and properly configured. The array at the central office must also have SAN Copy software installed. This requires that the arrays must be capable of running layered software.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 67

Page 69: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The array at the central office must be properly configured to run the SAN Copy software. Incremental SAN Copy descriptors must be created that represent the source and destination for the data being distributed. The descriptors are created once and maintained within the central office storage array. SAN copy is integrated with SnapView snapshots to provide support for incremental data distribution. The SnapView cache must be provisioned and configured in order to create the incremental SAN Copy descriptors. The customer may choose to use clones at the central office in addition to the implicit use of snapshots. If clones are used, the storage must be provisioned and the clone groups created. The tradeoff on whether to use clones is discussed in the System Architecture section. Each satellite office must be configured to use SnapView snapshots. The configuration includes provisioning the SnapView cache and the creation of the snapshots. Infrastructure - Central office and each satellite office are connected via an IP network. As discussed earlier, it is assumed the configuration uses one or more leased T1 lines. In a highly available configuration, two FC to IP storage routers at each site are dedicated to the storage data interchange; however only one storage router at each office is required. Each storage router is independently connected to a SP. The storage router should be capable of data compression. The storage router(s) can be direct connected to the array or connected via a FC switch.

System Architecture - The content distribution process, using CLARiiON hardware and software, consists of a flow of events that are performed at the central office to push consistent updates to the satellite offices and a set of events, performed at the satellite offices, to guarantee that the data is accessible and consistent. Further, the events are structured to avoid server reboots and to allow partial updates to be sent from the central office. CLARiiON SAN Copy software is used to send updates from the central office to the satellites. SAN Copy is an efficient multi-threaded, array to array copy engine with tight integration with SnapView to provide the ability to track changes and send periodic updates to multiple destinations from a single centralized source. CLARiiON SnapView array software is used both at the central office and at each satellite office. At the central office, it is used to make a consistent image of the data that will be the source of the distribution to the satellite offices (consistent data distribution). SnapView is also used to present the data in such a way as to safeguard any modifications to the data made by the application(s) running at each satellite office. This capability is critical to enable only partial updates to be sent from the central office. This will be explained in greater detail as the flow of events is further broken down into subordinate use cases. Customers will implicitly use SnapView snapshots (tight integration with SAN Copy) at the central office. They have a choice of additionally using clones to reduce the performance impact. Customers have a choice in using SnapView snapshots or SnapView clones at any of the satellite offices. The determination of which to use is based on a performance versus required disk space tradeoff.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 68

Page 70: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Snapshots require a fraction of the disk space needed to store the distributed data, but incur a copy-on-first-write performance penalty when in use. A clone requires two times the storage (full copy) of the data, but does not have a significant performance penalty when the clone is in use. If business conditions allow an off-peak time to distribute the data, then snapshots will be used at the central office, as the performance overhead will occur at times when the data is not heavily accessed. Use of a snapshot at the satellite offices will dominate the customer configurations as the performance overhead affects only the time it takes to transfer the updates. In addition, given the cost constraints, the network infrastructure will likely be the limiting factor in the data transfer times as opposed to the overhead incurred from the snapshot. This is due to the multi-threaded nature of the SAN Copy array software. This adds to the likelihood of snapshots being used in this environment. Regardless, both the use of snapshots and/or clones at both offices will be explored in this use case analysis.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 69

Page 71: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Case Studies MirrorView/A - Target Array was upgraded from a CX500 to a CX700, MirrorView/A has stopped working. Array Name: ‘name’ Serial No: APM000xxxxxxxx Time on SP A: 04/26/06 23:52:55 Time on SP B: 04/26/06 23:52:55 Revision: 2.19.700.5.016 Revision: 2.19.700.5.016 Serial Number For The SP: LKE00041100985 Serial Number For The SP: LKE00041100984 Storage Processor IP Address: 199.xxx.xx.xxx Storage Processor IP Address: 199.xxx.xx.xxx System Fault LED: OFF Total Number of Disk Enclosures: 15 Total disks reported by SPA: 225 Total disks reported by SPB: 225 MirrorView connections were "Disabled". We enabled this and now we do not get the Error "[0x71528003] Unable to send a message to the secondary or peer SP" that we were previously getting. When we synchronize an Admin Fractured Mirror, it seemed to start momentarily, but some moments later it stops. We see the Reserved SnapView Sessions start, but don't see any traffic between the SPA 3 and SPB 3 ports on the Switch. It's like there is no data being transferred between the Arrays for these Mirrors. Evidence seems to point to incorrect zoning or other issues with the connection between the arrays. Customer has existing primary and secondary which are admin fractured following a conversion of the secondary array from a cx500 to a cx700. The conversion has been completed and the MV Connection between the arrays has been enabled, but when customer tries to sync admin mirrors they go admin fractured immediately. Here is tracing from an attempt to sync mirrors following the conversion: 23:37:34.889 # FarLib: closed filament 23:37:34.927 # FarLib: Created DestinationFilamentName: FAR_ef2c609060010650:1 23:37:34.927 # FarLib: Created RespReceiveFilamentName: FAR_ef2c609060010650:0:RESP:3824:333252 23:37:34.927 # FarLib: Transact sending a message with opcode 14 23:37:34.927 # FarLib: closed filament 23:37:34.927 # FarLib: Sent message to remote SP 1 23:37:34.927 # FarLib: waiting on response from cmispid ef2c609060010650:1 23:37:34.928 # FarLib: Received response is 0 (FAR_ef2c609060010650:0:RESP:3824:333252 ) 23:37:34.928 # FarLib: closed filament 23:37:34.928 # FarLib: Created DestinationFilamentName: FAR_931822b060010650:0 23:37:34.929 # FarLib: Created RespReceiveFilamentName: FAR_ef2c609060010650:0:RESP:3824:333253 23:37:34.929 # FarLib: Transact sending a message with opcode 8 <-update remote mirror 23:37:34.929 # FarLib: closed filament 23:37:34.929 # FarLib: Mps returned error on sending a message 7110840a <- K10_UMPS_INTERNAL_ERROR_NO_CONNECTION

(The specified destination could not be reached.) 23:37:34.929 # FarLib: closed filament 23:37:34.999 # FarLib: Created DestinationFilamentName: FAR_ef2c609060010650:1 23:37:34.999 # FarLib: Created RespReceiveFilamentName: FAR_ef2c609060010650:0:RESP:3824:333254 23:37:34.999 # FarLib: Transact sending a message with opcode 33 23:37:34.999 # FarLib: closed filament 23:37:34.999 # FarLib: Sent message to remote SP 1 23:37:34.999 # FarLib: waiting on response from cmispid ef2c609060010650:1 23:37:35.000 # FarLib: Received response is 0 (FAR_ef2c609060010650:0:RESP:3824:333254 ) 23:37:35.000 # FarLib: closed filament 23:37:35.000 # FarLib: Created DestinationFilamentName: FAR_931822b060010650:0 23:37:35.000 # FarLib: Created RespReceiveFilamentName: FAR_ef2c609060010650:0:RESP:3824:333255 23:37:35.000 # FarLib: Transact sending a message with opcode 34 <- update group 23:37:35.000 # FarLib: closed filament 23:37:35.000 # FarLib: Mps returned error on sending a message 7110840a<- K10_UMPS_INTERNAL_ERROR_NO_CONNECTION 23:37:35.000 # FarLib: closed filament 23:37:35.000 # FAR: Unable to update remote71528003 <- FAR_ADMIN_ERROR_FAILED_TO_SEND_MESSAGE Unable to send a message to the secondary or peer SP. Check to see if MirrorView connections between secondary and primary arrays have been established. Also tried to create a small mirror and add a secondary to it. Create primary mirror was successful, but attempt to add secondary failed with the message: 0x712a8021 CPM_STATUS_DEST_LU_NOT_FOUND

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 70

Page 72: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The destination device could not be found. This is probably due to incorrect zoning on the switch or the device is not in the correct storage group.

1. Yes there were Zoning Errors. Initially after conversion from CX500 to CX700 CE left SPA and SPB port 1 connected to the Switch instead of connecting SPA and SPB Port 3. When this was corrected, there was still a Zoning error. SPA 3 was zoned to Primary SPB3 and visa versa. This was later also corrected.

2. Yes there was a problem with the MV/A Storage Group when we ran the "FarAccessControlTool -sanity". We then ran the "FarAccessControlTool -rebuild", checked the "sanity" again and all was good. As soon as this was done, customer reported that MA/A started to move data to the Target Array.

3. Conversion went of without error. Note that the new Array (CX700) will use Port's 3 from each SP for MV/A. MirrorView/S - SPB mirrorview initiator is missing after switch cable change Array Name: ‘name’ Serial No: F2002xxxxxxx Time on SP A: 08/22/06 10:56:27 Time on SP B: 08/22/06 10:56:27 Revision: 8.50.60 Revision: 8.50.60 Serial Number For The SP: SA100039 Serial Number For The SP: APM00023 Storage Processor IP Address: 192.xxx.xxx.xx Storage Processor IP Address: 192.xxx.xxx.xx System Fault LED: ON WRITE CACHE: DISABLED READ CACHE: ENABLED WRITE CACHE: DISABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 4 Total disks reported by SPA: 40 Total disks reported by SPB: 40 After switch cable changed on the SAN SPB MV initiators were missing from connectivity status on both arrays. Customer attempted the following:

1. Verify that both arrays are in the same storage domain. 2. In Navisphere > Manage Mirror Connections > disable and re-enable all initiators. Do this on both arrays. Usually, disabling

and re-enabling the initiators makes them register and log in to the array. 3. In Connectivity status > deregister everything that is invalid or not logged in to the array. 4. Reboot both SPs on both arrays (will require a maintenance window) 5. Recreate the zone for the mirror, emc_SPB_mirror. (Make sure you delete it and recreate it and add it back to the

configuration and save and enable.) 6. Fibre cables have also been swapped. 7. SP-b on Primary side replaced.

There is a note in the clarify text to say that the initiator might have been manually deleted. The SPCollects and switch logs have been rechecked for zoning and login issues but there does not seem to be anything obvious. Customer called in ready to make the changes on both switch 1 and 4, disabled both switches and ran configure change the long distance setting to 1, did the same on switch1, enabled the switches. The NL port corrected itself to N, checked connectivity status and now it showed the B1 connection, corrected the MV connectivity, started an MV sync on SPB. It was made sure to disable sw4 and leave disabled until sw1 was changed then re-enable sw4. That would be the first time both switches were rebooted at the same time. SanCopy - SanCopy failure Array Name: ‘name’ Serial No: APM000xxxxxxxx Time on SP A: 06/17/06 06:10:11 Time on SP B: 06/17/06 06:10:11 Revision: 2.07.600.5.020 Revision: 2.07.600.5.020 Serial Number For The SP: LKE00030300296 Serial Number For The SP: LKE00031900667 Storage Processor IP Address: 151.xxx.xx.xxx Storage Processor IP Address: 151.xxx.xx.xxx System Fault LED: OFF

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 71

Page 73: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 10 ATA ENC list: DAE2-ATA Bus 0 Enc 1, DAE2-ATA Bus 1 Enc 1, DAE2-ATA Bus 0 Enc 2

DAE2-ATA Bus 1 Enc 2, DAE2-ATA Bus 1 Enc 4 Total disks reported by SPA: 148 Total disks reported by SPB: 148 CX600 SP Name: SP A SP Port ID: 3 SP UID: 50:06:01:60:90:60:0D:A6:50:06:01:63:10:60:0D:A6 Link Status: Down Port Status: DISABLED Switch Present: NO SP Name: SP A SP Port ID: 2 SP UID: 50:06:01:60:90:60:0D:A6:50:06:01:62:10:60:0D:A6 Link Status: Down Port Status: DISABLED Switch Present: NO Logical ports were going up and down in the mean time SP A was rebooted on both the arrays, so the ktraces are missing during this span of time. San copy failure. FE ports 2 and 3 disabled on SP A. CX600 and CX700 connected direct to each other for San Copy migration. CX600 is at R14 / CX700 is at R19. The migration is failing intermittently. CX700 SPA ports 2 and 3 are showing no login from the CX600 ports 2 and 3.

• Currently San Copy sesions are progressing. • Array is connected to ports 1 and 2 only. • Replacing SP, when the San Copy sessions are going on may cause an issue. • Replace SPA on CX600 after the San Copy sessions are completed. • Setting the priority to 2. • TS2 to update the session status, once migration is complete and upload SP collects after replacement of the SP on CX600.

Logical ports were going up and down in the mean time SP A was rebooted on both the arrays, so the ktraces are missing during this span of time. Port 2 and 3 on SP A for both the array are disabled. Plugged the a2 and a3 from the cx700 into the switch and the leds came on, we plugged A2 and A3 from the cx600 into the switch and the leds did not come on,,, so it looks like the sp-A of the cx600 has bad ports. The array was connected to ports 1 and 2 only. San Copy sesions were progressing. Replacing SP, when the San Copy sessions are going on may cause an issue. Details from the ktraces of CX700, peer had reported that NTFE has been shutdown B 06/17/06 04:26:01 8295a9c8 FEDISK : Remove stale objs on adapter 8 (port 1) B 06/17/06 04:26:01 8295a9c8 FEDISK : Remove stale objs on adapter 9 (port 0) B 06/17/06 04:26:01 8295a9c8 FEDISK : Remove stale objs on adapter 10 (port 3) B 06/17/06 04:26:01 8295a9c8 FEDISK : Remove stale objs on adapter 11 (port 2) B 06/17/06 04:26:01 8295a9c8 FEDISK : CREATE FAILED for WWN 1010101010101010:1010101010101010 B 06/17/06 04:26:01 CPMCTRL 8295a9c8 CpmControlVerifyFrontendTarget(): Exited with 0xC000000E B 06/17/06 04:26:01 CPMCTRL 8295a9c8 CpmControlDeviceControl(): Exited with 0xE12AC007 B 06/17/06 04:34:08 PEER 8054aaa0 *** NTFE Shutdown. A 06/17/06 04:23:30 scsitarg 71170008 Fibre Channel loop down on logical port 2. A 06/17/06 04:23:43 scsitarg 71170009 Fibre Channel loop up on logical port 3 A 06/17/06 04:23:44 scsitarg 71170009 Fibre Channel loop up on logical port 2 A 06/17/06 04:23:44 scsitarg 71170008 Fibre Channel loop down on logical port 2. A 06/17/06 04:23:44 scsitarg 71170008 Fibre Channel loop down on logical port 3. A 06/17/06 04:23:58 scsitarg 71170009 Fibre Channel loop up on logical port 3 A 06/17/06 04:23:58 scsitarg 71170008 Fibre Channel loop down on logical port 2. A 06/17/06 04:23:58 scsitarg 71170008 Fibre Channel loop down on logical port 3. A 06/17/06 04:23:58 scsitarg 71170009 Fibre Channel loop up on logical port 2 - LUNs being bound on Bus 0 Enc0. B 06/16/06 22:10:57 Bus0 Enc0 Dsk0 645 CRU Bound 0 ffff0028 ffe0000 B 06/16/06 22:10:57 Bus0 Enc0 Dsk6 645 CRU Bound 0 ffff0028 ffe0006

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 72

Page 74: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

B 06/16/06 22:10:57 Bus0 Enc0 Dsk4 645 CRU Bound 0 ffff0028 ffe0004 B 06/16/06 22:10:57 Bus0 Enc0 Dsk5 645 CRU Bound 0 ffff0028 ffe0005 B 06/16/06 22:10:57 Bus0 Enc0 Dsk3 645 CRU Bound 0 ffff0028 ffe0003 B 06/16/06 22:10:57 Bus0 Enc0 Dsk2 645 CRU Bound 0 ffff0028 ffe0002 B 06/16/06 22:10:57 Bus0 Enc0 Dsk1 645 CRU Bound 0 ffff0028 ffe0001 B 06/16/06 22:11:57 Clones 71260123 Update driver properties request from remote array. A 06/16/06 22:12:10 Bus0 Enc0 Dsk0 645 CRU Bound 0 ffff0028 ffe0000 A 06/16/06 22:12:10 Bus0 Enc0 Dsk6 645 CRU Bound 0 ffff0028 ffe0006 A 06/16/06 22:12:10 Bus0 Enc0 Dsk3 645 CRU Bound 0 ffff0028 ffe0003 A 06/16/06 22:12:10 Bus0 Enc0 Dsk4 645 CRU Bound 0 ffff0028 ffe0004 A 06/16/06 22:12:10 Bus0 Enc0 Dsk5 645 CRU Bound 0 ffff0028 ffe0005 A 06/16/06 22:12:10 Bus0 Enc0 Dsk2 645 CRU Bound 0 ffff0028 ffe0002 A 06/16/06 22:12:10 Bus0 Enc0 Dsk1 645 CRU Bound 0 ffff0028 ffe0001 A 06/16/06 22:12:11 4600 'BindRAID5LUN' called by 'admin' (10.xx.xx.xxx) on 'Subsystem NH_ Target RAIDGroup 0' (Result: Success). LUN 4094 bound to Subsystem NH_Target RAIDGroup 0 A 06/16/06 22:12:11 Bus0 Enc0 Dsk0 60a A logical unit has been enabled [lun 4094] 0 ffff0028 ffe0000 A 06/16/06 22:12:44 Bus0 Enc0 Dsk0 621 Background Verify Started 0 ffff0028 ffe0000 A 06/16/06 22:12:45 Bus0 Enc0 Dsk0 622 Background Verify Complete 0 ffff0028 ffe0000 A 06/16/06 22:13:10 Clones 7126010e Set driver properties: AllocateWriteIntent flag = 1. - Clone private LUNs being allocated. A 06/16/06 22:13:13 4600 'Modify Clone Private LUNs' called by 'admin' (10.xx.xx.xxx) on 'CloneFeature' (Result: Success). Allocated Clone Private LUNs successfully. SP A: LUN 4095; SP B: LUN 4094. B 06/17/06 03:12:36 Cpm 712a0010 Maximum number of concurrent copies was reset. (Concurrent copies: 8). A 06/17/06 03:13:49 4600 'SetConcurrentSessionMax' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility' (Result: Success). Concurrent session max set to 8 on SP A A 06/17/06 03:13:49 Cpm 712a0010 Maximum number of concurrent copies was reset. (Concurrent copies: 8). A 06/17/06 03:13:49 4600 'SetConcurrentSessionMax' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility' (Result: Success). Concurrent session max set to 8 on SP B B 06/17/06 03:29:29 scsitarg 71170009 Fibre Channel loop up on logical port 2 B 06/17/06 03:29:46 scsitarg 71170009 Fibre Channel loop up on logical port 3 B 06/17/06 03:31:25 scsitarg 71170009 Fibre Channel loop up on logical port 0 B 06/17/06 03:31:32 scsitarg 71170009 Fibre Channel loop up on logical port 1 A 06/17/06 03:34:58 scsitarg 71170009 Fibre Channel loop up on logical port 1 A 06/17/06 03:35:05 scsitarg 71170009 Fibre Channel loop up on logical port 0 - Data migration starting successfully. B 06/17/06 03:37:12 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:38:26 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. A 06/17/06 03:38:26 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:38:26 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. B 06/17/06 03:39:41 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:40:54 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. A 06/17/06 03:40:54 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:40:54 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. B 06/17/06 03:41:22 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:42:35 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. A 06/17/06 03:42:35 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:42:35 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. B 06/17/06 03:43:32 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:44:46 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. A 06/17/06 03:44:46 NaviAgent 712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. A 06/17/06 03:44:46 4600 'PushInitiatorRecords' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. A 06/17/06 03:50:46 4600 'Log In' called by 'admin' (10.xx.xx.xxx) on 'Security Service'. User has logged in. A 06/17/06 03:50:46 4600 'Log In' called by 'admin' (10.xx.xx.xxx) on 'Security Service'. User has logged in. A 06/17/06 03:50:47 4600 'Log In' called by 'admin' (10.xx.xx.xxx) on 'Security Service'. User has logged in. B 06/17/06 03:51:56 4600 'VerifyDevice' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility'. - Had the following error from SAN Copy Disk driver. B 06/17/06 03:51:56 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016910600DA6, LUN 11 (SP B port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. A 06/17/06 03:53:53 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016110600DA6, LUN 5 (SP A port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. B 06/17/06 03:55:46 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016910600DA6, LUN 10 (SP B port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 73

Page 75: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

A 06/17/06 03:58:36 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016110600DA6, LUN 4 (SP A port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. B 06/17/06 04:00:05 4600 'VerifyDevice' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility' B 06/17/06 04:00:05 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016910600DA6, LUN 3 (SP B port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. B 06/17/06 04:00:26 4600 'CreateCopyDescriptorVersion4' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility' (Result: Success). Copy descriptor name.com:5 created. A 06/17/06 04:00:36 4600 'CreateCopyDescriptorVersion4' called by 'admin' (10.xx.xx.xxx) on 'Data Migration Facility' (Result: Success). Copy descriptor 4 created. A 06/17/06 04:00:36 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016110600DA6, LUN 9 (SP A port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. B 06/17/06 04:01:27 NaviAgent712a0030 Failing Command: K10CpmAdmin DBid 0 Op 6. B 06/17/06 04:07:35 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016910600DA6, LUN 7 (SP B port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. A 06/17/06 04:10:16 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016110600DA6, LUN 2 (SP A port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. A 06/17/06 04:15:56 FEDsk 712b0011 SANCopy Disk Driver: A non-fatal SCSI error occurred on target 5006016110600DA6, LUN 0 (SP A port 1): SCSI status 02, SK/ASC/ASCQ 06/29/00. The I/O request will be retried. San Copy - Host I/O failed with MV and host I/O running with 200ms and SPA rebooted SPA_navi_getlog.txt before each panic/reboot reports error “Transact request not executed. Reason: Unable to obtain ownership of transact mutex. SANCopy Disk Driver: A non-fatal I/O error.” SANcopy has implemented a work around for the issue "request lost in scsiport" in R24. Found that CMID driver has been replaced to deal with another panic issue. Fix for “panic on transient link failures” is in R24, so if that fix is required then array should be upgraded. CMID will declare lost-contact at 20sec for remote connection - but first scsiport requests must return (issued with 10 second timeout). So this is probably yet another case where scsiport loses request. There is no evidence of link / fibre issues and no traces from CMI and the FAR timer expired because it has held a lock for too long. In addition, FAR complains that CMI messages are slow. The LastAck of 43.8 indicates that scsiport has lost the message. It is recommended to upgrade to R24 or Q4 patch for R22. San Copy - LUN 23 corrupted Array Name: ‘name’ Serial No: APM000xxxxxxxx Time on SP A: 10/19/06 15:43:49 Time on SP B: 10/19/06 15:43:49 Revision: 2.19.700.5.030 Revision: 2.19.700.5.030 Serial Number For The SP: LKE00044901456 Serial Number For The SP: LKE00045000409 Storage Processor IP Address: 10.xx.xx.xxx Storage Processor IP Address: 10.xx.xx.xxx System Fault LED: OFF WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 15 ATA ENC list: DAE2-ATA Bus 2 Enc 1, DAE2-ATA Bus 3 Enc 1 Total disks reported by SPA: 211 Total disks reported by SPB: 211 DU due to not being able to mount ALU 23 in the W2K3 SP1 environment. Review of the logs show that the only events are MV/A and SanCopy events. No backend errors, no drive errors, only events are MV/A and SanCopy. LUN 23 is the source LUN. Need confirmation that the array is ok and LUN 23 is ok from the array's point of view. Also want to make sure MVA is not holding up LUN 23 and preventing it from being accessed. B 10/19/06 13:08:40 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168 B 10/19/06 13:08:41 SnapCopy 71000016 Incremental Session FAR_23 completing Full copy, LUN 23 (Disk0005). B 10/19/06 13:08:41 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 B 10/19/06 13:08:41 Cpm 712a8017 The source device specified in the session failed. (WWN: 60060160311413009A1EBD76E65EDB11) B 10/19/06 13:08:55 MirrorView_A 71520000 Info:Initiating trespass for FAR_23_FAR_37 this SP B 10/19/06 13:08:55 SnapCopy 71000015 Incremental Session FAR_ 23 starting Full copy, LUN 23 (Disk0005).

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 74

Page 76: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

B 10/19/06 13:08:55 Cpm 712a0033 Copy Command 0x”###” (DId: 23, Originator: 2) is using a Buffer Count of 7 and a Buffer Size of 128 blocks. B 10/19/06 13:08:55 MirrorView_A 71520000 Info: Restarted trespassed session 23 for mirror FAR_23_FAR_37 B 10/19/06 14:03:27 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168 B 10/19/06 14:03:28 SnapCopy 71000016 Incremental Session FAR_ 23 completing Full copy, LUN 23 (Disk0005). B 10/19/06 14:03:28 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 B 10/19/06 14:03:28 Cpm 712a8017 The source device specified in the session failed. (WWN: 60060160311413009A1EBD76E65EDB11) B 10/19/06 14:03:30 MirrorView_A 71520000 Info:Initiating trespass for FAR_23_FAR_37 this SP B 10/19/06 14:03:30 MirrorView_A 71520000 Info:Restarted trespassed session 23 for mirror FAR_23_FAR_37 B 10/19/06 14:03:31 Cpm 712a0033 Copy Command 0x”###” (DId: 23, Originator: 2) is using Buffer Count of 7 and Buffer Size of 128 blocks. B 10/19/06 14:03:31 SnapCopy 71000015 Incremental Session FAR_23 starting Full copy, LUN 23 (Disk0005). B 10/19/06 14:41:08 4600 'ConnectHosts' called by 'admin' (10.x.xx.xx) on 'name' (Result: Success). Host(s) “name” connected to “name” B 10/19/06 14:41:29 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168 B 10/19/06 14:41:29 Cpm 712a8017 The source device specified in the session failed. (WWN: 60060160311413009A1EBD76E65EDB11) B 10/19/06 14:41:29 SnapCopy 71000016 Incremental Session FAR_23 completing Full copy, LUN 23 (Disk0005). B 10/19/06 14:41:30 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 B 10/19/06 14:41:33 MirrorView_A 71520000 Info:Initiating trespass for FAR_23_FAR_37 this SP B 10/19/06 14:41:34 MirrorView_A 71520000 Info:Restarted trespassed session 23 for mirror FAR_23_FAR_37 B 10/19/06 14:41:34 SnapCopy 71000015 Incremental Session FAR_23 starting Full copy, LUN 23 (Disk0005). B 10/19/06 14:41:34 Cpm 712a0033 Copy Command 0x”###” (DId: 23, Originator: 2) is using a Buffer Count of 7 and a Buffer Size of 128 blocks. A 10/19/06 14:42:17 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 A 10/19/06 14:42:18 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168 A 10/19/06 14:42:21 NaviAgent 1 EV_Agent::Process--Cannot service request --unqualified user:User: root(0) from: name.com(160.xxx.x.xx) A 10/19/06 14:42:21 NaviAgent 1 EV_Agent::Process--Cannot service request --unqualified user:User: root(0) from: name.com(160.xxx.x.xx) A 10/19/06 14:42:21 NaviAgent 1 EV_Agent::Process--Cannot service request --unqualified user:User: root(0) from: name.com(160.xxx.x.xx) B 10/19/06 14:45:24 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168 B 10/19/06 14:45:25 SnapCopy 71000016 Incremental Session FAR_23 completing Full copy, LUN 23 (Disk0005). B 10/19/06 14:45:25 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 B 10/19/06 14:45:25 Cpm 712a8017 The source device specified in the session failed. (WWN: 60060160311413009A1EBD76E65EDB11) B 10/19/06 14:45:34 MirrorView_A 71520000 Info:Initiating trespass for FAR_23_FAR_37 this SP B 10/19/06 14:45:34 MirrorView_A 71520000 Info:Restarted trespassed session 23 for mirror FAR_23_FAR_37 Upon examination, the array looks very clean, all LUNs are owned and the drive map is not corrupt. There are no 801 Soft SCSI bus messages. The logs only go back to 10/12/2006 because there are many messages from SnapView from what looks like normal operation. There is no indication that there would be any sort of corruption from an array perspective. Also, I noticed that ALU 23 was not in storagegroup “name”. Storage Group Name: “name” Storage Group UID: A6:77:AC:B4:D8:DA:DA:11:A2:13:08:00:1B:43:A4:81 HBA/SP Pairs: HBA UID SP Name SPPort ------- ------- ------ 20:00:00:00:C9:58:26:C9:10:00:00:00:C9:58:26:C9 SP B 2 20:00:00:00:C9:58:26:59:10:00:00:00:C9:58:26:59 SP A 2 20:00:00:00:C9:58:26:59:10:00:00:00:C9:58:26:59 SP B 1 20:00:00:00:C9:58:26:C9:10:00:00:00:C9:58:26:C9 SP A 1 HLU/ALU Pairs: HLU Number ALU Number ---------- ---------- 0 20 1 21 2 66 It looks like an incremental session started but no completion messages. It also looks like LUN 23 was removed from the storage group. B 10/19/06 14:45:25 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 B 10/19/06 14:45:35 SnapCopy 71000015 Incremental Session FAR_23 starting Full copy, LUN 23 (Disk0005). A 10/19/06 14:46:12 Bus3 Enc0 Dsk0 60a A logical unit has been enabled [lun 23] 0 7001e 170168 A 10/19/06 14:46:14 Bus3 Enc0 Dsk0 606 Unit Shutdown for Trespass [lun 23] 0 7001e 170168

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 75

Page 77: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

B 10/19/06 14:51:47 4600 'RemoveLUs' called by 'admin' (10.x.xx.xx) on ‘name’ (Result: Success). LUN(s) LUN 21 LUN 66 LUN 20 LUN 23 removed from “name” B 10/19/06 15:05:36 4600 'RemoveLUs' called by 'admin' (10.x.xx.xx) on “name” (Result: Success). LUN(s) LUN 23 removed from “name” B 10/19/06 15:08:05 4600 'RemoveLUs' called by 'admin' (10.x.xx.xx) on “name” (Result: Success). LUN(s) LUN 23 removed from “name” The customer called in stating their server hangs when running an inquiry on ALU 23. Review of the array logs shows array is operating normally, no faults, no backend errors, no media errors. Lun 23 is ENA. Engineering requested that I trespass lun 23 to confirm that it's assign. I was able to trespass lun 23 to SPA, back to SPB to SPA and then back to SPB. The customer thinks ALU 23 is corrupt after host shutdown abnormally. Windows host can see the LUN but diskscan hangs at 100% CPU utilization in task manager. Performed a healthcheck on the array and everything looked fine. All LUNs owned, no DU from an array perspective. Checked for stuck trespass early on because host had full access to three other LUNs in the same storage group. ALU 23 trespassed from one SP to the other without any problems. Stuck trespass was ruled out. SnapView - Snapsession failure during a trespass Array Name: ‘name’ Serial No: APM000xxxxxxxx Time on SP A: 07/24/06 23:58:44 Time on SP B: 07/24/06 23:58:44 Revision: 2.19.600.5.030 Revision: 2.19.600.5.030 Serial Number For The SP: LKE00034700731 Serial Number For The SP: LKE00024400393 Storage Processor IP Address: 10.x.x.xxx Storage Processor IP Address: 10.x.x.xxx System Fault LED: OFF Total Number of Disk Enclosures: 16 A 07/23/06 03:26:09 SnapCopy 71004007 SnapView Session bk_0722062001 has been stopped due to an error during the trespass of a LUN. A 07/23/06 03:26:09 SnapCopy 7100808c Lun 24 terminating SnapView usage status c000020f. The LUN in question is LUN 24. The customer has asked why their SnapView session for LUN 24 was stopped. The session was terminated at the same time that an issue was seen on the backend. Immediately prior to the Snap session being terminated, a trespass was initiated for SnapView source LUN 24. This trespass completed successfully. However, when a source LUN is trespassed from one SP to the other, its Reserved LUN Pool LUNs would also be trespassed. It is believed that the trespass of the reserved LUN pool LUN failed, and this failure is what caused the SnapSessions to terminate. We see that at the same time that the Source LUN was trespassing, and therefore the RLP LUN(s) would be trespassing, drive 1-6-0 was shutdown and was being swapped out to a Hot Spare. Drive 1-6-0 is in RG 24 which contains thirteen LUNs that are Reserved LUN Pool LUNs. Although the logs do not tell us which RLP LUNs were associated with LUN 24 / Snap Session bk_0722062001, we expect that LUN 24 was using RLP LUN(s) from this raid group at that particular time. A bug has been identified in the Flare code where a trespass of a LUN will not succeed if it occurs at the same time that a Hot Spare is swapping in for a drive in that raid group. Since a Hot Spare was swapping into RG 24 containing the RLP LUNs, the trespass of the RLP LUN(s) would not complete successfully, causing the SnapSession to terminate. This bug is the result of a small timing window and has been only been seen at one other customer site. A fix for this bug is planned for the next patch release, which would be in the end of 2006/early 2007 timeframe. Since the SPCollects do not contain the Ktraces from the time of the incident, we can not be absolutely certain as to the problem encountered, but the above scenario is believed to be the most likely. Details from Logs B 07/23/06 03:25:34 Bus1 Enc6 Dsk0 a07 CRU Powered Down [CM: Killed by CM] 0 0 920c B 07/23/06 03:25:47 Bus0 Enc2 Dsk7 606 Unit Shutdown for Trespass [lun 24] 0 7000d 180025 B 07/23/06 03:25:47 Bus0 Enc2 Dsk7 606 Unit Shutdown for Trespass [lun 24] 0 7000d 180025 A 07/23/06 03:26:08 Bus1 Enc6 Dsk0 a07 CRU Powered Down [DH: Bad CDB] 0 0 d0d A 07/23/06 03:26:09 SnapCopy 71004007 SnapView Session bk_0722062001 has been stopped due to an error during the trespass of a LUN.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 76

Page 78: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

A 07/23/06 03:26:09 Bus1 Enc6 Dsk0 6a2 Hot Spare is now replacing a failed drive. 0 d2 b A 07/23/06 03:26:09 SnapCopy 7100808c Lun 24 terminating SnapView usage status c000020f. A 07/23/06 03:26:11 Bus0 Enc2 Dsk7 60a A logical unit has been enabled [lun 24] 0 7000d 180025 A 07/23/06 03:26:22 Bus0 Enc2 Dsk7 60a A logical unit has been enabled [lun 24] 0 7000d 180025 A 07/23/06 03:26:22 Bus0 Enc2 Dsk7 60a A logical unit has been enabled [lun 24] 0 7000d 180025 SnapView - Unable to delete LUNs that were part of a mirror. Customer created a snapshot of the unwanted luns and then deleted them, but was still unable to unbind them. As seen from the stack of the LUNs which the customer wants to Unbind, Layered drivers are still present on their stack. The following setfeature command was tried to remove the SnapCopy and Rollback drivers from the Lun stack, but it failed. navicli -h SP_IP snapview -setfeature -off -lun lun# Check the array status online and accordingly try and use the SETFEATURE OFF command to remove the drivers from the LUN stack. The following are the commands Navicli command from the Array C:\>navicli -h <SPIP> -user <username> -password <password> -scope <0/1>setfeature -messner -off -feature <Layered Driver> -lun <Lun#> Javacli command off the array java -jar navicli.jar -h <SPIP> -user <username> -password <password> -setfeature –off -feature <Layered Driver> -lun <Lun#> Engage the appropriate Layered Driver Group as there may be Layered driver dependencies involved here which may result in Device Map issues. DETAILS: Below is the list of luns along with the listed drivers. LOGICAL UNIT NUMBER 71 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin LOGICAL UNIT NUMBER 11 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin LOGICAL UNIT NUMBER 101 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin LOGICAL UNIT NUMBER 61 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin LOGICAL UNIT NUMBER 81 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin LOGICAL UNIT NUMBER 41 Listed Driver: K10RollBackAdmin, K10FarAdmin, K10SnapCopyAdmin Array Name: ‘name’ Serial No: APM000xxxxxxxx Time on SP A: 07/24/06 22:08:59 Time on SP B: 07/24/06 22:08:59 Revision: 2.19.700.5.030 Revision: 2.19.700.5.030 Serial Number For The SP: LKE00060100213 Serial Number For The SP: LKE00060100858 Storage Processor IP Address: 131.xxx.xxx.xx Storage Processor IP Address: 131.xxx.xxx.xx System Fault LED: OFF WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 11 DAE2-4P Stiletto ENC list: DAE2P Bus 0 Enc 0, DAE2P Bus 1 Enc 0 ATA ENC list: DAE2-ATA Bus 0 Enc 1, DAE2-ATA Bus 0 Enc 2, DAE2-ATA Bus 1 Enc 1, DAE2-ATA Bus 1 Enc 2, DAE2-ATA Bus 2 Enc 0, DAE2-ATA Bus 2 Enc 1, DAE2-ATA Bus 2 Enc 2, DAE2-ATA Bus 3 Enc 0, DAE2-ATA Bus 3 Enc 1 Total disks reported by SPA: 165 Total disks reported by SPB: 165 Unable to unbind some luns (luns 41, 81, 101, 61, 11, 71) which were involved in layered apps but have since been destroyed. Error received by customer is: “Being used by a feature of the storage system". “navicli -h SP_IP snapview -setfeature -off -lun lun# command failed.” Customer did run "navicli snapview -listsnapableluns and sees the ones that she wants to delete. Created a snapshot on these luns and then deleted them but still unable to unbind. The driver stack on the luns tells that those luns are still used by Async mirror. So, the way to remove async mirror is: C:\Program Files\EMC\Navisphere CLI>java -jar navicli.jar -address <ip of SP> -user admin -password swqa mirror -async -setfeature -off -lun <lun> They should then be able to unbind the lun.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 77

Page 79: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SnapView - SP Bugcheck 0x000000d1, 0x00000002, 0x00000000, 0x00000000 Array Name: ‘name’ Serial No: CK2000xxxxxxxx Time on SP A: 09/17/06 00:39:04 Time on SP B: 09/17/06 00:39:04 Revision: 2.19.700.5.034 Revision: 2.19.700.5.034 Serial Number For The SP: LKE00040200021 Serial Number For The SP: LKE00042902392 Storage Processor IP Address: 10.x.x.xx Storage Processor IP Address: 10.x.x.xx System Fault LED: OFF WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 10 Total disks reported by SPA: 141 Total disks reported by SPB: 141 Dump analysis: DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1), an attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace. 0: kd> !ktrace Boot 2006/08/10 13:52:09.718 stamp 00f4369cbc DATE: 2006/09/17 07:52:03.012 07:52:03.012 82140020 SNAPDRVR:SnapCopySessionSetSessionStopPattern() -> Session srvnt409-000-302 is persistent, write deadbeef 07:52:03.018 82140020 SNAPDRVR:SetSessionStopPattern() -> Set SessionStoppingPattern bit 0 07:52:03.018 83312020 PSMSYS:psmDataAreaWrite() 2:126 O=94216 L=4096 01 01 73 72 76 6e 74 34 30 39 2d 30 07:52:03.109 82140020 SNAPDRVR:StopOnTarget() -> LUN 302 Device \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.109 82140020 SNAPDRVR:StopOnTarget() -> Session srvnt409-000-302, Index 0 07:52:03.109 82140020 SNAPDRVR:StopOnTarget() -> Session srvnt409-000-302 is persistent, write to PSM! 07:52:03.111 82140020 SNAPDRVR:StopOnTarget() -> SessionStoppingPattern bit 0 is set 07:52:03.112 83312020 PSMSYS:psmDataAreaWrite() 2:126 O=94216 L=4096 01 00 00 00 00 00 00 00 00 00 00 00 07:52:03.155 82140020 SNAPDRVR:DecrementNonIscSessionCount() New Non-ISC-Session Count = 0x27 07:52:03.155 83312020 PSMSYS:psmDataAreaWrite() 2:124 O=0 L=4096 c8 00 00 00 00 00 00 00 ff ff ff ff 07:52:03.170 82140020 SNAPDRVR:CacheScheduleScavenge() Scheduling 07:52:03.170 82140020 SNAPDRVR:ThreadMgmt() -> Processing SC_THREAD_CMD_SESSION_SCAVENGE 07:52:03.170 82140020 SNAPDRVR:CacheScavenge() (Before) ScavengesScheduled 1 07:52:03.170 82140020 SNAPDRVR:CacheScavenge() QUICK 07:52:03.170 82140020 SNAPDRVR:TrespassSeizeCaches() CacheSeizures = 1 07:52:03.170 82140020 SNAPDRVR: \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.170 82140020 SNAPDRVR:TrespassSeizeCaches() Locking \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.170 82140020 SNAPDRVR:TrespassSeizeCaches() SUCCESS \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.170 82140020 SNAPDRVR:CacheScavenge() Caches Seized UnAssigning 07:52:03.174 82140020 SNAPDRVR:CacheScavenge() NO SESSION, setting Ignore DLM 07:52:03.174 82140020 SNAPDRVR:CacheScavenge() Checking Cache \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:VMUnAssignCacheLU() \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:ChunkGetRequsitionedChunks() SEARCH_AFTER_SCAVENGE 07:52:03.174 82140020 SNAPDRVR:ChunkVetEmptyCache() Ignoring Status 0xa1004027 07:52:03.174 82140020 SNAPDRVR:VMUnAssignCache() REMOVING \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:ChunkUnAssignEmptyCache() Chunks 0 - 62865 07:52:03.174 82140020 SNAPDRVR:ChunkDeleteCacheParticulars() Cache List Count 1 07:52:03.174 82140020 SNAPDRVR:ChunkDeleteCacheParticulars() FREEING 0xB0B5E600 07:52:03.174 82140020 SNAPDRVR:ChunkDeleteCacheParticulars() Cache List Count 0 07:52:03.174 82140020 SNAPDRVR:VMUnAssignCache() REMOVING \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:CacheScavenge() Freeing Cache: \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:CacheScavenge() Clearing Cache Bit 23 07:52:03.174 82140020 SNAPDRVR:CacheDLSSurrenderPermit() Permit 0xB0B588A0 \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:CacheDLSSurrenderPermit() SURRENDERING 0xB0B588A0 07:52:03.174 82140020 SNAPDRVR:CacheDLSSurrenderPermit() Permit 0xB0B588A0 FREE 07:52:03.174 82140020 SNAPDRVR:CacheScavenge() Removing pTargetCacheListEntry for 07:52:03.174 82140020 SNAPDRVR: TARGET: \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.174 82140020 SNAPDRVR: CACHE : \Device\CLARiiONdisk98 07:52:03.174 82140020 SNAPDRVR:CacheCalculateChunks() CacheLuSize 0x00010000 Regions 07:52:03.174 83312020 PSMSYS:psmDataAreaWrite() 2:126 O=94216 L=4096 01 00 00 00 00 00 00 00 00 00 00 00 07:52:03.221 82140020 SNAPDRVR:CacheScavenge() Caches Release UnAssigning 07:52:03.221 82140020 SNAPDRVR:TrespassReleaseCaches() 1 Seizures for \Device\CLARiiON\SnapCopy\Disk0023

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 78

Page 80: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

07:52:03.221 82140020 SNAPDRVR:TrespassReleaseCaches() \Device\CLARiiON\SnapCopy\Disk0023 CacheSeizures =0 07:52:03.221 82140020 SNAPDRVR:TrespassReleaseCaches() Unlocking \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.221 82140020 SNAPDRVR:TrespassReleaseCaches() = 0x00000000 Target = \Device\CLARiiON\SnapCopy\Disk0023 07:52:03.221 82140020 SNAPDRVR:CacheScavenge() QUICK succeeded 07:52:03.221 82140020 SNAPDRVR:CacheScavenge() Found no caches; New UID 1928 07:52:03.221 82140020 SNAPDRVR:CacheScavenge() (After) ScavengesScheduled 0 07:52:03.221 82140020 SNAPDRVR:CacheScavenge() QUICK UnAssign 07:52:03.264 82758b30 SNAPDRVR:Page::CacheMissCallback I/O Error 0xa1004034 07:52:04.265 82758b30 FCDMTL 5 (BE0) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 9 (FE2/SC) Suppressed all but 2 out of 30 identical log entries 07:52:04.265 82758b30 FCDMTL 9 (FE2/SC) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 8 (FE3/MV) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 7 (FE0/SC) Suppressed all but 2 out of 3106 identical log entries 07:52:04.265 82758b30 FCDMTL 7 (FE0/SC) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 6 (FE1/SC) Suppressed all but 2 out of 40 identical log entries 07:52:04.265 82758b30 FCDMTL 6 (FE1/SC) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 0 (PP1) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 4 (BE1) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 3 (BE3) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 2 (PP0) Bugcheck callback executed 07:52:04.265 82758b30 FCDMTL 1 (BE2) Suppressed all but 2 out of 6 identical log entries 07:52:04.265 82758b30 FCDMTL 1 (BE2) Bugcheck callback executed SPA rebooted with bugcheck 0x0000000a. The panic is IRQL_NOT_LESS_OR_EQUAL, which means an attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. B 09/17/06 02:13:22 Save Dump 2183 Reboot from bugcheck: 0x000000d1 (0x00000000, 0x00000002, 0x00000000, 0x00000000). - "SPA Bugchecked twice with 0x0000000a (0x00000000, 0x00000002, 0x00000000, 0x00000000)". panic is in snapcopy.sys This defect is fixed in R26 and what happens in this panic is that the first call to SnapCopyPageDiskCacheMissCallback() called a function call InvalidatePagedRegions(). This had the affect of dumping pages because the session took an error and was about to stop anyways. This put them back on the free queue. However, if the CacheMissQueue still had a page on it then now that page is on 2 queues. If another process were to allocate this page off of the free queue while were trying to pop the recursion below, then the page we were operating on here in this stack is now corrupted. The way it manifest itself is that its callback routine ends up being zeroed out and we attempt to call NULL (which is the Failed instruction address you see in the dump). A customer can reduce the likelihood of this failure by reducing the load on the vault disk which have most of the RLP luns on this RG as well as several heavily used source luns. RaidGroup ID: 9 RaidGroup Type: r5 Drive Type: Fibre Channel RaidGroup State: Explicit_Remove List of disks: Bus 0 Enc 0 Disk 4, Bus 0 Enc 0 Disk 3, Bus 0 Enc 0 Disk 2, Bus 0 Enc 0 Disk 1, Bus 0 Enc 0 Disk 0 List of luns: 4 5 221 810 180 802 814 818 822 826 830 804 834 838 842 846 850 854 858 862 866 870 874 878 882 886 190 2005 2006 200 220 86 87 63 64 2003 2004 4072 80 60 SnapView - Bugcheck 0xe111805f (0x81ff6c48, 0x00000000, 0x00000000, 0x000003cd) Array Name: A-APM000xxxxxx Serial No: APM000xxxxxx Time on SP A: 10/20/06 06:58:44 Time on SP B: 10/20/06 06:58:44 Revision: 2.19.600.5.027 Revision: 2.19.600.5.027 Serial Number For The SP: LKE00024500243 Serial Number For The SP: LKE00031500156 Storage Processor IP Address: 172.xx.x.xx Storage Processor IP Address: 172.xx.x.xx System Fault LED: OFF WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 79

Page 81: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Total Number of Disk Enclosures: 12 ATA ENC list: DAE2-ATA Bus 0 Enclosure 5 Total disks reported by SPA: 162 Total disks reported by SPB: 162 A 10/20/06 05:56:26 Save Dump 2183 Reboot from bugcheck: 0xe111805f (0x81ff6c48, 0x00000000, 0x00000000, 0x000003cd). [BugcheckCode: e111805f Definition: DLS_BUGCHECK_EXECUTIONER_LOCK_REQUEST_BY_CABALID_EXPIRED] A 10/20/06 06:04:38 DumpManager 41004100 Created Compressed Dump C:\dumps\SPA_APM00024100036_ad0f3_10-20-2006_05-18-39.dmp.zip 0: kd> !ktrace Boot 2006/10/20 00:46:50.203 stamp 001823d26c DATE: 2006/10/20 01:00:40.237 01:00:40.237 PSMSYS:psmDataAreaWrite() 1:30 O=458760 L=65536 00 00 00 00 00 00 00 00 00 00 00 00 01:00:40.243 SNAPDRVR:ThreadMain():SnapCopyTrespassAbdicate( LUN 32 ) 01:00:40.243 SNAPDRVR:TrespassAbdicate() Deferring \Device\CLARiiON\SnapCopy\Disk0041; Scavenge 01:00:40.412 PSMSYS:psmDataAreaWrite() 1:30 O=524296 L=65536 00 00 00 00 00 00 00 00 00 00 00 00 01:00:40.650 SNAPDRVR:ThreadMain():SnapCopyTrespassAbdicate( LUN 49 ) 01:00:40.650 SNAPDRVR:TrespassAbdicate() Deferring \Device\CLARiiON\SnapCopy\Disk0034; Scavenge 01:00:40.657 FCDMTL 6 (FE1) Abort received of type 31031 from loop ID 5. of exchange 0006 01:00:40.659 FCDMTL 6 (FE1) IOCTL DECREMENT_EVENT_COUNT to 0 01:00:40.758 PSMSYS:psmDataAreaWrite() 1:30 O=589832 L=65536 00 00 00 00 00 00 00 00 00 00 00 00 01:00:41.056 SNAPDRVR:ThreadMain():SnapCopyTrespassAbdicate( LUN 194 ) 01:00:41.056 SNAPDRVR:TrespassAbdicate() Deferring \Device\CLARiiON\SnapCopy\Disk0009; Scavenge 01:00:41.127 PSMSYS:psmDataAreaWrite() 1:30 O=655368 L=65536 00 00 00 00 00 00 00 00 00 00 00 00 01:00:41.161 FCDMTL 7 (FE0) Abort received of type 31031 from loop ID 11. of exchange 81fc 01:00:41.162 FCDMTL 7 (FE0) Abort received of type 31031 from loop ID 11. of exchange 804c 01:00:41.162 FCDMTL 7 (FE0) IOCTL DECREMENT_EVENT_COUNT to 1 01:00:41.162 FCDMTL 7 (FE0) IOCTL DECREMENT_EVENT_COUNT to 0 01:00:41.174 FCDMTL 7 (FE0) Abort received of type 31031 from loop ID 8. of exchange 3440 01:00:41.175 FCDMTL 7 (FE0) IOCTL DECREMENT_EVENT_COUNT to 0 01:00:41.197 SNAPDRVR:ThreadMain():SnapCopyTrespassAbdicate( LUN 192 ) 01:00:41.197 SNAPDRVR:TrespassAbdicate() Deferring \Device\CLARiiON\SnapCopy\Disk0003; Scavenge 01:00:41.202 TDD: IOCTL_FLARE_TRESPASS_QUERY returned 0xE1268548 for DLU B6B8E519504BD711. 01:00:41.202 FCDMTL 7 (FE0) Target command check condition: loopID = 12., SK = 05, ASC/Q = 0400 01:00:41.202 FCDMTL 7 (FE0) Target command check condition: loopID = 12., SK = 05, ASC/Q = 2501 01:00:41.259 SNAPDRVR:ThreadMain():SnapCopyTrespassAbdicate( LUN 143 ) 01:00:41.259 SNAPDRVR:TrespassAbdicate() Deferring \Device\CLARiiON\SnapCopy\Disk0049; Scavenge 01:00:41.325 TDD: Ownership Loss IownLun:1 ExeLast:0 \Device\CLARiiON\Clones\1c5fd093a174064 01:00:41.326 Clones: Trespass loss received on device \Device\CLARiiON\Clones\1c5fd093a174064 01:00:41.326 TDD: Ownership Loss IownLun:1 ExeLast:0 \Device\CLARiiON\Clones\1c5fd00a4f3ad2c 01:00:41.326 Clones: Trespass loss received on device \Device\CLARiiON\Clones\1c5fd00a4f3ad2c 01:00:41.337 TDD: OwnerLossIrpCompletion to WWN 0x3ed2987cf5d8d911 Status 0x0. 01:00:41.337 TDD:\Device\CLARiiON\Clones\1c5fd093a174064 Cancel Notify Irp No Wait 0x940c3030 01:00:41.338 NTFE: Lun 288 Cancel Notify Irp 0x940c3030, Cap 3800000 01:00:41.338 TDD: Notify Irp Completion 0x940c3030 Fl:0x0 Status 0xc0000120 01:00:41.338 Device: \Device\CLARiiON\Clones\1c5fd093a174064 01:00:41.338 TDD: Don't reissue Notify 3800000 new 0 01:00:41.338 TDD: OwnerLossIrpCompletion to WWN 0x77c92f095423d811 Status 0x0. 01:00:41.338 TDD: \Device\CLARiiON\Clones\1c5fd00a4f3ad2c Cancel Notify Irp No Wait 0x940b3110 01:00:41.338 NTFE: Lun 7 Cancel Notify Irp 0x940b3110, Cap 3800000 01:00:41.338 TDD: Notify Irp Completion 0x940b3110 Fl:0x0 Status 0xc0000120 01:00:41.338 Device: \Device\CLARiiON\Clones\1c5fd00a4f3ad2c 01:00:41.338 TDD: Don't reissue Notify 3800000 new 0 01:00:41.341 LUSM Enter 288 LU_ENABLED op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 01:00:41.341 LUSM Exit 288 LU_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 01:00:41.341 CM Shutdown: cm_release_stop_verify - Stop verify LUN 288, state 0x1901, cm_element 0x919e2008 01:00:41.341 LUSM Enter 7 LU_ENABLED op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 01:00:41.341 LUSM Exit 7 LU_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 01:00:41.341 CM Shutdown: cm_release_stop_verify - Stop verify LUN 7, state 0x1901, cm_element 0x919e2358 01:00:41.368 CM: cm_handle_vp_response: GLUT not updatable. Unit:288, State:LU_SHUTDOWN_TRESPASS 01:00:41.369 PSMSYS:psmDataAreaWrite() 1:30 O=720904 L=65536 00 00 00 00 00 00 00 00 00 00 00 00 01:00:41.371 LUSM Enter 288 LU_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS_DONE el.st=0x1901 01:00:41.371 LUSM Exit 288 LU_PEER_ASSIGN op=LUSM_RELEASE_FOR_TRESPASS_DONE el.st=0x1901 01:00:41.447 DLSEXP:DlsExecutionerLiquidateByCabalId(0x81ff6c48)(LockId = 0x0:0x00000000000003cd) 01:00:41.447 KLogBugCheckEx(0xe111805f 0x81ff6c48 0x00000000 0x00000000 0x000003cd) at line 924 01:00:41.447 of D:\views\a5eaeb761c8a5088c313a64985317c3e.stg\catmerge\services\DLS\src\ExportDr deadbeef, 01:00:42.459 FCDMTL 5 (BE0) Bugcheck callback executed 01:00:42.459 FCDMTL 9 (FE2) Bugcheck callback executed 01:00:42.459 FCDMTL 8 (FE3/FAR) Bugcheck callback executed

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 80

Page 82: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

01:00:42.459 FCDMTL 7 (FE0) Bugcheck callback executed 01:00:42.459 FCDMTL 6 (FE1) Bugcheck callback executed 01:00:42.459 FCDMTL 0 (PP1) Bugcheck callback executed 01:00:42.459 FCDMTL 4 (BE1) Bugcheck callback executed 01:00:42.459 FCDMTL 3 (AUX0) Bugcheck callback executed 01:00:42.459 FCDMTL 2 (PP0) Bugcheck callback executed 01:00:42.459 FCDMTL 1 (AUX1) Bugcheck callback executed The E111805F (DLS timeout) bugcheck was caused by canceled host i/o's in SnapView. SnapView does not handle canceled host i/o's until release 24. The host i/o cancellations are likely the result of very long response times caused by significant trespassing on the array. The trespasses may be a result of problems on SPA. The logs show SPA was rebooted three other times prior to the panic without creating a dump. If these were unexpected reboots, then the sp should be replaced. 10/20/2006 03:49:05 Enclosure 0 Disk 0 60a Internal information only. Logical unit has been enabled 10/20/2006 03:58:33 71200002 Compiled at Jan 4 2006. 10/20/2006 04:11:56 Enclosure 1 Disk 10 606 Unit Shutdown for Trespass [0x00] 100030 300019 10/20/2006 04:22:43 71200002 Compiled at Jan 4 2006. 10/20/2006 04:35:54 Enclosure 0 Disk 5 606 Unit Shutdown for Trespass [0x00] 1000b b0005 10/20/2006 04:46:50 71200002 Compiled at Jan 4 2006. 10/20/2006 05:00:39 Enclosure 0 Disk 0 606 Unit Shutdown for Trespass [0x00] ffff0009 90000 10/20/2006 05:12:44 71200002 Compiled at Jan 4 2006.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 81

Page 83: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 2 NDU – Basic Operations and Troubleshooting General Theory NDU (Non Disruptive Upgrade) is the mechanism used for upgrading CLARiiON storage system software for FC4700, CX series and CX3 series arrays. This document attempts to provide a basic understanding of the NDU process and triage assistance. NDU Process This section provides a brief overview of the NDU Process. The best way to find where the NDU operation was when it failed is to look at the event logs for the primary SP during the operation. There will also be more detailed information in the user ktrace for each SP. For installs, reverts, and uninstalls, the NDU process performs the following steps:

1. Query Package Files

In this step, Navisphere queries an array with the package(s) to be installed and NDU does its dependency checks. This step does not occur for reverts and uninstalls, since the packages involved are already installed. Failures here could indicate that the user forgot to include the latest version of an already installed package for releases before release 14. See Dependency Check Failed for more details. After release 14, a dependency failure may indicate that a package is being installed that does not belong on that platform. For example, they may be trying to install MirrorView on a CX500i, where it is not supported.

In addition to dependency checks, there are checks to ensure that no package being installed has a version already installed that requires a commit. There are also checks for duplicate packages being installed or an install that consists entirely of already installed packages.

2. Setting Up

Once the user decides to proceed with the install, many of the checks performed in the Query Package Files step are performed again. For a revert or uninstall operation, however, this would be the first time these checks are run.

3. Store Package Files

In this step, NDU stores the package files being installed to the PSM. This action is both to allow the other SP to find the files and to keep a backup copy in case the SP is later re-imaged. This step could fail due to the PSM getting errors back from Flare. Also, if there is a lack of system resources, the commands could fail to make it to the PSM entirely.

3. Disable Cache

In this step, NDU disables and zeroes read and write cache for both SPs. Other array settings are modified at this time too, like the Flare throttling and pausing of FBI activity. This step can fail if the cache disabling code receives an error. See the Cache Disable Failed section for more details. This condition is often caused by a cache dirty situation that can be resolved with AdminTool. A failure to throttle Flare or pause FBI will not directly cause the NDU to fail. It may, however, impact performance in a way that later causes the upgrade to fail.

4. Run Check Scripts

In this step, NDU runs any check scripts it finds in packages for the current operation. These scripts usually detect for unusual conditions that can’t otherwise be detected by the core code through its normal dependency checks.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 82

Page 84: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

For example, upgrading from RTM release 14 to the release 14 patches required that all async mirrors be fractured prior to the upgrade. The RTM release 14 code was the code running the upgrade, so it did not know about this restriction, but a check script put into the release 14 patch could enforce it.

Check scripts often fail due to package specific restrictions. More information can be found in the event logs, ktrace, and the c:\temp\ndu-check.out file on the primary SP for the upgrade.

Uninstalls may also fail a check script if the package being uninstalled has resources in use. For example, if you went to uninstall SnapView, but hadn’t removed all of your snapshots, then the uninstall would fail

Another cause of check script failures can be directories in use. See Initial Cleanup Failed as an example of this problem.

5. Install Peer

In this step, package files are copied from PSM to disk, and the package’s files are copied to their package’s directory on the peer SP. The file copy is done mostly by using rundll32 from a script on the .inf file for that package.

This step could fail if a file was missing or if there was a problem unpacking the package file. Since PSM is being accessed, the failures mentioned in the “Store Package Files” step could also be seen here. See PSM Access Failed or Setup Script Failed for examples.

6. Quiesce Peer

In this step, NDU issues a command to halt all I/O to the peer SP. This command goes to admin, which then sends it to each of the drivers, including layered drivers. A failure during this step usually indicated that some driver reported a problem back to admin, which then returned it to NDU. See Quiesce Failed for examples.

7. Deactivate Peer

In this step, old packages are deactivated on the peer SP. This involves deregistering .dll files and removing registry settings. The registry settings are usually removed by using rundll32 from a script on the .inf file for that package.

This step could fail if a script was missing or a .dll file being deregistered was never registered in the first place. Usually, this indicates a problem with the way the package was created. The most information about the failure can be found in the c:\temp\ndu-deact.out file on the SP where the package was being deactivated. See Deactivate Hang for an example of a panic during a deactivate.

8. Activate Peer

In this step, the new packages are activated on the peer SP. This involves registering .dll files and adding new registry settings. The registry settings are usually added by running rundll32 from a script on the .inf file for that package. Activate scripts can also do some very unusual activities. For example, the Base package will start the ASIDC driver and the ICA process so it can put the ObsoleteImageRemover plug-in into the image repository.

Activate scripts can fail for a wide variety of reasons. The most information about the failure can be found in the c:\temp\ndu-act.out file on the SP where the package was being deactivated. See Panic During Activate for an example. 9. Post-Activate Peer

This step is new to release 14 and higher. Any package can optionally supply a post-activate perl script that is run after all of the activate scripts have been run. During this step, layered drivers which have an enabler installed will actually become enabled. Having this step after all activates ensures that the correct drivers are enabled, no matter which order the driver’s engine package and enabler may have been activated.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 83

Page 85: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

10. Reboot/Restart Peer

In this step, the peer SP is rebooted so that it can restart with the new drivers and libraries. In the case of a rebootless NDU, certain processes such as NaviAgent and NDUmon may be restarted instead of rebooting the entire SP. Restarting NDUmon simulates a reboot and causes NDU to move to the next step in the upgrade process.

This step can fail if the peer SP either comes up degraded or fails to unquiesce I/O. In this case, the SP has booted, but it is not processing I/O, so it is not safe to continue with the current NDU operation. The primary SP will eventually time out if NDUmon on the peer SP never returns a successful status code from a poll attempt. See Reboot Failed for examples.

There can also be problems flushing out the registry changes to disk at reboot time that can lead to problems on the next boot. See Registry Flush failed for an example.

11. DMP Delay

When initiated from the Navisphere GUI, the NDU operation will include a DMP Delay step in between the reboot of the peer SP and the installation of software on the primary SP. This delay gives failover software a chance to notice that the original path has returned before the primary SP is taken offline for the NDU. This delay can also be specified with a command line option when using NaviCLI.

12. Install/Quiesce/Deactivate/Activate/Post-Activate Primary SP

These steps are much like the same steps on the peer SP. The main difference is that they do not require a message to be sent to NDUmon, so the steps are running in the NaviAgent process space on the primary SP. Output files and ktrace will be updated on the primary SP at this point as well. Failures during these steps would be similar to the failures on the peer SP for the same step.

13. Uninstall Peer/Primary SP

Before the primary SP reboots, it will uninstall any package files as needed on the peer and primary SP. This consists mostly of deleting the directories for that package which are no longer needed. A package can optionally supply a uninstall.bat script that would be run at this time.

14. Reboot/Restart Primary SP

One of the last steps of the upgrade is to reboot or at least restart processes on the primary SP. During the time that the primary SP is operating, the peer SP has a thread watching the progress of the primary SP. If the primary SP hasn’t made any progress during its timeout period, then the peer SP will timeout and declare the upgrade to be a failure. However, if the primary SP has gotten to the Reboot/Restart step, then the upgrade is past the point of no return, so a failure would not cause either SP to revert back to the original code.

15. Restore Cache

When NDUmon on the primary SP starts for the first time after the reboot/restart, it will attempt to restore the original cache settings from a data area (ndu-cache-settings) in PSM where they were saved.

16. Sync Operations (this step only occurs if a failure occurs)

Sync operations happen after an SP is re-imaged or a previous NDU operation fails. NDU reads the ndu-toc data area in PSM for the list of packages that should currently be installed and makes whatever changes are required to make this SP’s software match that list.

Before release 16, a sync initiated during the SP reboot would attempt to disable the cache settings, much like any other NDU operation. However, this was both unnecessary and prone to failure, so release 16 code won’t bother.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 84

Page 86: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

17. Commit

A commit operation indicates to the software that the current upgrade has completed and it is safe to start using new persistent data formats and messages. When the user commits a bundle, NDU will go through each of the component packages in that bundle and issue a commit to each of those first. If they all commit successfully, then the entire bundle will be marked as committed.

Within a package, there can be a number of admin libraries that request to receive a commit. For example, when the Base package is committed, the Flare, PSM, Hostside, and System admin libraries all receive a commit opcode that is forwarded to the appropriate drivers.

A failure to commit could be due to out of sync software. For example, the peer SP may have been up and running while a single SP upgrade was done, so it still has not done a sync, which would ensure that both SPs are running the code in the ndu-toc file. It could also be the case that a registry flush to disk failed, so that the registry setting changes made during the previous upgrade were lost. See Commit Failed for examples.

A failure to commit could also indicate that one of the component package’s admin libraries returned back an error. User ktrace would indicate the most information in this case. See Commit Failed for examples of this case as well. - Sample Cases -

Dependency Check Failed

These are cases where either some packages were being upgraded to a new release without upgrading other dependent packages or an enabler was being installed onto a platform where it does not belong.

In the first case, you may see messages in CLI like:

Uninstallable Reason: Dependency not met Required packages: SnapViewOption 141, SnapView 150, AccessLogixOption 131, ManagementServer 150, Navisphere 150, Base 150

The fix is to install all packages for the new release together or just use a bundle that already has all of the needed packages together.

In the second case, you may see a similar message before release 19. From the GUI, it may appear as:

-SANCopy - (Generation 141) - Cannot Install: Dependency not met: The following packages must be installed to install the requested package: OpenSANCopy generation >=140, SANCopy_PERMIT generation >=1, SANCopy_PERMIT >=10 MUST NOT BE INSTALLED

With release 19 or higher, the message may be more like “Permit attribute dependency not met. This package is not allowed on this platform type.”. This message indicates that this particular enabler is not allowed on this platform type. In this case, SANCopy, MirrorView, and MirrorView/A are all not allowed on CX500i arrays. PSM Access Failed

In these cases, either a read or write access to PSM failed. The event logs will show a message like:

K10PsmFileImpl::GetFile: ReadTail exception. File: D:\views\412f8dfdf6c5e7457d72788e1651d928.stg\catmerge\mgmt\K10GlobalMgmt\K10PsmFileImpl.cpp Line: 870 Error: 76008005 NTError: 000005aa Description: IOCTL_PSM_READ failed for ndu-EMC-Base-02174003.008. Indicates a failure to acquire the right kind of resources to perform the I/O, and may indicate the system is low on memory.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 85

Page 87: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Cache Disable Failed

In these cases, an upgrade was attempted when there are cache dirty LUNs. NDU must disable and zero the cache settings before it can proceed, but it will be prevented from doing so if there are cache dirty LUNs. In a recent case, the event logs showed:

5/13/04 12:00:58 PM NDU Information (1234) 0 N/A CPQA2125 Informational message. File: K10NDUAdminManage.cpp Line: 1398 Details: Disabling cache 5/13/04 12:00:59 PM NDU Error (1234) 32820 N/A CPQA2125 Failed to disable cache settings. File: K10NDUAdminManage.cpp Line: 1406 Details: Failed to disable cache Status: 0x6000026d

This means that NDU attempted to disable the cache, but got the 0x6000026d error code back. This is a sunburst error code, which can be found in a slightly different form: #define HOST_LUNS_CACHE_DIRTY 0x206D

The corresponding sunburst error code can be derived by removing the leading 0x60000 and inserting a 0 between the first digit and the remaining digit. Potentially a ‘clear dirty cache’ procedure may be required to resolve this issue.

Check Script Failed

Check script failures show up after NDU operation has gone asynchronous. “NaviCLI ndu –status” command may show:

Is Completed: YES Status: Operation Failed: A check script contained in the associated package failed. Check the SP event logs on the primary SP and the package release notes (0x71518013) Operation: Install

This indicates that a check script failed and the event logs as well as the c:\temp\ndu-check.out file on the primary SP will contain more details. In this case, the ndu-check-out file showed:

Found LU 620 with WWID 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9:11 Exporting driver is K10AggDrvAdmin Consuming driver is K10CloneAdmin Public is: 0 Error: Found private LU 620 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9:11. Found LU 621 with WWID 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9:11 Exporting driver is K10AggDrvAdmin Consuming driver is K10CloneAdmin Public is: 0 Error: Found private LU 621 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9:11. ... Cannot upgrade to this release while K10CloneAdmin is consuming MetaLUNs for private use. Detected compatibility problems The event logs showed:

04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Found private LU 620 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9: NDU 04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Found private LU 621 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9: NDU 04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Cannot upgrade to this release while K10CloneAdmin is consuming MetaLUNs for private use.

There is a restriction in R16 and higher that was not there in R14 that private LUNs for layered drivers such as Clones WILs and SnapCache LUNs cannot be built on top of metaLUNs. These must be reallocated on normal LUNs before the array can be upgraded to a release after release 16.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 86

Page 88: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Setup Script failed

These incidents were upgrades from R13 to R14. In each case, the upgrade failed in a slightly different way. For one of the incidents, it failed in the setup script for MirrorView, although the event log uses the words “check script”: B 02/01/06 14:32:41 NDU 71518013 A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Failed to auto-install the MirrorViewOption package that is required to retain MirrorView capability when upgrading.

In another incident, it failed in a check script and produced this message:

A 03/04/06 12:29:31 NDU 71518013 A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: MirrorView has been stop-shipped, so this array cannot be upgraded to this bundle.

For both cases, the root cause was the same. Both arrays had MirrorView installed at one point, and the package was uninstalled, but it looks like some registry settings were left around on one SP. Since the upgrade was tried on different SPs in each case, different messages were seen.

A special package called ToR14EnablementCheck was created to fix this problem. Note that this package is not generally available and is only made available if required at the discretion of EMC Engineering. Quiesce Failed

This case looked to be a ScsiTarg Timeout when NDU went to quiesce I/O on the peer SP. The event logs for the primary SP showed: 01/06/2005 09:50:48 (71518016)An attempt to quiesce I/O on the peer SP failed. Consult other SP Event Log entries for details. Call Service provider. File: K10NDUAdminManage.cpp Line: 279 Details: Quiesce returned 1901166615 Status: 0x71518017 71 51 80 16 NDU

The user ktrace output on the peer SP showed:

09:40:43.377 NDU: Received QUIESCE command 09:40:43.462 ndumon: QuesceAll TCD 09:40:43.463 ndumon: HostAdmin quiesce opcode 6 09:50:43.526 NDU: error: 0x71508003 File: D:\views\3a5fd66156ebf0222db8e58e642d6629.stg\catmerge\mgmt\K10Ho 09:50:43.526 NDU: stAdmi 09:50:43.534 NDU: QUIESCE failed 71518017 09:50:43.534 NDU: Calling TerminateThread to cancel HangTimer: 3a4 09:50:43.534 NDU: Hang Timer canceled 09:50:43.534 NDU: Returning RC: 71518017

The 71508003 error code is K10_HOSTADMIN_ERROR_IOCTL_TIMEOUT, which indicates that HostAdmin got an error from TCD (ScsiTarg). Deactivate Hang

For this issue, there was a panic during an upgrade. The user ktrace buffer in that panic showed that NDU never finished deactivating the Base package:

06:01:48.321 NDU: Starting HangTimer 06:01:48.322 NDU: CreateThread() returned 118 06:01:48.322 NDU: Timer started 3c0 06:01:50.321 NDU: Received RUN command: set Operation=Install&& set BASE_GENERATION=180&& cd c:\EMC\Base\02 06:01:50.321 NDU: 185003.005&& ndu\bin\deactivate > \temp\ndu-deact.out 2>&1

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 87

Page 89: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The end of ndu-old-deact.out showed iSNS as the last line that was being executed:

C:\EMC\Base\02185003.005>msgbin\isns -UnregServer

The user ktrace showed that the iSNS process was still running up to the panic:

06:17:07.746 iSNS: Wait on mutex 06:17:07.746 iSNS: Got Devmap mutex 06:17:07.749 iSNS: release mutex 06:17:07.749 iSNS: FlareData mutex count dec 0

Since the command didn’t complete within the 16 minute timeout period, NDU panicked and the NDU failed.

Panic During Activate

The user ktrace in this panic showed that NDU was activating the Base package: 11:03:58.476 NDU: Activating using set Operation=Revert&& set BASE_GENERATION=160&& cd c:\EMC\Base\02165003 11:03:58.476 NDU: .446&& ndu\bin\activate > \temp\ndu-act.out 2>&1

Engineering dump analysis showed that NDULoadBios was started by cmd.exe, which was started by cmd.exe, which was started by NaviAgent.exe. Because this panic was on the primary SP, the NDU thread is running in NaviAgent’s process space. Also, the ndu-act.out file would likely show NDULoadBios.exe as the last line in the output file. Reboot Failed

In this case, SPA was pinging, but there was no EMCRemote access. SPA was the primary in the NDU and it failed following its reboot (before the chkdsk). Hitting the NMI button forced a panic, from which dump analysis showed that this was an NtRaiseHardError hang, which has been seen mostly on CX500s when they reboot as part of an upgrade. Registry Flush Failed

In some cases, an SP will reboot, but come up unmanaged. It may be running with the latest drivers, but have old registry settings. This can be caused by a failure to flush the EMC key in the registry, which causes a message like this to appear in the event logs:

04/11/2005 10:28:40 (79508017)Dynamic strings:Cannot flush EMC keyD:\views\chainsaw_r12_ch2k_nal_fr.stg\catmerge\mgmt\K10SystemAdminLib\K10SystemAdminControl.cpp768 79 50 80 17 00 00 03 f8 00 00 00 00 naviagent

The safest way to fix this condition is to re-image the system. However, if that is not an option, contact EMC Engineering for a possible optional procedure. This problem should only be seen when upgrading from release 19 or earlier. Commit Failed

One case is where the software was out of sync on one SP between that SP and the ndu-toc file in PSM. The event logs showed a shutdown failure, which may have caused registry setting changes to be lost: 06/08/2004 20:18:03 (79508017)Exception: InitiateSystemShutdown failed; File: D:\views\b79e399bec2ddb1ffa549397821ab792.stg\catmerge\mgmt\K10SystemAdminLib\K10SystemAdminControl.cpp; Line: 387. 79 50 80 17 00 00 00 05 00 00 00 00 ndumon

This caused NDU to return the 7151803B error code back when the commit was attempted. Rebooting either SP should cause a sync which would fix the problem.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 88

Page 90: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Another is where an admin library returned back an error to NDU, which caused the commit to fail. User ktrace showed:

06:54:16.775 NDU: Sending commit to Admin Library K10FlareAdmin 06:54:16.777 NDU: Admin Library returned 0 06:54:16.777 NDU: Sending commit to Admin Library K10HostAdmin 06:54:16.823 NDU: error: 0x71508010 File: D:\views\31bc63403d30f59a60e5e38999d17155.stg \catmerge\mgmt\K10Ho 06:54:16.824 NDU: Commit failed71508010

The event logs showed:

Unexpected Exception. Call service provider. File: K10HostAdmin.cpp Line: 2612 Status: 0x71508010 NTErrorCode: 0x467 Exception Details: Error waiting on ioctl -2134236020 7151801d

From this information, it was eventually found to be a problem with Clones. Post Conversion Bundle Inconsistency in Release 14

When running release 14, conversions will not leave the bundle software in a consistent state. That is, the BundleIndex for the new platform may not exactly match the installed software, some of which may be left over from the old platform. The fix is normally to install the latest patch for the new platform and commit it. Until that is done, however, any installs may see dependency failures with the 71518004 error code. R12/R13 to R16/R17 stack size problem

In these incidents, an upgrade from R13 or earlier to R16 or higher on a CX400 or CX600 system with layered drivers installed caused a panic. In some cases, the panic was during the NDU, which caused it to fail. In others, the panic happened the first time layered drivers were used afterwards.

Usually, the panic code was “0x35, NO_MORE_IRP_STACK_LOCATIONS”. In other cases, however, the panic was “0xe1318013, CMID_BUGCHECK_PARTITION_FROM_LIVE_PEER_DETECTED”. The difference being that in some cases, the problem was detected on that SP and it panicked itself. In others, it looks like memory was corrupted, which caused the other SP to panic.

The root cause in either case was that on NT systems upgrading from a release 13 or earlier, the script that sets the driver stack size was only running when the Base package was being activated, and not as the layered drivers were being enabled. This left the default stack size of 3 in place, rather than the correct number for the installed layered drivers.

There is now an R16_R17_StackSize_Fix_emc107453 package that can be used in both upgrades to R16/R17 and on systems that may have a latent problem. If the SP is already degraded, the SP can be fixed manually. Contact EMC Engineering for information about this procedure if required.

Initial Cleanup Failed

In these incidents, the UtilityPartition package was installed, but the ICA process was left running. This caused the next installation to fail because the c:\temp\ndu directory was in use and could not be deleted. User ktrace shows:

19:03:21.787 NDU: Cleanup cmd is: rd/s/q \temp\ndu & mkdir \temp\ndu 19:03:21.919 NDU: Initial cleanup returned 1 19:03:21.920 NDU: Activate check script failed 19:03:21.920 NDU: Backing out ndu-EMC-RecoveryImage-02163005.004

The current workaround is to reboot the SP and retry the installation.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 89

Page 91: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

iSCSIPortx IP Configuration Restoration and Device Discovery

There were two incidents uncovered when manufacturing started using a new rev of CX300i hardware. This problem applies to any iSCSI platform (including CX500i and AX100[SC]i) where the iSCSI chip rev may change, e.g. due to an SP replacement or be different from the hardware revision stored in Windows Plug and Play meta data in a freshly imaged SP. Newer code has been added into R19 to address this on CX and AX platforms. Here is information that describes the issues in detail as well as workarounds for R16 and R17: QLogic r4/r3 issue There are two problems - one regarding plug-and-play, one regarding network settings and PSM. Some operations bring out one problem, some bring out both (some bring out neither). The plug-and-play issue occurs when new hardware is introduced, and, since we can't control when plug-and-play runs, the SP ends up in a state where iSCSI ports are not correctly named. So, this occurs when an R3-based SP is swapped out for an R4 one. It would also happen in the hypothetical reverse swap case, but EMC currently only intends to spare with R4s. The network setting case occurs when a data-in-place reimage is done without changing the hardware. This is caused by a bug where the NDIS settings for the iSCSI ports are not restored from what is correctly stored in PSM. So given this information; 1. NDUs work fine. No special steps are required. 2. An SP swap where the old and new revs of chip are the same rev requires no special steps. 3. A data-in-place reimage will require the user to re-enter the IP information for the iSCSI ports. Contact EMC Engineering for any required procedures. 4. An SP swap where the old and new revs of the 4010 chip are different requires a modified process. Contact EMC Engineering for any required procedures.

One or both SPs in reboot cycle (CX200/CX400/CX600 being upgraded from pre-R11 only) Event log shows newSP continually reinstalling a package successfully then rebooting

This problem most likely occurs when someone installs a TRANSIENT or EPHEMERAL NDU package (look at the package’s TOC.txt file for those keywords), perhaps along with other NDU packages, on a revision of software that does not support TRANSIENT or EPHEMERAL packages (i.e. packages that are supposed to disappear automatically after the NDU completes).

What happens is that the NDU succeeds, but because the Base software that is currently running does not know to treat the TRANSIENT or EPHEMERAL keywords properly (basically ignoring them), it includes the TRANSIENT or EPHEMERAL package in the ndu-toc file in PSM rather than leaving it out.

What happens on the next NDU sync opportunity, e.g. when an SP reboots after the NDU has already completed, is newSP finds that this package’s <PKG>_REVISION environment variable is not set, indicating that it is not active, and proceeds to activate it, then reboot. But since the package activation never sets that environment variable (by nature of being TRANSIENT or EPHEMERAL), this cycle recurs when the SP comes up again.

To fix this, it will be necessary to edit ndu-toc to remove the reference to the TRANSIENT or EPHEMERAL package. Contact EMC Engineering for any required procedures. Note that if any SP has been put into HFOFF mode, run HFON then reboot it. When it boots up, it should automatically remove the TRANSIENT or EPHEMERAL package that had been causing problems.

Once one SP that is in this situation has been repaired and is servicing I/O, it will cause the peer SP to perform an NDU sync and clean up the remains of the TRANSIENT or EPHEMERAL package on that SP. This may cause the peer SP to reboot. If the peer SP does not reboot and remains unmanaged, reboot it and it should come back managed if this were the only issue involved in its unmanaged behavior.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 90

Page 92: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tips and Tricks

SPCollects

SPCollect output files are an excellent source of information for triaging an NDU problem. They contain the event logs, ktrace output, and NDU’s output files from the temp directory.

Event Logs

The sus.zip file in the SPCollect output file contains a SP?_navi_getlog.txt file, which has a the event logs mixed together. If Navi was unavailable at the time SPCollect was run, you can still get the raw .evt files out of the evt.zip file. Use evtdump.exe to extract the text from these files.

Ktrace

The sus.zip file in the SPCollect output file has a SP?_kt_user.txt file, which has the latest user ktrace information. Ktrace information from previous boots can be found in the ktd.zip file. Searching for “-ruser” will jump to the start of the user ktrace output.

NDU Output Files

The rtp.zip file in the SPCollect output file has a number of NDU output files of the form ndu-*.out. These contain more detailed information about particular operations. For example, the ndu-act.out file contains information about the latest activation on that SP.

There may also be preserved output files from failures. If an activate for the Base package revision 02.16.500.5.001 failed, it would save the last ndu-act.out file as ndu-Base-02165004.001-act.out. Always check that this output file corresponds to that package though, as there are cases where a step can fail without producing a new output file, so the preserved file is from the previous package.

A sync operation will also save some files as ndu-old-*.out. This is useful when there is a panic or reboot that prevented the output files from being preserved at the time they were running. Force degraded mode

If an SP can be pinged, but is otherwise inaccessible, you can force it into degraded mode. This procedure can also be useful if the SP is in a panic loop, but you do not want to wait until it hits the reboot counter on its own.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 91

Page 93: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 3 Backend Architecture General Theory The back end of a CLARiiON array uses many of the same principals of the front end relative to data flow. The following section should be taken in a general context, then apply this knowledge to the back end loop. If you start by looking at what a storage area network is, you find that it is a collection of fibre channel or iSCSI nodes that usually communicate with each other via some type of media such as fiber optic or copper wire. A node is defined as a member of the fibre channel network that is provided a physical and logical connection to the network by a physical port on the switch. Every node requires the use of specific drivers to access this network. For example, on a host, one has to install an HBA and the corresponding drivers to implement FCP or iSCSI. These drivers are responsible for translating fibre channel or IP commands into something the host can understand (such as SCSI commands) and vice versa. The Fibre Channel nodes communicate with each other using a device such as a Fabric Switch. The primary function of a fabric switch is to provide a physical connection and logical routing of data frames between the attached devices. A switch also provides fabric services to the nodes attached to it to allow them to communicate with each other within a fabric. In the absence of a switch, arbitrated-loop or point-to-point communication is in use (discussed later). A SAN usually consists of one or more of the following components

Fibre Channel (FC) is used as the transport protocol in most SAN implementations. It is a serial protocol that can use either copper or optics as the physical medium. Earlier implementations of FC used copper while most modern implementations use fiber optic cables. The CLARiiON storage system front end uses optical media while the back end uses copper media.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 92

Page 94: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CLARiiON Backend Arbitrated Loop The ANSI Fibre Channel Standard defines three topologies: Point-to-point, Switched fabric (FC-SW) and Arbitrated loop (FC-AL). The following describes each of these. The CLARiiON back end operates on the Arbitrated loop (FC-AL) topology with some differences dependent upon the type of disk enclosure in use.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 93

Page 95: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Backend data flow How does this relate to the backend of a CLARiiON Storage System?

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 94

Page 96: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Data flow through each enclosure type FC-series data flow The FC4700, FC5700, FC5300 and FC5500 backend loop data flow is shown only to depict the difference with newer CX/CX3-series arrays. The principals of determining a failing device are the same but the data flow is different.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 95

Page 97: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CX data flow The CX-series data flow with DAE-2

Using the information in the above two diagrams, you can find the failing device. For a standard DAE2 type of enclosure, the failing device is fairly straight forward. In its simplest troubleshooting form, the device reporting the highest number of errors is just the messenger. It will normally be a device just prior to the one reporting errors which is the problem device.

The consideration of ‘upstream’ and ‘downstream’ devices has to take into account the type of disk array enclosure in use. The above two examples are for the older FC-series DAE and original CX-series DAE2 enclosures. The next section will cover the ATA and Ultrapoint (Stiletto) enclosures, showing the differences internal to each relative to data flow.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 96

Page 98: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

ATA (Advanced Technology Attachment) Disk Enclosures The ATA disk enclosure is the same DAE2 type enclosure used with FC-type drives in the CX-series array. It is not available on the FC-series arrays. The difference with ATA is that it uses PATA drives with a paddle card to convert them into serial for use on the same midplane structure of the enclosure. Additionally, the LCC (Link Control Card) is replaced with a more intelligent BCC (Bridge Control Card) that is codenamed Klondike. The BCC is also known for marketing purposes as an LCC, so be sure you are clear about the type of ‘LCC’ you are working on when requesting assistance. The firmware or frumon code that resides on the BCC is codenamed Yukon.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 97

Page 99: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

ATA Disk Ownership

Upon power-up the data path to all disk drives will be validated by each LCC to set the initial ownership values within the FRUMON drive bypass register. FRUMON will initiate the validation request by first reading the drive present register. This information is passed to a LCC redundant controller process to co-ordinate the validation of all drives from both controllers. Once all paths have been validated and the internal database is updated, the corresponding FC AL-PA is enabled and commands can be processed for the corresponding disk drive.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 98

Page 100: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Ownership of a disk path is established after validation completes with every other disk drive ‘owned’ by the local controller. The LCC will attempt to change ownership during operation if a majority of commands are received on the controller that does not ‘own’ the disk. This is done to minimize command latency while maximizing bandwidth capabilities of the controller. This is performed every five minutes and is based on >50% of the I/Os traversing the midplane. Since SATA is a point-to-point connection, the LCC utilizes a dedicated inter-controller link between controllers for dynamic dual path capability. The FC inter-controller link is designed to behave very similar to a second host port on each controller with the exception it must also act as an initiator to send commands and messages to the peer controller. Dynamic dual path also allows for controller ‘ownership’ of a SATA disk drive simplifying management of the data path and alleviating cache coherency issues since all data is managed for a given disk drive on a single controller.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 99

Page 101: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

To assist in troubleshooting an ATA enclosure, it is important to understand that a drive exhibiting issues, may have the peer BCC as the ‘owning’ BCC. Starting at R19 Base Software, it is possible to view the BCC event logs. Due to the nature of these logs, they should be gathered and provided for review as part of the normal analysis process. Note that power cycling or resetting the BCC will clear the event log, so it is important to obtain this log prior to any recovery activity. To accomplish this, you can run an FCLI command to retrieve these logs. Since FCLI is a powerful tool, caution should be taken when executing commands. As indicated, FLARE R19 includes a new command to retrieve the ATA BCC (aka; LCC) event log and include it in SPcollect information. This command was added to improve diagnosis of some of the issues seen with ATA (Klondike) enclosures. In many cases this is the only means for diagnosing ATA BCC behavior. Since each ATA enclosure contains two BCCs, it is important to collect logs from both to allow proper diagnosis. ATA BCC event logs should be retrieved prior to making any changes and certainly before resetting, reseating, or power-cycling the ATA BCCs since the event logs are cleared upon reset. The retrieval command must be run prior to executing SPcollect. The command retrieves the log from the ATA BCC or ATA BCCs attached to the SP on which the command is executed. Always collect the logs from both ATA BCCs in an ATA enclosure (i.e., execute the log retrieval command on both SPs from the FCLI prompt. - Run EMCRemote onto each of the storage processors (SPs). - From the command line, enter: flarecons d f b (when connected to SP-B)

flarecons d f a (when connected to SP-A) or

- If you can determine which enclosure or ATA BCC on the particular backend bus is operating incorrectly, retrieve the logs from it using the following command (executed on each SP): lccgetlog –e <enclosure _number> example: lccgetlog –e rmb1 - If you cannot determine which enclosure or ATA LCC is the problem, or suspect that multiple enclosures are misbehaving, use the following command (executed on each SP) to retrieve event log information on all ATA LCCs: lccgetlog –all - Disconnect from FCLI mode: CTRL-C - Note: To exit from FCLI mode, you must use CTRL-C. If in FCLI mode and type "Quit," you return the SP to serial mode. The only way to then get back into FCLI mode is to break into debugger mode and restart FLARE or reseat the SP. - Repeat the procedure for the other storage processor. - After using the command to gather all the logs run SPCollect as usual (emc60493). Note the command takes about 1-2 seconds to retrieve the log. If the retrieval command fails, a “LCC Yukon Getlog Failed for Encl:” message appears in the event log along with the enclosure number. If the command fails, the ATA BCC is likely hung or otherwise unable to return the logs. If the customer situation allows, reseat the ATA BCC or power cycle the enclosure to clear the hang. The BCC reseat or chassis reset will clear logs so further log retrieval is not possible. ATA BCC LEDs

There are five LED housings mounted at the air dam so that their LEDs are visible through holes in the air dam. Two are for cable connect status, two are for loop ID, and one is for fault and power

l LEDs are green except for fault LED which is amber. indication. AlCable on Loop LED

The LEDs indicate, when on, that there is a valid fibre channel signal on the receive side of the cable and that the cable is configured onto the fibre channel loop. The cable is taken off of the fibre channel loop if there is no valid fibre signal on the receive side of the cable and/or FRUMON was told to take the cable off of the fibre channel loop. A single error is not sufficient to take the cable off of the fibre channel loop. Fibre channel protocol is robust enough to handle errors that happen on the loop. This prevents short error bursts from removing enclosures from the loop. Short error bursts can happen for many reasons like hot insertion and removal of FRUs.

Loop ID LEDs

Loop ID LEDs indicate which backend loop the enclosure is on. There are eight LEDs. Four in each of two LED enclosures. Each LED enclosure is a single LED wide & four LEDs high. At most, one LED of eight is lit. No LEDs lit indicates that loop ID has not been latched into FRUMON. Blinking LED indicates the loop ID latched into this LCC does not agree with the loop ID latched into the other LCC, in this enclosure.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 100

Page 102: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Power and Fault LEDs

The power and fault LEDs share the same enclosure. The power and fault LEDs have built in current limiting resistors. The power LED, when lit, indicates that 5 volts is active and that the board has been inserted enough for the short pins, used in the board inserted loop, to have made contact. The fault LED, when lit, indicates that either there is something wrong or the board needs to be replaced or has not finished power up testing and config.

Ultrapoint (Stiletto) Disk Array Enclosure – DAE2P/DAE3P Fibre Channel Data Path

The fibre channel data flow in and out of every port is managed by a Cut-Through-Switch (CTS). The CTS controls the Fibre Cable ports and the Fibre Disk drive ports. The LCC (codename Stiletto) architecture supports a total of 15 drives. The CTS will route the IO traffic to its respective drive port using the destination address of the data. This provides a point-to-point connectivity from the host port of the switch to the respective drive ports avoiding a link routed through preceding disk drives. Fibre channel dataflow in and out from the cable ports are managed by the CTS. The loop’s inputs and outputs consist of a Primary port and an Expansion port, where the Primary port provides connectivity either to a host (SP) or a previous chassis and the Expansion port provides connectivity to a downstream chassis for each loop. A downstream or upstream chassis can be either a DAE or a DPE or a host (SP). Each cable port input port has a link monitoring capability, which monitors loss of link, link errors and link level violations. Both a digital signal loss of link and analog loss of link are provided. Link errors such as 8B/10B code violations and disparity errors and Fibre Channel frame CRC32 errors are detected. Fibre Channel link level violations include detecting link anomalies such as loss-of-sync and comma density violations. Utilizing the link input monitoring capability the Fibre channel data entering both Primary and the Expansion ports are selected to route through the LCC when cables containing valid fibre data are connected to primary and expansion ports.

The diagram shows a bit more detail of how the CTS works. When an I/O is destined for a drive, that drive and only that drive is brought onto the backend bus. All other devices are bypassed for the I/O on the data path. This eliminates the possibility that a single drive can cause multiple error events (801 Soft SCSI Bus) on other drives. If multiple drives are getting errors, then an issue is most likely to exist on the data path and not with drives. This would include the SP, cabling and LCCs. The exception to the case is when there are 2 is more actually faulted drives in a loop.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 101

Page 103: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

How to troubleshoot an Ultrapoint backend bus using the ‘counters’. Currently a FCLI command called lccgetstats is available. The command will retrieve and display output of the counters. The counters have limited functionality, but what is available is very important and useful in troubleshooting a Stiletto backend bus. It can also be used on a Stiletto LCC when installed in a mixed-FC or ATA backend environment. The command to use is: fcli> lccgetstats -b # fcli> lccgetstats –display The following example is from a CX500 running R19 Base Software)

• EMCRemote into SPB and bring up flarecons on SPB (flarecons d f b) • Execute the lccgetstats commands. The help screen follows.

fcli> lccgetstats For the lccgetstats, more parameters are required. Usage: lccgetstats –h lccgetstats <operation> Help options: Operations: -b Get information for the specified bus

-e Get information for specified enclosure -display Send the retrieved information to fcli console.

• Note. First issue the retrieval, then issue a second command to display • In R19 code, the HELP screen is incorrect regarding command usage. It is corrected in current code releases. • Issue the commands as shown. Backend BUS0 and BUS1 each have two Stiletto enclosures attached.

02/24/2006 19:36:49 fcli> lccgetstats -b 1 lccgetstats request sent. fcli> lccgetstats -display Enclosure 8, PRI_PORT, 02/24/2006 17:06:11 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Notice there is no status. Retimer and Monitor LCV Error Count: 0x0 Notice there are no error counts. Retimer and Monitor CRC Error Count: 0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x34 Enclosure 8, EXP_PORT, 02/24/2006 17:06:11 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Notice there is no status. Retimer and Monitor LCV Error Count: 0x0 Notice there are no error counts. Retimer and Monitor CRC Error Count: 0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x15 Enclosure 9, PRI_PORT, 02/24/2006 17:06:12 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Notice there is no status. Retimer and Monitor LCV Error Count: 0x0 Notice there are no error counts. Retimer and Monitor CRC Error Count: 0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x34 Enclosure 9, EXP_PORT, 02/24/2006 17:06:12

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 102

Page 104: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x222 Notice the status change. Retimer and Monitor LCV Error Count: 0xffff Error count ‘ffff’, no cable on EXP port. Retimer and Monitor CRC Error Count: 0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x15

• Status 0x222 (see table later in document) means ‘DLOL’, digital loss of link signal, ‘CDV’, no characters seen within the frame clock period and ‘LCV’, rate errors have exceeded the frame clock period threshold. You will get this status when a cable is not connected or improperly connected to the port.

• We pull the LCC cable from the expansion port of ENC 0_BUS1. • Execute the lccgetstats commands

02/24/2006 19:55:45 fcli> lccgetstats -b 1 lccgetstats request sent. 02/24/2006 19:55:45 fcli> lccgetstats -display Enclosure 8, PRI_PORT, 02/24/2006 19:55:37 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Notice there is no status. Retimer and Monitor LCV Error Count: 0x0 Notice there are no error counts. Retimer and Monitor CRC Error Count: 0x0 Enclosure 8, EXP_PORT, 02/24/2006 19:55:37 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x222 Notice the status has changed. Retimer and Monitor LCV Error Count: 0xffff Notice there are now error counts. Retimer and Monitor CRC Error Count: 0x0

• Notice that Enclosure 9 is no longer displayed as it is no longer connected. The expansion port line items are removed from this example as they are not currently in use.

• Reconnect Enclosure 9 and execute the lccgetstats command. The previous buffer will be shown and then the new output will be shown. Each time you run the command, the counters will be reset to zero and what is displayed is the ‘previous’ display results and the ‘current’ command results.

• Running the lccgetstats command needs to be done several times to see if errors are incrementing. If you have a bad bus and the first time you run the command, hundreds of errors may be shown.

• Execute the command again to clear the counters (you can display them if you wish). Then wait 5-10 minutes and rerun the command to see if the errors have incremented.

• Do this a few times to see if errors continue. As with any troubleshooting effort of a backend bus, I/O is required and it may take minutes or hours for errors to get generated. Suggestion: execute a Background Verify to a LUN on the backend bus. This will help generate I/O to assist you in isolating the backend bus.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 103

Page 105: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The following are descriptions of the registers returned in the lccgetstats output. Retimer and Monitor Configuration: 0x38 Indicates how the monitor is configured. Not important for field usage in reviewing lccgetstats values. Retimer and Monitor Status: 0x###

Retimer and Monitor LCV Error Count: 0x# Line code violations are the occurrence of either a Bipolar Violation (BPV) or Excessive Zeroes (EXZ) error event. BPV is the occurrence of a pulse of the same polarity as the previous pulse. EXZ is the occurrence of more than fifteen contiguous zeroes. It counts all the line code violations/disparity errors detected by the 8B/10B decoder. Retimer and Monitor CRC Error Count: 0x# Cyclic Redundancy Check (used to verify the integrity of a data block) errors found during the retiming of the FC signal. It counts all the CRC32 errors detected by the CRC checker. Expansion A: 0x# “Unused counter for future FRUMON error counts. Ignore any values” Expansion B: 0x# “Unused counter for future FRUMON error counts. Ignore any values” Expansion C: 0x# “Unused counter for future FRUMON error counts. Ignore any values” Expansion D: 0x# “Unused counter for future FRUMON error counts. Ignore any values”

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 104

Page 106: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

How to interpret the output of the Ultrapoint counters Output of the lccgetstats is not difficult once you understand the order of data flow. Since data flow is monitored by the ‘outbound’ counters first and then ‘inbound’ counters, it is suggested to realign the output of the display. From a previously shown example, you can rearrange the counters and list them in the order of the data path. It would look like the following;

Enclosure 8, PRI_PORT, 02/24/2006 17:06:11 Retimer and Monitor Configuration:0x38 Retimer and Monitor Status:0x0 Retimer and Monitor LCV Error Count:0x0 Retimer and Monitor CRC Error Count:0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x34 Enclosure 9, PRI_PORT, 02/24/2006 17:06:12 Retimer and Monitor Configuration:0x38 Retimer and Monitor Status:0x0 Retimer and Monitor LCV Error Count:0x0 Retimer and Monitor CRC Error Count:0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x34 Enclosure 9, EXP_PORT, 02/24/2006 17:06:12 Retimer and Monitor Configuration:0x38 Retimer and Monitor Status:0x222 Retimer and Monitor LCV Error Count:0xffff Retimer and Monitor CRC Error Count:0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x15 Enclosure 8, EXP_PORT, 02/24/2006 17:06:11 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Retimer and Monitor LCV Error Count: 0x0 Retimer and Monitor CRC Error Count: 0x0 Expansion A: 0x0 Expansion B: 0x0 Expansion C: 0x0 Expansion D: 0x15

Simply find a method of viewing in which you can line up and understand where the counters reside in the data path.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 105

Page 107: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

EXAMPLE: The following array has been getting several 801 Soft SCSI Bus errors on BUS 3, SPB side of a CX700 that is configured with Stiletto-based enclosures. The CRU list for BUS 3 is as follows; DAE2P Bus 3 Enclosure 0 Bus 3 Enclosure 0 Fan A State: Present Bus 3 Enclosure 0 Fan B State: Present Bus 3 Enclosure 0 Power A State: Present Bus 3 Enclosure 0 Power B State: Present Bus 3 Enclosure 0 LCC A State: Present Bus 3 Enclosure 0 LCC B State: Present Bus 3 Enclosure 0 LCC A Revision: 6.60 Bus 3 Enclosure 0 LCC B Revision: 6.60 Bus 3 Enclosure 0 LCC A Serial #: FCNBD054403509 Bus 3 Enclosure 0 LCC B Serial #: FCNBD054103814 DAE2P Bus 3 Enclosure 1 Bus 3 Enclosure 1 Fan A State: Present Bus 3 Enclosure 1 Fan B State: Present Bus 3 Enclosure 1 Power A State: Present Bus 3 Enclosure 1 Power B State: Present Bus 3 Enclosure 1 LCC A State: Present Bus 3 Enclosure 1 LCC B State: Present Bus 3 Enclosure 1 LCC A Revision: 6.60 Bus 3 Enclosure 1 LCC B Revision: 6.60 Bus 3 Enclosure 1 LCC A Serial #: FCNBD054403853 Bus 3 Enclosure 1 LCC B Serial #: FCNBD054405233 DAE2P Bus 3 Enclosure 2 Bus 3 Enclosure 2 Fan A State: Present Bus 3 Enclosure 2 Fan B State: Present Bus 3 Enclosure 2 Power A State: Present Bus 3 Enclosure 2 Power B State: Present Bus 3 Enclosure 2 LCC A State: Present Bus 3 Enclosure 2 LCC B State: Present Bus 3 Enclosure 2 LCC A Revision: 6.60 Bus 3 Enclosure 2 LCC B Revision: 6.60 Bus 3 Enclosure 2 LCC A Serial #: FCNBD054207867 Bus 3 Enclosure 2 LCC B Serial #: FCNBD053004074 The SP event log for SPB has been showing the following; 05/01/2006 00:41:15 Bus 3 Enclosure 0 Disk 10(801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:43:08 Bus 3 Enclosure 0 Disk 8 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:44:02 Bus 3 Enclosure 0 Disk 5 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:44:51 Bus 3 Enclosure 0 Disk 7 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:44:53 Bus 3 Enclosure 0 Disk 6 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:45:03 Bus 3 Enclosure 0 Disk 6 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:45:16 Bus 3 Enclosure 0 Disk 6 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:45:26 Bus 3 Enclosure 0 Disk 5 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:46:38 Bus 3 Enclosure 0 Disk 10(801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:47:15 Bus 3 Enclosure 0 Disk 11(801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:47:38 Bus 3 Enclosure 0 Disk 7 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:47:41 Bus 3 Enclosure 0 Disk 11(801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:47:53 Bus 3 Enclosure 0 Disk 8 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:49:52 Bus 3 Enclosure 0 Disk 12(801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:49:55 Bus 3 Enclosure 0 Disk 6 (801) Soft SCSI Bus Error [0x00] 0 2a 05/01/2006 00:51:56 Bus 3 Enclosure 0 Disk 6 (801) Soft SCSI Bus Error [0x00] 0 2a Log continues but is shortened for purposes of this document.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 106

Page 108: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

In order to effectively troubleshoot this backend bus, obtain the lccgetstats output from BUS3 on the SPB side. 05/31/2006 08:32:24 fcli> lccgetstats -b 3 lccgetstats request sent. NOTE: The buffered contents for the previous lccgetstats request are not shown for brevity. 05/31/2006 08:32:31 fcli> lccgetstats –display Enclosure 24, PRI_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Retimer and Monitor LCV Error Count: 0x0 Retimer and Monitor CRC Error Count: 0x0 Enclosure 24, EXP_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x2 Retimer and Monitor LCV Error Count: 0x15d Retimer and Monitor CRC Error Count: 0x5 Enclosure 25, PRI_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Retimer and Monitor LCV Error Count: 0x0 Retimer and Monitor CRC Error Count: 0x0 Enclosure 25, EXP_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x2 Retimer and Monitor LCV Error Count: 0x245 Retimer and Monitor CRC Error Count: 0x5 Enclosure 26, PRI_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x0 Retimer and Monitor LCV Error Count: 0x0 Retimer and Monitor CRC Error Count: 0x0 Enclosure 26, EXP_PORT, 05/31/2006 08:32:31 Retimer and Monitor Configuration: 0x38 Retimer and Monitor Status: 0x222 Retimer and Monitor LCV Error Count: 0xffff Retimer and Monitor CRC Error Count: 0x0

<- Outbound path (ENC 0 – BUS 3) <- No errors outbound for CRC <- Inbound path (ENC 0 – BUS 3) <- Inbound path on ENC 0 – BUS 3) detects CRC errors <- Outbound path (ENC 1 – BUS 3) <- No errors outbound <- Inbound path (ENC 1 – BUS 3) <- Inbound path on (ENC 1 – BUS 3) detects CRC errors <- Outbound path (ENC 2 – BUS 3) <- No errors outbound <- Inbound path (ENC 2 – BUS 3) <- Last ENC, loops back before the connectors

If we view these errors in pictorial view, the first diagram shows the data path on a single backend bus.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 107

Page 109: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The second diagram above shows which possible device could be at issue. A bad ‘CABLE 2’ could have potentially been inserted and to test a cable, you can reverse it and check the counters. This would prove or disprove a bad cable. The actual case this example comes from, the LCC in ENC2 and ‘Cable 2’ were replaced. The remaining device to replace was the LCC in ENC 1, which fixed the issues on BUS 3. A bent pin on the EXP PORT of the LCC in ENC 1 was found.

Using the switched LCC concept of Ultrapoint, the enclosure provides additional features to make troubleshooting easier and greater stability on the backend bus. Use the lccgetstats to assist in troubleshooting and having a full understanding of how a Stiletto functions will greatly enhance your ability to effectively isolate issues. LCC LEDS

FRUMON is in charge of a variety of LED tasks, each is described below. Link status and speed

Each cable port has a link status and speed LED. This single LED per port takes on a dual purpose. If the LED is off this indicates the link is bad. If the LED has a steady on, this indicates a valid 2G link. If the LED is steady on an every 4 seconds blinks 4 times this indicates a valid 4G link. These 2 LEDS are GREEN. This follows the HAMMER family convention for LED’s. It is recommended that be a new FRUMON command that ‘aligns’ the LED’s strobe effect so that all LED’s on a loop blink at the same time.

Power good and Fault

Power good LED is lit by all the voltages being good. The Fault LED should be lit at power up and not turn off until all POST activity finishes.

Enclosure ID Eight blue LEDS represent Enclosure ID, if enclosure ID not set, defaults to ENC_7. Loop ID

The 8 Green LEDS represent the Loop ID, if the Loop ID is not set, it will default to off. During power up these Loop IDs define a sort of Post code status. Each LED would have a stage that FRUMON is going through and in the event of a problem would indicate the stage that didn’t function properly.

Fault LED

Loop ID Function working on

ON No LEDS OS didn’t start (bad hardware) ON 0 Test internal hardware of SMC ON 1 Test external hardware of SMC(flash/sram/registers) ON 2 I2C works ON 3 Com Port works ON 4 Fibre DDR works ON 5 ON 6 ON 7 OFF No LEDS FRUMON finished – Set proper Loop ID when commanded by host.

(Note the host may have issued the command during this power up process)

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 108

Page 110: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Other options for backend isolation SP Event Logs

Open CAP2 and select: LAUNCH -> SPLAT Merged event log will be displayed. Apply FILTERS to show just backend problems. Only the relevant events will be shown. Set the display mode to ANALYZE. The events will be shown as in the display below.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 109

Page 111: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

RLS Monitor Logs How to run FBI in CAP2 and analyze the output is detailed along with other CAP2 uses in Primus solution emc110664. To use FBI within CAP2 use the following instructions. Note that if you have FBI reports you wish to analyze within CAP2, see solution emc110663 (see below).

Run SPcollects against array and have that xml file open in CAP2. Previously generated xml files for this array can be selected from the File drop-down menu. If not, you must run SPcollects.

From the CAP2 toolbar, select Tools -> Launch -> FBI DiagX GUI

Select Edit -> Preferences Enter the IP address for SP A, and click "OK."

Select Actions -> Run Sizer. This may take a while to complete.

Click the green GO button. The red STOP button will be lit.

Run FBI for as long as required. Note that FBI will consume resources while it runs.

Click on the red STOP icon. FBI may take a minute to stop.

Close out the FBI program. FBI data will be analyzed and shown in Analysis tab.

ID: emc110663 Goal How to use CAP2 to review FBI .rls files Symptom If CAP2 is used to analyze FBI .rls output files, CAP2 cannot use FBI .rls file. If FBI is run external to CAP2, the output files have a format that cannot be directly used by the current CAP2 program. These names have a naming convention of: SPA_<array serial number>_<enclosure>_<bus>_curr.rls. The files must be converted to a file format that CAP2 can import. The naming convention for this file format is: RlsReportLog<date and time>.txt To convert the files:

Start CAP2 using one of the following methods: Double-click the CLARiiON Array Properties icon on your desktop (if you created one); or Click the Quick Launch icon in your task bar (if you created one); or Go to Start -> Programs -> CAP2 -> CLARiiON Array Properties. From the CAP2 main screen, select Tools -> Launch -> FBI DiagX GUI. FBI will be launched. Select File -> Load RLS Common Format File. Locate the .rls file that is to be converted. Select File -> Write RLS Progress Reports. Close FBI.

The output file will be located in the C:\Program Files\CAP2\FBIDiagX directory and its name will have the format RlsReportLog<date and time>.txt. On the CAP2 screen, select File, Load FBI Data, and browse to the file that you have just created. The file contents will be analyzed and the output will be appended to the report under the Analysis tab.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 110

Page 112: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

To launch the FBI tool and view either existing files or to run the FBI program, perform the following;

Open CAP2, select the TOOLS option LAUNCH. Select ‘FBI DiagX GUI’ Program will open and now select FILE and option shown. Load RLS Common… The window of information shown below will display.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 111

Page 113: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 4 Troubleshooting & Tools CAP Introduction - CAP is a distribution of tools used to service CLARiiON storage servers. Included in the CAP are the CAP task bar, CLARiiON Array Properties, and SPLAT, SP Log Analysis Tool. Functionality in the workbench includes:

• Ability to parse SPCollect zip files generated on FC4700 & CX arrays running all major releases of FLARE • Ability to directly capture SPCollect zip files from FC4700 & CX arrays running FLARE software R12 and above. • Ability to directly capture configuration data from Cisco, McData and Brocade FIBRE switches and hosts running

Navisphere host agent software. • Ability to analyze configuration data identifying issues which need to be addressed • Ability to analyze server logs using extensible filters, with intuitive searching, marking and annotation capabilities. • Ability to interface with EMC SYR CLARiiON repository, providing long term data retention and retrieval services. • CAP is distributed to Customer Service Engineers, Regional Support Specialists, Technical Support Specialists

and Software Developers. The workbench provides a standard framework for exchanging customer configuration data, via SPCollect zip files and CAP XML files, and facilitates.

Dependencies- CAP is dependent on SPCollect that is distributed on CLARiiON storage processors and on navicli (to communicate with the CLARiiON storage processors. Note that executing SPCollect on SPs requires a significant amount of resources. Installation - The installation kit for CAP includes:

• CAP • SPLAT • mergelogs • NaviCli • FBI DiagX

The installation kit removes the prior version of CAP. Functional Description - CAP is a Java Swing application providing a standard dashboard for managing common service activities, via multiple service wizards. The dashboard is comprised of a menu/icon tool bar, a main display pane containing variety of report tabs and a status window. CAP contains a configuration manager that tracks user activity including:

• Tracking of the most recently accessed directory used during file operations. • Profile management for recording array, switch and host capture configuration data • Automatic registration of cap xml configuration files to the SYR upload queue, used for batch uploading of

configuration data when service personnel have network access. • Tracking of the 10 most recently used cap configuration files providing a quick reference to configuration files. • CAP contains configuration file converters, multiple wizards used to interface with storage processors, FIBRE

switches and hosts, and provides access facilities for uploading and downloading files to/from the SYR DBMS. The functionality within CAP is presented with screenshots of workbench panels and dialogs along with text describing the purpose of these components.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 112

Page 114: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Main Display - the main display of CAP provides access to all of the facilities of Service Workbench.

Issue Report Tab - All major Service Workbench wizards and file converters generate an XML configuration file. Once created, this file is loaded into the main display window, displaying a summary of Issues found in that configuration. File Menu - The File Menu provides access to file converters.

File->Open - Use File open to load existing XML configuration files or to create XML configuration files from SPCollect zip files. The CAP XML configuration files can be constructed from spcollect zip files, navicli getall files and ReportClariion naviall files. The default action of the file open dialog selects SPCollect zips and CAP XML reports in the browser. Select from the ”Files of type” drop down list to change the browser’s file selection type.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 113

Page 115: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

File->Save As

Use “Save As” to convert a CAP configuration XML file to Excel. After selecting a file name the CAP2CCPF converter will generate an XLS file.

File->Add Report

Text files can be inserted into a CAP XML configuration file via File “Add Report”. Once added these files are displayed under the Analysis Tab.

File->Load FBI - Report files generated by DiagX GUI can be merged into CAP XML configuration files via File “Load FBI”. FBI reports are parsed with counts stored within the appropriate drive structures and when non-zero counts are found issues identifying the effected drives are added to the Issues Tab. NOTE: CLARiiON Storage processors running Release 19 microcode automatically capture RLS statistics which will be contained on SPCollect zip files. Load FBI data can be used to parse output from the DIAX GUI which captures RLS data from arrays running older version microcode.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 114

Page 116: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

File->Print->Preview File->Print->Print Tools menu - the tools menu provides access to CAP wizards.

Tools->Capture - Select Tools Capture to execute the capture wizard. This wizard uses the currently loaded CAP XML file to initialize the configure capture process before entering the capture dialog menus. The main capture Action dialog contains 3 check boxes used to enable array, switch and host capture. Selecting “Collect Array Information” enables the SP-A/B IP fields. Enter the IP addresses for the array.

Select “Configure SPCollect settings” and pick the Next button to bring up the SPCollect settings dialog. Use this dialog to configure tools used to communicate with the array or to change the timer values used when interacting with the array.

When “Collect Host Configuration” is selected, the Host capture dialog will be displayed. This dialog can be used to configure the list of host which will be queried during the capture process.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 115

Page 117: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

When “Collect Switch Configuration” is selected, the Switch configuration dialog will be displayed. This dialog can be used to configure the list of switches that will be queried during the capture process. Picking the Finish button launches the capture process. When the “Collect Array Configuration” check box is selected, the array collect dialog will be displayed. This dialog shows the progress of the array capture process, by displaying command and command output of the polling process that interacts with the array. Once the capture process is completed, a new CAP XML configuration file is generated and displayed.

Tools->Download Dumps and SPCollect Select Tools Download Dumps and SPCollect to launch the download wizard.

Enter SP IP addresses and enter the download path before picking the Next button to begin the download process. The Choose Files dialog will display the files available for down. Select the desired files and pick the Ok button to start downloading files.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 116

Page 118: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Service Workbench->Enhanced Install Procedure The Enhanced Install Procedure wizard guides service personnel through generating the install procedure of a recently installed CLARiiON Storage processor. This wizard is structured to complete the install procedure process in 30 minutes. This wizard displays an initial registration panel. All of the fields (except Remarks) need to be specified before the wizard will allow user to proceed to the next panel.

Next, the wizard displays a configuration panel. All of the fields need to be specified before the wizard will allow install procedure process can begin.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 117

Page 119: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

FIELD NAME DESCRIPTION

Max Run Time (minutes) The Max Run Time is the approximate time (in minutes) that the wizard will run. This includes the time for the SPCollect and Zerodisk processes. The default is 30 minutes, and the minimum value allowed is 10 minutes, since the time needed for the SPCollect is assumed to be 10 minutes. The remainder of the time is allotted for the FBI/Zerodisk process. Must be a positive number greater than 10

No Progress timeout (seconds)

Maximum time any individual SPCollect step can run without responding. When SPCollect starts to run, it uses the No Progress Timeout (in seconds) to make sure that the process changes from one stage to the next within this time. If there is no progress, then it will abort automatically. Must be a positive number.

Check Interval (seconds) The Check Interval (in seconds) is the interval at which progress is checked during the SPCollect capture process. Must be a positive number.

NaviCLI Path The path used to locate the navicli executable. File must exist

Lastly the wizard displays a panel in which the user specifies the IP addresses of the array on which the install procedure has to be performed as well as the base output directory in which the output files will be stored. All of the fields need to be specified before the wizard will allow the install procedure can begin.

The wizard will verify that the array is accessible, that both IP addresses belong to the same Storage Processor and the user has Privileged Access to the Storage Processor. Pick next to start the install procedure.

Once the install procedure starts, the status panel will be displayed. The wizard commences by contacting the storage processor to access needed configuration data. It then creates a directory for the Engagement Number and under this it will create a dated subdirectory for the array (subdirectory name is based on the array serial number + current date and time). It will then place all of its output files into this subdirectory.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 118

Page 120: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

It is expected that user will perform enhanced install procedures before storage processors enter customer service. As such, this wizard will warn the user if any LUNS have been bound. If LUNS are bound, the user needs to contact the customer to ensure that the storage processor is not in use. If the user decides to proceed with the install procedure, wizard will perform the install procedure, bypassing drives contained in Raid Groups. If the wizard detects that there are currently bound LUNS on the storage processor it will display the following dialog: The user may then select the appropriate response after verifying the configuration with the customer.

The wizard then issues commands to the storage processor to Reset Statistics, Reset the SP Event Logs, and clear the Dump Directory, after which it launches the FBI monitor (diagXGUI.exe) if the storage processor is running a Flare Release earlier than 16. If the SP is running Flare Release 16, or above, the wizard uses arrayside FBI monitoring. The wizard will configure write cache and enable statistics, before directing the user to run Power Cycle Testing. The following two dialogs will be displayed to walk the user through this task.

The wizard next creates a list of unbound disks and uses this to perform Zerodisk activity, which continues until the Max Run Time (minutes) has completed. Should user click the ‘Cancel’ button prior to this, the following will be displayed.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 119

Page 121: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The user may then select the appropriate response. After completing (or skipping) Zerodisk, the wizard shows the progress of the array capture process, by displaying command and command output of the polling process that interacts with the array.

Once the capture process is completed, the wizard restores the original cache settings, generates a report of the cache statistics, clears the SP getlogs then waits for the user to click the Finish button. This allows the user to copy any external files (e.g. for SPCollects from Flare R12 or earlier) into the output directory. After the user clicks the Finish button, the wizard generates a registration.XML file which documents this service activity and a new CAP XML configuration file. The logging information from the status panels is saved to text files in the output directory, and all files in this directory are zipped up to create a zip file: xxxxxxxxxxxxxx_MM-DD-YYYY_hh-mm-ss_Enhanced_Install_Procedure.zip where:

xxxxxxxxxxxxxx Array serial number MM-DD-YYYY Date of assessment hh-mm-ss Time of assessment

The registration xml file contains the information entered into the initial wizard panels by the user. Lastly, the profile for the array is updated, the display is loaded with the new CAP XML configuration file and if automatic FTP forwarding has been enabled, the output zip file is uploaded to the target configured for FTP forwarding. Tools->Service Workbench->CLARiiON Configuration Review - the CLARiiON Configuration Review wizard guides service personnel through the procedure to review the current configuration of a CLARiiON Storage processor. This wizard is structured to complete the review process in 30 minutes. This wizard displays an initial registration panel. All of the fields (except Remarks) need to be specified before the wizard will allow the user to proceed to the next panel.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 120

Page 122: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Next, the wizard displays a configuration panel. All of the fields are set to default values which should be adequate during normal operation. Extend Timeout values if the Storage processor is under extreme load. Specify an alternate NaviCLI executable if the Storage processor is running microcode that requires a particular version of NaviCLI.

FIELD NAME DESCRIPTION

Max Run Time (minutes) The Max Run Time is the approximate time (in minutes) that the wizard will run. This includes the time for the SPCollect and FBI monitoring processes. The default is 30 minutes, and the minimum value allowed is 10 minutes, since the time needed for the SPCollect is assumed to be 10 minutes. The remainder of the time is allotted for the FBI monitoring process. Must be a positive number greater than 10

No Progress timeout (seconds)

Maximum time any individual SPCollect step can run without responding. When SPCollect starts to run, it uses the No Progress Timeout (in seconds) to make sure that the process changes from one stage to the next within this time. If there is no progress, then it will abort automatically. Must be a positive number.

Check Interval (seconds) The Check Interval (in seconds) is the interval at which progress is checked during the SPCollect capture process. Must be a positive number.

NaviCLI Path The path used to locate the navicli executable. File must exist

The FBI Configuration section changes its behavior depending on what selections the user has made on the “User Registration” form.

• If user has selected an “Activity Type” of FCO or Software Upgrade, “Execute FBI Process” will be unchecked and enabled. • If user has specified the “Engagement Type” as Clarify Case, “Execute FBI Process” will be checked and disabled. • If user has selected PAS Number, “Execute FBI Process” will be unchecked and enabled.

For arrays running Release 16 or above, arrayside FBI monitoring will be utilized; otherwise the windows utility diagXGUI.exe will be used to provide FBI monitoring. Next, the wizard displays a panel in which the user specifies the IP addresses of all of the arrays for which the configuration review will be performed on as well as the base output directory into which output files will be stored. At least one array must be entered before the wizard will allow the user to advance to the next panel. IP addresses to additional arrays can be added, as needed.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 121

Page 123: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

For each array specified, the wizard will verify that the array is accessible, that both IP addresses belong to the same array and that the user has Privileged Access to that array. Next, the wizard displays a dialog used to configure the list of hosts which will be queried during the configuration review. The Hosts to Query list is initially populated with all of the known hosts attached to the arrays specified in the previous panel. The user may specify the IP addresses of any other host, or he may chose to move some of the hosts to the exclude list, which will prevent them from being queried. The Collect Host Data checkbox must be unchecked or at least one host must be defined before the wizard will allow the user to advance to the next panel. Lastly, the wizard displays a dialog used to configure the list of switches which will be queried during the configuration review. When the Collect Switch Data checkbox is checked at least one switch must be defined before the wizard will allow the user to advance to the next panel.

Pick next to begin the configuration review process. All configuration capture processes are executed in parallel. Separate logs are maintained for each configuration capture process being managed by the Configuration Review Wizard. The following progress panel is displayed during the configuration capture process.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 122

Page 124: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The drop down list, at the top of this panel, can be used to select which of the logs will be displayed in the main panel. Initially, the Overview log is displayed. This log tracks the high level progress of the Configuration Review. More detailed logs tracking the capture progress on each array are also available for display.

When all configuration capture processes have completed, the Finish button will be enabled. The configuration review commences by giving the user instructions for running the Navisphere Service Tool on non-Windows hosts. If the user had checked the capture host data checkbox, he will be directed to run EMCGrab/EMCReports and where to place the resulting output files. If the user had checked the capture switch data checkbox, he will be directed to perform the switch data collection and where to place the resulting output files.

Since the configuration review wizard has had no communication with any switches, it has no way to differentiate between McData / Brocade / Cisco switches. As a result, this popup is always displayed when switch data collection is requested. The Navisphere Service Tool requires credentials to be authenticated with the array. These credentials are kept in the profile for the array. If credentials have changed, are invalid or have never been obtained, the following popup will be shown. The user has a maximum of three attempts to enter valid username/password combination for the specified array.

The main collection phase proceeds in parallel with a separate review for each array specified on the Array Definition Form. Also, if the user specified host or switch data collection, a host and switch review will commence. Once all configuration review processes are completed, the wizard prompts the user to click the Finish button. Picking the Finish button causes the wizard to copy all externally generated files to the appropriate Array Serial Number directories. It then generates a registration.XML file detailing the activities completed in this configuration Review and for each array as well as a new CAP XML configuration file for each array. The logging information for each array status panel is saved to text files in each array’s output directory. Lastly all files in each array directory are zipped up to create separate zip files (one per array):

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 123

Page 125: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

xxxxxxxxxxxxxx_MM-DD-YYYY_hh-mm-ss_Enhanced_Install_Procedure-2.zip where:

xxxxxxxxxxxxxx Array serial number MM-DD-YYYY Date of assessment hh-mm-ss Time of assessment

The registration xml file contains the information entered into the initial wizard panels by Service Personnel. Lastly, the profile for each array is updated, the display is loaded with the new CAP XML configuration file from the configuration review of the first array, and, if automatic FTP forwarding has been enabled, all output zip files are uploaded to the target configured for FTP forwarding. Tools->Service Workbench->Monitor Storage Processor - The Monitor Storage Processor wizard automates Knowledge Base article 111000, “SP Boot or Unmanaged Troubleshooting Guide”. Use this wizard to evaluate Storage Processors state.

Selecting a profile or specifying one or more IP address to identify the array to be monitored. Pick OK to begin.

The monitor will continue to query the state of SPs, until the cancel button is picked. The following state is tracked:

1. Storage Processor accessibility, via PING. 2. Storage Processors reporting valid agent info 3. Storage Processors reporting active ports 4. Storage Processors reporting active LUN assignments

Pick the “Detailed” button to display a more detailed report on Storage Processors state.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 124

Page 126: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Service Workbench->Apply Default Sniffer values - Select Tools “Service Workbench” “Apply Default Sniffer values” to properly configure LUN sniffer configuration on an array.

Tools->Service Workbench->CX3 Conversion Readiness Check - The CX3 Conversion Readiness Check Wizard analyzes a CAP report to determine if user LUNS, on the first 5 drives, need to be move before an array can be converted. This wizard generates results on the Analysis report, similar to the following: CX3 Conversion Readiness Check (CX3CRC) Report ------------------------------------------------------------------------------------ Migration Summary: No lun needs to be moved Summary of Luns/Metaluns which need to be migrated: Count Description Capacity (MB) 0 LUN 0 0 Metalun 0 0(total) Summary of free disk space (excluding the system disks): Disk Count Description Disk Size(MB) 12 Unbound Disk 2646342 0 Empty RAID Group 0 12(total) 2646342(total) Summary of free available RAID group capacity: Description Capacity (MB) Inactive Hot Spare 1825014 System RAID groups 0 Non-System RAID groups 537816 2362830(total) Migration Table Lun(s) which need to be migrated: None Metalun(s) which need to be migrated: None Free Space Table Available Disk(s)(excluding the system disks): Disk RAID RAID Disk Bus Enc Slot Drive# Size(MB) State Group ID Type Type 0 0 5 5 136888 Unbound Disk N/A N/A Unknown 0 0 6 6 136888 Unbound Disk N/A N/A Unknown 0 0 7 7 136888 Unbound Disk N/A N/A Unknown 0 0 8 8 136888 Unbound Disk N/A N/A Unknown 0 0 9 9 136888 Unbound Disk N/A N/A Unknown 0 0 10 10 136888 Unbound Disk N/A N/A Unknown 1 1 11 146 304169 Unbound Disk N/A N/A Unknown 2 0 0 240 304169 Unbound Disk N/A N/A Unknown 2 0 1 241 304169 Unbound Disk N/A N/A Unknown 2 0 2 242 304169 Unbound Disk N/A N/A Unknown 2 0 3 243 304169 Unbound Disk N/A N/A Unknown 2 0 4 244 304169 Unbound Disk N/A N/A Unknown 2646342(total) Free capacity in System RAID Group(s): None Free capacity in Non-System RAID Group(s): Free Free Contiguous RAID Type Index Capacity (MB) Capacity (MB) Current Luns RAID5 3 107556 107556 144 145 146 147 148 149 150... RAID5 4 36 36 213 214 215 216 217 218 219... RAID5 5 107556 107556 282 283 284 285 286 287 288... RAID5 1 107556 107556 6 7 8 9 10 11 12 13 14 15 1... RAID5 2 107556 107556 75 76 77 78 79 80 81 82 83 ... RAID5 6 107556 107556 351 352 353 354 355 356 357... 537816(total) 537816(total)

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 125

Page 127: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Corporate Repository - Access to the Corporate Repository is available under Tools->Corporate Repository. The corporate repository is configured via the Tools->Settings->File Forwarding dialog, under Corporate Repository. By default, CAP is configured to connect to the EMC CLARiiON Repository maintained by the EMC SYR/IT organization. Alternatively, CAP can be configured to use an FTP server as a corporate repository.

Tools->Corporate Repository->SYR Upload - Service Personnel travel between customer sites while servicing customer arrays, using CAP to review the state of the arrays. CAP tracks the last 100 created XML configuration files in the SYR upload queue. The SYR upload queue simplifies managing and uploading XML configuration files to the SYR CLARiiON repository. Select Tools Corporate Repository “SYR Upload” to launch the SYR upload wizard, which displays the following dialog.

Select those files which should be uploaded to SYR. Pick the “OK” button to continue. Enter account/password. Pick the “OK” button to continue. After user authorization, the selected CAP xml files, along with the source SPCollect zip files are uploaded to SYR.

Tools->Corporate Repository->SYR Download - Select Tools “Corporate Repository” “SYR Download” to launch the SYR Download wizard. This wizard can be used to download files from SYR. This dialog contains:

• a drop down list/input field, at the bottom of the dialog, used to specify an array serial number • a table that will display files, associated with the specified serial number, contained in SYR

Selecting a file in the table enables the Download button. Selecting the Download button causes that file to be downloaded to the local workstation, to c:\capData\ArraySerialNumber. After downloading completes, Cap XML files are loaded into CAP and the Issues Report displayed.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 126

Page 128: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Corporate Repository->Upload Current Configuration - The Tools “Corporate Repository” “Upload Current Configuration” button is available whenever a CAP XML file is being displayed. Select “Upload Current Configuration” to directly upload that configuration file to the corporate repository. Tools->Launch - The Tools Launch menu provides access to a number of useful tools for analyzing SPCollect data, XML configuration files and other activities.

Tools->Launch->Splat - Select Tools Launch Splat to launch SPLAT, the SP Log Analysis Tool. SPLAT will load the merged navicli getlogs associated with the CAP XML files. See the SPLAT users guide for further documentation on using SPLAT.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 127

Page 129: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Launch->Storage System Log Assessment - Select Tools Launch “Storage System Log Assessment” to launch the Log Assessment wizard. This wizard provides access to the triage/Log Assessment analysis code. Select from the set of checkboxes and dialogs to configure the assessment. Pick the “Apply” to generate a report.

Tools->Profile - Select Tools profile to launch the Profile wizard. This wizard can be used to create or execute a capture profile. The Profile wizard is a tabbed Dialog with the following tabs:

• General Configuration • Navisphere Configuration • Storage System Capture Configuration • Host Capture Configuration • Switch Capture Configuration

General Configuration - Use the general configuration tab to specify:

• The root Output Directory for data generated by CAP Wizards. • The IP addresses for the Storage Processors • The model type

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 128

Page 130: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Navisphere Configuration - Use the Navisphere configuration tab to specify: • The communication mode to interact with Storage System. • Security credentials

Storage System Configuration - Use the Storage System configuration tab to configure the abort timeouts.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 129

Page 131: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Host Configuration - Use the Host Configuration tab to configure the host capture process. Check Collect host Data to enable host capture. Use the text input field and add button to add IP address to the Hosts to Query list. Use “>>” to move IP address to the Exclude Hosts list.

Switch Configuration - Use the Switch Configuration tab to configure the switch capture process. Check Collect Switch Data to enable switch capture. Use the text input field and add button to add IP address to the Switches to Query list.

Tools->Settings - Select Tools Settings to enter the Settings dialog. This dialog is used to configure global settings that control the behavior of the Service Workbench. Tools->Settings->General - The General Settings Tab is used to specify the default Output Path and default NaviCLI executable Path. The Output Path specifies the root directory for storing files created during normal operation. CAP stores files directory by creating Serial Number sub-directories.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 130

Page 132: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The naviCLI and Secure naviCLI path specifies the default executable used to interact with CLARiiON arrays.

Tools->Settings->File Forwarding - Select the “File Forwarding” tab to access CAP file forwarding configuration. Service Personnel working at Customer sites should use the File Forwarding tab to configure the workbench to forward files to the appropriate destination per standard service agreements.

This tab contains three controls:

• Destination List: The destination list is pre-configured to include the EMC/CLARiiON sites CLARiiON_FTP, EMC_FTP_INCOMING. Use Add, Modify or Delete buttons to configure addition FTP sites that CAP can access.

• Set Default actions: The workbench can be configured to use a default Destination and default directory. Set this default in Set Default actions. Note SYR manages files by Array Serial number. FTP sites require a sub-directory into which files are copied. By convention, directory names are expected to be the Clarify case number or PAS engagement number associated with the service activity.

• Corporate Repository: The EMC/CLARiiON repository is SYR. For those Service Partners that do not have access to SYR, use this control to specify an alternate FTP site/destination.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 131

Page 133: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tools->Settings->File Monitoring - Select the “File Monitoring” tab to access CAP monitoring capabilities. Home office support Personnel should use the File Monitoring tab to configure the workbench to monitor directories on particular FTP sites and download all files found to the local workstation. By default, local files are copied to c:\capData\SourceDirectoryName. Note: By convention, directory names are expected to be the Clarify case number or PAS engagement number associated with the service activity.

Select Add, Delete or Modify to manipulate the list of FTP site/directories that the workbench will monitor. Select an FTP site/directory and pick the Activities button to display the log of monitor activity. This log tracks when files were transferred to the local workstation. Select “Display monitors on Screen’ to enable the Monitor panel between the CAP display pane and the CAP status pane. All monitored FTP directories are displayed in the Monitor panel. As files are copied to the local workstation, these buttons blink to indicate that new file content is available for review.

Configuration Reports CAP provides access to 21 reports that display different views of the configuration data contained in the XML file. These reports include: Issues; SP Info; LUN Info; RAID Groups; Raid Group Layout; CRU Info; Drive Modules; NDU Software; Metaluns; SAN Copy; Snap Clones; Snap Sessions; Snap Views; Async Mirrors; Mirrors; Switches; Hosts; Storage Groups; Analysis; View All; HA Host.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 132

Page 134: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

DRU Introduction The Disk Replacement Utility (DRU) is a tool provided as part of the Navisphere Service Taskbar (NST). DRU will be run when an end-user has identified one or more disk issues on their array. Its purpose is to analyze the array and determine whether the end user should attempt to replace the single failed disk themselves, or whether EMC Service needs to get involved. The utility will base its recommendations on a set of criteria that have been defined by EMC engineering and service organizations. If the state of the array has been deemed appropriate for an end user disk replacement, then it will provide aid in walking the user through the actual disk replacement action. The currently supported arrays are any CX Series array running FLARE R14 PatchLevel 16, R16, R17, R19 and any CX-3 Series array running R22 and higher. Operation The Disk Replacement Utility guides the customer through the process of replacing a failed drive. To accomplish this, it performs the following steps:

- Checks the array is a candidate for the utility

- Identifies the drive that has failed

- Informs customer of the location of the failed drive and the part number to order as a replacement

- If the customer has a replacement drive • Graphically shows location of failed drive • Instructs customer of steps to replace drive • Detects when drive has been inserted • Automatically starts rebuild/equalize process • Indicates when rebuild is complete

Health Analysis DRU performs a disk health analysis to identify the failed drive and to determine if replacement will be allowed. The disk health analysis consists of two components. These components are:

• A real time analysis of all installed disks and their current states • An analysis of the array logs to determine if other issues may be present that would affect a customer's ability to

successfully replace a disk drive. These issues may include, but are not limited by: o backend loop instability o sporadic drive fallout

The intention is to base the disk replacement recommendation on a combination of real time data and historical analysis. Analysis Results Results of the analysis will indicate to user one of the following three, easy to understand and irrefutable scenarios:

• No disk issues detected – the user can only exit the application as there are no other operations to perform • Issues detected on the array make it necessary for EMC Service to be contacted – multiple faults and/or log

entries have removed this array from being a candidate for end user disk replacement. • The only issue detected on array is a single disk failure – the array is a candidate for end user disk replacement.

Because of this, the user will now have the option of continuing with the disk replacement operation. If they have the correct disk, then the user can choose to proceed with the disk replacement. If they need to order the disk, then all of the needed information is available to them. If they cannot or choose not to perform the physical disk replacement at this time, exiting the utility will cause the current state information to be written to the workstation that the utility was run from. This information will be used the next time the utility is run (presumably when the user is ready to perform the disk replacement) to validate the configuration before allowing the user to continue.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 133

Page 135: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Disk Replacement If the array has been deemed a candidate for user disk replacement, then the utility will present a replacement context that will aid the user in locating the failed drive. At this point, the utility will begin monitoring the array to assure that the proper drive is replaced and that it rebuilds properly. If the user is restarting the utility after a user disk replacement condition had been detected, then the utility will still do a full scan and assessment of the targeted array. After the assessment has been completed, the results will be checked against the information that is stored in the external file (noted in the previous section) to determine if the array is still in the same condition that it was in the last time the analysis was done. If any deviations in the disk health are noted (including the original faulted disk becoming “healthy”), then the user will be notified about the changes. If the system is found to be healthy, then the user will not be given the option to continue with the disk replacement. If multiple errors are found, they will be informed to contact EMC Service. If a different disk is found to be faulted, while the original disk is now healthy, then the user will be allowed to continue with the disk replacement (assuming that they have the appropriate replacement drive on hand).

Identifying the Faulted Disk Navisphere has the ability to aid the user in physically locating the faulted disk using two methods:

Flashing the LEDs on the affected enclosure Pictorially displaying the position of the disk within the enclosure

Once the user has chosen to continue with the disk replacement operation, they will be notified that the LEDs on the affected enclosure will start flashing. The utility will keep the lights of the enclosure flashing until it is determined that the drive has been replaced, or a certain timeout has occurred (discussed later). Because all CLARiiON disk drives are inserted in a numeric sequence starting at slot 0 (the far left hand side) of an enclosure, we need to display the position of the affected disk relative to the left hand side. A simple display of a line of healthy disks (utilizing the existing Manager “good” disk icon ) up to the position of the faulted disk (utilizing the existing Manager “faulted” disk icon ) should provide a sufficient visual to allow the user to locate the targeted disk.

Polling the Array for State Changes Along with prepping the user for the actual disk replacement, the utility will now begin to actively query the array for disk state changes, anticipating that the replacement disk will be inserted in place of the faulted one. This means that the utility will create a tight poll loop (every 10 seconds) where the subsystem is hard polled to force instance refreshes, and then the disks are queried to determine state changes. This is a time sensitive activity that that could potentially be intrusive in the performance of the array, so it cannot be allowed to go on indefinitely. When the polling starts, a timer needs to be set that, if no disk state activity is noted after 10 minutes, then the user is asked whether they want to continue with the disk replacement or not. If the answer is “Yes”, then the 10 minute timer is reset. If the answer is “No”, then the user is advised to rerun the utility at a time that they can perform the disk replacement. The polling is ended, and the LED flashing is stopped. If there is no answer to the query within 2 minutes, then the utility will automatically treat the case as if the user has answered “No”. The normal operation would be that the user does perform the physical disk replacement within the allotted 10 minutes. The state change of a disk in the array will be noted by the poll loop, and the utility will act accordingly. “Accordingly” will be defined by the handling of two different scenarios:

User pulls wrong disk/another disk becomes faulted User pulls and replaces the correct disk

If the user pulls the wrong disk, or if another disk changes state before the faulted disk is replaced, then upon realizing this the utility will notify the user (in a message box) about the change, and advise them to abort the disk replacement and call EMC Service. This is consistent with our overall policy of not allowing end user disk replacement if more than one faulted drive is detected on the system. If the correct disk is replaced, then the state of that disk will change to “rebuilding”. Once this is noted by the utility, a special rebuilding context will be shown to the user that will reflect the rebuilding nature of the disk.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 134

Page 136: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Tracking the Rebuild of the Disk The utility has two requirements with regards to displaying information about a rebuilding disk:

Present information that reflects the percent of the disk that has been rebuilt Present information about the amount of time that is remaining before the rebuild completes

Unfortunately, this type of information is not tracked by array software, so it isn’t readily available to Navisphere. But it can be extrapolated using existing information.

Percent Rebuilt Calculations While a disk knows that it is being rebuilt, it cannot determine the progress of the rebuild because in reality it is the LUNs that are part of the disk that are being rebuilt, not the disk itself. Individual LUNs do know how far along their rebuild process is, and LUN rebuilds happen sequentially, so the second LUN on the disk will not start rebuilding until the first LUN is completed. So mathematically, the overall percent of the rebuild that is completed can be determined by adding all of the PercentRebuilt values of the LUNs that are part of the RAID Group that the disk is a part of, and then divide by the number of LUNs. For example, if there are 6 LUNs in the RAID Group that contained the replaced disk, and the 3rd LUN is 33% rebuilt (meaning that the first 2 LUNs have been rebuilt), then the overall progress of the rebuild is:

(100 + 100 + 33 + 0 + 0 + 0) / 6 Or 233/6 = 39 (rounded)

So the disk is 39% rebuilt.

Time to Complete Calculation Predicting when an operation that is dependent on factors that are outside of our control will complete will never be an exact science. The array is geared to use some amount of its resources for the task of rebuilding the affected LUNs, but those resources may ebb and flow during the course of the rebuild. The best that we can hope for is to provide an educated guess based on current data, and that data will come from the same EV_LUN pool that we used to determine the progress of the rebuild. When a LUN is bound, the user has the option of setting its “Rebuild Priority”. Navisphere displays the choices for this property as “ASAP”, “High”, “Medium”, or “Low”, but those actually translate into “hours to rebuild numbers” of 0, 6, 12, and 18, respectively. So per EV_LUN, you can look at its “RebuildTime” property and determine how much time the user wants to allow the system in rebuilding this LUN. For the case of a disk rebuild, the total amount of time that will be allocated would be all of the RebuildTimes of the LUNs in the disk’s RAID Group added together. So for the 6 LUN example cited above, let’s say that the first 2 LUNs’ RebuildTime is ASAP (0), the next 2 are Medium (12), and the last 2 are Low (18). So, on paper the cumulative rebuild time to rebuild the disk that houses these LUNs is:

0 + 0 + 12 + 12 + 18 + 18 = 60 hours

But no LUN takes 0 hours to rebuild, so let’s assume ASAP actually means 1 hour

1 + 1 + 12 + 12 + 18 + 18 = 62 hours.

Since LUNs rebuild in sequence, this is a valid starting point for our time to complete estimate. But now we want some checkpoints to determine if the actual progress of the rebuild warrants a change to the time to complete estimate. This can be done by using the following information: • A timestamp that is noted when the rebuild is started • On an interval (not every poll cycle, maybe every 5 minutes), note “Percent Rebuilt” value that has been calculated.

When the interval is hit, the current time should be noted to figure out how much time has elapsed since the rebuild has started. Then some math can be done to extrapolate, based on the progress, how much more time is needed to complete the job.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 135

Page 137: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

So, again using the 6 LUN scenario above, let’s assume that the code has found the following progress during the rebuild:

Minutes 5: 3% 5 minutes X 33.33 (the factor needed to get to 100%) = 166.65 minutes 10: 5% 10 X 20 = 200 minutes 15: 8% 15 X 12.5 = 187.5 minutes 20: 10% 20 X 10 = 200 minutes 25: 13% 25 X 7.7 = 192.3 minutes 30: 15% 30 X 6.66 = 200 minutes

And so forth and so on. In order to calculate the time to completion accurately, then the actual elapsed time will have to be subtracted from the overall time to completion, so the advertised time to completion for this example would actually look like this:

5: 166.65 minutes – 5 = 161.65 minutes (2.7 hours) 10: 200 minutes – 10 = 190 minutes (3.2 hours) 15: 187.5 minutes – 15 = 172.5 minutes (2.875 hours) And so forth and so on.

You will notice that the estimates being calculated are much less than the original estimate of 62 hours. This is because of the mixed nature of the rebuild priority in this example, which means that when the LUNs that have a lower rebuild priority start rebuilding, the progress will be much slower and the time to complete will start going up accordingly. If all of the LUNs in this RAID Group had their rebuild priority set to ASAP (the default), then the originally estimated time to completion would have been 6 hours, which is a much closer approximation to the actual time that is being represented during the calculations.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 136

Page 138: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

TRiiAGE Introduction The TRiiAGE analysis report is an amalgamation of information obtained by scanning the reports retrieved from the array as part of the SPCollect process. This document attempts to provide some hints and helpful information for reading and use of the TRiiAGE analysis report. The TRiiAGE analysis report file names are TRiiAGE_Analysis.txt and TRiiAGE_full_Analysis.txt depending on if full output was requested. The term “TRiiAGELogs” is used throughout the document to mean the electronic merge by date and time of the SPA_navi_getlog.txt and SPB_navi_getlog.txt files. The physical file names for the TRiiAGELogs are TRiiAGE_SPlogs.txt and TRiiAGE_full_SPlogs.txt depending on if the f(ull) switch was specified when triage was run. The TRiiAGELogs is also known as Mergedlogs for historical reasons.

Running TRiiAGE Running TRiiAGE is straight forward if the location of the triage executable files are in the search path. Simply set the directory to the area where the *.zip files are located and issue the triage command. The following table describes the possible arguments to direct the operation of triage.

Argument Description On by Default

-h, -help Obtain help No

-c Generate a CLARiiON Array Properties (CAP) report. CAP 2.03 or higher is required to be installed on the system. No

-f Include all available log information in the TRiiAGE*SPlogs.txt file. No -q Include 30 days of log information in the TRiiAGE*SPlogs.txt file. Yes -p Include enclosure related resume prom information in TRiiAGE report. No -l Include layered application and related information in TRiiAGE report. No

Array Overview Information The array overview section contains general information about the array similar to that which is found under the SP tab in the CAP report. Unlike CAP, both ATA (Klondike) and Stiletto enclosures are called out for easy reference along with total enclosure count(s). Faulted enclosures and system fault light provide a broad overview of the health of the system.

Tip Description Who ran TRiiAGE? The login name of the individual that ran triage is located just below

the total drive count in this section. You may wish to process the SPCollects yourself knowing that you are executing the latest version. In the example below, stonec was the individual that ran TRiiAGE against these SPCollects.

Using serial number The array serial number can be used to search for other DIMs that may have been opened against this array. Simply copy the serial number in the “Details” window of a DIMs Incident form and press search.

Stiletto enclosures on the same bus as non-Stilettos

The Stiletto normally operates in point-to-point mode (each disk is isolated to its own circuit) unless non-Stiletto enclosures are on the bus. In this case, the Stiletto operates in loop mode just as the Katana does today. Stiletto enclosures are also known as 2/4-Gigabit Point-To-Point Disk Array Enclosure (DAE2P, DAE4P). Stiletto Power Topology: PS1 – LCCA, Disks 2, 3, 6, 7, 10, 11, 14 PS2 – LCCB, Disks 0, 1, 4, 5, 8, 9, 12, 13 Katana Power Topology: PS1 – LCCA, Disks 2 through 9 PS2 – LCCB, Disks 0, 1 and 10 through 14 Disks that fail on power boundaries can indicate problem with one or both power supplies or possibly one or more disk drives within those power boundaries. If disks on power boundary go away then come back, problem rests most likely with disk drive. In best circumstances, the problematic drive will not come on-line and can easily be identified and replaced.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 137

Page 139: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Example of Array Overview: Array Name: <host name> Serial No: APM000548xxxxx Time on SP A: 01/04/06 21:02:17 Time on SP B: 01/04/06 21:02:17 Revision: 2.19.700.5.016 Revision: 2.19.700.5.016 Serial Number For The SP: LKE00054201990 Serial Number For The SP: LKE00054201992 Storage Processor IP Address: 144.xxx.xxx.xxx Storage Processor IP Address: 144.xxx.xxx.xxx System Fault LED: OFF WRITE CACHE: ENABLED READ CACHE: ENABLED WRITE CACHE: ENABLED READ CACHE: ENABLED Total Number of Disk Enclosures: 14 -----DAE2-4P Stiletto ENC list: DAE2P Bus 0 Enclosure 0 DAE2P Bus 1 Enclosure 0 DAE2P Bus 2 Enclosure 0 DAE2P Bus 3 Enclosure 0 DAE2P Bus 0 Enclosure 1 DAE2P Bus 1 Enclosure 1 DAE2P Bus 2 Enclosure 1 DAE2P Bus 3 Enclosure 1 DAE2P Bus 0 Enclosure 2 DAE2P Bus 1 Enclosure 2 DAE2P Bus 2 Enclosure 2 ------------------------------------------------------------- -----ATA ENC list: DAE2-ATA Bus 3 Enclosure 2 DAE2-ATA Bus 0 Enclosure 3 DAE2-ATA Bus 1 Enclosure 3 ------------------------------------------------------------- Total disks reported by SPA: 210 Total disks reported by SPB: 210 CRU Replacement Information The CRU replacement section was added to help determine what array components have been replaced. With the advent of R19 and beyond, CRU serial numbers are available in the log messages. Unfortunately, some of the serial numbers can be truncated in the 7127897c messages. If the serial numbers for a given line are the same then more investigation as to replacement may be necessary. Also, care must be taken when reviewing logs for drive replacements because a serial number change may indicate a hot spare swap. This precaution is called out in the listing below. Example: 01/04/06 13:27:30 Log starts for SPA. 01/04/06 13:27:37 Log starts for SPB. 08/30/06 11:16:34 Bus0 Enc0 Dsk2 remove ECVR1NGC add 3KS3RDRD 08/30/06 12:14:38 Bus0 Enc0 Dsk2 remove 3KS3RDRD add ECVR1NGC potential hotspare swap 08/30/06 12:15:44 Bus0 Enc0 Dsk2 remove ECVR1NGC add 3KS3RDRD potential hotspare swap 08/30/06 12:15:53 Bus0 Enc0 Dsk2 remove 3KS3RDRD add ECVR1NGC potential hotspare swap 08/30/06 12:16:59 Bus0 Enc0 Dsk2 remove ECVR1NGC add 3KS3RDRD potential hotspare swap 05/10/06 04:41:52 Bus0 Enc0 Dsk3 remove ECVNS8EC add 3KS1PGBG 05/10/06 04:42:55 Bus0 Enc0 Dsk3 remove 3KS1PGBG add ECVNS8EC potential hotspare swap 05/10/06 04:44:07 Bus0 Enc0 Dsk3 remove ECVNS8EC add 3KS1PGBG potential hotspare swap 08/29/06 11:15:42 Bus0 Enc0 Dsk7 remove 3HY8Y7QM add 3KS3QH3R 07/04/06 11:05:25 Bus0 Enc0 LccA remove 42800802 add 53600467 07/04/06 12:28:02 Bus0 Enc0 LccA remove 53600467 add 42800802 potential hotspare swap 07/06/06 11:35:30 Bus0 Enc0 LccA remove 42800802 add 53600132 07/06/06 12:09:13 Bus0 Enc0 LccA remove 53600132 add 52900519 05/11/06 10:48:24 Bus0 Enc0 LccB remove 42800653 add 52900542 03/06/06 02:21:17 Bus0 Enc1 Dsk4 remove 3HY864VL add 3KS1R1A6 02/28/06 01:25:47 Bus0 Enc1 Dsk6 remove ECVLLRKD add 3KS161T5 05/11/06 10:57:57 Bus0 Enc1 Dsk9 remove ECVH7DBC add 3KS1WEGK 07/04/06 11:05:45 Bus0 Enc1 LccA remove 53202214 add 53201501 07/04/06 12:28:26 Bus0 Enc1 LccA remove 53201501 add 53202214 potential hotspare swap 07/06/06 12:09:35 Bus0 Enc1 LccA remove 53202214 add 53201501 potential hotspare swap 05/11/06 10:48:52 Bus0 Enc1 LccB remove 53201009 add 55011977 08/09/06 08:12:27 Bus3 Enc0 Dsk6 remove ECWGD6PC add 3KS3FHJP 08/09/06 08:12:32 Bus3 Enc0 Dsk6 remove 3KS3FHJP add ECWGD6PC potential hotspare swap 08/09/06 08:13:14 Bus3 Enc0 Dsk6 remove ECWGD6PC add 3KS3FHJP potential hotspare swap 03/06/06 02:21:50 Bus3 Enc0 Dsk7 remove 3HY1A9R2 add 3KS1W3C7 08/30/06 11:16:05 SPA remove 12120f add 16193a 02/21/06 06:47:39 SPB remove 1212a8 add 169678 09/08/06 13:31:55 Log ends for SPB. 09/08/06 13:31:57 Log ends for SPA.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 138

Page 140: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

NAVI and NDU Information All Navisphere and NDU (Non-Disruptive Upgrade) information pulled from the log file is listed in this section. NDU Audit Log Cleared Event log message called out by the TRiiAGE_Analysis.txt file indicating the event Logs were cleared manually from the GUI. These messages are used to keep the CE and/or customer honest. Example: The audit log was cleared ------------------------------------- B 12/29/03 05:54:22 Security 205 The audit log was cleared Navi Logging Stopped The Navisphere agent is experiencing some type of problem accessing the Windows NT event log during a poll cycle. This may occur for the following reasons:

1. The Windows Event log files are corrupt and not readable. 2. The Windows event log files were completely overrun by new events since the last poll. Navisphere uses the last

event from the previous poll cycle as a marker for the next poll in order to determine what messages need to be obtained from the Windows event log and stored in the Navi getlog. If this message is missing (because of the overrun), Navisphere does not know what messages to get from the file.

3. User sets the time backward on the SP. The last event in the Navi getlog will now be considered some time in the future. Subsequently, no new messages will be gathered from the WNT evt logs and stored in the Navi getlog.

Example message indicating trouble: 0x2086 Error Unable to read events from the Windows log. I/O from connected servers to the storage system will not be interrupted. Create a diagnostic file and call your service provider to correct the events display. Navi Agent Problems Messages indicating resource limitations that effect the operation of the Navisphere Agent are printed in this section. These limitations may be due to hardware restrictions or faults, inappropriate consumption of resources by some array application, or a problem with the Navisphere Agent itself. Example: A 04/05/06 01:50:42 NaviAgent 1 Agent exceeded virtual memory threshold. A 04/05/06 01:50:42 NaviAgent 1 Agent exceeded resource threshold. Agent will terminate: 208148 B 04/11/06 15:38:51 NaviAgent 1 Agent exceeded virtual memory threshold. B 04/11/06 15:38:52 NaviAgent 1 Agent exceeded resource threshold. Agent will terminate: 208148

DBPrep Activities This section is useful when customer is upgrading the array from R11 to R12. If updating from a running Release 11 array to Release 12, then you must load the DBPrep.ndu package before updating the array to Release 12. See Primus case emc80499 for installation precautions. Also, if DBPrep is not run before upgrading from release 11 to release 12, CRU Signature errors will result. NDU Deactivates The NDU procedure follows three general phases of Deactivate, Install, and Commit. During the Deactivate process, the previous version of flare running on the array is deactivated. This is an indication of the actual NDU process. Example: Deactivates --------------------------------------------------- A 02/19/05 17:57:32 NDU 71510000 Info:Deactivate ndu-EMC-Base-02066005.017 returned 0 @K10NDUAdminManage.cpp:354 A 02/19/05 18:15:54 NDU 71510000 Info:Deactivate ndu-EMC-Base-02066005.017 returned 0 @K10NDUAdminManage.cpp:354 A 05/13/05 17:55:49 NDU 71510000 Info:Deactivate ndu-EMC-Base-02076005.016 returned 0 @K10NDUAdminManage.cpp:382 A 05/13/05 18:16:34 NDU 71510000 Info:Deactivate ndu-EMC-Base-02076005.016 returned 0 @K10NDUAdminManage.cpp:382

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 139

Page 141: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

NDU Installs During this part of the NDU process a new base revision will be installed on the array. Example: Installs ------------------------------------------------------ A 02/19/05 18:16:44 NDU 71510000 Info:Successfully installed BundleIndex 02076005.016 @K10NDUAdminAsync.cpp:340 A 02/19/05 18:16:44 NDU 71510000 Info:Successfully installed Base 02076005.016 @ K10NDUAdminAsync.cpp:340 A 05/13/05 18:17:07 NDU 71510000 Info:Successfully installed Base 02166005.012 @ K10NDUAdminAsync.cpp:340 A 05/13/05 18:17:07 NDU 71510000 Info:Successfully installed BundleIndex 02166005.012 @K10NDUAdminAsync.cpp:340

Commits This is the last phase of the NDU process. After committing the installed NDU packages, it is not possible to revert back to a previous revision. No new LUNs can be bound until the outstanding NDU is committed. Example: Commits ------------------------------------------------------- A 02/19/05 18:51:45 NDU 71510000 Info: Completed commit @ K10NDUAdminManage.cpp:2325 A 05/13/05 18:56:22 NDU 71510000 Info: Completed commit @ K10NDUAdminManage.cpp:2413 Dumps, Panics, Managed Restart and Flare not Rescheduling Dumps Provides an easy reference to know what dumps might be available on the SP. The dump messages are from the DumpManager. Not all bugchecks produce dumps so there may not be a dump file for every bugcheck. Examples of the messages seen include: Example: A 12/15/05 23:18:05 Dump Manager 41004100 Created Compressed Dump C:\dumps\SPA_APM000415xxxxx_10c131_12-15-2005_23-11-18.dmp.zip B 12/15/05 23:20:20 Dump Manager 41004100 Created Compressed Dump C:\dumps\SPB_APM000415xxxxx_10ba53_12-15-2005_23-13-24.dmp.zip Panics Occasionally an SP will panic in response to an unanticipated external event or code anomaly. Unfortunately on rare occasions, both SPs will panic simultaneously or near simultaneously creating a momentary data unavailable situation and possibly data loss through the cache dirty condition. There are three types of panics seen with these arrays, those generated from the Microsoft Windows NT/XPe operating system, those caught by a driver (Flare, Layered Drivers, etc), and those that are called “Admin panics”. Most of the causes of the Windows panics are not bugs in the operating system but problems with a driver. Flare panics are those recognized situations where continued operation would jeopardize the integrity of data in memory and on disk. All panics ultimately call the KeBugCheckEx() function. This function takes a stop code and four parameters that are stop code specific. Admin panics occur within the range 0x4xxxxxxx and 0x7xxxxxx. However, the most significant bit in the 0x7xxxxxxx is reported as 0xExxxxxxx for some reason. A common example of an admin panic is the 0xE117B264, TCD_T_O_ABORT_ERR scsitarg timeout. Windows NT/XPe Panic: In this example, the Windows stop code is 0x000000a, IRQL_NOT_LESS_OR_EQUAL. More information on a specific Windows stop code can be obtained by going to the Microsoft web site and simply using the code as search criteria. Parameter1 – An address that was referenced incorrectly Parameter2 – An IRQL that was required to access the memory Parameter3 – The type of access, where 0 is a read operation and 1 is a write operation Parameter4 – The address of the instruction that referenced memory in parameter 1. 0x0000000a (0xe24edf62, 0x00000002, 0x00000000, 0x8010b225).

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 140

Page 142: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Flare Panic: Flare panics are typically called with a stop code of 0x00000000 with second parameter indicating reason for the panic. Parameter2 – Flare specific stop code (0x0080007d, CM_SWAP_PANIC) Parameter3 – Stop code specific information field one. Parameter4 – Stop code specific information field two. 0x00000000 (0x00000000, 0x0080007d, 0xa3b267d9, 0x82ab38b8)

Recommended Course of Action Determine what group should receive the dump and submit zip files taken from the c:\dumps directory on the array to that group for analysis. This can be done easily by looking at this section in the TRiiAGE Analysis for the contact information. TRiiAGE process attempts to identify owner of the code that produced the panic and recommends the “submit to” group. Example: B 02/16/06 19:45:19 Save Dump 2183 Reboot from bugcheck:0x00000035 (0x8f64a030,0x00000000,0x00000000,0x00000000) [BugcheckCode:35 Definition:NO_MORE_IRP_STACK_LOCATIONS Contact:Sustaining ] Searching for more information • Flare Panic Codes and the mnemonics they represent are found by searching Primus. • Admin Panic Codes and the mnemonics they represent are found by searching Primus. • Searching the Microsoft web site for Windows stop codes to obtain a generic overview of the panic and associated parameters. • Search DIMs for cases where this panic has happened in the past. Searching can be accomplished by entering either the hex stop code or the text mnemonic into either “Keywords”, “Details”, or “Brief Summary” fields. Searching can also be performed on DIMs notes. One thing to keep in mind is that an SP can panic for a wide variety of reasons under the same panic code. A solution for one case most likely will not be the solution for the case at had. Crash Dump Analysis Dumps and all supporting DLL files from the SP in question are rolled up into a zip file and submitted to CLARiiON Sustaining Engineering for analysis. Analysis is performed using the WinDbg debugger with appropriate access to symbols and source code. Please contact Sustaining Engineering Management for more information. Managed Restart The 618 message "CM Managed this SP Restart" with an extended status of 0x00f2000c, FLARE_MANAGED_SP_RESTART indicates that Flare called the “kill thyself" function and reboots, rather than causing a panic. Flare will call this function when either SP is not in sync with its peer. This is not a hard reboot for SP. Navisphere log (Ulog) code 0x618 which stands for the same indicate that CM expected the SP restart, the SP expected it’s peer to reboot but the peer did not. This kind of SP reboot will occur in case when both the SP’s are not in sync with each other, the reason for SP’s not in sync is different at different time. Example Log Entries: A 11/09/2005 18:59:01 SP A (618) CM Managed this SP Restart [0x00] f2000c 4 B 07/07/2005 01:12:14 SP B 618 CM Managed this SP Restart 0 f2000c a0113a

Flare not Rescheduling Flare not rescheduling is a warning that no Flare threads have run since that last time that the Flare Thread Watcher DPC (Deferred Procedure Call) ran. DPCs perform the heavy lifting that cannot realistically be done from within the time constraints of an ISR. Essentially, Flare is experiencing difficulty scheduling tasks such as processing I/O. The DPC can eventually panic the SP with HEMI_CPU1_WATCHDOG (0x00000041) if after the fourth check up it still sees that no Flare threads have run. This problem can occur for a variety of reasons including software bugs that cause processes

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 141

Page 143: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

(including Flare) to continuously loop and occupy the CPU, processes running at an elevated priority level for extended periods of time and even excessive trespassing. In general, CPU Watchdog panics should only be encountered during development. In older FLARE revisions (10 or 11, very rarely in 12), watchdog panics were more common. Example: B 10/05/05 20:59:33 Flaredrv 71274001 CPU1_WATCHDOG: Flare not rescheduling A 10/05/05 21:24:46 Flaredrv 71274001 CPU1_WATCHDOG: Flare not rescheduling B 10/05/05 21:25:05 Flaredrv 71274001 CPU1_WATCHDOG: Flare not rescheduling A 10/05/05 21:25:46 Flaredrv 71274001 CPU1_WATCHDOG: Flare not rescheduling

Excessive Trespassing In this case, the customer may be complaining of poor performance. The TRiiAGELogs indicate that Flare is not rescheduling and possibly the write cache is not enabling due to “Not enough pages”. A quick look at the ‘LUN and MetaLUN state change info:” in the TRiiAGE report indicate that there is high trespassing. The important counts in this case include the “606 Unit Shutdown for Trespass Count” and/or the “642 Background Verify Aborted Count”. If the counts are in the hundreds or even thousands there most likely is a trespassing problem. Keep in mind that for some third party path management software packages like PVLinks and DMP, the 606 messages are not printed in the TRiiAGELogs. This was done on purpose to prevent the log files from filling up exclusively with trespass messages. The only indication here is the high 642 message count. Background Verify is started every time a LUN is trespassed and will be aborted if it is not permitted to complete before the next trespass. Trespassing problems are a broad topic that can occur because of backend problems on the array, incorrect path management installation, improper host failover settings on the array, and hardware problems in between the host and the array. All of these items need to be investigated and may involve multiple disciplines from within EMC including ASE, Sustaining, RTP, and the switch, HBA, and host configuration experts. In some cases, Flare may be occupied with copying a large amount of cache date over to the other SP as part of the trespass operation. This can lead the DPC to issue the “Flare not rescheduling” warning message. "Write cache enable pending [Not enough pages]" The message “Write cache enable pending [Not enough pages]” may be seen along with “Flare not rescheduling” in high trespass environments for versions of Flare up to release 14. From DIMs: “The problem happens when a majority of the I/O's are going to a LUN or a group of LUNs or there is a wholesale trespass of a large number of LUNs occurring at once. The trespass of a large group of LUNs will leave the current owner without enough pages to cache. This situation causes the cache to disable, which in turn causes a cache dump to vault and the not enough cache pages. The deadlock avoidance code added in Release 16 will reduce the write aside cache size so that it will trigger the other SP to aggressively flush pages and release cache pages. Write aside cache size would eventually be restored to normal value." Example: B 10/14/05 15:45:40 SP B 78a Write cache enable pending [Not enough pages] 0 f334c530 7 B 10/14/05 15:45:40 SP B 78a Write cache enable pending [Not enough pages] 0 f334c530 7 B 10/14/05 16:16:04 SP B 78a Write cache enable pending [Not enough pages] 0 f334c530 7 B 10/14/05 16:16:04 SP B 78a Write cache enable pending [Not enough pages] 0 f334c530 7 General Recommendations: Look for and solve all backend problems first. If the array appears to be clean, the short term solution is to reboot the array. For the long term, look for signs of high trespassing activity then examine host array configuration setting, and multi-path failover software versions. Version 4.1 of DMP for Solaris implements failover similar to that of PowerPath. Recommendations to upgrade to this version or later may be appropriate. In an emergency, a host can be connected through a single path while the problem is resolved. Ultimately, high trespassing activity puts a big damper on performance and must be addressed.

1. If they are using DMP as failover SW then check primus emc56168: How VERITAS Dynamic Multi-Pathing (DMP) works with a CLARiiON array to reduce the heavy trespassing.

2. Reboot one SP at a time or the whole array if the customer is willing.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 142

Page 144: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SP Boot History This section provides information about the boot history of the SPs as indicated by the SP identifier or signature. A quick glance will tell you if the boot activity for one or both SPs is excessive. Each SP has a unique signature. The signature is reported by the spid process each time an SP boots. Unfortunately, not all reboots are logged. However, the peer SP will report the signature of the other SP through cmid transport. If the SP is replaced the cmid will log the signature of new SP. It will also give the date and time when the SP was first booted after replacement. Please keep in mind that it may not be easy to tell if an SP has been replaced. This situation can occur when the log wraps and older messages including the record of a reboot are forced out. Example 1: From the example text below, the unique boot signature for both SPA 0x150001 and SPB 0x1325e2 are recorded. The 0x9dd020b060010650:0 is the SP ID of SPA and 0x9dd020b060010650:1 is the SP ID of SPB. SP BOOT SIGNATURE HISTORY: Review this section for excessive reboots and SP replacements. CMID reports the peer SP signature. ************************************************************************************** A 03/19/06 23:44:50 Reboot 71200007 Found package Base02.16.500.5.012. A 03/19/06 23:45:20 SP A 780 03/19/2006 23:45:20 SP A (780) BIOS Rev: 03.28 0 148 0 B 03/19/06 23:46:31 Reboot 71200007 Found package Base02.16.500.5.012. B 03/19/06 23:47:20 SP B 780 03/19/2006 23:47:20 SP B (780) BIOS Rev: 03.28 0 148 0 A 03/20/06 15:20:24 Reboot 71200007 Found package Base02.16.500.5.012. B 03/20/06 15:20:42 cmid 71310008 CMID Transport Device 0: Gate 0 connects to SP ID 0x9dd020b060010650:0 (Signature 0x150001) A 03/20/06 15:20:54 SP A 780 03/20/2006 15:20:54 SP A (780) BIOS Rev: 03.28 0 148 0 B 03/20/06 15:28:43 Reboot 71200007 Found package Base02.16.500.5.012. A 03/20/06 15:28:59 cmid 71310008 CMID Transport Device 0: Gate 0 connects to SP ID 0x9dd020b060010650:1 (Signature 0x1325e2) B 03/20/06 15:29:13 SP B 780 03/20/2006 15:29:13 SP B (780) BIOS Rev: 03.28 0 148 0 Known Case of Spontaneous Reboots Certain CX400 SPs with a serial number xxxx xx NNN x xxxx where NNN is in the range of 304-326 may have a VRM hardware problem that can cause a “spontaneous” reboot. This problem was observed in June of 2003 and should be rarely seen today. Replace any SP that is a revision A30 or higher. Use the array serial number found in the Array Overview section in this document to determine if the SP is within the problematic range. Do not use the SPID for this determination. SP BIOS Revisions This section also lists the information about the BIOS Revision the SP is running and is recorded when an SP boots. The BIOS revision is useful when there are problems identified that are version specific. For example; The CX700 array may experience hangs when running BIOS revision 3.15 and should be upgraded to revision 3.33 or later. In this case, the CX700 hang issue was solved in release R14.20 and R16.08. Example 2: In the following example, this CX700 was running BIOS version 3.14. The recommended course of action is to NDU to the latest Flare release to obtain the updated BIOS. A 08/09/04 17:45:26 Reboot 71200007 Found package Base02.07.700.5.004. B 08/09/04 17:45:26 Reboot 71200007 Found package Base02.07.700.5.004. B 08/09/04 17:45:55 SP B 780 08/09/2004 17:45:55 SP B (780) BIOS Revision: 03.14 0 13a 0 A 08/09/04 17:45:59 SP A 780 08/09/2004 17:45:59 SP A (780) BIOS Revision: 03.14 0 13a 0 B 08/23/04 19:32:48 Reboot 71200007 Found package Base02.07.700.5.004. A 08/23/04 19:33:21 Reboot 71200007 Found package Base02.07.700.5.004. B 08/23/04 19:33:55 SP B 780 08/23/2004 19:33:55 SP B (780) BIOS Revision: 03.14 0 13a 0 SP Skipping Unquiesce, HFOFF Mode, SP Degraded SP Skipping Unquiesce entries are indicative of several potential problems. With all of these issues, the symptoms are the same. Front end ports are not enabled on the SP. This condition means that the array will not allow host login. First, the device map may be somehow corrupt or the C:\ drive is full. The device map is created when the SP boots and must first be written to disk before it is loaded into memory as a memory mapped file. If the C:\ drive is full this process cannot occur. Second, device map issues may be related to the fact that a layered application has not started properly. Look for NduApp and ndumon entries in the log files and reported exceptions for further information. Some evidence of this

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 143

Page 145: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

condition should be available in the log files. Third, unquiesce problems can occur when Hostside XLU information in the PSM is corrupt. Most of these issues will require assistance from the appropriate ASE development group. Escalate as necessary. The SP may have been left in HFOFF (Hands Free Off) and degraded mode. Both of these conditions are reported in this section. HFOFF is used when analyzing and troubleshooting SP boot related issues. The SP is manually set to HFOFF when attempting to start the array drivers in a step by step fashion. This is done by editing a copy of the flareandlayeredstart.bat file and inserting pause statements to start one driver at a time to determine the point of trouble. When the troubleshooting session is complete, the engineer must remember to replace the startup file with an original and set the SP back to HFON. Refer to the Analysis of Boot Problems document for information on troubleshooting boot issues and a presentation on the SP Boot Sequence. If the SP is in degraded mode, only those drivers required to boot the OS will start. Array drivers like Flare can not be started at all when the SP is degraded. The array will automatically enter degraded mode if it has unsuccessfully rebooted four times. The messages that you will see in this section contain the phases "skipping SP Unquiesce", "HFoffMode" or possibly "The SP has been rebooted unsuccessfully 4 times". Example: ************************************************************************ SP Skipping Unquiesce, SP in HFOFF mode, SP reboots unsuccessfully (Degraded) ************************************************************************ A 08/16/05 15:27:56 ndumon 71514000 Warn: Degraded, skipping SP Unquiesce @ NDUmon.cpp:1245 A 08/16/05 16:35:16 ndumon 71514000 Warn: Degraded, skipping SP Unquiesce @ NDUmon.cpp:1245 A 08/16/05 16:41:28 ndumon 71514000 Warn: Inhibited, skipping SP Unquiesce @ NDUmon.cpp:1250 A 08/16/05 16:26:28 SafetyNet 0 HFoffMode - Not starting K10Governor A 08/16/05 16:38:46 SafetyNet 0 HFoffMode - Not starting K10Governor Stuck REMAP thread count Each SP runs a single remap thread for all raid groups that access a shared list of known bad sectors. The shared list is mirrored on each SP and access to the list is coordinated by the threads. The remap thread handles sector remapping for disk drives if necessary when media errors are reported. When a particular error condition occurs, the remap thread will continuously re-attempt a remap request and become stuck in an infinite loop. This problem was found in R14 through R22.005 with an affinity of occurrence towards R19. The problem was fixed in R19.034, R16.024 and R22.505 (Vulcan). The remap thread is stuck if the number presented in the TRiiAGE Analysis Report is greater than zero. All cases of suspected stuck remap should be escalated to CLARiiON Sustaining Engineering for recovery. Potential DU or DL entry counts These counts are at-a-glance indicators of current and past data unavailable (DU) and Data Loss (DL). If the DIMs is submitted as DU/DL, these counts can indicate possible problems and what to look for in the logs. For example, if the DIMs is submitted as unowned LUNs, the CRU signature error count is positive and other counts are zero then begin by searching the logs for CRU Signature errors. These counts simply provide indicators on where to focus the troubleshooting effort relative to the problem reported in the DIMs. Example: There are several issues here that may need addressing. All of the counts below indicate some level of data loss and most likely data unavailable. A brief glance at these counts would indicate that the investigation should focus on rectifying the cache dirty situation if it is current then address the uncorrectable errors. ************************************************************************ Potential DU or DL entry counts: ************************************************************************ (a08)Database Sync Error: 0 (957)Uncorrectable Sector: 654 (953)Uncorrectable Parity Sector: 153 (951)CRU Signature Error: 0 (90a)Can't Assign - Cache Dirty: 6 (COH)Coherency Errors: 2 Statistical Information The statistical information section records the counts of the number of times a message is seen in the log. These counts should be looked at for unusually high numbers and symmetry. For example; if the background verify abort count is excessively high, this could indicate a problem with host failover configuration and high trespassing. Large differences between start and complete or finish counts may indicate other problem areas and suggest a target of focus for further investigation. The top of this section reports the number of initiators that are registered with the array and logged in are

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 144

Page 146: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

reported for each front end port. Check for “Warn: inhibited, skipping SP Unquiesce” messages in the log files if no initiators are logged in when they should be. Front end port speeds are also listed. This is important for the CX3-xx series arrays where port speeds can vary from 2gb to 4gb. Example: *********************************************************************************************** STATISTICAL INFORMATION: *********************************************************************************************** INITIATORS: SPA (Registered Initiator/Logged Initiator): Port 0 10/9, Port 1 9/8, Port 2 4/4, Port 3 4/4 SPB (Registered Initiator/Logged Initiator): Port 0 9/8, Port 1 10/9, Port 2 4/4, Port 3 4/4 PORT SPEEDS: SPA Port 0: 2Gbps, Port 1: 2Gbps, Port 2: 2Gbps, Port 3: 2Gbps SPB Port 0: 2Gbps, Port 1: 2Gbps, Port 2: 2Gbps, Port 3: 2Gbps MetaLUN: Create; 0 Destroy; 0 Stripe Start: 0 Stripe Complete: 0 Concatenation Complete: 0 SHUTDOWNS: (906)Unit: 0 (606)Trespass: 690 REBUILDS: (603)Start: 0 (604)Complete: 0 (605)Halt: 0 (641)Abort: 0 (67b)Start: 0 (67d)All Completed: 0 EQUALIZES: (613)Start: 0 (614)Complete: 0 (615)Abort: 0 EXPANSIONS: (6f0)Start: 0 (6f1)Finish: 0 (6f2)Halt: 0 BGVERIFIES: (621)Start: 0 (622)Complete: 0 (642)Abort: 0 HOT SPARE SWAPS: (6a2)In: 0 (6a3)Out: 0 RECONSTRUCTS: (689)Sector: 0 (684)Parity: 0 SP CORRECT_ECC: (ECC); 0 MetaLUN Information At-a-glance count of MetaLUNs currently bound on the array as well as the MetaLUN LUN number and state. If the LUN is in any other state other than “ENABLED”, more investigation is warranted. Example: ******************* MetaLUN INFORMATION ******************* Number of metaLUNs in the system: 4 MetaLUN Number: 3 Current State: ENABLED MetaLUN Number: 6 Current State: ENABLED MetaLUN Number: 132 Current State: ENABLED MetaLUN Number: 9 Current State: ENABLED Array Hardware Errors Array hardware errors are divided up into two categories, warning messages and errors. Warning messages, if any, are reported to the logs when an SP boots. Warning messages are an indication that something is not quite correct but the SP is not prevented from moving on through the boot process. Errors on the other hand, will prevent the SP from booting. Errors can occur for a variety of reasons including problems with backend cabling. Unfortunately, there is no documentation available at this time regarding the meaning of these messages. The format of a message is as follows: A 12/25/05 19:38:40 DGSSP 76004101 Found a warning during POST: 03/18/05 19:09:37 WARNING: Power Supply B AC Fail detected The first date (12/25/05) indicates when the message was written to the Navisphere log file and the second date (03/18/05) indicates when the message actually occurred. This is important to remember. The first date may be relatively current but the actual incident may have occurred some time in the distant past. The best way to se these messages is to look for other supporting evidence that may have occurred on or about the second date. ECC Errors ECC errors reported by the POST boot software indicate problems with one or more DIMM memory cards in the SP. Correctable (Single bit) ECC errors are recoverable, multi-bit ECC errors are not. The replacement rules for all SPs are as follows:

ECC Error Type Action Less than 50 single bit ECC errors No action necessary Greater than 50 single bit ECC Errors Replace the offending SP One or more Multi-bit ECC Errors Replace the offending SP

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 145

Page 147: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

However, SP replacement may be made when thresholds are not reached in cases where the SP does not respond, has a history of “spontaneous” reboots, or panics in such a way as to indicate memory corruption. The algorithms for ECC vary for different memory controllers. A CX700 uses a memory controller called a CMIC. This particular controller can sometimes correct more than a single-bit, under certain circumstances (if the bits are in the same nibble, for example). Correctable ECC errors are synonymous with the term single-bit (more recent Flare revs will use this terminology also) and are used interchangeably in this section.

Background on ECC error handling: The error checking is performed by the CMIC hardware on every memory read. In addition to normal processor reads, the CMIC performs a background "scrubbing" algorithm, so all locations in memory are periodically checked for errors. When an ECC error is detected, the hardware latches information associated with the error (such as address and syndrome information), and generates the highest priority system interrupt, which calls special error handling firmware. This firmware reads the error registers within the CMIC, and logs the information in Non-Volatile memory. A background software task checks this Non-Volatile memory periodically, and updates the Navi Log when it finds new ECC errors. Example: Correctable (Single bit) ECC Errors B date/time DGSSP 76008101 SINGLE_BIT_ECC: 09/15/05 11:07:47 : Failing address is 0x03bed4c0, Failing DIMM is DIMM CHA-1 B date/time DGSSP 76008101 SINGLE_BIT_ECC: 09/15/05 11:07:47 : Failing address is 0x03bedb80, Failing DIMM is DIMM CHA-1 B date/time DGSSP 76008101 SINGLE_BIT_ECC: 09/15/05 11:54:25 : Failing address is 0x03bed280, Failing DIMM is DIMM CHA-1

Example: Multi-bit ECC Errors B date/time DGSSP 76008102 MULTI_BIT_ECC: 06/23/04 09:20:54 : Failing DIMM is DDR SLOT1, Syndrome is 0x6a000000 B date/time DGSSP 76008102 MULTI_BIT_ECC: 06/03/05 01:43:02 : Failing DIMM is DDR SLOT1, Syndrome is 0x53000000 B date/time DGSSP 76008102 MULTI_BIT_ECC: 06/07/05 09:41:23 : Failing DIMM is DDR SLOT1, Syndrome is 0x93000000

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 146

Page 148: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

FLARE Centric Log Error Reporting Information The messages below are culled from the TRiiAGELog and displayed together for quick reference. A18 CRU Drive Causing Loop Failure Description: This message is reported against a drive when a drive is bypassed from the loop because it was causing the loop discovery mechanism to fail. In a loop topology, the drive reporting this message might not actually be the bad drive. One or more other drives on the loop might be generating noise to make it appear that the reporting drive is bad. Recommendation: Look for other supporting evidence in the logs that would indicate that the reporting drive is actually the problem. This might include numerous bad block or hardware messages indicating hardware replacement. Run FBI polling every few minutes or so for a period of fifteen minutes to a half hour. Review FBI logs. Review “Disk Sense Data” for the bus and enclosure in question. Sense data might point to the drive that is experiencing problems. Note: This type of problem can cause a data unavailable (DU) situation if a raid group is seen as faulted from both SPs or failover software is not configured properly. A17 CRU Unformatted Description: The drive is not factory formatted to CLARiiON specifications and is not supported by the CLARiiON. Error can be reported when the drive is faulty in some way and is showing signs of inconsistencies on both the SP’s. Recommendation: In ex; there is a case where drives intended for JBOD (not formatted to 520 bytes which is needed for FC drives) use were supplied and inserted into a CLARiiON. The easiest course of action is to replace the drive with one that is properly formatted. One other alternative is to attempt to use the setdisk command in fcli to format the drives to the required format. This does not always work properly.

A09 CRU Drive too Small Description: A replacement disk drive was inserted into an existing raid group, but this disk has a smaller capacity than the other drives. The rebuild operation cannot begin until the smaller disk is replaced with a disk of the correct size. Also when a drive fails and a hotspare of lower capacity kicks in for the failed drive then this message is reported in the ulogs. This is also the case when array is rebooted after a core software upgrade and the luns are inaccessible and this error message is reported in the ulogs. Older versions of Core Software that do not support 18 GB or 36 GB drives will allow these drives to be bound into a LUN without “trimming” the drives to the proper size. (Core Software performs trimming to ensure interchangeability between different generations of drives and drives from different vendors.) However, when Core Software is upgraded to a revision that supports 18 GB or 36 GB drives, the upgraded revision will trim the drives. Recommendation: If this is the case after the core software upgrade: If you downgrade to the previous revision of Core Software, the LUNs will again become accessible. At this point you must backup the data, unbind the LUNs containing the unsupported 18 GB or 36 GB drives, upgrade Core Software to the appropriate revision, rebind LUNs and restore data. Refer to primus ID 1.0.39005404.2473408.

a08 Database Sync Error Description: The SP cannot determine the correct virtual configuration of all LUNs in the storage system. Some LUNs may be unusable. The problem occurs when a RAID Group creation fails. This could happen because of a disk write failure when Flare attempts to write the FRU signature or when the peer SP dies as the write is occurring. Unfortunately, the RAID Group create failure isn’t cleaned up properly and leaves behind a “ghost” RAID Group. The ghost raid group is of the number used at creation time. If you then create another RAID Group, using one or more of the same disks, but with a lower RAID Group number than the ghost, you are exposed to the problem. Recommendation: Contact Sustaining Engineering for directions on the Database Sync recovery procedure. This has been fixed the latest patch releases of R14, R16, and R19.

A07 CRU Powered Down or 799 Peer Requested Drive Power Down Description: This message indicates a drive failure. The specified disk module has been powered down by Flare, has failed, or has been removed from the chassis and is not receiving any I/O. This could be due to hard or soft SCSI Bus errors, media errors or other hardware errors. If LUNs are bound in a raid group containing the CRU (drive), the a07 error is followed by the “906 – Unit Shutdown” message. Information on the 906 message regarding hot spare shutdown can be found in Primus solution emc100036. The “0x799 peer requested drive power down” message is an informational message informing one SP that the other SP is powering down the drive. Here is a scenario when this error is reported in logs: A drive is causing consternation on the SPB side of the SCSI bus in a loop enclosure. SP-B shutdown

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 147

Page 149: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

the drive on its side and reports the “a07 CRU powered down” message. SP-B sends a message to SP-A requesting to power-down the drive. Since this is the first drive in the raid group to be powered down, SPA obliges and records the 799 Peer Requested Drive Power Down. If SPB looses access to more than one drive, SPA will not shutdown any more drives as long as it is not experiencing connectivity issues in its side of the bus. Recommendation: Dial in and verify the fault. If multiple drives in the same RAID group have failed and are reported by both SPs, it is a double-faulted situation and needs appropriate corrective action. If it is a single drive fault, the drive may be replaced. Before replacing any drive it is best to check the Raid Group Health Check (RGHC) output within a triage report run against the latest SPCollects. Reference : Primus solution emc100037 (a07 – CRU Powered Down ) , Primus solution emc117113 ( 799 – Peer Requested Drive Power Down)

966 Can’t enable write cache due to cache dirty unit(s) Description: This message is entered by the Cache State Machine in the log if the vault load did not complete successfully on one or the other SP, This message does not indicate a problem so long as the cache does eventually become enabled after both SPs have booted. The message can occur when an SP is coming up and recovering the cache image. The vault is loaded and tests to see whether the cache can be enabled. This SP finds that dirty units are unassigned at this time. LUNs could go unassigned because of dual SP reboots, deficient system failover implementation etc. Cache will not enable while there are unassigned, dirty units so Flare issues these messages to let the user know that manual intervention may be required. Recommendation: Check to make sure all LUNs are assigned and write cache is enabled. If this is the case, then no further action needs to be taken. If all LUNs are assigned and write cache is disabled, attempt to enable the write cache. If the write cache does not enable, then check in the *.drt file to be absolutely sure that there are no LUNs marked as cache dirty. Contact Sustaining Engineering for information on “Cleaning Cache Dirty using ktcons”. Reference: Primus solution emc119573. Note: Cache dirty units indicate that there was data in the write cache that was not written to disk due to some unforeseen event (i.e. dual SP panic). This always results in a DL situation for the customer. The customer is advised to verify data before putting those LUNs back in service.

953/957 Uncorrectable Sector Description: An uncorrectable sector is a situation where the Flare raid engine can not regenerate a stripe element from the other remaining stripe elements in the raid set. The stripe elements themselves are made up of several 520 block sectors. Raid can regenerate either parity or data if one of the stripe elements is missing. This is an expected and common occurrence when one drive in a raid group fails. However, if more than one stripe element is missing or can not be used for regeneration (i.e. invalid CRC) then a 957 Uncorrectable Sector message is recorded in the log file. Unfortunately, these messages indicate that data loss has already occurred. An Uncorrectable Sector can be reported against any drive either Fibre Channel or ATA. However, the majority of the cases involves ATA drives and is subsequently the focus of on-going investigation. Recommendation: The typical response to these types of errors is to perform file level isolation by requesting that the customer perform host based backup. Once the backup is complete, assess what hardware might need to be replaced or recovered for failure analysis and make that recommendation. The usual course of action is to obtain Yukon logs from BCCs, unbind LUNs in the Raid group in question, replace hardware, rebind LUNs, and restore data from backup. View the Raid Group Health Check information from a triage report before proactively replacing any drive and to focus on messages associated with any one disk. For release 19.016 and beyond, CRC trace information is gathered whenever a CRC error occurs. This trace information can be utilized by skilled personnel to determine the nature of the problem and possibly where the problem originates. All of these cases should be reviewed by Sustaining Engineering and possible ASE for further investigation. Seek help for these cases if you have doubts or questions. Furthermore, all customers with ATA drives should be encouraged to upgrade to version 19.027 at a minimum. Also, please note ATA best practices if customers have LUNs in a given raid group tied to both SPA and SPB as default owners. ATA best practices state that customer should have all LUNs in a raid group tied to either SPA or SPB as default owner. This minimizes cross talk across BCCs. Note: Uncorrectable sectors mean that there is some level of data loss. Performing host based backup can help determine the extent of the problem. Please consult Primus cases emc115999 and emc116000 for more information.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 148

Page 150: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

951 CRU Signature Description: The CRU Signature message is generated during the assign process as an exception when a LUN is evaluated by the assign state machine for its potential to be enabled. CRU Signatures (a.k.a FRU Signature) are written to all drives in a raid group when a LUN is bound. CRU Signature messages will be generated if a drive from one raid group is placed in another raid group or if a drive is swapped with other drives in the same raid group. This is only an issue if the swapped drive was part of a raid group with bound LUNs. CRU Signature messages can be viewed as a safety mechanism to prevent use of drives that belong to other raid groups or are placed in the wrong slot within a raid group. Recommendation: There have been instances where the CRU Signature message has been reported by the Assign State Machine against drives that were not swapped or moved. Consult Sustaining Engineering for information on “Fixing CRU Signature Errors”. Note: A DU situation will result if more than one drive in a raid group is exhibiting CRU Signature errors. Under certain circumstances, the CRU signature restoration tool can cause DL. Please follow all Sustaining Engineering recovery recommendations before working on the customer’s array.

941 Battery Online Description: These events indicate that the SPS (Standby Power Supply) has lost AC power and switched over to battery backup. When this occurs, enough battery power is supplied for the SP(s) to write any information in cache to the vault and shut the SPs down until power is restored. Example messages: B 02/01/05 08:46:16 Bus0 Enc0 SpsA 941 Battery Online 0 0 1 A 02/01/05 08:47:13 Bus0 Enc0 SpsA 941 Battery Online 0 0 1 Recommendation: Triage logs that contain several Battery Online messages in a short time period may indicate an erratic AC power source. The customer should verify the AC power source is working properly.

920 Hard Media Error Description: The 920 message is reported by the Flare Device Handler when no valid LBA is returned from the drive and block(s) can not be remapped. This message should rarely be seen in association with a fibre drive but is quite prevalent for ATA drives on arrays running a Flare release prior to R17. This message was changed from 920 Hard Media Error, 0x3D can’t remap to 820 Soft Media Error 0x3B can’t remap for Flare releases R17 through R19.019. For Flare release R19.027 and beyond, this message was changed again to 820 Soft Media Error, 0x05 Bad Block based on changes made to Flare and BCC FRUMON firmware to better handle media events reported by the drive. Recommendation: For release R19.027 and above, use the same replacement guidelines for ATA drives reporting an 820 Soft Media Error, 0x05 Bad Block as is done today for fibre drives. The 820 message indicates that an error was reported by the drive but the information in the sector was rebuilt and successfully remapped. Drive replacement should only occur if there are numerous 820 messages reported. Consult Primus emc64488 for more information regarding soft media errors and drive replacement. In all cases, check the Raid Group Health Check section of the triage report for the drive in question and review output before taking any action. Do not replace drives for releases older than R19.027 if the 920/0x3D or 820/0x3B messages is the only evidence of a problem. If a drive is replaced under these circumstances, there is a good chance that a double fault will occur in the raid strip resulting in uncorrectable sectors. Please read the above Primus case thoroughly to minimize this possibility. Consider backing up data and upgrading the array to the latest patch release of Flare if core software is prior to R19.027 before taking replacement action.

90a Can’t Assign – Cache Dirty (plus Related Messages) Description: This message means that a data unit (LUN) is inaccessible and can not be assigned because it is cache dirty. This message originates from the Assign state machine when sanity checks are run to make sure a LUN is assignable. This is always a DL situation and may occur as the result of a dual SP panic or other sudden stop that affects both SPs. Recommendation: Consult Sustaining Engineering for information on “Recovery Procedure for Clearing Cache Dirty LUNs” and take the appropriate action as necessary. Don’t forget to make sure write cache is enabled once all Cache dirty LUNs have been cleared. If write cache does not enable, check to make sure all of the following hardware components are available:

1. Both SPs (non-AX100 single SP systems) are operating properly. 2. In CX600/CX700/CX3-xx series computers, both LCCs in enclosure 0, bus 0 are operating properly. 3. Both power supplied are not faulted.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 149

Page 151: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

4. Drive 0 to 4 must be available & healthy. Bus 0 should be free from backend errors which can impact drives. 5. At least one standby power (CX200/CX300) properly cabled to the SP and DPE power supplies.

If the hardware is fine, zero write cache then set write cache to the appropriate size. The appropriate size is usually what the customer had cache set to before the cache dirty situation occurred. Contact the customer/CE if this information is not handy in some prior SPCollects. Note: Cache dirty units indicate that there was data in the write cache that was not written to disk due to some unforeseen event (i.e. dual SP panic). This always results in a DL situation for the customer. The customer is advised to verify data before putting those LUNs back in service.

906 Unit Shutdown Description: The “906 – Unit Shutdown” message is usually generated when a Hot Spare is engaged in place of a failed drive. A CRU powered down message is usually followed by a unit shutdown message. In a redundant unit, a failure of two CRUs is needed to produce this error which means the SP shut down the LUN and the host no longer has access. Recommendation: If this message appears along with 0x905 and/or 0xa06 message, replacing a defective fan module may restore access to the LUN. If problem is with disk modules, appropriate corrective action needs to be taken by replacing defective disk modules. Reference: Primus emc100036. Note: A DU situation exists if these messages are not associated with a hot spare LUN. Examine the array carefully and proceed with DU recovery effort.

904 VSC Shutdown Description: These events indicate the VSC (Power Supply) lost AC power (shutdown) or was physically removed. B 02/01/05 08:46:16 Bus0 Enc0 PowB 904 VSC Shutdown/Removed 0 0 4 A 02/01/05 08:47:08 Bus0 Enc0 PowB 904 VSC Shutdown/Removed 0 0 4 Recommendation: If a “941 Battery Online” message follows the VSC message, this would indicate an AC power failure (to both power supplies) and that the SPS is supplying power long enough to dump the cache and shut the SPs down. If a “941 Battery Online” message does not follow the VSC message, it means the Power Supply was physically removed, switched off, had its power cord removed, or has failed. The LEDs on the back of the Power supply should be checked. Green LED means AC power is supplied, Amber LED means the power supply has failed.

873 CMI Connection is degraded Description: The CMI (Communications Management Interface) is the software component that allows each storage processor’s (SP’s) instance of FLARE software to communicate with its peer. The communication takes place across a defined physical hardware path or paths between the two storage processors. The software (CMI) is a set of communications drivers used to send messages locally between two SPs or remotely to another array. It consists of several layers that manage the protocol and that act as the transport layer(s). Reference: Primus solution emc113741 CMI Connection degraded or “873 Flare’s ATM detects one CMI connection is down” message indicates that SPs are not able to communicate with each other. This could be due to one SP down, possibly because of a normal reboot or reboot from timeout or some other bugcheck, bad SP hardware or a CMI issue. It is usually accompanied by fcdmtl messages e.g. “The device, \Device\Scsi\fcdmtl2, did not respond within the timeout period” and “Hard Peer Bus error” messages. Further, the CMI connection going degraded prevents Write Cache from enabling because one of the CMI paths is down. Recommendation: In case of a reboot due to panic or timeout, the timeout or bugcheck analysis should be done and appropriate action taken.

866 LCC Mcode upgrade error Description: An error occurred when attempting to upgrade the FRUMON software on the BCC/LCC hardware. This message can be posted for a number of reasons all culminating in the fact that the FRUMON firmware was not updated. Reasons include a bad firmware image, can’t find the firmware image because the path is invalid (null), the requested revision does not match the image revision, can’t get permission from the peer for the firmware update because the peer is dead, can’t read the resume prom of the BCC/LCC, can’t allocate enough memory to perform the update, another upgrade is already in progress, can’t find the new firmware image file when performing an NDU, can access the registry for upgrade information, and any other reason the download may have failed. For the BCC in particular, the firmware file is transferred to the first drive in the enclosure for staging prior to updating the BCC. If there is a problem accessing the first ATA drive, the BCC firmware update may/will fail. This is not an issue with LCCs because the firmware file is sent via the serial diplex connection.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 150

Page 152: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recommendation: Further investigation is required since the firmware can fail for a variety of reasons. Review both the ktrace and the SP event log files and gather as much information as possible. Look for backend errors on the database disks and correct if there is a problem. Sometimes reseating the BCC/LCC will trigger the firmware download. Other instances might require a full reboot of both SPs in cases where one side accepts the firmware download for all enclosures and the other side does not. On very rare occasions this might occur after an array is NDU’d. Always check for 866 errors after an NDU. If in doubt, escalate to next higher level for assistance. Obtain “YUKON logs” from ATA enclosures prior to escalation using Primus emc117216.

850 Enclosure State Description: These messages should be considered when a lot of backend instabilities are reported on the array. Enclosure state change events indicate the state of the enclosure has changed. The error is usually the result of component failure or component removal through some type of disconnection. Disconnection may occur through loss of power or loss of fibre channel signal. The extended codes can be interpreted by knowing what the possible state changes mean. Possible states are: Enclosure is missing / Enclosure is in a normal state / Enclosure is degraded / Enclosure is bypassed / Enclosure has failed If you know the possible states of the enclosure, you can decipher what the extended data means in the entry. For example, hex 0x12 indicates that the Enclosure went from a normal state (1) to a degraded state (2). Similarly, hex 0x10 indicates that the enclosure went from a normal state (1) to a missing state (0). Therefore, the first digit of the extended code indicates that the initial state of the enclosure and the second digit indicates the state to which it changed.

For Flare revisions prior to R19, the entries of “Enclosure state change” have the code 850 whereas for R19 onwards the code has been changed to a25. Message code 7a8 was also introduced in R19 indicating addition of enclosure. Example: ***************************************************************************************** ENCLOSURE INFORMATION 1. All DAEs on a single bus must support point-to-point or the bus will run in loop mode. A25/878/7a8 Enclosure changes --------------------------------- 1. Replaces 850 ENC entry for R19. 7a8 = ENC add, all other state changes generate a critical a25 entry unless it is a Stiletto ENC which also generates an 878 indicating an LCC GENERAL FAULT. An LCC reporting a GENERAL FAULT should be replaced. ***************************************************************************************** B 12/04/05 16:40:47 Bus0 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 A 12/04/05 16:41:13 Bus0 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 B 12/04/05 16:46:37 Bus1 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 A 12/04/05 16:46:54 Bus1 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 B 12/04/05 16:51:25 Bus0 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 A 12/04/05 16:51:29 Bus0 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 B 12/04/05 16:53:40 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 A 12/04/05 16:53:45 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 B 12/04/05 17:01:33 Bus1 Enc3 a25 Enclosure state change [ok->missing] 10 0 0 A 12/04/05 17:02:15 Bus1 Enc3 a25 Enclosure state change [ok->missing] 10 0 0 B 12/04/05 17:03:02 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 A 12/04/05 17:03:47 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 B 01/18/06 11:03:14 Bus0 Enc1 a25 Enclosure state change [ok->degrading] 12 0 0 B 01/18/06 11:03:37 Bus0 Enc1 a25 Enclosure state change [degrading->ok] 21 0 0 A 01/18/06 11:04:16 Bus0 Enc1 a25 Enclosure state change [ok->degrading] 12 0 0 A 01/18/06 11:04:37 Bus0 Enc1 a25 Enclosure state change [degrading->ok] 21 0 0 B 01/18/06 12:04:44 Bus0 Enc1 a25 Enclosure state change [ok->degrading] 12 0 0 B 01/18/06 12:05:02 Bus0 Enc1 a25 Enclosure state change [degrading->ok] 21 0 0 A 01/18/06 12:05:44 Bus0 Enc1 a25 Enclosure state change [ok->degrading] 12 0 0 A 01/18/06 12:06:02 Bus0 Enc1 a25 Enclosure state change [degrading->ok] 21 0 0 B 01/18/06 12:08:40 Bus0 Enc1 a25 Enclosure state change [ok->degrading] 12 0 0 A 02/17/06 18:21:30 Bus1 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:21:30 Bus1 Enc1 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:21:31 Bus1 Enc1 a25 Enclosure state change [ok->missing] 10 0 0 A 02/17/06 18:21:31 Bus1 Enc3 a25 Enclosure state change [ok->missing] 10 0 0 A 02/17/06 18:21:31 Bus1 Enc2 a25 Enclosure state change [ok->missing] 10 0 0 A 02/17/06 18:21:34 Bus1 Enc1 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:21:35 Bus1 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 151

Page 153: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

A 02/17/06 18:21:36 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:21:47 Bus1 Enc2 a25 Enclosure state change [ok->failed] 14 0 0 A 02/17/06 18:21:47 Bus1 Enc3 a25 Enclosure state change [ok->failed] 14 0 0 A 02/17/06 18:21:47 Bus1 Enc1 a25 Enclosure state change [ok->failed] 14 0 0 A 02/17/06 18:21:54 Bus1 Enc2 a25 Enclosure state change [failed->missing] 40 0 0 A 02/17/06 18:21:54 Bus1 Enc3 a25 Enclosure state change [failed->missing] 40 0 0 A 02/17/06 18:21:54 Bus1 Enc1 a25 Enclosure state change [failed->missing] 40 0 0 A 02/17/06 18:25:57 Bus1 Enc1 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:25:58 Bus1 Enc2 7a8 Enclosure added. [missing->ok] 1 0 0 A 02/17/06 18:26:00 Bus1 Enc3 7a8 Enclosure added. [missing->ok] 1 0 0 B 02/17/06 19:09:05 Bus0 Enc2 a25 Enclosure state change [ok->failed] 14 0 0 B 02/17/06 19:09:05 Bus0 Enc3 a25 Enclosure state change [ok->failed] 14 0 0 B 02/17/06 19:09:05 Bus0 Enc1 a25 Enclosure state change [ok->failed] 14 0 0 Recommendation: Look for other items that may contribute to backend instability including 801 Soft SCSI Bus Errors. Look for evidence of loss of power to the enclosure(s) in question. If the enclosure is an ATA, the BCC may have panicked if the time difference between [ok->missing] and [missing->ok] is approximately ~21 seconds. If you suspect a BCC panic, obtain Yukon logs for engineering analysis. LCCs in Stiletto enclosures reporting 878 messages indicating LCC General Fault should be replaced proactively. 853/854 The LCC cables are crossed or connected to the wrong input port Description: These Configuration Manager (CM) messages occur when an LCC cable has been connected improperly. For example, connecting the cable from the SP to the Expansion Port instead of the Primary Port on the first Disk Enclosure on a CX600/700 will log this message (code 854). These messages can occur when a new enclosure is added to an array and is cabled incorrectly, or if the cables are reconfigured improperly on a running array. Also note that incorrectly connecting certain LCC cables on a running array can cause an SP to panic. Example: 854 LCC cables are connected to wrong input port. Make sure LCC input cable is connected to the primary port. Recommendation: Consult the appropriate documentation to properly connect the LCC cables. If a cabling problem is discovered on a running array, it may be best to find a maintenance window to bring down the array first before correcting the cables connections.

820 Soft Media Error Description: Soft media errors (820) indicate that a condition has already been corrected, and generally, no additional corrective action should be taken. Any disk drive from any manufacturer can exhibit sector read errors due to media defects. These media defects only affect the drive’s ability to read data from a specific sector; they do not indicate general unreliability of the disk drive. When a disk drive encounters trouble reading data from a sector, the drive will automatically attempt recovery of the data through its various internal methods. Whether or not the drive is eventually successful at reading the sector, the drive will report the event to FLARE. FLARE will, in turn, log this event as a “Soft Media Error” (event code 820) and will re-allocate the sector to a spare physical location on the drive. In the event that the drive was eventually successful at reading the sector (event coded 820 with sense key 22), FLARE will directly write that data into the new physical location. If the correct sector data was not available (event code 820 with sense key of 05), FLARE will attempt to reconstruct the data from a mirror image (RAID 1 and RAID 1/0) or parity information stored on other drives (RAID 5 and RAID 3). If data is available either from the drive itself, or by means of RAID reconstruction, no data is lost, normal operation continues, and the original defective physical sector will never again be accessed or cause an error. Only in the event that the sector’s data cannot be read by the drive, and the data also cannot be reconstructed via RAID (or the disk drive is part of a non-redundant RAID group, such as RAID 0) will there be data loss. If this occurs, FLARE will log an “Uncorrectable Sector” with event code of 957, and a “Data Sector Invalidated” with an event code of 840. The disk drives shipped by EMC incorporate predictive techniques (“SMART” technology) that will measure and report when the drive is exhibiting behavior indicative of an imminent failure. When FLARE receives a SMART notification from a drive, it will log it as a “Recommend Disk Replacement” event (event code 803). The proper service practice is to replace any drives that exhibit such 803 events. The Sniffer is another Flare feature that reduces the probability of a latent defective sector causing a double fault. The Sniffer courses over all LUNs 0x05 – A bad block was detected and remapped on the drive. 0x22 – The drive encountered an error reading data, but was able to recover using its internal ECC mechanisms. The drive transferred valid data to the SP.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 152

Page 154: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Recommendation: For this class of error include both preventative measures as well as potential actions. 1. The proper handling of disk drives after leaving manufacturing is the single best way to avoid media errors. The

CLARiiON Open Systems Procedure Generator and CLARiiON documentation call out specific recommendations and cautions when handling disk drives. As drive recording density continues to increase, proper drive handling during shipping and installation is critical to avoid introducing new media errors. It is imperative that the handling procedures be followed. There are procedures which pertain specifically to the handling of drives within the CLARiiON lab environment but the basic physical handling of the drives applies to all circumstances.

2. The probability of discovering soft media errors during array operation can be significantly reduced by running

Background Verify on each LUN after it is created and before any data is laid down on it. This action will effectively detect and re-allocate any latent error locations that may have occurred during shipment or handling. (Any soft media errors that are detected during this verify operation can be ignored.) Background Verify is initiated by default when a LUN is bound. This feature can be easily turned off at bind time if so desired. It is therefore recommended that Background Verify be allowed to run when a LUN is bound. Background Verify does not have any affect of the time required to actually bind a LUN.

3. Additional defective sectors can occur during normal operation of any disk drive over its lifetime. The in-time

occurrence of soft media errors should not necessarily be construed as an indication of a problem with the basic reliability or remaining lifetime of the drive. CLARiiON arrays provide the above noted Sniffer function that constantly searches for and re-allocate defective sectors before one can cause an operational problem. Please consult Primus solution emc10734 for more information on Background Verify and the Sniffer.

4. Recommended action for proactive replacement of disk drive(s) that report “Recommend Drive Replacement”

(event code 803) or customers who are demanding proactive action on soft media error events (event code 820). Verify that the suspect drive has not already been faulted. Run Background Verify on all LUNS associated with the suspect drive. Allow Background Verify(s) to complete before taking any other action. If you run Background Verify against that drive’s LUNs first, you can help reduce any possibility of data loss due to latent soft media errors that might be present on other drives in the RAID group. There is no need to run Background Verify if the drive that you are replacing is already faulted. It is a good idea to run the Raid Group Health Check (rghc.pl) on the drive that is to be replaced. This script reads the SPCollect files and analyzes all other drives in the Raid Group and will report if it is safe to replace the drive in question. Run rghc.pl for help on appropriate arguments. If the drive is not safe to remove, other action such as backing up data before any action might be an appropriate course of action.

5. Recommended action when soft media errors seem to be frequent or increasing. As noted above, Sniffer

constantly runs in the background to detect and re-allocate any defective sectors. These re-allocations are logged by FLARE as soft media errors. In addition, any defective sectors encountered during normal operation are also logged as soft media errors. These logged errors generally do not indicate an abnormal condition and do not require any corrective action. Allowing FLARE to function in this manner as designed, is EMC’s recommendation in most situations. However, if more than three (3) soft media errors (specifically 820 with sense key of 05, “Bad Block”) are detected on any one drive in any 30-day period, Background Verify should be run on all LUNS associated with that disk drive. If, following the Background Verify operation, that same drive logs more than two (2) soft media errors (specifically 820 with sense key of 05) in the subsequent 30 day period, (excluding any errors logged during the running of Background Verify itself), drive should be proactively replaced. Refer to Primus emc64488, and emc96028.

803 Recommend Disk Replacement Description: Certain drive manufacturers can report when a disk has reached its failure threshold and this event will appear in the logs. It is usually preceded by other error messages which indicate why the disk should be replaced (e.g. “820 Soft Media Error [Can’t remap after recovered error]”, “Storage Array Faulted Bus 2 Enclosure 1 Disk 13……”). Example: B 10/06/05 17:37:45 Bus0 Enc0 Dsk0 803 Recommend Disk Replacement [Hardware error] 9 0 0 Recommendation: Replace the disk according to Primus emc83536.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 153

Page 155: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

801 Soft SCSI Bus Error Description: A lot of information is available regarding these types of messages. In fact there is a course that one can take that covers backend fault isolation and architecture of the backend buses for the various CLARiiON families of products. The embodiment of the course is covered in a document entitled “EMC/CLARiiON Troubleshooting”. Ultimately, Soft SCSI errors indicate that there is some type of disturbance on the bus. This disturbance may be caused by a bad transmitter on a drive, an LCC, or Cable. Unfortunately, finding the bad component can be hard to detect. The backend buses in a CLARiiON CX product consist of a Fibre channel arbitrated loop. The enclosures attached to the loop are DAE, DAE2-ATA, DAE2P or DAE4P. The DAE (Katana) extends the loop from the first drive to the next through the last drive in the enclosure. Bus disturbances tend to be reflected up and down the topology affecting good devices. Flare will attempt to stem these problems by shutting down drives. Unfortunately searching for the failing drive is not an exact science and many times good drives are excommunicated from the bus. The DAE2P and DAE4P (Stiletto) enclosures isolate drives from each other in switched mode in an attempt to prevent one drive from affecting many. One thing to keep in mind is that if a Stiletto enclosure is added to a CX (chameleon/Fish) DPE, DAE (Katana), or DAE2-ATA (Klondike) it operates in loop mode and does not provide drive isolation. Recommendation: There are several recommended courses of action depending on if logs are available and how sudden the problem crops up.

Troubleshooting with logs 1. 801 errors with extended status 0x2 “Parity Error” and 0x2A “Bad Transfer Count”. Drives on the loop before the

problem drive(s) tend to report 0x2A and drives after the problem tend to report 0x2. Identify the last drive reporting 0x2A and the first drive reporting 0x2. Suspected drives include those identified above and all drives in between. Use FBI to corroborate your finding and further narrow the selection. Remove any hot spares that are not engaged. If FBI information can not be obtained or is inconclusive, remove the last drive to report 0x2A and monitor the array to see if the 801 messages stop. Hot spares can complicate the identification process if they are involved in Raid Groups where the messages are being reported. Note: The tricky part here is to keep in mind that IBM drives report “Aborted by Device” error code 0x11 when they really mean “Parity error” error code 0x2.

2. If the above combination is not present, run FBI and examine output. If messages occur on an enclosure boundary and on one side only then suspect an LCC. Other indicators of LCC problems are “6c2 BE Fibre Loop Hung” and “LCC Glitch” messages.

Troubleshooting without logs Unfortunately, there are times when backend problems are severe enough to prevent gathering logs. This can occur when the Navisphere agent is degraded or one of the SPs is not operating properly.

1. Try booting the failing SP with only the boot DAE attached. Monitor boot log through serial PPP connection to make sure the array successfully boots the OS.

2. If this does not work, attempt to get the failing SP booted into degraded mode. 3. If step two does not work, replace the SP. 4. If you can put the machine into degraded mode, then drives 0_0_0 through 0_0_14 are suspect. Set HFOFF, then

reboot again. Start an off array ktcons session with ktail output to a log file. Make a copy of the flareandlayeredstart.bat file then edit the file to include a pause command after each driver starts. Step through to see what driver fails and examine ktail for clues to the reason for the failure. Keep in mind that one SP can affect the operation of the other when backend problems are severe enough.

5. If the ktcons output does not reveal much or you are experiencing trouble getting output, you can shutdown both SPs, remove all drives but the primary boot drive for the troublesome SP and attempt to boot. Monitor progress from ktail and boot log. If the SP boots, then proceed to add disks to the enclosure one by one slowly monitoring output and status. This type of process may serve to isolate the problem areas.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 154

Page 156: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

78b/78c Drive physically remove/inserted Description: The Configuration Manager detects that a disk drive has left or returned to the fibre channel loop or bus. Be aware that these messages can be found in abundance in the triage log files and may not necessarily mean the drive has been removed or inserted. In fact, if the 78b/78c occurs in close proximity to another within seconds it can be assumed that the device has not been physically removed and inserted into an enclosure. Instead, another event on the disk in the slot may have occurred that causes the software to report the disk as removed and/or inserted. For example, if the drive is powered down because of errors, it may appear to the software that the drive has been removed. Release 19 and beyond includes a feature in the logs where the serial number of the drive is recorded and can help determine if the drive has actually been replaced. Example: B 02/14/06 14:50:29 Bus3 Enc2 DskB 78b Drive physically removed from slot 0 0 0 B 02/14/06 14:51:12 Bus3 Enc2 DskB 78c Drive physically inserted into slot 0 0 0 Recommendation: Further investigation for the reported removal/insertion must be done before any action is taken.

798 The Drive Port Bypass Circuit Status changed. Description: This message was added in Flare release 16 to indicate when a drive is leaving the loop or attempting access. An extended status of one indicates the drive port bypass circuit is set and the drive is leaving the loop. An extended status of 0 indicates that a drive is attempting to regain access to the loop. The PBC can be controlled by the drive itself or the LCC. This message is output to the log when the CM performs flaky drive handling. Recommendation: Examine the circumstances around why a drive or drives are reporting changes in their PBC. If it is a single drive, the drive might be flaky and may need to be replaced.

799 Peer Requested Drive Power Down Description: Added in R16 to indicate the reporting SP received a message from the peer requesting drive shutdown. Recommendation: Further investigation as to the reason for the shutdown must be done before any action is taken.

6a0 Disk soft media error remapped via disk ECC Description: This message indicates that the disk has successfully remapped a bad sector using its internal ECC. Today’s high density drives have thousands of sectors available for remapping bad sectors. Recommendation: See Primus article emc64488. There is no need to consider the drive defective unless the “Recommend Disk Replacement” message appears in the triage log files.

69d/69e Recovery started/completed Description: The message “69d A bad drive or LCC is causing hardware problems.” Indicates that a bad disk or LCC is causing problems on the bus. The storage system will soon remove the bad drive or LCC from service, and the storage system will then generate an 0x9-level or 0xA-level “xxx removed” message. The message “69e A bad drive or LCC is causing hardware problems.” Indicates the storage system has finished removing the bad disk or LCC noted in message 0x69d from service. This is an informational message that follows the 0x9-level or 0xA-level “xxx removed” message. Recommendation: Look for numerous 63e LCC port glitch entries in Navisphere log files (TRiiAGE_Splogs.txt) to isolate a faulty LCC(s) causing backend instability which might then need replacement. Similarly, look for drive issues (media errors, Soft SCSI Bus errors, a18 CRU Drive Causing Loop Failure messages etc.) to isolate a faulty drive(s) in the loop which might also need proactive replacement.

63e A port glitch was detected by the LCC Description: This event indicates that the LCCs are having problems communicating with the SP. This would indicate a problem with one of the LCCs or cable on this back-end fibre loop. The term “Port Glitch” means that a short temporary loss of sync on the loop was detected by CLARiiON software. It may or may not result in loss of connectivity of the enclosure. There can be multiple reasons for a port glitch. Port glitches may be seen before or after enclosures are added or removed. Basically, a port glitch reflects a physical change when connectivity of the loop or software on the LCC changes. These changes disturb the loop so that port glitches show up in the logs. Unless there are other issues associated with the bus, they can be ignored. Recommendation: For Backend instabilities seen in logs, LCC on SPA or SPB side could be suspect and prove to be a candidate for replacement. Look for items that may contribute to backend instability including 801 Soft SCSI Bus Errors.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 155

Page 157: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SP State Advanced Lustat Statistics Logging: Reports Statistics Logging is ENABLED or DISABLED. The SP maintains a log of statistics for the LUNs, disk modules, and storage-system caching that you can turn on and off. When enabled, logging affects storage-system performance, so you may want to leave it disabled unless you have a reason to monitor performance. You can change the Statistics Logging from the General tab in the storage-system properties dialog box. Note: If Navisphere Analyzer is installed, and you enable statistics logging for the storage system, Analyzer logging is also enabled. PEER SP: Status of the peer SP as seen from this SP is reported here. This field can be used to get the status on Peer SP when we are not able to connect to it or there are any issues with it. Status can be REMOVED, PRESENT or NONE (in case product contains only one SP – for e.g. AX100SC) WRITE CACHE: Write Cache displays the state of the storage system’s write cache. Write cache can be enabled or disabled from the Cache tab in the storage-system properties dialog. The size of the write cache can be set from the Memory tab in the storage-system properties dialog box. Write cache states are:

Write Cache State Details INITING

Cache is initializing. This is initial value of cache state when powering up

SYNCING When the peer is powering up (determined by a peer INITING event) the running SP enters the syncing state where the two SPs (a.k.a. boards) are attempting to sync the cache ram images. The SP remains syncing until the peer either dies or transitions to another state (i.e. ENABLING, DISABLING or DISABLED).

ENABLING Before cache is ENABLED, each SP enters the ENABLING state. Enabling is simply a handshake between the two Caches indicating that each is ready to enable.

ENABLED** The cache is ENABLED when all necessary caching components are enabled. The components that must be operational for a viable cache include: peer SP is up and communicating appropriately, the vault is enabled, the BBU (Battery Backup Unit, i.e. SPS) is charged, the fans and VSCs (power supplies) are not faulty. If one or more of these components should fail, the cache is disabled. Please note: The CX550, CX750, and CX950 family of arrays can loose one fan and still maintain a viable cache.

QUIESCING When a required cache component fails, the cache image is backed up on vault. Before the writes to the disk can begin, all cache ram modifications must be stopped (due to parity encoding). This stage in the cache shutdown process is referred to as “quiescing”. The QUIESCING state is when all CAQEs (Cache Queue Elements) are being stopped. Once all active CAQEs are stopped, the cache is said to be frozen.

FROZEN All CAQEs are frozen. But before backing up the cache image, we must wait for all CMI traffic from the other SP to stop (for the same reason as above). When the peer responds that it is also frozen (or dead), The cache image backup can take place.

DUMPING One of the SP’s is dumping the cache to the Vault. DISABLING The cache is disabling while there are component failures and the cache is

dirty. When the cache is clean (no cache dirty LUNs), the cache is said to be disabled.

DISABLED** The cache is disabled while if there are components failures (and the cache is clean) or the operator has purposefully shutdown or not enabled the cache. All component failures must be rectified before the cache can be enabled.

RECOVERING The cache is recovering if we are caching on a single SP (non-mirrored) and the CM tells us that cache recovery is needed. The RECOVERING state is similar to the INITING state, in that the Front End (host ports) is not yet turned on, and LUNs are not yet assigned. After a successful recovery, we will transition to the DUMPING state.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 156

Page 158: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

**: State should be either ENABLED or DISABLED. The other states are interim conditions in which the cache can be found. If the TRiiAGE Analysis report indicates that the cache is in one of these momentary states, it is best to check the live status of the array before taking any action. The array should not be “stuck” in any of these states for an extended period of time. A detailed examination of the array is in order and possible escalation to the Crisis Team may be appropriate. READ CACHE: Read Cache displays status of SP Read Cache. Each SP has a read cache in its memory, which is either enabled or disabled. The read cache on one SP is independent of the read cache on the other SP. Storage system read caching for an SP must be enabled before read caching can be enabled for any given LUN. You can enable or disable an SP’s read cache from the Cache tab in the storage-system properties dialog. You can set the size of an SP’s read cache from the Memory tab in the storage-system properties dialog box. Status can be DISABLING, ENABLED, DISABLED or UNKNOWN A: DP 50% TOTAL 122751 DIRTY 62251 TOTAL: Total number of write cache page count on the SP DIRTY: Write cache dirty page count DP: The dirty pages percentage (=DIRTY/TOTAL) B: TOTAL 122752 Total number of Write cache page count on SPB U: DP 00% TOTAL 0000 TOTAL: Write cache unassigned page count DP: (Unassigned/Total) % (from code this looks like it will be always 0) Requests Complete: 209382809 Number of completed host requests. SPS A: OK SPS B: OK These fields report information on SPS (Standby Power Supply) and the SPS configuration.

SPS Status Reported as: SPS = SPS_BAT_OK OK OK Unknown Config OK Invalid Power Cable1 OK Invalid Power Cable2 OK Invalid Serial Cable OK Invalid Multiple Cables SPS = SPS_TESTING TE SPS = UNIT_NOT_PRESENT -- SPS = UNIT_FAILED FLT NR

SPS status and configuration requires verification if it is not reported as “OK” Ktcons lustat: The table below is an example of an enhanced version of the typical LUSTAT (a.k.a. Advanced LUSTAT) output reported in the SPx_cfg_info.txt file. Data is collated from three different files to form the basis for the output seen in the TRiiAGE analysis report. These input files include: SPx_cfg_info.txt, *.drt, and SPx_metalun_info.txt. The *.drt file was added to SPCollects for releases 13 and beyond. For these older releases (< R13), the Advanced LUSTAT output will not be produced and recorded in the TRiiAGE Analysis report. The reason for this is that Advanced LUSTAT uses the *.drt file to obtain the FLU to ALU mappings. Consider looking at the SPx_cfg_info.txt file for a basic lustat listing when dealing with these older releases.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 157

Page 159: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Advanced Lustat version 1.53 MLU ALU FLU RGRP ENC TYPE P LD CAPACITY CAC DEFOWN STATE NAVIFRUS - 0 0 7 ATA RAID-5 N - 476.0 GB RW- SP-A ENA:PEER 0.3.0 0.3.1 0.3.2 0.3.3 0.3.4 0.3.5 0.3.6 0.3.7 0.3.8 0.3.9 0.3.10 0.3.11 0.3.12 0.3.13 - 1 1 5 FC RAID-5 N - 100.6 GB RW- SP-A ENA:PEER 0.2.0 0.2.1 0.2.2 0.2.3 0.2.4 - 205 2 205 ATA HotSpare Y - 297.0 GB --- SP-A ENA 1.4.14 - 192 3 7 ATA RAID-5 N - 150.0 GB RW- SP-A ENA 0.3.0 0.3.1 0.3.2 0.3.3 0.3.4 0.3.5 0.3.6 0.3.7 0.3.8 0.3.9 0.3.10 0.3.11 0.3.12 0.3.13 - 135 4 18 ATA RAID-5 N - 1.8 TB RW- SP-A ENA:PEER 1.4.0 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 Following are the details on columns reported in the output. LUNs are listed along with the NaviLUN (ALU), Flare LUN (FLU) and MetaLUN (listed if LUN is a component LUN) numbers. Mapping of FLU<->ALU is provided via WWN matching if .DRT file is available, otherwise TRiiAGE is forced to perform the match using the GETRG output (which can be unreliable). MLU : Metalun number ALU : Navi LUN number FLU : Flare LUN number RGP : Raid group number ENC : Reports Enclosure Type on which LU is bound. This can be FC, ATA, ST2 (Stiletto 2G) or ST4 (Stiletto 4G). TYPE : Raid Type; This can be: Ind-Disk, RAID-0, RAID-1, RAID-10, RAID-3, RAID-5, HotSpare P : indicates Private LU. Reports “Y” if LU is private. For example: hot spare, MetaLUN Components, Snap cache

LUNs and Clone Private LUNs are all reported as Private LUNs. LD : Reports if there are any Layer driver in the LU stack. Only the first device in the stack is listed. For example: If

the stack for “LOGICAL UNUT NUMBER 46” contains K10RollBackAdmin, K10FarAdmin and K10SnapCopyAdmin in that order, only K10RollBackAdmin is reported here as RB.

LD Abbreviations Layered Driver

SC SnapCopy Admin RM Remote Mirror Admin AG Aggregate (MetaLUN) Driver Admin CL Clone Admin AM Asynchronous Mirror WIL Write Intent Log SCL Snap Cache LUN RB Roll Back Admin

CAPACITY : Reports LUN Capacity. CAC : Reports LU Read and Write Cache state DEFOWN : Reports Default Owner for LU STATE : Reports LUN Status. NAVIFRUS : Reports drives on which LU is bound in B.E.D format

Note: FRU order in case of RAID10 is: P1 P2 P3 S1 S2 S3 Some Important Notes: 1. WIL = Write Intent Log and CPL = Clone Private Lun will not assign after failure if the array FE cables are disconnected. Reference the Layered Product section for which Luns make up the WIL and CPL. 2. ATA guideline emc95538 recommends disks in the same RG are assigned to one SP. 3. ALUSTAT is LUSTAT with ALU column appended. The ALU is determined using the WWN for R13 and greater. The mapping WILL always be correct. WARNING: The ALU is determined using Navi commands for R12 and earlier. It should usually be okay, but do not rely on it for critical analysis. 4. Default owner for a component of MetaLUN is not important. For e.g. If metalun has 3 components and each component can have different default owner assigned. Aggregate driver ensures that all components are assigned to the same SP i.e. current owner for all components of a MetaLUN will be the same.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 158

Page 160: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Ktcons Vpstat Reports verify information for LUNs. > !vpstat Summary of Verifies: RAID Sniffing Verify Percent Sniff BV Total LUN Group Type State Capacity Type Complete Rate Time Passes ---- ----- ----- -------- -------- -------- -------- ----- ---- ----- 0 7 RAID-5 Enabled 476.0 GB ------ 10 0 24 1 5 RAID-5 Enabled 100.0 GB ------ 10 0 44 2 205 HotSpare Enabled 297.0 GB ------ 254 0 34 3 7 RAID-5 Enabled 150.0 GB ------ 10 0 4 4 18 RAID-5 Enabled 1782.0 GB sniff 48 10 0 24 5 19 RAID-5 Enabled 1782.0 GB peer:sn 0 10 0 9 6 7 RAID-5 Enabled 100.0 GB ------ 10 0 4 Sniffing State: Reports Sniff is Enabled/Disabled for LU In ideal scenario, all LUNs should report Sniffing State as “Enabled” This can be changed with setsniffer navicli command. Verify Type: Reports type of verify running on LU. Values can be: sniff – We are doing a sniff verify.

BV – We are doing a background verify. Peer:sn – Peer doing a sniff verify. Peer:bv – Peer doing a background verify.

Percent Complete: Reports verify percentage complete for LU. Sniff Rate: Specifies the rate at which sniffs are executed. It is specified in 100-ms units. (100 ms /sniff verify IO)

Valid values are 1 through 254. BV Time: Reports checkpoint verify time in seconds Total Passes: Reports the number of verify passes for a given unit

Critical Issues Information The critical issues list is taken directly from a text file generated by the CAP dispatcher and displayed in triage. This list provides an at-a-glance overview of the health of the array, faulted components and components that could potentially cause problems in the future. Most, if not all of the information listed here is also reported in other sections of the TRiiAGE report. This section simply provides a concise summary of areas in the array that might need attention. Caution: For all array problems, it is recommended that an in-depth analysis of the information provided in the SPCollect zip files be performed before taking any action recommended in this section. FCOScan FCOScan is a utility that was written by CLARiiON Sustaining Engineering and is run as part of TRiiAGE. Its purpose is to call out important array components that are known to be covered by an FCO. The types of information include drives that were miss-zeroed at the factory, drives that have performance issues, old revisions of Flare that desperately need to be upgraded etc. The problems that these FCOs address can be sensitive in nature and as such are encoded (i.e. AMSPR) and are to be used by qualified support personnel only. Deciphering the codes and determining a potential course of action can be accomplished by contacting your local CTS or escalating to engineering.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 159

Page 161: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Displaying Coherency Error Count A count of the coherency errors reported in the log file is indicated here for quick reference. In a Raid-5 raid group, coherency errors occur when the data and parity checksums match but the written parity does not match the calculated parity from data. The system recovers from this situation by calculating the parity from data then rewriting the parity information. In a Raid-1/0 raid group, coherency errors occur when the data on the primary does not match the data on the secondary even though the data checksums are calculated to be correct. Regardless of the raid type, coherency errors do not bode well for the consistency and health of customer data. Cases of this nature should be escalated to engineering for review. Example: DISPLAYING COHERENCY ERROR COUNT These may cause host corruption. COH will be reported against the parity disk on R5 and R3 RGs. COH will be reported on the secondary disk of an R1 or R10. Look for a data disk in the same RG logging CRC errors. That disk indicates which slot is bad on the BCC SI Athena chip. Better yet, run dsf.pl. It does all of these checks for you. COH count: 115

RAID Group Error Summary Information This section provides an at-a-glance summary of the health of various raid groups and contained drives that are reporting some combination of SCSI bus errors (communication issues) as well as media issues. This table can help focus concentration on a specific raid group or raid groups when performing initial triage. Also, refer to this table after identifying a specific drive or drives to verify that you have covered all of the issues that might be present with the raid group before making any recommendations. It is advisable to run the RAID Group Health Check (rghc) utility after identifying a drive or drives for replacement. RAID Group Health Check provides more details to determine if the drive is safe to remove, the drive has been replaced or an indication that more specific actions must be taken. The following table lists the column headings found in the RAID Group Error Summary section of TRiiAGE and the message from the Navi log that make up the counts displayed in the report.

Column Header Error Message Extended Status

Hard Media Errors 920 Hard Media Error 901 Hard SCSI Bus Error

Any

Soft Media Errors 820 Soft Media Error

All except 3a – Sector remapped after recovered error

PFA & Drive HW 801 Soft SCSI Bus Error 803 Recommend Disk Replacement

9 – Hardware Error 19 – PFA Threshold Reached

Remapped Sectors 801 Soft SCSI Bus Error 803 Recommend Disk Replacement 820 Soft Media Error

4 – Remap timeout 3b – Can’t remap after recovered error 3c – Remap failed after recovered error 3d – Can’t remap after media error 3e – Remap failed after media error

Transfer Errors 801 Soft SCSI Bus Error 820 Soft media Error

2a – Bad Transfer Count 2c – Data Overrun or Underrun

Timeout Errors 801 Soft SCSI Bus Error 820 Soft media Error

4 – Remap Timeout 6 – Command Timeout 7 – Select Timeout

Parity Errors 801 Soft SCSI Bus Error 820 Soft Media Error

2 – Parity Error 30 – Parity Error 32 – Status Parity Error

Bad Blocks 801 Soft SCSI Bus Error 820 Soft Media Error

5 – Bad Block

Invalidated Sectors 694 Parity Invalidated 840 Sector Invalidated 956 Parity Invalidated

Any

Reconstructed Sectors

684 Parity Sector Reconstructed 689 Sector Reconstructed

Any

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 160

Page 162: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

DISKS SENSE DATA from SP*_System.evt files SCSI sense data reported by the target (disk) is retrieved by the initiator (SP) and recorded in the Windows system event log. SCSI sense data consists of a “Sense key”, an “Additional sense code” and an optional “Additional sense code qualifier”. Consult Primus emc111551 for detailed description of SCSI Status codes and links to the appropriate T10 SCSI specifications. The instructions below describe how to format this information in a readable form to help troubleshoot drive problems. Please keep in mind that the information in this file is cumulative and currently not sensitive to drive replacement or the passage of time. If you select a drive for replacement based on the information in this file, you must first verify that the drive in the array is really bad and has not already been replaced before making any recommendations. This information and resulting table help identify potential disks problems to a finer granularity than what is entered in the SP logs. Examine an ANSI spec for generic sense key definitions. The output is TRiiAGE_SP*_disks_sense_data.csv and needs some manipulation to produce a nicely formatted EXCEL output. There are two mutually exclusive options that can be used to format this information. The first option utilizes a macro and assumes access to the _support_tools directory on \\susfs \sustaining. Option 1 will produce both a chart and pivot table of SCSI sense data and Option 2 will only generate a pivot table. Both options are included here in case the macro is unavailable. Option 1: 1. Open the TRiiAGE_SP_disls_sense_data.csv file using the Microsoft Excel spreadsheet application. 2. From the menu, click Tools->Macro->Visual Basic Editor or Alt-F11. 3. Set focus to the Microsoft Visual Basic editor screen. From the menu, select File->Import File. 4. In the Import File dialog box, browse to the _support_tools directory on \\susfs\submittals and select the file named

SenseData.bas. Minimize the Visual Basic editor screen for now. 5. Set focus back to the Microsoft Excel spreadsheet and select Tools->Macro->Macros from the menu. 6. In the Macro dialog box, make sure “Sensedata_Formatter” is highlighted and press the run button. At this point, one

can view disk drive SCSI sense data using the chart or pivot table. 7. Note: This option is unavailable to non-EMC personnel. Internally, this is only available to EMC TS2 and CLARiiON

LRS personnel. Option 2: 1. Open.csv file using Excel. Select all, then double click the vertical bar between columns “A” and “B” for auto-fit. 2. Create a pivot table from row 2 to the end, column A through I. Highlight columns and all of the rows in advance. 3. Drop the ‘bus’, ‘enc’, ‘fru’ labels in the row fields and ‘ascq’, ‘asc’, ‘sk’ in the column fields. Drop in the order listed. 4. Drop the ‘timestamp’ field in the data fields area. 5. Save as an .xls workbook. 6. Use a disk interface specification to determine the sense data definitions.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 161

Page 163: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

FBI Error Information General operation: The default poll is once every hour. The max number of polls captured is 10. The history is it will always save the first 5 and the last 5 deltas. It always saves the previous session, but the prior previous will be overwritten at the start of every new session. The SPs log the same data in case one bus cannot be accessed. This way the peer SP can still read the RLS data from the disks. --------------------------- Errors that CAN be ignored:

1. Two or Three "RLSException" errors logged on all busses, after an NDU operation. 2. Flare removed or faulted a drive in the hour prior to an "RLSException". 3. A Hotspare was swapped in or out in the hour prior to an "RLSException".

------------------------------------------------------------------------------------------ Errors that SHOULD be looked at in more detail using FBI utilities against the .rls files.

1. When a BUS has more than 5 "RLSException" errors logged against an SP bus in a 24-hour duration. This usually indicates there is a consistent backend issue. Allocate the SPCollects that contain the log files that cover the days the "RLSException" errors were logged, and check .rls files in detail. Multiple SPcollects might be needed to put together a timeline for diagnosibility.

************ Bus 0 errors ************ ---------- SPA_RLSCOLLECTORLOG.TXT 12-29-2005_23-35-43>RLSException@BUS0; Total Delta: 1942 12-30-2005_03-35-58>RLSException@BUS0; Total Delta: 731229 12-30-2005_06-36-10>RLSException@BUS0; Total Delta: 730518 01-04-2006_06-33-30>RLSException@BUS0; Total Delta: 564 01-04-2006_10-33-44>RLSException@BUS0; Total Delta: 10 ---------- SPB_RLSCOLLECTORLOG.TXT 12-29-2005_23-36-46>RLSException@BUS0; Total Delta: 1879 12-30-2005_01-36-52>RLSException@BUS0; Total Delta: 63 12-30-2005_03-36-59>RLSException@BUS0; Total Delta: 731229 12-30-2005_06-37-09>RLSException@BUS0; Total Delta: 730518 01-04-2006_07-34-26>RLSException@BUS0; Total Delta: 564 01-04-2006_10-34-36>RLSException@BUS0; Total Delta: 10 ************ Bus 1 errors ************ ---------- SPA_RLSCOLLECTORLOG.TXT 12-24-2005_21-19-15>RLSException@BUS1; Total Delta: 341 01-01-2006_22-30-28>RLSException@BUS1; Total Delta: 728094 01-04-2006_06-33-31>RLSException@BUS1; Total Delta: 6010 ---------- SPB_RLSCOLLECTORLOG.TXT 12-24-2005_21-20-30>RLSException@BUS1; Total Delta: 15 01-01-2006_22-31-24>RLSException@BUS1; Total Delta: 728094 01-04-2006_06-34-24>RLSException@BUS1; Total Delta: 3425 01-04-2006_07-34-27>RLSException@BUS1; Total Delta: 2585 ************ Bus 2 errors ************ ---------- SPA_RLSCOLLECTORLOG.TXT ---------- SPB_RLSCOLLECTORLOG.TXT ************ Bus 3 errors ************ ---------- SPA_RLSCOLLECTORLOG.TXT ---------- SPB_RLSCOLLECTORLOG.TXT

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 162

Page 164: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

YUKON Log Analysis The term “YUKON logs” refers to the log files that can be obtained from BCC controllers using commands described in Primus emc117216. YUKON logs are only available from the BCC controllers in an ATA (Klondike) enclosure. This section is primarily for engineering use to obtain and analyze diagnostic information from the BCCs. However, there are two things to look for when considering replacement of the BCCs. First, if you see the text “ECC error overflow” in the second section (below), proactively recommend replacing both BCCs in the enclosure. In the example below, both BCCs would be replaced for enclosure 9, “SPA_yukon_log_encl_9_03_12_2006_22_06.evl”. Second, replace both BCCs for a given enclosure if there are more than 50 correctable ECC errors reported for a given BCC. The YUKON log analysis will be listed in the TRiiAGE report when YUKON logs are available in the SPCollect files. If no YUKON logs are available for a given case, the report will not be seen in TRiiAGE. **************************************************************************************** “YUKON log(s) bugcheck entries and other error entries by count:” Yukon timestamps are based on Taiwan time which is GMT+8. So the logs are 12 hours ahead of EDT or 13 hours ahead of EST. -

a. A log will only display the last bugcheck. Replace the BCC if there is an “Ecc error overflow”.

YUKON\SPA_yukon_log_encl_9_03_12_2006_01_22_06.evl:<Msg Type=”GPR_Event_Properties” Tag=”1”><Event_Properties Number=”0” Type=”System” Date=”3/10/06” Time=”4:11:04 PM” Category=”Informational” Owner=”Controller A”/><Event_Data Event=”Normal power up (V1.86.0)” Bugcheck_Date=”3/10/06” Bugcheck_Time=”4:10:48 PM” Error_Code=”108” Process_Name=”Event Log” Module_Name=”Globals” Line_Number=”1603” Information=”ECC error overflow”/></Msg>

b. Entry counts indicate CORRECTABLE ECC errors Replace a BCC if the error count is over 50.

YUKON\SPA_yukon_log_encl_9_03_12_2006_01_22_06.evl:186 - - Entry counts all indicate something wrong with the FC ports between the BCC’s. - We’re only tracking these errors for now. Do not replace a BCC based on this information alone. Imm Notify for a cmd still in REQ ring ------- Duplicate CTIO received ---------------------- TIP Initiator port status, Data Underrun ----- ISP ABORT acknowledged ----------------------- - - Entry counts indicate [Medium Error] which translate to UNRECOVERED READ error check conditions [03/11] returned to Flare. - The entries should match the SP log entries. - - - Entry counts indicate [Uncorrectable medium error] which translate to UNCORRECTABLE errors from the disk. -

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 163

Page 165: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SPCollect Information This section simply lists when a SPCollect was initiated on a given SP. This is useful when seeking information specific to a point in time that may not be in the most recent SPCollect files. This information typically includes ktraces which wrap quite frequently and are very helpful during the RCA process. The messages do not indicate if the actual SPCollect zip files are still on the array but do provide the engineer with the capability to request specific files from the field. SPQ General Description: SPQ (SP Qualifier) is a program designed to help diagnose the state of an SP that may be a target for replacement. The utility checks the overall state of the array and makes recommendations whether an SP needs to be replaced or not. This can help avoid unnecessary SP replacements and in some cases facilitate remote diagnosis. SPQ is currently targeted for CX200 – CX700 FC arrays only. The program does remote analysis from a management station. However, if an SP is stuck in the boot process, SPQ can be run from a CE’s laptop connected serially to the SP. This tool is created and maintained by Midrange Sustaining Engineering. Report all issues to [email protected].

• Detects whether an SP is degraded, hung, dead or experienced a reboot/bugcheck and reports I/O state. • Determines whether a hardware problem or a software problem is keeping the SP from functioning normally.

In the case of an SP replacement, SPQ prompts service personnel to return the replaced SP, SPQ Report and SPcollects to manufacturing for FA (Failure Analysis). Support personnel should submit SPQ reports to the next support level if the problem persists. SPQ Pre-requisites

• Software – A compatible revision of NaviCLI is installed on the service laptop or management station. • Failover – SPQ assumes that Failover is properly configured and there are no problems with HA. If not, SPQ will

prompt the user to trespass the LUNs to the Peer SP prior to further diagnosis. Also, SPQ assumes that NaviCLI is installed in the CE’s laptop.

• Hardware – Serial cable(s) must be connected to the SP for a full diagnosis. SPQ performs a major part if its analysis remotely from the management station. However, under certain situations SPQ needs to be executed from a CE’s laptop using serial connectivity. For this reason and the possibility of an SP requiring a reseat, a CE must be present to execute a complete SPQ test sequence.

• Connection – While SPQ is running, please do not try to initiate a PPP connection manually. SPQ initiates its own PPP connection.

• Ktcons – While SPQ is checking an SP, do not start a remote Ktcons session for that SP from any machine. SPQ does not cater to more than one Ktcons session at a time. SPQ may hang if a Ktcons session is already established.

How SPQ Works SPQ gathers the following statistics about both of the SPs:

• Is an SP consistently pingable for a certain period of time? • Is the NaviCLI getagent command returning valid SP information? • Is the SP capable of servicing I/O?

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 164

Page 166: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

SPQ relies upon the following existing array software to determine the SP state:

• Ktcons to capture remote Ktraces. • NaviCLI commands to perform the analysis from a management station. • SPQ captures SP output during boot time using serial connectivity to SP. SPQ has been implemented with PPP

connection to the SP to retrieve NaviCLI information when a CE is not allowed to access a customer network. • SPQ uses documented POST errors to detect hardware issues with the SP.

SP Issues/Problems detected by SPQ

• Unpingable SP and not servicing I/O • Pingable SP but not servicing I/O • Partially pingable SP • Unpingable SP but servicing I/O • Pingable SP, servicing I/O but Navi agent unmanaged • ECC errors, Machine Check Exceptions and bugchecks non the SP

Single SP Operations SPQ can be utilized to analyze the state of a single SP as well. Input a single SP IP address. SPQ then assumes the second SP as a dummy SP. If the single SP is not pingable at all, then as in the case of a Dual SP scenario, SPQ will first prompt the user to check for customer network issues. Additional details on SPQ can be found in the Storage Processor Qualifier (SPQ) User Guide, available as SPQualifier_User_Guide.doc in the SPQ installation directory.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 165

Page 167: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Section 5 General Troubleshooting and Information This section describes some basic procedures for an “unbootable” SP on a CX Series or CX3 Series array. The same basic guidelines apply regardless of whether the array has been running normally or an SP has already been re-imaged in an attempt to get an SP up, or a platform upgrade (conversion) is in progress. Note: This section contains information written for an engineering/sustaining environment. This includes some information that may NOT be suitable for customer environments. The information should not be distributed to and is not intended to replace or supersede official product support documentation. Private Space Reference The following diagram shows the locations of the Flare Boot Partitions and Utility Partitions in private space. This diagram is not drawn to scale. For more details, see the following specifications:

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Flare Boot Partition

SPA Primary Flare Boot Partition

SPB Primary Flare Boot Partition

SPA Secondary Flare Boot Partition

SPB Secondary

Utility Partition SPB Primary

Utility Partition SPA Primary

Utility Partition SPB Secondary

Utility Partition SPA Secondary

Image Repository (CX3 Series only)

Image Repository (CX3 Series only)

Image Repository (all CX and CX3)

For CX200/400/600 and CX300/500/700 platforms

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 166

Page 168: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

For CX3-20, CX3-40, CX3-80 platforms

“SP Will Not Boot” When an SP is “unmanaged” or “unbootable”, there are several possible root causes. The word “boot” has many meanings, so it is important to determine exactly where the boot process failed. See further down in this section for additional information about the boot process. Note: Re-imaging does not solve all problems, and can leave the SP in a worse state than when you started. The current policy is that EMC Technical Support Level 2 or EMC Engineering or both should be contacted before any data-in-place re-imaging operations are performed on CX-series arrays.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 167

Page 169: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

First Steps To Try Always start with Primus solution emc111000, which contains a link to troubleshooting trees for all CX arrays. Much of this section is derived from those trees, while this section is designed to only be a “quick reference”. 1. Ping the SP. Use “ping <ip address> –t” at a command prompt and allow it to run for several minutes in case the

array is in a reboot loop. Make sure the SP network cable is connected correctly. a. If the SP is pingable, the OS image for the SP is probably OK. Do NOT re-image. The SP may be

degraded. Try to access the SP using EMC Remote. i. If you can access the SP using EMC Remote, debug the SP as a Degraded SP. ii. If you cannot access the SP using EMC Remote, you should always try to Force Degraded Mode.

b. If the SP is not pingable, try to establish a serial (PPP) connection to the SP. Note that if the SP was recently re-imaged, the IP address may not have been restored from the PSM after re-imaging if the SP booted in Degraded Mode.

2. [CX3 Series only] Check the system event log on the PEER SP for “peer boot logging” messages. Those logs

may indicate a hardware problem. The peer SP will also log if the local SP is in Degraded Mode. Note that Flare must be running normally on the peer SP, or these messages will not be logged.

3. Check SP Fault light and Extended POST output. Check the amber SP Fault LED located on the air dam of the

SP. If the SP is running normally, this LED is turned off. If the LED is off or on “solid”, re-imaging is unlikely to solve the problem. You should also collect Extended POST output available from HyperTerminal. This is critical information in some situations.

4. Try to get the SP into Degraded Mode (see Primus solution emc76039 for an important note about older Flare

revs). If the SP can boot in Degraded Mode and becomes pingable, do not re-image the SP. Debug the SP as a Degraded SP.

5. Check the health of the PEER SP. What is the state of the peer SP? Is it running normally? or in Degraded Mode,

or pingable? If SPCollects from the peer SP are available, check for disk and/or backend loop issues that could be affecting the local SP. Re-imaging one SP while the peer is also having problems is NOT recommended, since this could make the situation worse!

a. For example, if the peer SP is "degraded", the PSM may have a problem and be inaccessible, which could cause a re-image of an SP to leave the SP in an unmanaged state, because the re-imaged SP would be unable to access its IP address and other configuration information stored in the PSM.

6. Look for a hardware problem. If the SP cannot boot even in Degraded Mode, and the peer SP is running normally,

there is either a hardware problem or re-imaging is needed. You must try to determine if a hardware problem (SP/disk/cable/enclosure) exists before trying to re-image. Check that backend zero is connected properly and there are no fault lights on the boot drives. Note that SPA boots from drives 0 and 2. SPB boots from drives 1 and 3. Try to isolate the problem by reducing the configuration (fewer enclosures and drives).

7. If the SP cannot boot in Degraded Mode, and no hardware-related problems are found, then re-imaging is the only

option left. Only do so under the direction of EMC Technical Support Level 2 or EMC Engineering.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 168

Page 170: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CX Boot Failure Modes Note: Primus solution emc111000 contains a the following matrix that is suitable for the field, and should be consulted during escalations. Primus solutions emc76039 and emc66446 also contain some tips in this area.

Additional Notes: Both SP’s failing - If both SP's of an array are failing, it is mostly likely not bad SP hardware; but rather a software (database) problem, or backend issue. Do not swap SP hardware or reimage. The SP’s are most likely in a hung or degraded state (depending on which other symptoms most closely match.) Network Connectivity - Note that the ability to ping an SP assumes that an IP address has been configured for that SP and that the SP has not been reimaged (after imaging a failing SP may need its IP address set again.) If an IP address has not been set, the SP will not be pingable and will report “unmanaged” regardless of its boot state. In addition, a bad network cable or network switch may impact network connectivity. If all other indications are that the SP is healthy, it may be a cable or switch issue. Establishing a PPP connection to an SP If an SP is not pingable, it can be useful to try to establish a PPP connection. This will allow EMC Remote access (at a fairly slow speed) and can help in the investigation of field issues. 1. Make sure the SP serial port is connected to a COM port on your console PC. 2. Use “Dial-Up Networking” on console to establish connection. Make sure it is configured to use the same COM port

that the SP is connected to (in ASE labs, this is normally COM1 for SPA, COM2 for SPB). Use 115200 baud rate. 3. Once connection established, use EMC Remote using the standard Dial-Up address (192.168.1.1)

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 169

Page 171: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

LAN Service Port (CX3 Series only) If an SP is not pingable, but you suspect that the SP is actually running (example: the SP Fault LED is off, or FCLI on the peer SP shows the local SP as PRESENT), then there could be a problem on the customer’s network. You may want to use the LAN Service Port to help confirm this. The LAN Service Port creates a direct-connect "virtual LAN" interface between a Service laptop and an SP. 1. Use a regular IP cable - a special cross-over cable is not needed. 2. To access an SP, you need to connect to the LAN Service Port on its peer SP. Details below. 3. Configure the Service laptop as follows:

IP address = 128.221.1.249 or 128.221.1.254 Subnet Mask = 255.255.255.248 Gateway = None (use blank spaces; the LAN Service Port is direct connection only) IMPORTANT: Do not connect the Service laptop to the Customer LAN while it is configured this way.

4. To access SPA, connect IP cable from Service laptop to LAN Service Port on SPB. SPA's Service Port IP address will be 128.221.1.250

5. To access SPB, connect IP cable from Service laptop to LAN Service Port on SPA. SPB's Service Port IP Address will be 128.221.1.251 IMPORTANT: Never connect the Corporate LAN to either LAN Service Port.

6. Attempt to ping the appropriate Service Port IP address as listed in Step 4(a/b). If successful replies are received, the SP should be considered "pingable", and a network issue with the Customer LAN or the SP should be suspected.

EMC Remote password changes in R24 (and beyond) 1. Customers can use Navisphere to change their EMC Remote username/password. 2. If an SP is not accessible using EMC Remote, make sure you are using the correct username/password. 3. When connecting using an IP address, the only valid default username/password for EMC Remote is

clariion1992/clariion1992. The clariion/clariion! username/password no longer works for EMC Remote when connecting using an IP address. Note that clariion1992/clariion1992 also worked on arrays running pre-R24 code.

4. If you establish a PPP session to an SP over the serial port, the username/password of clariion/clariion! should always work.

5. You can still use clariion/clariion! or clariion1992/clariion1992 as the Windows logon password. SP Fault LED Blink Rates The SP Fault LED is an amber LED located on the air dam of the SP. During a normal boot of the Flare Partition, the following blink rates will be seen:

Blink Rate Interpretation ¼ Hz Power up and BIOS Initialization Phase ½ Hz Extended POST Testing Phase 4 Hz Operating System Boot Phase – Windows may or may not have started. Flare/NDUMON have not fully

started, so the SP is not ready to handle Flare IO. It may be possible to connect to the SP via PPP (and EMCRemote) since the SP may be in degraded mode.

Off Boot success, ready for Flare IO – This is the normal case after booting. Flare and NDUMON have started. SP should be accessible via ping, Dial-Up Networking, or EMCRemote.

The following blink rates are for special cases:

Blink Rate Interpretation 2 Hz NMI button pressed (CX3 Series only) 1 – 3 – 3 - 1 Bad DIMM detected. On (solid) Flare has turned on the fault light due to a hardware issue.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 170

Page 172: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Summary of Boot Process Here is a quick summary of the major steps in the boot process in each of the three boot modes: Step Normal Case Degraded Mode* HFOff 1 BIOS Output displayed on

HyperTerminal, SP Fault LED starts blinking**

Output displayed on HyperTerminal, SP Fault LED starts blinking**

Output displayed on HyperTerminal, SP Fault LED starts blinking**

2 Extended POST “Alphabet string” displayed on HyperTerminal, followed by “INT 13” messages.

“Alphabet string” displayed on HyperTerminal, followed by “INT 13” messages. Degraded Mode message also displayed.

“Alphabet string” displayed on HyperTerminal, followed by “INT 13” messages.

3 OS Boot Reboot count incremented SP becomes pingable

Reboot count incremented SP becomes pingable

Reboot count incremented SP becomes pingable

4 Flare starts Flare does not start, drivers cannot be started manually

Flare does not start, drivers can be started manually

5 EMC Remote agent starts

SP accessible using EMC Remote

SP accessible using EMC Remote

SP accessible using EMC Remote

6 NduMon starts SP fault LED turns off, Reboot count cleared, Front-ends opened for IO

SP fault LED contines to blink, Reboot count cleared.

Does not start (Reboot count is not cleared)

7 Navisphere agent starts SP becomes manageable using Navi

Does not start Does not start

*Reboot count tripped, or “Force Degraded Mode” flag set in Extended POST. On CX3 Series, the system event log on the peer SP should contain an informational message from flaredrv - “SPx Status: (37) In Degraded Mode”. **The SP Fault LED will continue to blink at different frequencies until it is turned off at the end of the “normal” boot process. See above for a description of the SP fault LED blink rates.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 171

Page 173: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CX200 / CX400 / CX600 POWERUP Sample Extended POST Output (NT) The following represents the normal output produced by Extended POST during booting of Windows Embedded NT. It is taken from a CX600 running Release 11 (pre-RTM) code on its SPB. The following is designed to help answer the question – “Does the Extended POST output in HyperTerminal look okay?” Always configure HyperTerminal to use a 9600 baud serial connection. Note: Author’s comments in bold type are not part of the output stream. Copyright (c) EMC Corporation , 2002 Disk Array Subsystem Controller Model: CX600 DiagName: Extended POST DiagRev: Rev. 02.60 Build Date: Thu Dec 12 13:35:12 2002 StartTime: 02/10/2003 09:13:32 SaSerialNo: LKE00023511042 AabcdeBCDabEabcdFGHabIabcJabcKabLabMabcNabOabPabQabRabSabTabUabVabWabXYZAA

[No errors in Extended POST diagnostics] Initializing back end FIBRE... PCI Config Reg: 2.4.1 0x0157 FCDMTL 0 [2.4.1] Dual Mode Fibre init - OSW DB PTR 0x20000000 FCDMTL 0 [2.4.1] Cached memory - 0xF5E67 bytes @ 0x200006A8 FCDMTL 0 [2.4.1] Noncached memory - 0xBF3BF bytes @ 0x200F650F (0x200F650F phys) FCDMTL 0 [2.4.1] DVM Initialized FCDMTL 0 [2.4.1] IMQ base ptr = 20170000; IMQ length = 8000 Dualmode fibre init completed FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x4, cmd=0x1 FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4002, info=0x0 FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000000, flg=0x84 FCDMTL 0 [2.4.1] DVM Disc Comp- Dev List Size: 3 FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x208, cmd=0x0 Link Event: 0x00030001 Link Event: 0x00030005 FCDMTL 0 [2.4.1] TPM Notify: st=0xA000002, flg=0x200, cmd=0x4 Device Event (0xFFFFFC): 0x00030015 FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x200, cmd=0x2 FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4004, info=0x0 FCDMTL 0 [2.4.1] TPM Lnk Dwn: st=0xA000001, flg=0x201, evnt=0x4004 FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4002, info=0x0 FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000001, flg=0x8005 FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E1 FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E2 FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E4 FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: EF FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E8 Device Event (0xE1): 0x00030012 Device Event (0xE2): 0x00030012 Device Event (0xE4): 0x00030012 Device Event (0xEF): 0x00030012 Device Event (0xE8): 0x00030012 FCDMTL 0 [2.4.1] DVM Disc Comp- Dev List Size: 8 FCDMTL 0 [2.4.1] TPM Notify: st=0xA000001, flg=0x8209, cmd=0x0 Link Event: 0x00030001 FCDMTL 0 [2.4.1] TPM Notify: st=0xA000002, flg=0x200, cmd=0x5 FCDMTL 0 [2.4.1] TPM Resume SCSI Issued: tgdb = 0x200004C4 Device Event (0xEF): 0x00030016 Target 0 is online Target 1 is online

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 172

Page 174: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Target 2 is online Target 3 is online Target 4 is online

[First 5 drives are accessible by Extended POST; since this is SPB, an inaccessible drive 1 or 3 could be issue.] Relocating Data Directory Boot Service (DDBS)... Autoflash POST? POST/DIAG image located at sector LBA 0x00012048 Autoflash BIOS? BIOS image located at sector LBA 0x00011048 EndTime: 02/10/2003 09:14:06 int13 - RESET (1) DDBS: MDB read from both disks. DDBS: Chassis and disk WWN seeds match. DDBS: First disk is valid for boot. DDBS: Second disk is valid for boot.

[No DDBS errors. Note that one of the disks could require a rebuild or contain a “more recent timestamp” if a rebuild is required. This is normal.] NT FLARE image (0x00400009) located at sector LBA 0x0002284B Disk Set: 1 3

[Drives 1 and 3 are valid boot disks for SPB (On SPA, 0 and 2). If the DDBS indicated that drive 3 needed a rebuild, this line would be “Disk Set: 1”] Total Sectors: 0x005821A1 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 240 Cylinders: 382 Capacity: 5775840 sectors Total Sectors: 0x005821A1 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 240 Cylinders: 382 Capacity: 5775840 sectors int13 - READ PARAMETERS (19) int13 - READ PARAMETERS (22) int13 - DRIVE TYPE (57) int13 - READ PARAMETERS (58) int13 - DRIVE TYPE (59) Error : Invalid Drive ID - 0x81

[This “Error” is normal.] int13 - CHECK EXTENSIONS PRESENT (61) int13 - GET DRIVE PARAMETERS (Extended) (62) int13 - READ PARAMETERS (63) int13 - READ PARAMETERS (65)

[There is usually a pause after “READ PARAMETERS (65)” is displayed. If IO is in progress to the boot drives, this pause could be significant. The numbers displayed after this point will not always be the same.] int13 - READ PARAMETERS (1201) int13 - READ PARAMETERS (1240) int13 - READ PARAMETERS (1276) int13 - READ PARAMETERS (1312) int13 - READ PARAMETERS (1352) int13 - READ PARAMETERS (1496) int13 - READ PARAMETERS (1530) int13 - READ PARAMETERS (1565) int13 - READ PARAMETERS (1625) int13 - READ PARAMETERS (1658) int13 - READ PARAMETERS (1727)

[The last number here is not as important as the number of “READ PARAMETERS” displayed after “READ PARAMETERS (65)” (11 in this case). This appears to be platform and image dependant. Hint: If both SPs are at the same rev, then both SPs normally display the same number of READ PARAMETERS statements after #65.]

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 173

Page 175: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CX300 / CX400 / CX700 POWERUP Sample Extended POST Output (Fish) The following represents the normal output produced by Extended POST during booting of Windows Embedded XP. It is taken from a CX700 running Release 14 (pre-RTM) code on its SPB. (See note below about new output for Release 19.) The following is designed to help answer the question – “Does the Extended POST output in HyperTerminal look okay?” Always configure HyperTerminal to use a 9600 baud serial connection. Note: Author’s comments in bold type are not part of the output stream. Copyright (c) EMC Corporation , 2004 Disk Array Subsystem Controller Model: CX700: SAN GBFCC4 DiagName: Extended POST DiagRev: Rev. 01.97 Build Date: Fri Mar 26 11:03:10 2004 StartTime: 05/17/2004 17:30:42 SaSerialNo: LKE00034701753 AabcdefBCDEabFabcdGHabIabcJabcKabcLabcMabcNabOabPabQabRabSabTabUabVabWabXYZAA Initializing back end FIBRE...

[No errors in Extended POST diagnostics] PCI Config Reg: 2.4.1 0x0157 FCDMTL 0 [2.4.1] Dual Mode Fibre init - OSW DB PTR 0x397D4280 AG: init DMD to FC_SPEED_2_GBPS FCDMTL 0 [2.4.1] Cached memory - 0xF77B9 bytes @ 0x396DCA60 FCDMTL 0 [2.4.1] Noncached memory - 0xC037F bytes @ 0x3961C680 (0x3961C680 phys) FCDMTL 0 [2.4.1] DVM Initialized FCDMTL 0 [2.4.1] IMQ base ptr = 39690000; IMQ length = 8000 Dualmode fibre init completed FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x4, cmd=0x1 FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x397D4744, evnt=0x4002, info=0x0 FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000000, flg=0x84 Link Event: 0x00030005 Device Event (0xE8): 0x00030012, tach_ptr: 0x3C0467DC Device Event (0xE2): 0x00030012, tach_ptr: 0x3C0467DC Device Event (0xEF): 0x00030012, tach_ptr: 0x3C0467DC Device Event (0xE4): 0x00030012, tach_ptr: 0x3C0467DC Device Event (0xE1): 0x00030012, tach_ptr: 0x3C0467DC DL waited 1s for discovery Target 0 is online Target 1 is online Target 2 is online Target 3 is online Target 4 is online

[The first 5 drives are accessible by Extended POST; since this is SPB, an inaccessible drive 1 or 3 could cause a problem. Also note that if the above output (FCDMTL strings and “is online” messages) appear repeatedly, this indicates that one of the first 5 drives or the loop itself has a problem. Unless you know that a drive is missing/pulled for some reason, this could be causing the problem you are looking at.] Relocating Data Directory Boot Service (DDBS: Rev. 02.08)...

[The following is new for Release 19 – If the SP is booting in degraded mode and/or has tripped the reboot count, you will see messages and non-zero values here. Also see section 6.2.1 “Determining if an SP is in Degraded Mode”] DDBS: K10_REBOOT_DATA: Count = 0 DDBS: K10_REBOOT_DATA: State = 0 DDBS: K10_REBOOT_DATA: ForceDegradedMode = 0

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 174

Page 176: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Autoflash POST? DDBS: MDDE (Rev 2) on disk 0 POST/DIAG image located at sector LBA 0x00012048 Autoflash BIOS? DDBS: MDDE (Rev 2) on disk 0 BIOS image located at sector LBA 0x00011048 EndTime: 05/17/2004 17:31:09 int13 - RESET (1) DDBS: MDDE (Rev 2) on disk 1 DDBS: MDDE (Rev 2) on disk 3DDBS: MDB read from both disks. DDBS: Chassis and disk WWN seeds match. DDBS: First disk is valid for boot. DDBS: Second disk is valid for boot. NT FLARE image (0x00400009) located at sector LBA 0x0002284C Disk Set: 1 3

[No DDBS errors. Note that one of the disks could require a rebuild or contain a “more recent timestamp” if a rebuild is required. This is normal and may or may not indicate that a crash dump has been written to disk.] Total Sectors: 0x00583F29 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 360 Capacity: 5783400 sectors Total Sectors: 0x00583F29 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 360 Capacity: 5783400 sectors int13 - READ PARAMETERS (3)

[Note that DDBS is called again after an additional int13 – RESET; this is new for XP. This is normal.] int13 - RESET (5) DDBS: MDDE (Rev 2) on disk 1 DDBS: MDDE (Rev 2) on disk 3DDBS: MDB read from both disks. DDBS: Chassis and disk WWN seeds match. DDBS: First disk is valid for boot. DDBS: Second disk is valid for boot. NT FLARE image (0x00400009) located at sector LBA 0x0002284C Disk Set: 1 3

[Still no DDBS errors.] Total Sectors: 0x00583F29 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 360 Capacity: 5783400 sectors Total Sectors: 0x00583F29 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 360 Capacity: 5783400 sectors int13 - READ PARAMETERS (7) int13 - READ PARAMETERS (24) int13 - READ PARAMETERS (515) int13 - CHECK EXTENSIONS PRESENT (516) int13 - GET DRIVE PARAMETERS (Extended) (517) int13 - READ PARAMETERS (520)

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 175

Page 177: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

int13 - CHECK EXTENSIONS PRESENT (521) int13 - GET DRIVE PARAMETERS (Extended) (522) int13 - READ PARAMETERS (524) int13 - CHECK EXTENSIONS PRESENT (525) int13 - GET DRIVE PARAMETERS (Extended) (526) int13 - READ PARAMETERS (535) int13 - CHECK EXTENSIONS PRESENT (536) int13 - GET DRIVE PARAMETERS (Extended) (537) int13 - DRIVE TYPE (555) int13 - READ PARAMETERS (556) int13 - DRIVE TYPE (557) int13 - CHECK EXTENSIONS PRESENT (559) int13 - GET DRIVE PARAMETERS (Extended) (560) int13 - READ PARAMETERS (561) int13 - CHECK EXTENSIONS PRESENT (562) int13 - GET DRIVE PARAMETERS (Extended) (563) int13 - READ PARAMETERS (573) int13 - CHECK EXTENSIONS PRESENT (574) int13 - GET DRIVE PARAMETERS (Extended) (575) int13 - READ PARAMETERS (578) int13 - CHECK EXTENSIONS PRESENT (579) int13 - GET DRIVE PARAMETERS (Extended) (580)

[There is usually a pause after “READ PARAMETERS (580)” is displayed. If IO is in progress to the boot drives, this pause could be significant. The numbers displayed after this point will not always be the same.] int13 - READ PARAMETERS (1459) int13 - CHECK EXTENSIONS PRESENT (1460) int13 - GET DRIVE PARAMETERS (Extended) (1461) int13 - READ PARAMETERS (1476) int13 - CHECK EXTENSIONS PRESENT (1477) int13 - GET DRIVE PARAMETERS (Extended) (1478) int13 - READ PARAMETERS (1492) int13 - CHECK EXTENSIONS PRESENT (1493) int13 - GET DRIVE PARAMETERS (Extended) (1494) int13 - READ PARAMETERS (1510) int13 - CHECK EXTENSIONS PRESENT (1511) int13 - GET DRIVE PARAMETERS (Extended) (1512) int13 - READ PARAMETERS (1528) int13 - CHECK EXTENSIONS PRESENT (1529) int13 - GET DRIVE PARAMETERS (Extended) (1530) int13 - READ PARAMETERS (1578) int13 - CHECK EXTENSIONS PRESENT (1579) int13 - GET DRIVE PARAMETERS (Extended) (1580) int13 - READ PARAMETERS (1595) int13 - CHECK EXTENSIONS PRESENT (1596) int13 - GET DRIVE PARAMETERS (Extended) (1597) int13 - READ PARAMETERS (1632) int13 - CHECK EXTENSIONS PRESENT (1633) int13 - GET DRIVE PARAMETERS (Extended) (1634) int13 - READ PARAMETERS (1690) int13 - CHECK EXTENSIONS PRESENT (1691) int13 - GET DRIVE PARAMETERS (Extended) (1692) [The last number here is not as important as the number of “READ PARAMETERS” displayed after the delay in “READ PARAMETERS” output (see previous note).]

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 176

Page 178: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CX3-20 / CX3-40 / CX3-80 Sample Extended POST Output (CX3 Series) The following represents the normal output produced by Extended POST during boot. It is taken from a CX3-40 (aka Sledgehammer) running Release 22 (pre-RTM) code on its SPB. The following section is designed to help answer the question – “Does the Extended POST output in HyperTerminal look okay?” Always configure HyperTerminal to use a 9600 baud serial connection. Note: Author’s comments in bold type are not part of the output stream. Copyright (c) EMC Corporation , 2006 Disk Array Subsystem Controller Model: Sledgehammer: SAN DiagName: Extended POST DiagRev: Rev. 01.02 Build Date: Thu Feb 09 09:11:16 2006 StartTime: 02/24/2006 10:25:46 SaSerialNo: LKE00053102759 ABabcdefgCDEabcdFabcGHabIabcJabKabcLabMNabcOabcPabQabRabcSTUabVabWabcdXabcdYZ Initializing back end FIBRE... PCI Config Reg: 9.0.0 0x0007 FCDMTL 0 [9.0.0] Dual Mode Fibre init - OSW DB PTR 0x3B877C40 No AG: init DMD to FC_SPEED_4_GBPS FCDMTL 0 [9.0.0] Cached memory - 0xFFA14 bytes @ 0x3AB50E80 FCDMTL 0 [9.0.0] Noncached memory - 0xC045F bytes @ 0x3AA909C0 (0x3AA909C0 phys) FCDMTL 0 [9.0.0] DVM Initialized FCDMTL 0 [9.0.0] DIAG LOG DIAG MODULE INITIALIZED - diag gdb 3AC39D19 FCDMTL 0 [9.0.0] IMQ base ptr = 3AB10000; IMQ length = 8000 FCDMTL 0 [9.0.0] SFP sfp_monitoring INIT called FCDMTL 0 [9.0.0] SFP INIT FCDMTL 0 [9.0.0] SFP SFP_INIT FCDMTL 0 [9.0.0] SFP *** Waiting for atd_ptr->api_init_state = FC_CTLR_INIT_STATE_ONLINE; *** Dualmode fibre init completed FCDMTL 0 [9.0.0] TPM Notify: st=0xA000000, flg=0x4, cmd=0x1 FCDMTL 0 [9.0.0] Set PCI EXP DEV CTRL MRRS to 4096 Bytes FCDMTL 0 [9.0.0] speed negotiation not on SN1:0xCE97240C FCDMTL 0 [9.0.0] fc_tl_write_TX_RX. TX:0x00000004 RX:0x00000010 FCDMTL 0 [9.0.0] fc_ tl_set_speed_led: htd_ptr->gpio_output = 0x00000012 FCDMTL 0 [9.0.0] TPM Hndle API Event: cntx=0x3B878120, evnt=0x4015, info=0x380AB024 Link Event: 0x0003100F FCDMTL 0 [9.0.0] SFP IS INSERTED FC_TL_SFP_TRANSACTION_STATE_I2C_RESET_INITIAL FCDMTL 0 [9.0.0] TPM Hndle API Event: cntx=0x3B878120, evnt=0x4002, info=0x0 FCDMTL 0 [9.0.0] TPM Lnk Up: state=0xA000000, flg=0x84 Link Event: 0x00031005 Link Event: 0x00031006 FC_TL_SFP_TRANSACTION_STATE_I2C_RESET_START Device Event (EF/00): 0x00031012, tach_ptr: 0x3C0BF910 Device Event (E2/03): 0x00031012, tach_ptr: 0x3C0BF910 Device Event (E1/04): 0x00031012, tach_ptr: 0x3C0BF910 Device Event (E4/02): 0x00031012, tach_ptr: 0x3C0BF910 Device Event (E8/01): 0x00031012, tach_ptr: 0x3C0BF910 DL waited 1s for discovery Target 0 is online Target 1 is online Target 2 is online Target 3 is online Target 4 is online

[The first 5 drives are accessible by Extended POST; since this is SPB, an inaccessible drive 1 or 3 could cause a problem. Also note that if the above output (FCDMTL strings and “is online” messages) appear repeatedly, this indicates that one of the first 5 drives or the loop itself has a problem. Unless you know that a drive is missing/pulled for some reason, this could be causing the problem you are looking at.]

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 177

Page 179: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Relocating Data Directory Boot Service (DDBS: Rev. 03.06)... Autoflash POST? DDBS: MDDE (Rev 200) on disk 0 POST/DIAG image located at sector LBA 0x00012048 Autoflash BIOS? DDBS: MDDE (Rev 200) on disk 0 BIOS image located at sector LBA 0x00011048

[If the SP is booting in degraded mode and/or has tripped the reboot count, you will see messages and non-zero values here. Also see section 6.2.1 “Determining if an SP is in Degraded Mode”] DDBS: K10_REBOOT_DATA: Count = 0 DDBS: K10_REBOOT_DATA: State = 0 DDBS: K10_REBOOT_DATA: ForceDegradedMode = 0 DDBS: SP B Normal Boot Partition DDBS: MDDE (Rev 200) on disk 1 DDBS: MDDE (Rev 200) on disk 3 DDBS: MDB read from both disks. DDBS: Chassis and disk WWN seeds match. DDBS: First disk is valid for boot. DDBS: Second disk is valid for boot.

[No DDBS errors. Note that one of the disks could require a rebuild or contain a “more recent timestamp” if a rebuild is required. This is normal and may or may not indicate that a crash dump has been written to disk.] FLARE image (0x00400009) located at sector LBA 0x011E804C

[Note that the FLARE Boot Partition is located at a very different LBA than on NT or Fish arrays] Disk Set: 1 3

[Drives 1 and 3 are valid boot disks for SPB (On SPA, 0 and 2). “Disk Set: 1 3” is the normal case. For example, if the DDBS indicated that drive 3 needed a rebuild, this line would read “Disk Set: 1” instead] Total Sectors: 0x01BFDB24 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 1827 Capacity: 29350755 sectors Total Sectors: 0x01BFDB24 Relative Sectors: 0x0000003F Calculated mirror drive geometry: Sectors: 63 Heads: 255 Cylinders: 1827 Capacity: 29350755 sectors EndTime: 02/24/2006 10:26:43 int13 - RESET (1) int13 - READ PARAMETERS (3) int13 - RESET (5) int13 - READ PARAMETERS (7) int13 - READ PARAMETERS (24) int13 - CHECK EXTENSIONS PRESENT (39) int13 - READ PARAMETERS (516) int13 - CHECK EXTENSIONS PRESENT (517) int13 - GET DRIVE PARAMETERS (Extended) (518) int13 - READ PARAMETERS (521) int13 - CHECK EXTENSIONS PRESENT (522) int13 - GET DRIVE PARAMETERS (Extended) (523) int13 - READ PARAMETERS (525) int13 - CHECK EXTENSIONS PRESENT (526) int13 - GET DRIVE PARAMETERS (Extended) (527) int13 - READ PARAMETERS (536) int13 - CHECK EXTENSIONS PRESENT (537) int13 - GET DRIVE PARAMETERS (Extended) (538) int13 - DRIVE TYPE (557) int13 - READ PARAMETERS (558) int13 - DRIVE TYPE (559) int13 - CHECK EXTENSIONS PRESENT (561) int13 - GET DRIVE PARAMETERS (Extended) (562) int13 - READ PARAMETERS (563)

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 178

Page 180: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

int13 - CHECK EXTENSIONS PRESENT (564) int13 - GET DRIVE PARAMETERS (Extended) (565) int13 - READ PARAMETERS (577) int13 - CHECK EXTENSIONS PRESENT (578) int13 - GET DRIVE PARAMETERS (Extended) (579) int13 - READ PARAMETERS (582) int13 - CHECK EXTENSIONS PRESENT (583) int13 - GET DRIVE PARAMETERS (Extended) (584)

[There is usually a pause at this point. If IO is in progress to the boot drives, this pause could be significant. The numbers displayed after this point will not always be the same.] int13 - READ PARAMETERS (1400) int13 - CHECK EXTENSIONS PRESENT (1401) int13 - GET DRIVE PARAMETERS (Extended) (1402) int13 - READ PARAMETERS (1418) int13 - CHECK EXTENSIONS PRESENT (1419) int13 - GET DRIVE PARAMETERS (Extended) (1420) int13 - READ PARAMETERS (1437) int13 - CHECK EXTENSIONS PRESENT (1438) int13 - GET DRIVE PARAMETERS (Extended) (1439) int13 - READ PARAMETERS (1455) int13 - CHECK EXTENSIONS PRESENT (1456) int13 - GET DRIVE PARAMETERS (Extended) (1457) int13 - READ PARAMETERS (1474) int13 - CHECK EXTENSIONS PRESENT (1475) int13 - GET DRIVE PARAMETERS (Extended) (1476) int13 - READ PARAMETERS (1493) int13 - CHECK EXTENSIONS PRESENT (1494) int13 - GET DRIVE PARAMETERS (Extended) (1495) int13 - READ PARAMETERS (1554) int13 - CHECK EXTENSIONS PRESENT (1555) int13 - GET DRIVE PARAMETERS (Extended) (1556) int13 - READ PARAMETERS (1586) int13 - CHECK EXTENSIONS PRESENT (1587) int13 - GET DRIVE PARAMETERS (Extended) (1588) int13 - READ PARAMETERS (1626) int13 - CHECK EXTENSIONS PRESENT (1627) int13 - GET DRIVE PARAMETERS (Extended) (1628) int13 - READ PARAMETERS (1676) int13 - CHECK EXTENSIONS PRESENT (1677) int13 - GET DRIVE PARAMETERS (Extended) (1678)

[The last number here is not as important as the number of “READ PARAMETERS” displayed after the delay in “READ PARAMETERS” output (see previous note).]

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 179

Page 181: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Data Sector Protection Data protection at a high level consists of the various choices of redundant RAID levels – 1, 1/0, 3 and 5. The data is essentially protected by parity or mirroring. There are other RAID Group types that are non-redundant – RAID 0, single disk and hot spare. RAID level protection is on means but data protection begins at a much lower level. CLARiiON arrays use timestamps and parity information associated with each sector to ensure data integrity as well throughout the data path. In combination with the use of redundant RAID types, it adds another level of data integrity. Media errors- for example, a bad sector- may be corrected without having to restore from a backup medium. If SNiiFFER detects a bad sector, that sector will be rebuilt with the aid of the redundant information stored as part of the RAID level used for the LUN. More information is provided later regarding SNiiFFER (flare verify operation).

Included is a 2 byte checksum, which is a checksum of the 512 ‘user’ bytes as well as the extra 6 bytes in the timestamps. This checksum can be used to detect, but not correct, any errors in the sector. The SNiiFFER disk checking process makes extensive use of these checksum bytes. Here is a detailed look at the sector metadata which is encoded in the last 8 bytes of each sector beginning at offset 0x200. It consists of 4 2-byte fields.

Sector Offset 0x200 Sector Offset 0x204 Offset 0x200 Offset 0x202 Offset 0x204 Offset 0x206 Time stamp Checksum Shed stamp Write stamp

Time Stamp - the time stamp field is used on both data drives and the parity drive. The time stamp field is initialized with the constant INVALID time stamp when the LU is first bound. INVALID time stamp has a value of 0x7FFF. The time stamp field is usually updated when an mr3 write occurs. It is used with RAID 5 and RAID 3 and when writing to all drives in parallel.

The MSB of the time stamp field is designated as the ALL time stamp bit. The remaining 15 bits hold the time stamp value. The time stamp value is a random number that is generated with the system time as the seed value. If the ALL time stamp bit is set on the parity drive, then the time stamp values on the data drives must match the time stamp value on the parity drive. This update usually occurs during an MR3 write. When the ALL time stamp bit is set, inconsistencies in time stamp values between parity and data drives can be used to detect errors. When a 468 write occurs the ALL time stamp bit is cleared on the parity drive, and the time stamp value on the data drive is set to INVALID time stamp.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 180

Page 182: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Checksum - is used on both data drives and the parity drive. The 2-byte checksum is calculated using the following algorithm.

1. XOR all 4-byte data words in the sector. 2. XOR Result with 0x0000AF76 (the seed) 3. ROTATE Result LEFT 1 bit. 4. XOR HIGHER 2 bytes with LOWER 2 bytes to obtain final checksum value.

This checksum is used for all RAID types and is created from the 512 bytes of user data. Shed Stamp - is used only on the parity drive. It is initialized to zero on all data drives and on the parity drive. When a drive is missing in the RAID group and a write occurs, the data is written into the associated sector on the parity drive and the shed stamp of that sector is updated with the drive number of the missing drive in the LUN. The diagram below represents the bit positions associated with each drive number.

When the shed stamp is updated, the bit position corresponding to the missing drive number is set to 1. It is used with RAID 5 and RAID 3 to perform parity shedding. It represents a flag to indicate the data has been placed in parity sector Write Stamp - is used on both data drives and the parity drive. It is initialized to zero on all drives. The write stamp is usually updated when a 468 write occurs. On data drives when a write stamp is updated, the bit position corresponding to the drive number in the LUN on which the sector resides is toggled.

On the parity drive the write stamp contains the XOR of all the write stamps fields of the data drives in the LUN. Any Inconsistencies in write stamps on the data and parity drives can be used to detect errors. It is used with RAID 5 and each bit represents a drive and its associated data-parity coherency How do these bytes work?

In this case, how do we store the data? We store the data in the parity segment locale but mark the block as a ‘parity shed’. When the stripe is rebuilt, upon the disk being replaced, the data is copied into the correct location and then parity is generated for the parity block vacated by the parity shed block.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 181

Page 183: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Write stamps are a single bit per drive. Each data drive sector has one bit which is reflected in the corresponding position on the parity drive. Each time a write IO occurs, the corresponding data and parity bits are flipped. On the parity drive the write stamp contains the XOR of all the write stamps fields of the data drives in the LUN.

Inconsistencies in write stamps on the data and parity drives can be used to detect errors. If they don’t match, that’s an error. Since you have to flip it, you must preread it to know the current state.

The date stamp fixes the problem seen in standard parity RAID when data and parity were being updated.

But if a power failure caused an interruption: which is current, the parity block or the data block? One may have been written and the other stale.

The date stamp tells the processor which block was written last, and thus is current data.

What can cause uncorrectable sectors? A closer examination of the parity stripe and how it works is required to fully understand what a correctable or uncorrectable sector is. This information explains in better detail how a stripe is constructed and the potential causes of sector issues.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 182

Page 184: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The example shown is in handling a typical RAID 5 error condition although CLARiiON does support other RAID types with an equal amount of data protection. In this scenario we examine typical RAID 5 error conditions when two active SPs are present via a CLARiiON system’s dual-porting feature. Other products that are limited to one SP or a "hot-standby" must solve the same problems. Note that the error conditions discussed represent only a few examples of what can go wrong in the real world.

In the above figure1, a CLARiiON SP is controlling a RAID 5 group of disks. A cross section of the same physical block on each drive shows user data on four of the drives plus parity ("10") on the fifth drive. The parity data "10" is calculated by summing the data on the data drives. If one of the drives that contain data fails, the lost data can be reconstructed on a Read request by subtracting the remaining values from the parity. (Note that in actuality, parity is calculated, then reconstructed in the array using exclusive OR logic; we are using addition and subtraction here to simplify the example.) Power Loss Scenario A problem arises when a write update to a RAID 5 group that has already suffered a drive failure (ie; third drive from left) occurs. If the file is updated and a new block of data is written to the second drive on the left, changing the value from "2" to "3," then a parity update ("10" to "11") should be written to the drive holding the parity block.

If "3" is written to the second disk and the SP loses power before writing "11" to the parity drive, the array stripe is left in an incoherent state. If the user attempts to read the third drive, the data is reconstructed as a "2" instead of correct value of "3." When the failed drive is subsequently replaced, the same incorrect value is rebuilt and written to the replacement drive.

The traditional approach to dealing with this problem has been to use an uninterruptible power supply. A UPS holds the system up (at least) long enough for a parity update to complete and for no data to be lost. However, this strategy does not apply to RAID subsystems because SP failures and drive failures are not necessarily power failures. Also, due to cost considerations, not every system is equipped with a UPS. Obviously a UPS is not a foolproof solution to the problem. Another approach has been to use a background verify process. This is a process that runs through the entire address space of an array, fixing data/parity mismatches as it encounters them. This method has been discarded because it is useless if a drive has failed or fails during the background verify procedure. Consider Figure 2. A background verify procedure has no knowledge that the value "3" exists on the third drive. So it is impossible to recalculate the proper parity value of "11" without having all of the drives present. There are no safeguards in place should the user request the data from the failed drive.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 183

Page 185: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Pro Active Data Integrity CLARiiON uses a unique combination of three levels of data protection to provide a foolproof solution to the problem of returning incorrect data: 1. Standard RAID 5 Parity: Allows for reconstruction of data on a read request from a failed drive. 2. Patented error-handling algorithms: ensures data and parity are always coherent, even when multiple failures occur. 3. Data Stream Checksumming: Mechanism that ensures data was written/read correctly to and/or from an individual disk. CLARiiON has employed the standard RAID definitions in its designs. Therefore, the RAID definitions are not discussed in this paper. What sets CLARiiON apart in terms of data integrity focus are the patented Prevention and Detection algorithms and the data stream check-summing techniques that recognize potentially dangerous data integrity scenarios and prevent them from occurring.

First, CLARiiON formats the disk sectors at 520 bytes instead of the normal 512 bytes to provide eight bytes per sector for error detection handling. These additional bytes include linear checksums and status bits for the stripe. Second, non-volatile memory (or NOVRAM) on the SP is used to ensure that in the event of an AC power loss, the consistency of the data and corresponding parity is maintained.

The NOVRAM keeps track of stripes which are being updated, so that, in the event that power fails before the data and the parity have been updated, only stripes which were in use need be verified. This speeds up the verify process, and minimizes the performance impact on the array. It is important to note that no user data is stored in the NOVRAM – only flags related to stripes in use. Loss of the NOVRAM, therefore, does not affect data integrity in any way. To see how CLARiiON deals with the earlier example of data integrity (that in a previous figure resulted in undetected data corruption) let us step through CLARiiON's patented "Parity Shedding" algorithm, which is shown in below. When CLARiiON determines a drive in a stripe has failed, the stripe will operate in a degraded mode where the focus is on avoiding the catastrophe of data/parity incoherence. While in this mode, a read request to the stripe will return the data as read from the target drive, if running, or, the "Exclusive Or" product if the target drive is the failed disk.

But let us assume that a write operation is requested that would change the value of the second drive from a value of "2" to "4." Before the write request is executed, Parity Shedding will cause the data from the failed drive to be reconstructed via "Exclusive Or-ing" the surviving drives. This calculated value will then be written OVER the original parity value of "10," and a flag will be set in one of the additional eight bytes indicating that the parity for this stripe is now data, i.e., the Parity has been shed.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 184

Page 186: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The data value of "2" will be replaced with the requested update of "4." Note that the stripe has no parity and that there is NO point where an AC power loss could lead CLARiiON to have data/parity incoherence thus avoiding any chance for undetected data corruption. Furthermore, any request to read the data from the failed drive will be directed to the former parity drive to get the correct data in a single disk access. When the failed disk drive is replaced, the stripe will rebuild complete with parity to regain "Redundant Mode" operation. The use of NOVRAM, also a CLARiiON patent, keeps track of the state consistency of all stripes in the array. If the power is lost after a data block has been written, but before the associated calculated parity update is written, this state will be reflected in the NOVRAM entries for the stripe. When power returns, the stripe states are checked and any stripes with inconsistent states have the parity recalculated and then written to the parity sector. In the worst case, if a power failure was accompanied by a failure of a storage processor (i.e., the NOVRAM is lost); a complete Background Verify is executed to ensure that all data and parity are consistent. CLARiiON also employs Longitudinal Redundancy Checking to assure the integrity of data being written from a storage processor to a disk and being read back from the disk on a host read request.

A portion of the eight additional bytes per sector is used to store the calculated LRC code for each data sector within the stripe. In this way, as multiple streams of data are moving back and forth at up to 100 megabytes per second across the two "back-end" Fibre Channel Arbitrated Loop (FC-AL), one more level of protection is effected.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 185

Page 187: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Dual Active Storage Processors CLARiiON architecture features two active storage processors with each SP able to access all the drives independently. One benefit of this implementation is the continuous access to data that SP redundancy offers to the user. With two SPs, the user can now suffer a single point of failure at the SP, host, or front-end Fibre Channel level and still have uninterrupted access to the data. A second benefit of dual active SPs is increased throughput (I/O operations per second) as a result of load-sharing the disks between the two SPs. A third benefit arises from the fact that two SPs permit the use of mirrored write cache. This enables applications to write to protected cache memory instead of executing four disk transactions to complete a RAID 5 write operation. CLARiiON offers these benefits by handling the increased data integrity complexity that comes with two SPs. CLARiiON architecture features two active storage processors with each SP able to access all the drives independently. One benefit of this implementation is the continuous access to data that SP redundancy offers to the user. With two SPs, the user can now suffer a single point of failure at the SP, host, or front-end Fibre Channel level and still have uninterrupted access to the data. A second benefit of dual active SPs is increased throughput (I/O operations per second) as a result of load-sharing the disks between the two SPs. A third benefit arises from the fact that two SPs permit the use of mirrored write cache. This enables applications to write to protected cache memory instead of executing four disk transactions to complete a RAID 5 write operation. CLARiiON offers these benefits by handling the increased data integrity complexity that comes with two SPs.

This figure shows a CLARiiON system’s dual-porting feature. Both SPs can actively access different disks. In this diagram, there are two SPs with SP-A having performed a write of "3" to a RAID 5 stripe that it owns exclusively. SP-A has calculated a parity value of "12" but has not yet updated the parity value on disk.

In this figure, SP-A has failed. The user then chooses to transfer ownership of the RAID 5 group through SP-B using CLARiiON's "trespass" feature. This feature allows for a smart host and device driver to reroute failed commands to the surviving SP without user intervention (via Powerpath software). At this point, because the parity value on disk was not updated to "12," there exists an incoherent data/parity cross-section that could prove disastrous should a drive failure occur.

Parity no longer reflects the data on the other four drives! The CLARiiON array, however, detects this situation and will invoke a Background Verify to ensure that SP-B makes the parity on disk coherent. CLARiiON arrays can do this because of the technique of maintaining status information in the additional eight bytes per sector. Of course, without a second SP, the user loses much of the value of a RAID system by becoming susceptible to a single point of failure. The permutations that exist for RAID 5 failure scenarios are numerous. With the addition of a second SP that can access the same disks, the possibilities double. Not only has CLARiiON addressed the complicated failure scenarios inherent in any RAID 5/dual-SP implementation, but it supports ACTIVE dual SPs, where both storage processors can simultaneously access different RAID groups connected via the alternate backend FC-AL to the dual-ported drives.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 186

Page 188: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Stripe Access Management An essential feature of CLARiiON dual active SP implementation is its exclusive ownership of stripes by each storage processor. That is, for normal read/write access, only one of the two SPs may access a given stripe. Only when the loss of the owning SP occurs will the stripe become accessible to the surviving SP via the CLARiiON trespass facility (either automatically or manually). Thus, CLARiiON precludes the occurrence of a wide range of possible data integrity problems. To see this, what might occur if CLARiiON permitted dynamic access by SP-A and SP-B to a given stripe?

In this figure, where the array contains four different files, A, B, C, and D. Parity (P) is contained on the fifth drive. Imagine if both SP-A and SP-B can access any file, with locking being done on those files at the operating system level. SP-A is writing A* as a replacement for file A. This will eventually result in a parity update (P*) to the fifth drive. SP-B is reading file C. In normal mode this sequence of events works properly.

But what happens when drive C is missing, as in the figure below? The write of A* has already occurred to the first drive. Parity (P*) is being calculated on SP-A, but has not yet been written to disk. SP-B, sensing that the data C cannot be directly read from disk, is reading from the other four drives in order to recalculate C.

Once SP-B has read all four drives, it subtracts the three data files from the parity file in order to recalculate C. Note that the parity does not yet reflect the updated data A.* In this case, incorrect data will be returned to the host instead of to file C. All of this can take place even with operating system file locking. If the operating system locks file A, that does not prevent an access to file C.

One possible approach to the problem in the above figure might be to use the reserve/release commands found in the SCSI specification (With Fibre Channel, SCSI commands are actually "tunneled" onto the Fibre Channel protocol). These commands could be used at the drive level (after the SCSI commands are stripped off the Fibre Channel link) on every I/O to remove the possibility of reconstructing incorrect data. But this is undesirable because reserve/release commands significantly degrade the system performance, and deadlocks would frequently occur that would need to be untangled. Another possible scheme is for the operating system to implement stripe locking. Stripe locking would prevent multiple I/Os occurring simultaneously to the same data/parity stripe. This would require that the operating system know the sector layout and parity rotation algorithms used by the SP firmware. However, a problem still exists when rebuilding a failed drive. If SP-A is rebuilding drive C for example, SP-B could be writing to the same stripe and incorrect data could be rebuilt. Stripe locking also negates one of the main advantages of RAID 5: multiple independent accesses on the same stripe! CLARiiON's non-sharing SP architecture avoids these issues while still providing operating system independence (and therefore portability), high performance, and data integrity. It is also important to remember is that the CLARiiON knows a write did not complete. A LUN will be marked dirty prior to a write operation and a log entry put into the nonvolatile RAM on SP for RAID writes. This log contains the LUN and LBA information of the write operation. After the write is completed, we clear the entry. In reality, the Dirty flag gets cleared after the LUN has been idle for 5 seconds. If a failure causes the replacement of a SP or a LUN trespass and dirty flag is set as well. To check the integrity of the protection mechanisms, we need to run a verify process.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 187

Page 189: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

How do we check the integrity of the (4) 2-byte sectors? There is a background process called verify that runs continually for each LUN that is bound in addition to any bound hot spare LUNs. In short, it reads all sectors of all disks in bound LUNs checking for any inconsistencies. Any correctable errors are done so by use of RAID, while those that are uncorrectable are just that, uncorrectable due to a double-fault within the stripe of data. There are three types of verify: nonvol, sniff, and background verify. The general algorithms used to detect and fix errors remain the same regardless of the type of verify being used. Each type of verify is run for a different reason, but the purpose remains the same, to detect and correct potential disk media errors or Raid coherency/consistency errors. In general, verifies operate on a LUN basis. For any raid group, only one sniff or background verify may operate at a time within that raid group. Background verifies always take priority over sniff verifies. All background verifies, which are required for LUNs in a raid group are completed before sniff verifies. For example, suppose LUN 2 was doing a sniff verify, and we determined that LUN 3 needed a background verify. In this case, the sniff on LUN 2 will be stopped, and the background verify on LUN 3 will be performed. But why is the Verify process necessary? Whenever a LUN takes on write I/Os, we mark that unit as being ‘dirty’. In this context ‘dirty’ means that the LUN potentially has inconsistent areas on it while the writes are in progress. If these writes were interrupted by an SP failure or a shutdown LUN, then the areas where writes were outstanding need to be checked for consistency. The types of inconsistencies, which may come about during a failure, are typically an inconsistency in checksums or in the redundant RAID information. For example, suppose we are writing a Raid 5 unit. We may be writing the data drive and parity drive in parallel. If this is interrupted, then it is possible that the data or parity have not been updated. Verify detects and corrects this inconsistency. The Flare Verify Operation is the basic operation performed during any Verify done by Flare (Sniff, Background or Nonvol). This is a feature of a CLARiiON disk array, which inspects and attempts to validate the data and parity of bound LUNs. This feature helps to detect and correct errors before they become unrecoverable. Verify enables a drive to locate/fix media errors and allows Flare to locate and correct RAID coherency/consistency errors. Verify Algorithms

The following pertains to Background and Nonvol verify operations. These operations take place in 64kb pieces. Each verify operation first reads 64kb on each drive in the unit. Verify then checks the internal consistency of each sector by validating checksums and other RAID stamps in this sector. If a sector takes a media error, then Flare will ask the drive to remap each sector that is erred. For soft remap errors, the data is recovered, but for hard remap errors the data is lost. In the case of hard remap errors, the RAID algorithms will attempt to reconstruct the data from the available redundant RAID information. Each sector is also checked for coherency with its corresponding parity (R3/R5) or mirrored pair (R1/R10), and any necessary corrections are made using the available redundant data.

Sniff Verify Sniff verify runs in the background on a bound unit to issue drive level SCSI verify commands. When the sniffer starts, it runs sequentially starting from the beginning of the bound unit, and issues SCSI drive level Verify commands in this unit, for the entire capacity of the unit. Sniff verify covers a raid group in the order that the LUNs are bound. When the last unit in the raid group has finished sniffing, we begin sniffing the first LUN in the raid group again. Sniff verify does keep a checkpoint and will resume operation from the checkpoint upon a trespass or a power cycle. The SCSI Verify operation causes the drive to check the media only and does not transfer data to the SP. Thus for sniff verify operations, the SP does not check the checksums of individual blocks or the stripe level coherency (stamps, parity).

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 188

Page 190: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The sniff rate is the rate at which sniff requests are issued. Prior to R19, the sniff rate was specified on a LUN basis and was specified in units of 100ms. A sniff rate of 30 issues requests once every 3 seconds, and a rate of 5 issues once every 500 ms or two per second. The Sniff Verify Rate changed in R19 and later. It is now 512 kb (1024 blocks) every 1 second. This works out to a rate of approximately 1.8 Gigabytes per hour, which would be able to sniff verify a 300 GB drive in about a week.

Release 11 to Release 17 Sniff Rate Details

Below is a table of the estimated times for various sniff rate settings. These rates are valid for Release 11 to Release 17. Note that we specify size per drive. If the unit is a 4+1 Raid 5, then 1 GB per drive is really a 4 GB unit. Similarly a 72 GB drive is really a 288 MB R5 4+1. The formula to get the figures shown below is: (Size / 64kb ) * time per request

For a Rate of 30 and 1 GB Size to verify: (1 GB / 64 kb) * 3 seconds = 16,384 * 3 = 49,152 seconds = 819 minutes = 13.6 hours

Sniff Verify Rates (estimated)

Size Per Drive

Rate = 100 Rate = 30 Rate = 10 Rate = 5

1 GB 45.5 hours 13.65 hours

4.55 hours 2.28 hours

10GB 18.96 days 5.69 days 1.90 days 22.76 hours

72 GB 137 days 41 days 13.65 days 6.83 days 100 GB 190 days 57 days 18.96 days 9.48 days

By default, sniff verify is enabled on all normally bound units. Sniffing is not enabled by default as part of a Non-Destructive Bind. Sniffing also is not enabled by default on the Vault LUN. Sniffing may be disabled via the navicli ‘setsniffer’ command. In Release 11 and earlier, the default sniff rate was 30, and sniffing was not enabled by default on RAID 3 units. In R12 to R17 the default sniff rate is 5 and sniffing is enabled by default on RAID 3 units. Internal LUNs The PSM and Vault both have sniffing enabled. These 2 LUNs are considered

“Private LUNs” because these LUNs are used by the system and are not available to the user. The sniff rate on the PSM and Vault is set according to the LUN capacity. The sniffing on the Vault and PSM is scheduled to complete approximately every 4 days.

Idle Hot Spare Sniffing

Hot spares that are not in use are also sniffed starting with release 14. Idle in this context means that the hot spare is not swapped in. Hot spare sniffing stops as soon as a hot spare gets swapped in. Hot spare sniffing is enabled by default and is not allowed to be disabled by the user in release 19 and later. Hot spare sniffing has a different rate than sniffing of user LUNs. The hot spare sniff rate is approximately 1.5 GB per hour.

Nonvol Verify

In order to explain nonvol verifies, we should explain that whenever Flare submits writes to disk, it puts a record of this write range into the nonvolatile ram part on the SP. If the system undergoes a power failure or if the unit shuts down, then these records can be used to identify potentially inconsistent areas of a LUN. A nonvol verify is run when a unit is assigned. Units may assign at startup time, trespass time, or following the restoration of a faulted LUN. Nonvol verifies are not run on Raid 0 or individual disks, because these unit types are not redundant. For example, suppose two drives are removed from a 4+1 Raid 5, which has writes outstanding. When the drives are re-inserted, the LUN will become assigned and a nonvol verify will be performed to areas for which there are write records. Nonvol verify has the advantage of only verifying the areas affected by outstanding writes rather than verifying the entire capacity of the unit. Unfortunately, the number of failure cases where a nonvol verify may be used is somewhat limited because the nonvol part is per SP. If a LUN trespasses to the peer SP after the failure, then nonvolatile ram cannot be used and a full unit background verify is necessary.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 189

Page 191: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Background Verify If a failure occurs and a LUN fails over to the peer SP, then a nonvol verify cannot be performed, since the write records needed for the nonvol verify are still on the original SP. Thus, it is necessary to perform a background verify of the entire unit. Background verifies begin at the start of a LUN and verify the entire capacity of the LUN. Some of the failures which could cause a background verify to start are: Shutdown of a LUN caused by back end cable failure Shutdown of a LUN caused by drive failures SP failure The background verify is checkpointed, such that if an SP reboot or trespass occurs, the LUN will continue the verify operation from the checkpoint. Background verifies have LUN specific priority, which the user can set to ASAP, High, Medium or Low. As of Release 11, the different background verify priorities correspond to rates of verify. For example, a rate of ASAP corresponds to a rate of 1 gigabyte per minute, and a rate of High corresponds to 1 gigabyte every 5 minutes. The below table show estimated times to complete background verifies with different background verify rates. The below verify times specify a size per drive. For example, a 5-drive raid 5 of capacity 1 GB has a 1 GB size per drive and a 4-drive Raid 10 of capacity 2 GB has a 1 GB size per drive.

Background Verify Rates (estimated) Size Per Drive

ASAP High Medium Low

1 GB 1 min 5 min 10 min 15 min

10GB 10 min 50 min 100 min 150 min 72 GB 72 min 6 hours 12 hours 18 hours 100 GB 1.6 hours 8.3 hours 16.7 hours 25 hours

In Release 12 to Release 22, all binds except for non-destructive binds will start a background verify as soon as the bind finishes. In R24 and later the background verify will start after the background zeroing has finished. From Navicli and Navisphere there is an option to override this default behavior so that no initial background verify is performed. Checkpointing Both background verifies and sniffs are checkpointed. If the verify should

be halted due to an SP reboot or trespass, the verify will resume from the checkpoint. The checkpoint is written out either every 1 minute or 1% of the LUN, whichever is smaller.

NDB

Prior to Release 19, after a Non Destructive Bind, sniffing was not enabled on the LUN. The reason for this was so that if the NDB had been performed incorrectly, the sniffer would not “correct” any issues with the LUN. In Release 19 and later, since the sniffer does not look at parity or stamps, it was decided to allow sniffing to become enabled following an NDB.

Degraded Units When a redundant raid group is degraded, the verify operation on that raid group halts. After a raid group completes a rebuild of a drive, sniffing will restart from the beginning of the first LUN in the raid group. Any outstanding background verifies are cleared when a redundant unit goes degraded because the rebuild also performs a verify operation. If a rebuilding unit undergoes a failure, such that a nonvol or background verify is required, then the rebuild will be restarted from the beginning of the unit. When a raid group is shutdown due to drive failures, verify operations in the raid group halt. When a raid group comes back from a shutdown condition verify operations will be restarted on the raid group.

Trespass

In environments with trespass we expect that sniff or background verifies should make continual progress as long as they are trespassed no more than once every 1.5 minutes. When a trespass does occur on a unit with a sniff or background verify, we will expect the verify operation to resume on the peer SP from the verify checkpoint once the LUN becomes enabled on the peer. If a

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 190

Page 192: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

verifying LUN is trespassed more than once every 1.5 minutes, then we cannot guarantee verify will make progress. If this continual trespassing is a temporary state, then the LUN will eventually begin to show progress again as soon as the trespassing slows to a rate of no more than once every 1.5 minutes.

Verify Results

The navicli getsniffer command allows the user to view the status and results from past or current verifies. The navicli getsniffer command returns the verify results from just a single SP. The verify results for a single SP show the history of verify activity when the LUN was assigned to this SP. For example, suppose LUN 0 is assigned to SP A and does 5 full sniff passes and then trespasses to SP B and does 20 full sniff passes. Doing a navicli getsniffer to SP A returns a pass count of 5 for LUN 0, but doing a navicli getsniffer to SP B returns a pass count of 20 for LUN 0. To get a coherent picture in an environment with trespassing of LUNs, we need to sum the totals of both SP's verify reports. If the verify report fetched from either SP for a given LUN shows a pass count of more than 0, then it means the LUN was assigned to that SP some time in the past when a verify completed.

The meanings of the verify results errors are below: Checksum Error The CRC for the sector was not correct. This indicates either a data

corruption of some type, or the sector was intentionally invalidated because of a previous error.

Write Stamp Error

The write stamp for a particular sector doesn't match between the data and parity drive. This is usually caused by a R5 write failing between writing the data and parity. These errors are usually correctable.

Time Stamp Error The time stamp for a stripe doesn't match across all the drives. Timestamps are used on R3 and R5 (full stripe) writes. Usually caused by a failure while writing the stripe. These errors are usually correctable.

Shed Stamp Error These are most often fatal. A shed stamp was found in an unexpected place on a unit. Could be caused by a data corruption of some type or a SW error where the rebuild checkpoint was not maintained properly.

Coherency Error This indicates that although the stamps may match, the parity for a stripe doesn't accurately reflect the data. On a R1/R10, this means that the mirrored pairs do not match. On R3/R5, this means that the xor of all the data drives does not match the contents of the parity drive. Would be seen on R1/R10 units if a write fails before getting to both drives. This could be seen on R3/R5 if a write fails before getting to all the data drives and parity drive being modified. These errors are most often correctable.

The following information describes how to approach & resolve uncorrectable sector issues. Use the appropriate section dependent upon the type of storage environment that is involved.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 191

Page 193: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CLARiiON stand alone storage environment ID: emc48444 NOTE: Always refer to the most current Primus solution, this is provided as reference only. Interpreting uncorrectable data and parity errors on a CLARiiON array such as; 0x695 Uncorrectable Data Sector 0x957 Uncorrectable Data Sector 0x953 Uncorrectable Parity Sector 0x68A Uncorrectable Parity Sector 0x840 Data Sector Invalidated You may see the array log a 0x695 error, followed by a 0x68A error and then a 0x840 error on multiple drives in the same RAID group. The event codes shown above are logged by FLARE when it is unable to read data from a drive and subsequent attempts to reconstruct the data from other drives in the RAID group failed. The "Uncorrectable" messages indicate which drive(s) FLARE was unable to successfully read from and the "Invalidated" messages indicate which drive(s) FLARE then marked as being void of valid information in a specific location. This marking is done by FLARE to ensure that no invalid data will be returned to a host system. Attempts to read from an invalidated location will result in a hard error being returned to a host. Attempts to write to an invalidated location will complete successfully and generally "fill" the void location. To resolve these types of issues, run a Background Verify on the affected LUN(s) to determine how many uncorrectable sectors there are. If Background Verify reports any uncorrectable sectors, you should recommend that the client attempt to backup the data to determine what files were affected, and then restore any lost files. If this is not possible, or restoration of specific data files is not possible, a sequence of unbinding, rebinding, and restoring all data to the affected LUN(s) will be required. If a CELERRA file server is attached, contact CELERRA Technical Support (TS) before performing an unbind and rebind sequence. When CLARiiON TS is asked to consult on an investigation by CELERRA TS, CLARiiON TS must provide CELERRA TS with the LUN numbers affected by the errors, the number of such errors found by running Background Verify, and any recommendations for hardware replacements during the recovery process. CLARiiON TS should let CELERRA TS offer the recovery alternatives to the customer, as CELERRA TS has a tool called Volcopy that they may choose to use as an alternative to the normal unbind/rebind option. Note: If the Background Verify finds any uncorrectable locations, there is a possibility that a small amount of valid data within a RAID 5 or a RAID 3 group is unprotected, and could be lost on a future drive failure. Thus an unbind/rebind/restore operation is recommended any time a Background Verify identifies uncorrectable locations on a RAID 5 or RAID 3 group, even if the user does not report any data unavailability. Frequently asked questions: QUESTION: It is known that the only way to recover from invalidated sector errors is to unbind/rebind the LUN and restore backup or to restore/write the specific file that can’t be read. Does Engineering have other ways to recover data if both options are not possible? ANSWER: There is no other way to recover the data other then by means of a restore operation. Since the uncorrectable data is missing, there is no way of knowing what that data should be in order to write it back out. This is why the sector is 'invalidated' and a hard error gets returned to the host. It is better to return a hard error than incorrect data. See information regarding the BRT tool. QUESTION: Is it possible for an invalidated sector to change locations on a disk? ANSWER: No, invalidated sectors remains invalid for a specific location until repaired by means of rebind, or written to by a host. QUESTION: Is there a way of finding out the actual location of an invalidated sector? ANSWER: It is very difficult to locate the position of an invalidated sector, due to how LUNs are mapped within RAID groups and what information is available through event logs. Even if the specific location of an invalidated sector is determined, there is no way of knowing what data to place into the sector. So any type of recovery effort short of restoring from a customer backup is not provided. QUESTION: If the invalidated sector does not appear to impact the data area, is there a way to get rid of it without unbinding/rebinding? ANSWER: Some success has been reported when writing temporary data to fill the LUN and then deleting the temporary data. If the invalidated area is written to with temporary data, the voided location(s) are filled, thus restoring full redundancy to the RAID group. Note: Release 19 patch 30 and later may provide an alternative. Contact CLARiiON Technical Support Level 2 for assistance.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 192

Page 194: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

ID: emc111779 NOTE: Always refer to the most current Primus solution, this is provided as reference only. This solution provides information regarding RAID logging changes for R19.

Prior to R19 all checksum errors are treated equally. A checksum error that was caused by an intentionally invalidated sector is treated the same as an unexpected checksum error. In R19 the extended status was modified to help identify classes of checksum errors and to help identify and differentiate these scenarios. Prior to R19, the error bits indicate the exact type of either correctable or uncorrectable error. Since the event log messages also indicate correctable or uncorrectable by their event codes, it is somewhat redundant to also contain this information within the extended status. Below is the new definition of extended status codes. This table describes error bit definitions for Release 19. These bits are located in the lower 2 bytes of the second extended status. Error Error Type Meaning Bit 1 unexpected

checksum error Checksum error detected and block does not match any of the “known” invalid sector patterns.

0x0001

2 coherency error

unchanged 0x0002

3 time stamp error

unchanged 0x0004

4 write stamp error

unchanged 0x0008

5 shed stamp error

unchanged 0x0010

6

FLARE RAID invalidated sector error

Checksum error and block does match the FLARE RAID invalid sector format. This means that sometime in the past RAID invalidated this sector because of another adjacent checksum error or media error in the stripe. This could occur because of a media error or checksum error on a different drive during a degraded condition or rebuild. Or this could occur due to detection of multiple checksum media error on a non-degraded unit. Note: On redundant units, this error will only occur in the presence of errors on other drives. The errors logged against other drives will allow us to determine exactly why this drive was invalidated.

0x1000

7 Klondike invalidated sector error

Checksum error and block does match the Klondike invalid sector format. Means that sometime in the past Klondike invalidated the sector. This error is important because it allows us to determine that the block was lost due to a Klondike invalidation.

0x2000

8

FLARE DH invalidated sector error

Checksum error and block does match the FLARE DH invalid sector format. This means that sometime in the past DH invalidated this sector because a media error (fibre drives only) caused a remap (and hence invalidation by the DH). This new error is important because it allows us to determine that the block was lost due to media error returned from drive to the DH.

0x3000

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 193

Page 195: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

9 FLARE media error invalidated sector

A sector was invalidated due to a media error. This occurs when RAID receives a media error status from the DH for a particular drive, and it is unable to reconstruct the block due to errors on other drives or due to the unit being non-redundant.

0x4000

10

FLARE intentionally invalidated sector

A sector was intentionally invalidated by the array or by an external test tool like DAQ. This type of sector is written when the corrupt CRC opcode is processed by FLARE. On a customer system only a layered driver may need to intentionally invalidate blocks when copying data. The cache also invalidates blocks that are lost from the vault using this method.

0x5000

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 194

Page 196: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

New tool – BRT Starting with R19 base software, a new tool has been made available to assist under specific conditions to help alleviate the reporting of the uncorrectable sectors. A second host related component of this tool will help assist a customer in identifying the corrupted files in order to make a decision on restoration. BRT stands for “Bad Blocks Reporting and Correcting Tool” and is being developed as a part of serviceability initiative to reduce the turnaround time for customer issues and to improve customer responsiveness. Occasionally CLARiiON arrays experience situations where data from a particular LUN sector cannot be read or reconstructed. This generally happens following the failure of one disk containing part of the LUN and the subsequent discovery of a latent sector error on a second disk in that unit during a normal read or rebuild op. Array software marks erring sectors invalid and enters an event log entry to that effect. The log entries specify a disk LBA generally of the beginning of the stripe element containing the error. Using available configuration information this address can be referenced back to a location in a user visible LUN to aid in identifying affected file(s). The tool interprets array event log and configuration information, generating a file containing a list of affected LUNs and LBA ranges within those LUNs. This file can then be passed as input to a reverse mapping tool that is still being developed, which queries the file systems of connected hosts to determine which of their files are now possibly corrupt. The tool will also provide options to clean up reported invalidated sectors, preventing potential future uncorrectable conditions. As a combined solution, the typical usage scenario of this tool and the reverse mapping tool is as follows:

1. Run background verify to find all effected LUNs. 2. Generate the LBA report of the affected blocks using the Bad Blocks Reporting and Correcting Tool (BRT). 3. Use the report file as input to the reverse mapping tool. 4. The user can then decide which bad blocks have to be cleaned and edit the LBA report file. 5. Run the ‘clean’ command using BRT with the (perhaps edited) report file as input.

Important to remember that the BRT tool does not repair or restore lost data. An uncorrectable data sector is lost data. Previous to the introduction of this tool, the events would continually get reported in the SP event log until corrective action was taken. Normally this action was to unbind and rebind the effected LUN. If the effected sector was in non-used data space, having to rebind and restore data is a major impact upon customer operations. The normal course of events prior to R19 are;

1. Run background verify on all LUNs associated with the disk/s reporting the uncorrectable or invalidated events. 2. Determine if the data loss is in actual ‘used’ data space or in ‘unused’ data space. This can be done by having

the customer run a full backup on the effected LUN/s. If no errors are returned, then skip to #4. 3. If errors, then a rebind/restore operation will have to occur unless the customer can identify the effected data.

a. If the customer can identify the effected file/s, rewriting the file/s from a backup will correct the condition. 4. If no errors, then you can do one of the following;

a. Ignore the events and as the ‘unused’ data space gets written to, new WRITEs will overwrite the bad sector and clear the error condition. Note that every time the verify process attempts to read or a rebuild in the RG occurs, the events will/could get reported in the event log. This could potentially fill up event log and cause dial home events to occur.

b. A rebind/restore operation will also clear the condition. Starting at R19 going forward; The same steps above still apply, but now you can engage Engineering early in the process to see if BRT can be applied. Since BRT will only function under specific conditions, you must engage Engineering for further guidance. For other EMC product environments with CLARiiON please see the following information on how to handle these types of events.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 195

Page 197: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CELERRA storage environment ID: emc117301 NOTE: Always refer to the most current Primus solution, this is provided as reference only. This solution provides information on how to recover Celerra file systems with unrecoverable data on attached backend storage array. For full details of these operations, contact the appropriate level of Technical Support as the solution contains procedures only to be used by Celerra Technical Support. The error symptoms that occur are as follows; Specific DART panic: >>PANIC in file: ../BVolumeIrp.cxx at line: 298 : IO failure despite all

retries/failovers The /nas/log/sys_log --> Clariion passes backend events to Celerra

<date/time> NaviEventMonitor:3:3 Backend Event Number 0x953 Host OEM-XOO25IL9VL9 Storage Array APM00041700339 SPA Device Bus 1 Enclosure 1 Disk 8 SoftwareRev 6.19.0 (4.14) Unknown Error 2.19.0.701.5.027 Description Uncorrectable Parity Sector <date/time> NaviEventMonitor:4:2 Backend Event Number 0x840 Host OEM-XOO25IL9VL9 Storage Array APM00041700339 SPA Device Bus 1 Enclosure 1 Disk 11 SoftwareRev 6.19.0 (4.14) Unknown Error 2.19.0.701.5.027 Description Data Sector Invalidated

CLARiiON SPCollect events

<date/time> Bus1 Enc1 Dsk8 956 Parity Invalidated [vr_rd RAID] 0 21773bc0 12001000 <date/time> Bus1 Enc1 DskB 957 Uncorrectable Sector [vr_rd RAID] 0 21773bc0 12001000 <date/time> Bus1 Enc1 DskB 840 Data Sector Invalidated [vr_rd RAID] 0 21773bc0 12001000

Navicli getsniffer results $ ./navicli -h 192.168.1.200 getsniffer 28 Corrected Uncorrectable Checksum errors 0 103

Server log events

CamStatus 84 ScsiStatus 02 Sense 0400 00 <date/time>:CAM:3:I/O Error: c80t1l7 Irp 0x90e11084 CamStatus 0x84 ScsiStatus 0x02 Sense 0x04/0x00/0x00 <date/time>:CAM:3:camFlags 0x50 Addr 0x8d635304 Len 0x1c000 <date/time>:CAM:3:cdb: 28 00 00 c9 02 80 00 00 e0 00 00 00

A fatal event happened on the storage system attached to a Celerra which caused data lost. Those events might be double drive faults in one RAID group or a power outage without battery backup. This basically means any event that could cause previously written and committed data to be invalid. On Symmetrix and CLARiiON, invalid data will be marked as "bad" and the host access is denied. The problem occurs if there has been a fatal error on the backend that caused data lost. In this case the backend "knows" that the data on the affected sectors / tracks had been changed but the data itself had been lost. Both CLARiiON and Symmetrix are designed to prevent the client from reading this old / bad data and return a read error. The Celerra is designed to trust the backend for data integrity. Since the data integrity has been lost, the Data Mover will panic once a read (or write) error occurs. Since a file system check just verifies the logical structure of the filesystem but not the data within, it will not find the corrupted data. Any affected file that has lost its data integrity needs to be restored from tape. The only method from a host to get the data valid again is to overwrite the track / sector with "some" data. Since the NAS code (as with any other operating system) does not know what this data has been it needs to write "zero" data to this block. This is what the special NAS code provides. Once data has been overwritten, the client is able to access the file again. But from an application point of view, data is probably bad. ATTENTION! Before carrying out ANY procedure on the Celerra, CLARiiON or Symmetrix, support MUST verify the backend and correct as much data as possible. Any event that could cause further backend outages must be carried out BEFORE the procedure can be executed. The backend needs to be in a healthy state beside the invalid tracks. Faulted drives or other bad backend hardware need to be replaced first. On RAID groups, the parity rebuild should finish. On a CLARiiON array, background verify must run. The number of uncorrectable tracks needs to be known before continuing. REQUIREMENT! The procedure requires a special NAS code and tool provided by engineering. The patch and tool will be delivered by engineering on request. Any request MUST be escalated to engineering before this patch can be installed and used. RESTRICTION! Due to the nature of the problem, there is no guarantee that all file systems and data will be recovered. The recovery procedure includes many risks. The process includes many manual steps that needs to be executed for every affected file system / every affected block. This is time consuming. There is a maximum limit of 50 bad sectors / tracks on ATA and 100 bad sectors / tracks on FC drives per system. If there exists more bad tracks / sectors, the customer is encouraged to delete the file systems and restore from tape since the recovery will require more time then the restore would. Consult with TS2 management if there are objections. The procedure requires a TS2 person to spend 100% of the time since the process needs to be constantly watched. The procedure can NOT be carried out on weekends without TS2 management approval.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 196

Page 198: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

CDL storage environment ID: emc106007 NOTE: Always refer to the most current Primus solution, this is provided as reference only. This solution provides information on how to address uncorrectable errors on a CLARiiON Disk Library (CDL). Details are not provided specific to all steps. Please engage appropriate Technical Support resources for more detail on usage. The basic steps that will be taken are; 1. Run Background Verify on all LUNs in the affected RAID group. Refer to solution emc32911. 2. If uncorrectable errors are detected, go to Step 3. If none are detected, you are done.

Note: To perform the steps below, you will be required at a point in time before unbinding the LUN from Navisphere (Step 8) to stop all I/O to the CDL completely, which will requires downtime.

3. Warning! If this is not done, there will be data loss. If you have LUN 899 and 900, you are running the old LUN scheme and must look at a getall output in the SPcollect file. HLU/ALU Pairs: HLU ALU HLU = Host LUN ---- --- ALU = Array LUN 0 4 1 5 2 6 If ALU 5 had the errors, then in Step 4 you would use HLU 1 (LUN 1). If no LUN 899 and 900, go to Step 4.

4. At the CDL console check for all tapes that reside on the effected LUN. Follow this path: (Physical Resources) / Storage Devices / Fibre Channel Devices / DGC-RAID / General tab ((Check SCSI Address 0:0:0 4 is LUN 4 )) / Layout tab ((VirtualTape-02806 is VID 02806))

5. Map all VID numbers to bar codes or tapes. Follow this path: (Logical Resources) / VirtualTape Library System) / pick each library ( STK-L180-02789) / ( Tapes ). This will show you bar code to Virtual TapeID numbers. Note all virtual tapes that are found.

6. The customer or administrator must perform this step:. A. Back up all data off the virtual tape(s) found in Step 4. B. On the CDL if a physical tape is connected to the backend, the virtual tape can be moved to vault. C. If there is a license for remote copy, which can be used. D. The only other way to achieve this is through the backup application software. Note: The customer must create new tapes to achieve Step 5. Then they must make sure that the new tapes are not created on the affected LUN(s). When you create a virtual tape, you can "uncheck" the affected LUN.

7. The customer or administrator will be asked to perform the following steps; o Delete all tapes to be unbound from backup software on the backup server.

8. Delete all tapes to be unbound from the CDL. 9. Discharge the LUN in the CDL console. In the console tree: Select Physical Resources/Storage Devices/Fibre Channel

Devices/CLARiiON S/N/ DGC:RAID 3(LUN to be unbound), right click and select Discharge. 10. Unbind the faulted LUN(s) from Navisphere. 11. See solution emc125981. See solution emc48444 or emc62865 for a description of errors described in Symptoms.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 197

Page 199: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

General Array and Host Attach Related Information Binding Binding involves taking a group of one or more disk modules and grouping them into a Logical Unit (LUN). Only after a disk module has been bound into a LUN is its storage space available for host access. A LUN is always created as part of a RAID group. You can create the RAID group explicitly or have it created when you bind the LUN. The RAID group can be of any of the following RAID types:

RAID-5 (individual access array); RAID-3 (parallel access array); RAID-1 (mirrored pair) individual disk; RAID-0 (nonredundant individual access array); RAID-1/0 group (mirrored RAID-0 group); Individual disk Hot Spare disk

A LUN created by the binding process is given a unique identifying integer. There is also a RAID group ID, assigned when the RAID group is created. A LUN can be UNBOUND, upon which all knowledge of the LUN is removed from the SP's databases and all host data on the LUN is destroyed. After all LUNs in a RAID Group have been unbound, the RAID Group itself can be removed. After a LUN is unbound, the disk modules that made up that LUN are free to be bound into new LUNs (added to existing RAID groups or bound into new RAID groups of any type).

* For ‘Default Owner’ detail, see Initial Assignment below. * For ‘Enable Auto Assign’ detail, see Auto-Assignment below.

The term used to describe how LUNs are assigned and ‘owned’ by each SP is called “LUN OWNERSHIP ACCESS”. It is sometimes referred to as an ‘active / passive’ ownership model. This is different from a DMX environment that is known as ‘active / active’ or as Dual-Simultaneous Access. This type of design allows multiple interfaces to a logical device equal access to the logical device.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 198

Page 200: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The LUN Ownership Access (CLARiiON) model allows access to LUNs through only one path at a time. This access model requires a trespass command to the LUN to move ownership from one SP to the other SP. If there are multiple interfaces to a logical device, one of them is designated as the primary route to the LUN device. Host I/O is not directed to paths connected to a non-assigned interface, meaning paths to the non-owning SP. Normal access to a device through any interface on an SP other than the assigned one is not possible. In event of a failure (storage processor or all paths to SP), logical devices or LUNs must be moved to another SP. If an interface card fails, logical devices are reassigned from the broken interface to another interface. External failover software (in ex; EMC PowerPath or Veritas/DMP) instructs storage system to initiate this reassignment (known as trespassing). After devices are trespassed, data is sent via the new route to an SP. In order to understand how ownership is handled, the following information is provided. Assignment Assignment is the process by which a given SP is given EXCLUSIVE ownership of a given LUN. The responsibility of ownership is primarily the maintenance of data/parity integrity in a dual-ported environment; it is essential that only ONE SP access any LUN at one time. Assignment enforces this by denying access to the LUN through the SP that does NOT have the LUN assigned. The SP does not require that an explicit assign command be issued to it. LUNs are assigned by the SP as part of the SP's power-up process. This Assignment activity is called INITIAL Assign. There are two methods that a host may alter the default Initial Assignment of any LUN: AUTO-ASSIGNMENT and TRESPASS. These methods are discussed next in more detail. Initial Assignment At the time a LUN is bound, one of the two SPs is identified as the default owner. At power-up time, the SP will assign all LUNs that it owns by default. This process is called Initial Assignment. The LUN will remain Assigned to this SP until changed by either a Trespass (discussed later) command from the host, or it is prompted by the fault/removal of this SP in a dual-SP cabinet (involves Auto-Assignment, discussed later), or the default SP owner for the LUN is changed (via a serial port or SCSI command) AND the cabinet is power-cycled. Thus, Initial Assignment is the process that the SP takes at power-up to assign all LUNs for which it is the default owner. The concept of Initial Assignment along with default ownership of a LUN exists only to provide the means to decide at POWER-UP time which SP owns which LUNs. The default owner of a LUN is the SP that assumes ownership of the LUN when the storage system is powered up. If the storage system has two SPs, you can choose to bind some LUNs using one SP as the default owner and the rest using the other SP as the default owner, or you can select Auto, which tries to divide the LUNs equally between SPs. The primary route to a LUN is the route through the SP that is its default owner, and the secondary route is through the other SP. If you do not specifically select one of the Default Owner values, default LUN owners are assigned according to RAID Group IDs as follows:

RAID Group IDs Default LUN owner Odd numbered SP A Even numbered SP B The default owner property is unavailable for a Hot Spare LUN.

Auto-Assignment Auto-Assignment occurs when an initiator issues an I/O request to an SP for a LUN that the SP does NOT currently have assigned. Since Assignment enforces exclusive access to a LUN, the SP receiving such an I/O will attempt to assign the LUN in order to service the request. If the peer SP is present in the cabinet and it has the LUN Assigned, the Assign attempt is denied and the I/O will return an error. Otherwise, if the Assign attempt is not denied by a peer SP, then this SP Assigns the LUN to itself and is now the exclusive owner until power-cycled or the LUN is Trespassed. Performing an Auto-Assignment does NOT affect the default ownership of the LUN; thus its effect does not survive a power cycle of the SP. Auto-Assignment can be useful for maintaining access to LUNs that were assigned to an SP that has failed. By default, Auto-Assignment is disabled.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 199

Page 201: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The auto assign option for a LUN should only be enabled if the connected host does not use failover software. The auto assign property is ignored when the storage system's failover mode for an initiator is set to 1. This property will not interfere with PowerPath's control of a LUN. The option will enable or disable (default) auto assign for a LUN. The purpose of Auto assign is to control the ownership of the LUN when an SP fails in a storage system with two SPs. You enable or disable auto assign for a LUN when you bind it. You can also enable or disable it after the LUN is bound without affecting the data on it. With auto assign enabled, if the SP that owns the LUN fails and the server tries to access that LUN through the second SP, the second SP assumes ownership of the LUN to enable access. The second SP continues to own the LUN until the failed SP is replaced and the storage system is powered up. Then, ownership of the LUN returns to its default owner. If auto assign is disabled in this situation, the second SP does not assume ownership of the LUN, and access to the LUN does not occur. If you are running failover software on a server connected to the LUNs in a storage system, you must disable auto assignment for all LUNs that you want the software to fail over when an SP fails. In this situation, the failover software, not auto assign, controls ownership of the LUN in a storage system with two SPs. The auto assign property is not available for a Hot Spare LUN. Failover Feature (relative to Auto-Assign and not trespassing) If an SP in a dual-SP storage system fails, any LUN that was assigned to the failed SP can be accessed through the surviving peer SP via the FAILOVER feature. All a host need do is direct all I/O to the surviving peer SP. Upon receiving an I/O for such a LUN, the peer SP will attempt an Auto-Assignment, which should be successful, given that the original SP has failed and is not capable of preventing the Assign. Once the Auto-Assignment is completed by the peer, the Failover process is complete, leaving the LUN fully accessible via the peer SP. As with any Assignment, the Failover remains in force until a power-cycle occurs or a Trespass command supersedes it. Thus if the defective SP is replaced and turned on, it will NOT be able to Initial Assign the LUN due to the peer SP now owning it. Therefore, it is suggested that, when powering-up a replacement SP, you power-cycle the peer or use a Trespass commands to return the Assignment scheme to their Initial Assignments. Trespass A Trespass operation forcibly transfers ownership of a LUN to the other SP. You can transfer ownership of one LUN, all LUNs in the storage system or all LUNs in a RAID group. When a Trespass command is issued to an SP, and that SP does not currently have the LUN Assigned, it will take the LUN away from the SP that does have it assigned. The SP that is losing ownership of the LUN will complete all I/O that has already started on the LUN and abort all I/O that has not yet started. If the SP that lost the LUN receives a Trespass command from a host for that LUN, it will take the LUN back just as it had lost it. Performing a Trespass command does NOT change the default owner of the LUN. Thus, a Trespass command does not survive a power-cycle of the SP. Auto-Trespass The CLARiiON system can also be set up to perform Automatic Trespass operations. Changing the system parameters through Navisphere can enable this feature. When Auto-Trespass is enabled on a unit, a Trespass command automatically occurs whenever an initiator issues an I/O request to the SP which does NOT currently has the unit assigned. This behaves exactly as if the SP had actually received a Trespass command. This feature allows a system to have High-Availability access to disk units without implementing the Trespass command. Note that this is a feature with a very significant performance impact, and should be used with extreme caution. If units are accessed repeatedly through the separate SPs, each access will cause a Trespass operation, and there will be a severe impact, as the SPs go through the Trespass operations.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 200

Page 202: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Storage Groups Once LUNs have been created and the correct properties applied, we need a Storage Group in order to make the LUNs accessible to a host. A Storage Group (SG) is one or more LUNs (logical units) within a disk-array storage system that is reserved for one or more servers and inaccessible to other servers. Commonly known synonyms for a SG are Virtual Array and Virtual Target. Industry literature often refers to the SG capability as LUN Masking. You can define your own SGs, and there are always two predefined storage groups. To define a SG, when you configure a storage system, you specify servers and the SG(s) each server can read from and/or write to. The Licensed Internal Code firmware running in each storage system enforces the server-to-SG permissions.

SGs are primarily designed for use in a Storage Area Network (SAN). In a SAN, servers are connected to storage systems via Fibre Channel switches. The figure shows a simple SAN configuration consisting of one storage system with two SGs. One SG is used by a cluster of two Window NT hosts and the other SG is used by a UNIX database server. Each server is configured with two independent paths to its data, including separate HBAs, switches, and SPs, so there is no single point of failure for access to its data.

Setting Up Storage Groups (SGs) When anyone binds a LUN, the Core Software creates a World Wide Name (WWN) and attaches the WWN to the LUN. The Set LUN Name feature provides the capability to attach a "nice" character or byte string name to a physical LUN. This is also known as the LUN Nice Name. After binding the desired number and types of LUNs, the user may define any number of storage groups (up to the maximum supported). Remember that a storage group is simply a structure mapping a group of physical LUNs to a set of virtual LUN numbers (VLUs). The SG is assigned a unique world wide name by the Core Software when it is initially set up. This name stays with the SG through its lifetime and is used to generate the mapping between initiator/port and the SG. The user may also specify a "nice" character string name for the SG for Core Software/LIC to store with the other information. Special (predefined) Storage Groups Two special (predefined) Storage Groups will be automatically available. The Management SG provides no LUN mappings. If the LUNZ – arraycommpath - option is enabled, we create a “fake” LUN0 to allow configuration commands to be sent to the array. This functionality is separate from the Management SG – in that you can have LUNZ with the Physical or a User-Defined SG. The Physical Storage Group will provide a virtual-equals-real mapping for all LUNs as they are bound in the system.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 201

Page 203: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Default Storage Group A system will have a "default" Storage Group that undefined initiators who happen to log in will be mapped to. The user may set up a new Storage Group and designate it to be the default SG at any time. Any of the two special SGs can be made the default as well. Defining Initiators Once Storage Groups are set up, they may be mapped for access by initiators. Information about initiators that have been previously defined or are logged in may be obtained from the storage system. The user can use this and other information to define which storage group each initiator may access. It is not necessary for an initiator to be logged in to be defined, as long as the user has enough information to generate its description block. An initiator so defined will automatically be mapped to the correct Storage Group as soon as it logs in. Along with the world wide name of the desired Storage Group, the Core Software/LIC stores user provided host data, system type information, interface options required by that initiator, and the storage-system port the initiator is expected to communicate through. Each initiator/port entity may access only one Storage Group at a time. Heterogeneous Hosts To provide heterogeneous multiple-host support, all Core Software options have been divided into two categories:

System-Type options Initiator-Type options

Initiator types allow each initiator to have its own set of options and independent specific behavior. System-Type options can be set independently for storage systems and initiators. An initiator can have either a System type or an Initiator type. You can change an Initiator type and doing so aborts all I/O for this initiator. Initiator type options are as follows:

Substitute Queue busy for Queue full Recovered errors reporting Mode page 0x08 support Auto Trespass Mode Setting Inquiry Page 80 returns Array or LUN Serial Number LUN_Z reporting support

These options can be configured on the System and on the Initiator level. All other options are configurable on the System level only. On the Initiator level, these options are set automatically depending on Initiator type. However, you can also set/reset the option Substitute Queue busy for Queue full on the user level, regardless of the Initiator type. NOTE: Systemtype options are generally used when AccessLogix is not in use. The default systemtype setting is ‘3’. The system type usage and settings are not discussed here. Information being provided is for those environments in which AccessLogix is in use. Therefore, the following information pertains to the initiatortype (registration) settings.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 202

Page 204: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

(See PRIMUS solution emc99467 for more information on these settings) Initiatortype 3 = CLARiiON Open (default SCSI-3 Initiator Type)

2 = HP Auto Trespass 10 = HP No Auto Trespass 9 = SGI 22 = Fujitsu Siemens 28 = Compaq/Tru64

The initiator setup command defines or modifies permanent initiator information held by the storage system. An initiator can have either a temporary or permanent record. The Core Software automatically creates a temporary record for an undefined initiator who (fibre channel) logs in. When any information is explicitly added or modified by actions associated with this command the record will be made permanent and will survive power cycles and restarts. Core Software defines the operating modes, collectively known as Initiator Types (in Classic Flare, these were known as System Types). The current list of initiator types is as follows: CLARiiON Open - This initiator type (3) extends the default initiator definition as follows:

Enables the SCSI-3 Interface Reporting definition. HP (HP with Auto Trespass) - This initiator type (2) extends the default initiator definition as follows:

Mode Sense/Select page 0x08 is supported A Special HP Inquiry block is returned 2015 (Host Broken Unit) and 2016 (Host Bad FRU Signature) errors are returned as Selection Timeout Errors,

rather than Hardware Errors (04/00/00). 02/04/03 (Not Ready – LUN Down) errors are reported as 0B/00/00 (Aborted Command) errors. Auto Assign Error codes are suppressed SCSI-3 Interface Reporting is enabled Volume Set Addressing will be accepted by the Fibre Front End Driver. Auto Trespass is enabled for all units.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 203

Page 205: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Auto Trespass is enabled on a system-wide basis (for the storagegroup), rather then on a per unit basis. No Unit Attention conditions will be set as a result of an Auto Trespass operation. Auto Trespass Operations

The Auto-trespass feature is required for use with HPUX PVLINKS failover feature. It provides functionality for the HPUX host to provide primary and secondary paths to the array. See PRIMUS solution emc56730 “HP Best Practices for CLARiiON arrays” for more information on this subject. HP (HP without Auto Trespass) - This initiator type (10) extends the default initiator definition as follows:

Mode Sense/Select page 0x08 is supported A Special HP Inquiry block is returned 2015 (Host Broken Unit) and 2016 (Host Bad FRU Signature) errors are returned as Selection Timeout Errors,

rather than Hardware Errors (04/00/00). 02/04/03 (Not Ready – LUN Down) errors are reported as 0B/00/00 (Aborted Command) errors. Auto Assign Error codes are suppressed SCSI-3 Interface Reporting is enabled Volume Set Addressing will be accepted by the Fibre Front End Driver.

SGI - This initiator type (9) extends the default initiator definition as follows:

SGI Inquiry Reporting is enabled o Vendor Identification “SGI” o “K5 CONTROLLER 001”

Fibre channel Soft Addressing is disabled Fujitsu-Siemens - This initiator type (22) extends the default initiator definition as follows:

Fujitsu Inquiry Reporting is enabled o Vendor Identification “FSC FC47” (FC4700 SPs) o Vendor Identification “FSC FC66” (All CX series SPs)

Compaq TRU64 - This initiator type (28) extends the default initiator definition as follows:

Mode Sense/Select Page 0x08 is supported Sense Data

o For any command that requires LUN ownership, if the unit isn’t assigned on the SP, a 02/04/02 (Lun Not Ready, Initialization Command Required) error will be returned.

o For any non-Inquiry command received on a Quiescing SP (during NDUs), a 02/04/02 (Lun Not Ready, Initialization Command Required) error will be returned.

Inquiry o For existing LUNs, byte 0 is either 0x00 (assigned this controller) or 0x20 (assigned other controller, or

owned by this SP with an NDU in progress). o For non-existent luns, byte 0 is 0x7F. o Default Inquiry response is a standard SCSI-2 format. o Inquiry page 0x83 (Vital Products Data Page) only returns LUNs WWN and not the second LUN/VLU

mapping information report. Start/Stop Unit

o The command causes a LUN trespass operation to the sent controller when the Start Bit = 1. Report Device UUID

o This command returns a Unique Unit Identifier for the logical unit. The Device Identifier is the unique value assigned to the logical unit, e.g. Base Address + Offset, and is used by both the Tru64 OS to assist in logical unit management and the Compaq prom code for booting support.

o UUID is required to be unique across all LUNs in a cluster, which can include multiple storage systems.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 204

Page 206: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Storage system type Max UID

CX3-20 31181 CX3-40 31343 CX3-80 30319 CX200 32511 CX400 32155 CX500 31643 CX600 31443 CX700 30419

FC4700 32243

Base UUID is supported for Tru64 servers only and is required to support UUID (Unique Unit Identifier) reporting for Tru64 host systems. The Unique Unit Identifier is required to be unique across all LUNs in a cluster, which can include multiple storage systems. The UUID consists of the base UUID and the HLU ID added together: Base UUID - the Base Address which can be set by the user. You determine the Base UUID for a LUN by subtracting the maximum number of LUNs supported by the storage system from the Base UUID value. Max Base UUID values dependent on each storage system as shown.

Host LUN ID (HLU ID) - An offset for each logical unit in a storage system, determined by its logical unit position in the Storage Group. For example, If the Base UUID value for a LUN is determined to be 12000, and the HLU ID is 6, the UUID for this LUN is 12006. Arraycommpath 0 or Disabled = LUN_Z disabled

1 or Enabled = LUN_Z enabled

The LUNZ is an initiator option of the SP, where the SP reports that a non-existent logical unit zero exists to the initiator. This logical unit has no physical attributes or size, but instead provides a target for host operation systems to send configuration commands to the SP to initialize and configure the subsystem. When LUNZ initiator option is set and no logical unit zero exists, the following changes are made to the SCSI interface:

An Inquiry command to LUN 0 will return a Peripheral Qualifier of zero (device ready and on-line). : Inquiry Vital Product Data Pages will return slightly different information for LUN 0 Report LUNs command will always include a Logical Unit 0 in the returned data block. Test Unit Ready command to LUN 0 will return a 02/04/03 error. Start/Stop command to LUN 0 will return a 02/04/03 error.

Failovermode

0 = LUN-Based Trespass 1 = PNR (Passive Not Ready) 2 = DMP Mode 3 = PAR (Passive Always Ready)

The base software uses a “LUN Ownership Model”, where a logical unit is “owned” by one of the SPs and the owning SP provides access to the logical unit for all initiators. However, to support various host operating systems and failover models, the lun availability reporting from the non-owning SP can be modified, based on the “Auto Trespass” field setting.

LUN Based Trespass Mode (failovermode 0) In this, the default mode of operation, the LUN ownership reporting mode is set on a per-LUN basis. If Auto Trespass mode is set for the logical unit, it acts as described below in the Auto Trespass section. If Auto Trespass mode isn’t set for the logical unit, it acts as described below in the Manual Trespass section above. This mode is used primarily with in HPUX and Tru64 environments. Most commonly seen when HPUX PVLINK failover is in use.

Passive Not Ready (PNR) Mode (failovermode 1) In this mode of operation, the non-owning SP will report that all non-owned logical units exist and are available for access. Any media access commands directed to the non-owning SP will be rejected with a 02/04/03 error, which is different from the Manual Trespass model. Ownership of the logical unit can only be changed via a manual trespass operation. This failovermode option is used primarily with PowerPath.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 205

Page 207: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

PowerPath in a CLARiiON environment, a path set will include all logical paths connected to the assigned storage processor (active paths), as well as all the logical paths connected to the un-assigned storage processor (passive paths). Once created, PowerPath can use any path in the set among the active paths to service an I/O request. If a path fails, PowerPath can redirect an I/O request from that path to any other active path in the set. Redirection is transparent to the application which is not aware of the error on the initial path and therefore does not receive an error. If all active paths in the set (paths to assigned storage processor) fail, PowerPath initiates a trespass and the passive paths become active. This trespass is also transparent to the application.

With PowerPath, the application is protected from having to deal with an I/O error. The following will happen.

1. PowerPath intercepts these I/O requests. (Actually, PowerPath is always intercepting I/O to these devices.) 2. It determines that a path is dead. (Either from knowledge obtained while processing prior I/O requests, or from

having tried the latest requests down the failed path) 3. It selects an alternate path, based on failover policy selected for this host. 4. It will then try the requests down the selected alternate path. If all goes well, the results are sent back up-stack to

the application. DMP Mode (failovermode 2) In this mode of operation, similar to Auto Trespass mode, the non-owning SP will report that all non-owned logical units exist and are available for access. Any I/O commands directed to the non-owning SP will cause the LUN to trespass to the SP the command was received on. However, in this mode, Unit Attention conditions are not reported to the initiator when a logical unit is trespassed to or from the SP. This suppression of Unit Attention reporting is what differentiates this mode from the Auto Trespass mode of operation. I/O Performance in this mode can be severely limited if the initiator is sending commands to both SPs, as the logical unit will be trespassing back and forth between SPs to satisfy all the I/O requests. In this mode, it is the execution of an I/O down the non-owning path which will cause a LUN to change ownership and NOT a trespass command. DMP on the host side handles pathing of the IO. The flow of event notification during I/O failure is as follows:

1. Device driver/HBA controller issues SCSI III commands to disk and command fails 2. OS kernel trying to issue write fails, generates an error 3. Based on the error code, the DMP driver takes the correct action and fails over the I/O to the alternate HBA 4. I/O is reissued down alternate path

Veritas refers to a CLARiiON storage array as an “Active/Passive” device within the framework of its failover software called DMP (Dynamic Multipathing), in which controllers “own” Logical Units (LUNs)/disks. Only the controller which “owns” the LUN issues I/Os to these LUNs. The controller that “owns” a LUN is called the “primary” path and the alternate controller is termed as the “secondary” path to the LUN. Since accessing a disk/LUN through multiple available paths on such disk arrays is not allowed, if a LUN in this type of environment is accessed simultaneously via multiple paths, the “ownership” of the LUN shifts back and forth across the controllers, causing a “ping-pong” effect. Because changing ownership can be a time-consuming operation, this can cause immense performance degradation. For CLARiiON disk arrays, DMP policy is to use the available primary path as long as it is accessible. Then DMP will shift I/Os to the secondary path only when the primary path fails. If a path fails, alternate paths are used. If all access paths have failed, the disk is considered to have failed. A disk driver failure will not cause a DMP failure.

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 206

Page 208: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

The DMP driver itself does not take any time to switch over. However, the total time for failover is dependent on how long the underlying disk driver retries the command before giving up. On HP, the user can set the PowerFail timeout value for that disk. DMP allows the administrator to indicate to the DMP subsystem in Volume Manager whether the connection is repaired or restored. This is called DMP reconfiguration and the procedure also allows detection of newly added devices, as well as devices that are removed after the system is fully booted (if the operating system detects them properly). Note that a LUN moving between SPs is not trespassing in the technical sense but is changing ownership due to IO occurring down the non-owning SP path. The end result is the appearance of a trespass (606) event in older versions of base software. To keep a high number of events from ‘flooding’ an SP event log, DMP induced LUN ‘trespasses’ are suppressed and are only reported in the ktrace file (kt_std). This mode is used primarily with Veritas/DMP failover software and beginning with VERITAS 4.1 and the supported base software versions, failovermode can be set to (1). This will eliminate the high numbers of DMP related LUN ‘trespasses’. See Primus solution emc127913 for more information.

Passive Always Ready (PAR) Mode (failovermode 3) In this mode of operation, the non-owning SP will report that all non-owned logical units exist, and are available for access. A Test Unit Ready command will always return success, even on the non-owning SP, unless a Group Reservation is in effect for the logical unit. In such a case, a Test Unit Ready command sent to the non-owning SP will return a RESERVATION CONFLICT status. Any media access commands directed to the non-owning SP will be rejected with a 02/68/00 error, which is different from the Manual Trespass model. Ownership of the logical unit can only be changed via a manual trespass operation. This mode is used for AIX environments using PowerPath and online NDUs are required. See Primus solution emc67186 for more information. Logical Unit Serial Number Reporting

LUN = Returns a derived value for the Serial Number ARRAY = Returns the array Serial Number

In the SCSI-3 Interface, the Unit Serial Number Page is used to report the “serial number for the target or logical unit”. Since this is an ambiguous definition, the standard LIC code will return the array serial number in this page. However, since some host operating systems require a unique value to be returned in this page, the LIC supports a “Report Lun Serial Number” option. When this option is enabled, the LIC code will return a unique logical unit serial number in this page. This value is a mathematical combination of the array serial number and the World Wide Name of the logical unit. This option can also be controlled by Navisphere CLI utility, using the UnitSerialNumber parameter. The help information for Unit Serial Number indicates that it reports the serial number on a per LUN or per Array basis to the host operating system. Unit Serial Number should always be set to Array except for Solaris. The Solaris Unit Serial Number setting is dependent on the configuration. Select the proper setting as follows: For Solaris 2.6, 2.7, 8 w/DMP or PowerPath, determine if any of the following packages are present:

• Solstice DiskSuite (SUNWmdr) • SunCluster 3.x (SUNWscr) • VERITAS DBE/AC (VRTSdbac) • SFRAC (VRTSdbac)

- If none of the packages listed above are present, then Unit Serial Number must be set to ARRAY. - If any of the packages listed above are present, then set Unit Serial Number as follows:

• VRTSvcs with I/O fencing: Unit Serial Number must be set to LUN • VRTSvcs without I/O fencing: Unit Serial Number must be set to ARRAY

I/O fencing is enabled when the /etc/VRTSvcs/conf/config/main.cf file contains a line of "UseFence = SCSI3". The /etc/vxfentab file is not empty. The command /sbin/vxfenadm -g all -f /etc/vxfentab returns at least one SCSI3 keys. For Solaris 9, Unit Serial Number must be set to ARRAY

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 207

Page 209: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

ID: emc99467 NOTE: Always refer to the most current Primus solution, this is provided as reference only. This solution provides the correct parameter settings. AIX

Parameter PowerPath DMP (AIX 5.1, 5.2, 5.3 only) Initiator Type 3 (CLARiiON open) 3 (CLARiiON open)

Arraycommpath 0 (Disabled) or 1 (Enabled) 1 1 (Enabled) Failovermode 1 6 2

UnitSerialNumber Array Array HP-UX

Parameter PVLinks No PVLinks PowerPath DMP (HP-UX 11i only) Initiator Type (AccessLogix) HP auto trespass HP no auto trespass HP no auto trespass HP no auto trespass

Array Systemtype (non AccessLogix) 2 decimal 10 (hex a) decimal 10 (hex a) N/A

Arraycommpath 0 (Disabled) or 1 (Enabled)2

0 (Disabled) or 1 (Enabled)2 1 (Enabled) 1 (Enabled)

Failovermode 0 0 1 2 UnitSerialNumber LUN or Array3 LUN or Array3 LUN or Array3 LUN or Array3

IRIX

Parameter Native Failover Initiator Type 9 (SGI)

Arraycommpath 0 Failovermode 0

UnitSerialNumber Array

NetWare Parameter PowerPath

Initiator Type 3 (CLARiiON open) Arraycommpath 1 (Enabled) Failovermode 1

UnitSerialNumber Array Linux

Parameter PowerPath DMP DM-MPIO Initiator Type 3 (CLARiiON Open) 3 (CLARiiON Open) 3 (CLARiiON Open)

Arraycommpath 1 (Enabled) 1 (Enabled) 1 (Enabled) Failovermode 1 2 1

UnitSerialNumber Array Array Array Solaris

Parameter PowerPath DMP MPxIO or STMS Initiator Type 3 (CLARiiON Open) 3 (CLARiiON Open) 3 (CLARiiON Open)

Arraycommpath 1 (Enabled) 1 (Enabled) 1 (Enabled) Failovermode4 1 1 or 2 1

UnitSerialNumber5 LUN or Array LUN or Array LUN or Array Tru64

Parameter Native Failover Initiator Type dec 28 / hex 1C

Arraycommpath 1 (Enabled) Failovermode 0

UnitSerialNumber Array

VMware Parameter Native Failover

Initiator Type 3 (CLARiiON Open) Arraycommpath 1 (Enabled) Failovermode 1

UnitSerialNumber Array Windows

Parameter PowerPath DMP (Windows 2000 and Windows 2003 only)Initiator Type 3 (CLARiiON open) 3 (CLARiiON open)

Arraycommpath 1 (Enabled) 1 (Enabled) Failovermode 1 1

UnitSerialNumber Array Array

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 208

Page 210: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 209

APPENDIX Flare Revision Decoder and CX/CX3 Bus numbering charts For bus numbering charts, see the CLARiiON Troubleshooting Guide, (Edition One) for older storage system types or contact Technical Support.

EMC Confidential - Internal Use Only

Page 211: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 210EMC Confidential - Internal Use Only

BUS 0 CX-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa DAE# b-e-d dec hex b-d al_pa 0 0-0-0 0 0x00 0-0 ef 4 0-4-0 60 0x3c 0-3c 76 0-0-1 1 0x01 0-1 e8 0-4-1 61 0x3d 0-3d 75 0-0-2 2 0x02 0-2 e4 0-4-2 62 0x3e 0-3e 74 0-0-3 3 0x03 0-3 e2 0-4-3 63 0x3f 0-3f 73 0-0-4 4 0x04 0-4 e1 0-4-4 64 0x40 0-40 72 0-0-5 5 0x05 0-5 e0 0-4-5 65 0x41 0-41 71 0-0-6 6 0x06 0-6 dc 0-4-6 66 0x42 0-42 6e 0-0-7 7 0x07 0-7 da 0-4-7 67 0x43 0-43 6d 0-0-8 8 0x08 0-8 d9 0-4-8 68 0x44 0-44 6c 0-0-9 9 0x09 0-9 d6 0-4-9 69 0x45 0-45 6b 0-0-10 10 0x0a 0-a d5 0-4-10 70 0x46 0-46 6a 0-0-11 11 0x0b 0-b d4 0-4-11 71 0x47 0-47 69 0-0-12 12 0x0c 0-c d3 0-4-12 72 0x48 0-48 67 0-0-13 13 0x0d 0-d d2 0-4-13 73 0x49 0-49 66 0-0-14 14 0x0e 0-e d1 0-4-14 74 0x4a 0-4a 65 1 0-1-0 15 0x0f 0-f ce 5 0-5-0 75 0x4b 0-4b 63 0-1-1 16 0x10 0-10 cd 0-5-1 76 0x4c 0-4c 5c 0-1-2 17 0x11 0-11 cc 0-5-2 77 0x4d 0-4d 5a 0-1-3 18 0x12 0-12 cb 0-5-3 78 0x4e 0-4e 59 0-1-4 19 0x13 0-13 ca 0-5-4 79 0x4f 0-4f 56 0-1-5 20 0x14 0-14 c9 0-5-5 80 0x50 0-50 55 0-1-6 21 0x15 0-15 c7 0-5-6 81 0x51 0-51 54 0-1-7 22 0x16 0-16 c6 0-5-7 82 0x52 0-52 53 0-1-8 23 0x17 0-17 c5 0-5-8 83 0x53 0-53 52 0-1-9 24 0x18 0-18 c3 0-5-9 84 0x54 0-54 51 0-1-10 25 0x19 0-19 bc 0-5-10 85 0x55 0-55 4e 0-1-11 26 0x1a 0-1a ba 0-5-11 86 0x56 0-56 4d 0-1-12 27 0x1b 0-1b b9 0-5-12 87 0x57 0-57 4c 0-1-13 28 0x1c 0-1c b6 0-5-13 88 0x58 0-58 4b 0-1-14 29 0x1d 0-1d b5 0-5-14 89 0x59 0-59 4a 2 0-2-0 30 0x1e 0-1e b4 6 0-6-0 90 0x5a 0-5a 49 0-2-1 31 0x1f 0-1f b3 0-6-1 91 0x5b 0-5b 47 0-2-2 32 0x20 0-20 b2 0-6-2 92 0x5c 0-5c 46 0-2-3 33 0x21 0-21 b1 0-6-3 93 0x5d 0-5d 45 0-2-4 34 0x22 0-22 ae 0-6-4 94 0x5e 0-5e 43 0-2-5 35 0x23 0-23 ad 0-6-5 95 0x5f 0-5f 3c 0-2-6 36 0x24 0-24 ac 0-6-6 96 0x60 0-60 3a 0-2-7 37 0x25 0-25 ab 0-6-7 97 0x61 0-61 39 0-2-8 38 0x26 0-26 aa 0-6-8 98 0x62 0-62 36 0-2-9 39 0x27 0-27 a9 0-6-9 99 0x63 0-63 35 0-2-10 40 0x28 0-28 a7 0-6-10 100 0x64 0-64 34 0-2-11 41 0x29 0-29 a6 0-6-11 101 0x65 0-65 33 0-2-12 42 0x2a 0-2a a5 0-6-12 102 0x66 0-66 32 0-2-13 43 0x2b 0-2b a3 0-6-13 103 0x67 0-67 31 0-2-14 44 0x2c 0-2c 9f 0-6-14 104 0x68 0-68 2e 3 0-3-0 45 0x2d 0-2d 9e 7 0-7-0 105 0x69 0-69 2d 0-3-1 46 0x2e 0-2e 9d 0-7-1 106 0x6a 0-6a 2c 0-3-2 47 0x2f 0-2f 9b 0-7-2 107 0x6b 0-6b 2b 0-3-3 48 0x30 0-30 98 0-7-3 108 0x6c 0-6c 2a 0-3-4 49 0x31 0-31 97 0-7-4 109 0x6d 0-6d 29 0-3-5 50 0x32 0-32 90 0-7-5 110 0x6e 0-6e 27 0-3-6 51 0x33 0-33 8f 0-7-6 111 0x6f 0-6f 26 0-3-7 52 0x34 0-34 88 0-7-7 112 0x70 0-70 25 0-3-8 53 0x35 0-35 84 0-7-8 113 0x71 0-71 23 0-3-9 54 0x36 0-36 82 0-7-9 114 0x72 0-72 1f 0-3-10 55 0x37 0-37 81 0-7-10 115 0x73 0-73 1e 0-3-11 56 0x38 0-38 80 0-7-11 116 0x74 0-74 1d 0-3-12 57 0x39 0-39 7c 0-7-12 117 0x75 0-75 1b 0-3-13 58 0x3a 0-3a 7a 0-7-13 118 0x76 0-76 18 0-3-14 59 0x3b 0-3b 79 0-7-14 119 0x77 0-77 17

Page 212: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 211EMC Confidential - Internal Use Only

BUS 1 CX-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa DAE# b-e-d dec hex b-d al_pa 8 1-0-0 120 0x78 1-0 ef 12 1-4-0 180 0xb4 1-3c 76 1-0-1 121 0x79 1-1 e8 1-4-1 181 0xb5 1-3d 75 1-0-2 122 0x7a 1-2 e4 1-4-2 182 0xb6 1-3e 74 1-0-3 123 0x7b 1-3 e2 1-4-3 183 0xb7 1-3f 73 1-0-4 124 0x7c 1-4 e1 1-4-4 184 0xb8 1-40 72 1-0-5 125 0x7d 1-5 e0 1-4-5 185 0xb9 1-41 71 1-0-6 126 0x7e 1-6 dc 1-4-6 186 0xba 1-42 6e 1-0-7 127 0x7f 1-7 da 1-4-7 187 0xbb 1-43 6d 1-0-8 128 0x80 1-8 d9 1-4-8 188 0xbc 1-44 6c 1-0-9 129 0x81 1-9 d6 1-4-9 189 0xbd 1-45 6b 1-0-10 130 0x82 1-a d5 1-4-10 190 0xbe 1-46 6a 1-0-11 131 0x83 1-b d4 1-4-11 191 0xbf 1-47 69 1-0-12 132 0x84 1-c d3 1-4-12 192 0xc0 1-48 67 1-0-13 133 0x85 1-d d2 1-4-13 193 0xc1 1-49 66 1-0-14 134 0x86 1-e d1 1-4-14 194 0xc2 1-4a 65 9 1-1-0 135 0x87 1-f ce 13 1-5-0 195 0xc3 1-4b 63 1-1-1 136 0x88 1-10 cd 1-5-1 196 0xc4 1-4c 5c 1-1-2 137 0x89 1-11 cc 1-5-2 197 0xc5 1-4d 5a 1-1-3 138 0x8a 1-12 cb 1-5-3 198 0xc6 1-4e 59 1-1-4 139 0x8b 1-13 ca 1-5-4 199 0xc7 1-4f 56 1-1-5 140 0x8c 1-14 c9 1-5-5 200 0xc8 1-50 55 1-1-6 141 0x8d 1-15 c7 1-5-6 201 0xc9 1-51 54 1-1-7 142 0x8e 1-16 c6 1-5-7 202 0xca 1-52 53 1-1-8 143 0x8f 1-17 c5 1-5-8 203 0xcb 1-53 52 1-1-9 144 0x90 1-18 c3 1-5-9 204 0xcc 1-54 51 1-1-10 145 0x91 1-19 bc 1-5-10 205 0xcd 1-55 4e 1-1-11 146 0x92 1-1a ba 1-5-11 206 0xce 1-56 4d 1-1-12 147 0x93 1-1b b9 1-5-12 207 0xcf 1-57 4c 1-1-13 148 0x94 1-1c b6 1-5-13 208 0xd0 1-58 4b 1-1-14 149 0x95 1-1d b5 1-5-14 209 0xd1 1-59 4a 10 1-2-0 150 0x96 1-1e b4 14 1-6-0 210 0xd2 1-5a 49 1-2-1 151 0x97 1-1f b3 1-6-1 211 0xd3 1-5b 47 1-2-2 152 0x98 1-20 b2 1-6-2 212 0xd4 1-5c 46 1-2-3 153 0x99 1-21 b1 1-6-3 213 0xd5 1-5d 45 1-2-4 154 0x9a 1-22 ae 1-6-4 214 0xd6 1-5e 43 1-2-5 155 0x9b 1-23 ad 1-6-5 215 0xd7 1-5f 3c 1-2-6 156 0x9c 1-24 ac 1-6-6 216 0xd8 1-60 3a 1-2-7 157 0x9d 1-25 ab 1-6-7 217 0xd9 1-61 39 1-2-8 158 0x9e 1-26 aa 1-6-8 218 0xda 1-62 36 1-2-9 159 0x9f 1-27 a9 1-6-9 219 0xdb 1-63 35 1-2-10 160 0xa0 1-28 a7 1-6-10 220 0xdc 1-64 34 1-2-11 161 0xa1 1-29 a6 1-6-11 221 0xdd 1-65 33 1-2-12 162 0xa2 1-2a a5 1-6-12 222 0xde 1-66 32 1-2-13 163 0xa3 1-2b a3 1-6-13 223 0xdf 1-67 31 1-2-14 164 0xa4 1-2c 9f 1-6-14 224 0xe0 1-68 2e 11 1-3-0 165 0xa5 1-2d 9e 15 1-7-0 225 0xe1 1-69 2d 1-3-1 166 0xa6 1-2e 9d 1-7-1 226 0xe2 1-6a 2c 1-3-2 167 0xa7 1-2f 9b 1-7-2 227 0xe3 1-6b 2b 1-3-3 168 0xa8 1-30 98 1-7-3 228 0xe4 1-6c 2a 1-3-4 169 0xa9 1-31 97 1-7-4 229 0xe5 1-6d 29 1-3-5 170 0xaa 1-32 90 1-7-5 230 0xe6 1-6e 27 1-3-6 171 0xab 1-33 8f 1-7-6 231 0xe7 1-6f 26 1-3-7 172 0xac 1-34 88 1-7-7 232 0xe8 1-70 25 1-3-8 173 0xad 1-35 84 1-7-8 233 0xe9 1-71 23 1-3-9 174 0xae 1-36 82 1-7-9 234 0xea 1-72 1f 1-3-10 175 0xaf 1-37 81 1-7-10 235 0xeb 1-73 1e 1-3-11 176 0xb0 1-38 80 1-7-11 236 0xec 1-74 1d 1-3-12 177 0xb1 1-39 7c 1-7-12 237 0xed 1-75 1b 1-3-13 178 0xb2 1-3a 7a 1-7-13 238 0xee 1-76 18 1-3-14 179 0xb3 1-3b 79 1-7-14 239 0xef 1-77 17

Page 213: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 212EMC Confidential - Internal Use Only

BUS 2 CX-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa DAE# b-e-d dec hex b-d al_pa 16 2-0-0 240 0xf0 2-0 ef 20 2-4-0 300 0x12c 2-3c 76 2-0-1 241 0xf1 2-1 e8 2-4-1 301 0x12d 2-3d 75 2-0-2 242 0xf2 2-2 e4 2-4-2 302 0x12e 2-3e 74 2-0-3 243 0xf3 2-3 e2 2-4-3 303 0x12f 2-3f 73 2-0-4 244 0xf4 2-4 e1 2-4-4 304 0x130 2-40 72 2-0-5 245 0xf5 2-5 e0 2-4-5 305 0x131 2-41 71 2-0-6 246 0xf6 2-6 dc 2-4-6 306 0x132 2-42 6e 2-0-7 247 0xf7 2-7 da 2-4-7 307 0x133 2-43 6d 2-0-8 248 0xf8 2-8 d9 2-4-8 308 0x134 2-44 6c 2-0-9 249 0xf9 2-9 d6 2-4-9 309 0x135 2-45 6b 2-0-10 250 0xfa 2-a d5 2-4-10 310 0x136 2-46 6a 2-0-11 251 0xfb 2-b d4 2-4-11 311 0x137 2-47 69 2-0-12 252 0xfc 2-c d3 2-4-12 312 0x138 2-48 67 2-0-13 253 0xfd 2-d d2 2-4-13 313 0x139 2-49 66 2-0-14 254 0xfe 2-e d1 2-4-14 314 0x13a 2-4a 65 17 2-1-0 255 0xff 2-f ce 21 2-5-0 315 0x13b 2-4b 63 2-1-1 256 0x100 2-10 cd 2-5-1 316 0x13c 2-4c 5c 2-1-2 257 0x101 2-11 cc 2-5-2 317 0x13d 2-4d 5a 2-1-3 258 0x102 2-12 cb 2-5-3 318 0x13e 2-4e 59 2-1-4 259 0x103 2-13 ca 2-5-4 319 0x13f 2-4f 56 2-1-5 260 0x104 2-14 c9 2-5-5 320 0x140 2-50 55 2-1-6 261 0x105 2-15 c7 2-5-6 321 0x141 2-51 54 2-1-7 262 0x106 2-16 c6 2-5-7 322 0x142 2-52 53 2-1-8 263 0x107 2-17 c5 2-5-8 323 0x143 2-53 52 2-1-9 264 0x108 2-18 c3 2-5-9 324 0x144 2-54 51 2-1-10 265 0x109 2-19 bc 2-5-10 325 0x145 2-55 4e 2-1-11 266 0x10a 2-1a ba 2-5-11 326 0x146 2-56 4d 2-1-12 267 0x10b 2-1b b9 2-5-12 327 0x147 2-57 4c 2-1-13 268 0x10c 2-1c b6 2-5-13 328 0x148 2-58 4b 2-1-14 269 0x10d 2-1d b5 2-5-14 329 0x149 2-59 4a 18 2-2-0 270 0x10e 2-1e b4 22 2-6-0 330 0x14a 2-5a 49 2-2-1 271 0x10f 2-1f b3 2-6-1 331 0x14b 2-5b 47 2-2-2 272 0x110 2-20 b2 2-6-2 332 0x14c 2-5c 46 2-2-3 273 0x111 2-21 b1 2-6-3 333 0x14d 2-5d 45 2-2-4 274 0x112 2-22 ae 2-6-4 334 0x14e 2-5e 43 2-2-5 275 0x113 2-23 ad 2-6-5 335 0x14f 2-5f 3c 2-2-6 276 0x114 2-24 ac 2-6-6 336 0x150 2-60 3a 2-2-7 277 0x115 2-25 ab 2-6-7 337 0x151 2-61 39 2-2-8 278 0x116 2-26 aa 2-6-8 338 0x152 2-62 36 2-2-9 279 0x117 2-27 a9 2-6-9 339 0x153 2-63 35 2-2-10 280 0x118 2-28 a7 2-6-10 340 0x154 2-64 34 2-2-11 281 0x119 2-29 a6 2-6-11 341 0x155 2-65 33 2-2-12 282 0x11a 2-2a a5 2-6-12 342 0x156 2-66 32 2-2-13 283 0x11b 2-2b a3 2-6-13 343 0x157 2-67 31 2-2-14 284 0x11c 2-2c 9f 2-6-14 344 0x158 2-68 2e 19 2-3-0 285 0x11d 2-2d 9e 23 2-7-0 345 0x159 2-69 2d 2-3-1 286 0x11e 2-2e 9d 2-7-1 346 0x15a 2-6a 2c 2-3-2 287 0x11f 2-2f 9b 2-7-2 347 0x15b 2-6b 2b 2-3-3 288 0x120 2-30 98 2-7-3 348 0x15c 2-6c 2a 2-3-4 289 0x121 2-31 97 2-7-4 349 0x15d 2-6d 29 2-3-5 290 0x122 2-32 90 2-7-5 350 0x15e 2-6e 27 2-3-6 291 0x123 2-33 8f 2-7-6 351 0x15f 2-6f 26 2-3-7 292 0x124 2-34 88 2-7-7 352 0x160 2-70 25 2-3-8 293 0x125 2-35 84 2-7-8 353 0x161 2-71 23 2-3--9 294 0x126 2-36 82 2-7-9 354 0x162 2-72 1f 2-3-10 295 0x127 2-37 81 2-7-10 355 0x163 2-73 1e 2-3-11 296 0x128 2-38 80 2-7-11 356 0x164 2-74 1d 2-3-12 297 0x129 2-39 7c 2-7-12 357 0x165 2-75 1b 2-3-13 298 0x12a 2-3a 7a 2-7-13 358 0x166 2-76 18 2-3-14 299 0x12b 2-3b 79 2-7-14 359 0x167 2-77 17

Page 214: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 213EMC Confidential - Internal Use Only

BUS 3 CX-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa DAE# b-e-d dec hex b-d al_pa 24 3-0-0 360 0x168 3-0 ef 28 3-4-0 420 0x1a4 3-3c 76 3-0-1 361 0x169 3-1 e8 3-4-1 421 0x1a5 3-3d 75 3-0-2 362 0x16a 3-2 e4 3-4-2 422 0x1a6 3-3e 74 3-0-3 363 0x16b 3-3 e2 3-4-3 423 0x1a7 3-3f 73 3-0-4 364 0x16c 3-4 e1 3-4-4 424 0x1a8 3-40 72 3-0-5 365 0x16d 3-5 e0 3-4-5 425 0x1a9 3-41 71 3-0-6 366 0x16e 3-6 dc 3-4-6 426 0x1aa 3-42 6e 3-0-7 367 0x16f 3-7 da 3-4-7 427 0x1ab 3-43 6d 3-0-8 368 0x170 3-8 d9 3-4-8 428 0x1ac 3-44 6c 3-0-9 369 0x171 3-9 d6 3-4-9 429 0x1ad 3-45 6b 3-0-10 370 0x172 3-a d5 3-4-10 430 0x1ae 3-46 6a 3-0-11 371 0x173 3-b d4 3-4-11 431 0x1af 3-47 69 3-0-12 372 0x174 3-c d3 3-4-12 432 0x1b0 3-48 67 3-0-13 373 0x175 3-d d2 3-4-13 433 0x1b1 3-49 66 3-0-14 374 0x176 3-e d1 3-4-14 434 0x1b2 3-4a 65 25 3-1-0 375 0x177 3-f ce 29 3-5-0 435 0x1b3 3-4b 63 3-1-1 376 0x178 3-10 cd 3-5-1 436 0x1b4 3-4c 5c 3-1-2 377 0x179 3-11 cc 3-5-2 437 0x1b5 3-4d 5a 3-1-3 378 0x17a 3-12 cb 3-5-3 438 0x1b6 3-4e 59 3-1-4 379 0x17b 3-13 ca 3-5-4 439 0x1b7 3-4f 56 3-1-5 380 0x17c 3-14 c9 3-5-5 440 0x1b8 3-50 55 3-1-6 381 0x17d 3-15 c7 3-5-6 441 0x1b9 3-51 54 3-1-7 382 0x17e 3-16 c6 3-5-7 442 0x1ba 3-52 53 3-1-8 383 0x17f 3-17 c5 3-5-8 443 0x1bb 3-53 52 3-1-9 384 0x180 3-18 c3 3-5-9 444 0x1bc 3-54 51 3-1-10 385 0x181 3-19 bc 3-5-10 445 0x1bd 3-55 4e 3-1-11 386 0x182 3-1a ba 3-5-11 446 0x1be 3-56 4d 3-1-12 387 0x183 3-1b b9 3-5-12 447 0x1bf 3-57 4c 3-1-13 388 0x184 3-1c b6 3-5-13 448 0x1c0 3-58 4b 3-1-14 389 0x185 3-1d b5 3-5-14 449 0x1c1 3-59 4a 26 3-2-0 390 0x186 3-1e b4 30 3-6-0 450 0x1c2 3-5a 49 3-2-1 391 0x187 3-1f b3 3-6-1 451 0x1c3 3-5b 47 3-2-2 392 0x188 3-20 b2 3-6-2 452 0x1c4 3-5c 46 3-2-3 393 0x189 3-21 b1 3-6-3 453 0x1c5 3-5d 45 3-2-4 394 0x18a 3-22 ae 3-6-4 454 0x1c6 3-5e 43 3-2-5 395 0x18b 3-23 ad 3-6-5 455 0x1c7 3-5f 3c 3-2-6 396 0x18c 3-24 ac 3-6-6 456 0x1c8 3-60 3a 3-2-7 397 0x18d 3-25 ab 3-6-7 457 0x1c9 3-61 39 3-2-8 398 0x18e 3-26 aa 3-6-8 458 0x1ca 3-62 36 3-2-9 399 0x18f 3-27 a9 3-6-9 459 0x1cb 3-63 35 3-2-10 400 0x190 3-28 a7 3-6-10 460 0x1cc 3-64 34 3-2-11 401 0x191 3-29 a6 3-6-11 461 0x1cd 3-65 33 3-2-12 402 0x192 3-2a a5 3-6-12 462 0x1ce 3-66 32 3-2-13 403 0x193 3-2b a3 3-6-13 463 0x1cf 3-67 31 3-2-14 404 0x194 3-2c 9f 3-6-14 464 0x1d0 3-68 2e 27 3-3-0 405 0x195 3-2d 9e 31 3-7-0 465 0x1d1 3-69 2d 3-3-1 406 0x196 3-2e 9d 3-7-1 466 0x1d2 3-6a 2c 3-3-2 407 0x197 3-2f 9b 3-7-2 467 0x1d3 3-6b 2b 3-3-3 408 0x198 3-30 98 3-7-3 468 0x1d4 3-6c 2a 3-3-4 409 0x199 3-31 97 3-7-4 469 0x1d5 3-6d 29 3-3-5 410 0x19a 3-32 90 3-7-5 470 0x1d6 3-6e 27 3-3-6 411 0x19b 3-33 8f 3-7-6 471 0x1d7 3-6f 26 3-3-7 412 0x19c 3-34 88 3-7-7 472 0x1d8 3-70 25 3-3-8 413 0x19d 3-35 84 3-7-8 473 0x1d9 3-71 23 3-3-9 414 0x19e 3-36 82 3-7-9 474 0x1da 3-72 1f 3-3-10 415 0x19f 3-37 81 3-7-10 475 0x1db 3-73 1e 3-3-11 416 0x1a0 3-38 80 3-7-11 476 0x1dc 3-74 1d 3-3-12 417 0x1a1 3-39 7c 3-7-12 477 0x1dd 3-75 1b 3-3-13 418 0x1a2 3-3a 7a 3-7-13 478 0x1de 3-76 18 3-3-14 419 0x1a3 3-3b 79 3-7-14 479 0x1df 3-77 17

Page 215: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 214EMC Confidential - Internal Use Only

BUS 0 CX3-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa loop-id DAE# b-e-d dec hex b-d al_pa loop-id 0 0-0-0 0 0x00 0-0 ef 0 4 0-4-0 60 0x3c 0-3c 76 60 0-0-1 1 0x01 0-1 e8 1 0-4-1 61 0x3d 0-3d 75 61 0-0-2 2 0x02 0-2 e4 2 0-4-2 62 0x3e 0-3e 74 62 0-0-3 3 0x03 0-3 e2 3 0-4-3 63 0x3f 0-3f 73 63 0-0-4 4 0x04 0-4 e1 4 0-4-4 64 0x40 0-40 72 64 0-0-5 5 0x05 0-5 e0 5 0-4-5 65 0x41 0-41 71 65 0-0-6 6 0x06 0-6 dc 6 0-4-6 66 0x42 0-42 6e 66 0-0-7 7 0x07 0-7 da 7 0-4-7 67 0x43 0-43 6d 67 0-0-8 8 0x08 0-8 d9 8 0-4-8 68 0x44 0-44 6c 68 0-0-9 9 0x09 0-9 d6 9 0-4-9 69 0x45 0-45 6b 69 0-0-10 10 0x0a 0-a d5 10 0-4-10 70 0x46 0-46 6a 70 0-0-11 11 0x0b 0-b d4 11 0-4-11 71 0x47 0-47 69 71 0-0-12 12 0x0c 0-c d3 12 0-4-12 72 0x48 0-48 67 72 0-0-13 13 0x0d 0-d d2 13 0-4-13 73 0x49 0-49 66 73 0-0-14 14 0x0e 0-e d1 14 0-4-14 74 0x4a 0-4a 65 74 1 0-1-0 15 0x0f 0-f ce 15 5 0-5-0 75 0x4b 0-4b 63 75 0-1-1 16 0x10 0-10 cd 16 0-5-1 76 0x4c 0-4c 5c 76 0-1-2 17 0x11 0-11 cc 17 0-5-2 77 0x4d 0-4d 5a 77 0-1-3 18 0x12 0-12 cb 18 0-5-3 78 0x4e 0-4e 59 78 0-1-4 19 0x13 0-13 ca 19 0-5-4 79 0x4f 0-4f 56 79 0-1-5 20 0x14 0-14 c9 20 0-5-5 80 0x50 0-50 55 80 0-1-6 21 0x15 0-15 c7 21 0-5-6 81 0x51 0-51 54 81 0-1-7 22 0x16 0-16 c6 22 0-5-7 82 0x52 0-52 53 82 0-1-8 23 0x17 0-17 c5 23 0-5-8 83 0x53 0-53 52 83 0-1-9 24 0x18 0-18 c3 24 0-5-9 84 0x54 0-54 51 84 0-1-10 25 0x19 0-19 bc 25 0-5-10 85 0x55 0-55 4e 85 0-1-11 26 0x1a 0-1a ba 26 0-5-11 86 0x56 0-56 4d 86 0-1-12 27 0x1b 0-1b b9 27 0-5-12 87 0x57 0-57 4c 87 0-1-13 28 0x1c 0-1c b6 28 0-5-13 88 0x58 0-58 4b 88 0-1-14 29 0x1d 0-1d b5 29 0-5-14 89 0x59 0-59 4a 89 2 0-2-0 30 0x1e 0-1e b4 30 6 0-6-0 90 0x5a 0-5a 49 90 0-2-1 31 0x1f 0-1f b3 31 0-6-1 91 0x5b 0-5b 47 91 0-2-2 32 0x20 0-20 b2 32 0-6-2 92 0x5c 0-5c 46 92 0-2-3 33 0x21 0-21 b1 33 0-6-3 93 0x5d 0-5d 45 93 0-2-4 34 0x22 0-22 ae 34 0-6-4 94 0x5e 0-5e 43 94 0-2-5 35 0x23 0-23 ad 35 0-6-5 95 0x5f 0-5f 3c 95 0-2-6 36 0x24 0-24 ac 36 0-6-6 96 0x60 0-60 3a 96 0-2-7 37 0x25 0-25 ab 37 0-6-7 97 0x61 0-61 39 97 0-2-8 38 0x26 0-26 aa 38 0-6-8 98 0x62 0-62 36 98 0-2-9 39 0x27 0-27 a9 39 0-6-9 99 0x63 0-63 35 99 0-2-10 40 0x28 0-28 a7 40 0-6-10 100 0x64 0-64 34 100 0-2-11 41 0x29 0-29 a6 41 0-6-11 101 0x65 0-65 33 101 0-2-12 42 0x2a 0-2a a5 42 0-6-12 102 0x66 0-66 32 102 0-2-13 43 0x2b 0-2b a3 43 0-6-13 103 0x67 0-67 31 103 0-2-14 44 0x2c 0-2c 9f 44 0-6-14 104 0x68 0-68 2e 104 3 0-3-0 45 0x2d 0-2d 9e 45 7 0-7-0 105 0x69 0-69 2d 105 0-3-1 46 0x2e 0-2e 9d 46 0-7-1 106 0x6a 0-6a 2c 106 0-3-2 47 0x2f 0-2f 9b 47 0-7-2 107 0x6b 0-6b 2b 107 0-3-3 48 0x30 0-30 98 48 0-7-3 108 0x6c 0-6c 2a 108 0-3-4 49 0x31 0-31 97 49 0-7-4 109 0x6d 0-6d 29 109 0-3-5 50 0x32 0-32 90 50 0-7-5 110 0x6e 0-6e 27 110 0-3-6 51 0x33 0-33 8f 51 0-7-6 111 0x6f 0-6f 26 111 0-3-7 52 0x34 0-34 88 52 0-7-7 112 0x70 0-70 25 112 0-3-8 53 0x35 0-35 84 53 0-7-8 113 0x71 0-71 23 113 0-3-9 54 0x36 0-36 82 54 0-7-9 114 0x72 0-72 1f 114 0-3-10 55 0x37 0-37 81 55 0-7-10 115 0x73 0-73 1e 115 0-3-11 56 0x38 0-38 80 56 0-7-11 116 0x74 0-74 1d 116 0-3-12 57 0x39 0-39 7c 57 0-7-12 117 0x75 0-75 1b 117 0-3-13 58 0x3a 0-3a 7a 58 0-7-13 118 0x76 0-76 18 118 0-3-14 59 0x3b 0-3b 79 59 0-7-14 119 0x77 0-77 17 119

Page 216: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 215EMC Confidential - Internal Use Only

BUS 1 CX3-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa loop-id DAE# b-e-d dec hex b-d al_pa loop-id 8 1-0-0 120 0x78 1-0 ef 0 12 1-4-0 180 0xb4 1-3c 76 60 1-0-1 121 0x79 1-1 e8 1 1-4-1 181 0xb5 1-3d 75 61 1-0-2 122 0x7a 1-2 e4 2 1-4-2 182 0xb6 1-3e 74 62 1-0-3 123 0x7b 1-3 e2 3 1-4-3 183 0xb7 1-3f 73 63 1-0-4 124 0x7c 1-4 e1 4 1-4-4 184 0xb8 1-40 72 64 1-0-5 125 0x7d 1-5 e0 5 1-4-5 185 0xb9 1-41 71 65 1-0-6 126 0x7e 1-6 dc 6 1-4-6 186 0xba 1-42 6e 66 1-0-7 127 0x7f 1-7 da 7 1-4-7 187 0xbb 1-43 6d 67 1-0-8 128 0x80 1-8 d9 8 1-4-8 188 0xbc 1-44 6c 68 1-0-9 129 0x81 1-9 d6 9 1-4-9 189 0xbd 1-45 6b 69 1-0-10 130 0x82 1-a d5 10 1-4-10 190 0xbe 1-46 6a 70 1-0-11 131 0x83 1-b d4 11 1-4-11 191 0xbf 1-47 69 71 1-0-12 132 0x84 1-c d3 12 1-4-12 192 0xc0 1-48 67 72 1-0-13 133 0x85 1-d d2 13 1-4-13 193 0xc1 1-49 66 73 1-0-14 134 0x86 1-e d1 14 1-4-14 194 0xc2 1-4a 65 74 9 1-1-0 135 0x87 1-f ce 15 13 1-5-0 195 0xc3 1-4b 63 75 1-1-1 136 0x88 1-10 cd 16 1-5-1 196 0xc4 1-4c 5c 76 1-1-2 137 0x89 1-11 cc 17 1-5-2 197 0xc5 1-4d 5a 77 1-1-3 138 0x8a 1-12 cb 18 1-5-3 198 0xc6 1-4e 59 78 1-1-4 139 0x8b 1-13 ca 19 1-5-4 199 0xc7 1-4f 56 79 1-1-5 140 0x8c 1-14 c9 20 1-5-5 200 0xc8 1-50 55 80 1-1-6 141 0x8d 1-15 c7 21 1-5-6 201 0xc9 1-51 54 81 1-1-7 142 0x8e 1-16 c6 22 1-5-7 202 0xca 1-52 53 82 1-1-8 143 0x8f 1-17 c5 23 1-5-8 203 0xcb 1-53 52 83 1-1-9 144 0x90 1-18 c3 24 1-5-9 204 0xcc 1-54 51 84 1-1-10 145 0x91 1-19 bc 25 1-5-10 205 0xcd 1-55 4e 85 1-1-11 146 0x92 1-1a ba 26 1-5-11 206 0xce 1-56 4d 86 1-1-12 147 0x93 1-1b b9 27 1-5-12 207 0xcf 1-57 4c 87 1-1-13 148 0x94 1-1c b6 28 1-5-13 208 0xd0 1-58 4b 88 1-1-14 149 0x95 1-1d b5 29 1-5-14 209 0xd1 1-59 4a 89 10 1-2-0 150 0x96 1-1e b4 30 14 1-6-0 210 0xd2 1-5a 49 90 1-2-1 151 0x97 1-1f b3 31 1-6-1 211 0xd3 1-5b 47 91 1-2-2 152 0x98 1-20 b2 32 1-6-2 212 0xd4 1-5c 46 92 1-2-3 153 0x99 1-21 b1 33 1-6-3 213 0xd5 1-5d 45 93 1-2-4 154 0x9a 1-22 ae 34 1-6-4 214 0xd6 1-5e 43 94 1-2-5 155 0x9b 1-23 ad 35 1-6-5 215 0xd7 1-5f 3c 95 1-2-6 156 0x9c 1-24 ac 36 1-6-6 216 0xd8 1-60 3a 96 1-2-7 157 0x9d 1-25 ab 37 1-6-7 217 0xd9 1-61 39 97 1-2-8 158 0x9e 1-26 aa 38 1-6-8 218 0xda 1-62 36 98 1-2-9 159 0x9f 1-27 a9 39 1-6-9 219 0xdb 1-63 35 99 1-2-10 160 0xa0 1-28 a7 40 1-6-10 220 0xdc 1-64 34 100 1-2-11 161 0xa1 1-29 a6 41 1-6-11 221 0xdd 1-65 33 101 1-2-12 162 0xa2 1-2a a5 42 1-6-12 222 0xde 1-66 32 102 1-2-13 163 0xa3 1-2b a3 43 1-6-13 223 0xdf 1-67 31 103 1-2-14 164 0xa4 1-2c 9f 44 1-6-14 224 0xe0 1-68 2e 104 11 1-3-0 165 0xa5 1-2d 9e 45 15 1-7-0 225 0xe1 1-69 2d 105 1-3-1 166 0xa6 1-2e 9d 46 1-7-1 226 0xe2 1-6a 2c 106 1-3-2 167 0xa7 1-2f 9b 47 1-7-2 227 0xe3 1-6b 2b 107 1-3-3 168 0xa8 1-30 98 48 1-7-3 228 0xe4 1-6c 2a 108 1-3-4 169 0xa9 1-31 97 49 1-7-4 229 0xe5 1-6d 29 109 1-3-5 170 0xaa 1-32 90 50 1-7-5 230 0xe6 1-6e 27 110 1-3-6 171 0xab 1-33 8f 51 1-7-6 231 0xe7 1-6f 26 111 1-3-7 172 0xac 1-34 88 52 1-7-7 232 0xe8 1-70 25 112 1-3-8 173 0xad 1-35 84 53 1-7-8 233 0xe9 1-71 23 113 1-3-9 174 0xae 1-36 82 54 1-7-9 234 0xea 1-72 1f 114 1-3-10 175 0xaf 1-37 81 55 1-7-10 235 0xeb 1-73 1e 115 1-3-11 176 0xb0 1-38 80 56 1-7-11 236 0xec 1-74 1d 116 1-3-12 177 0xb1 1-39 7c 57 1-7-12 237 0xed 1-75 1b 117 1-3-13 178 0xb2 1-3a 7a 58 1-7-13 238 0xee 1-76 18 118 1-3-14 179 0xb3 1-3b 79 59 1-7-14 239 0xef 1-77 17 119

Page 217: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 216EMC Confidential - Internal Use Only

BUS 2 CX3-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa loop-id DAE# b-e-d dec hex b-d al_pa loop-id 16 2-0-0 240 0xf0 2-0 ef 0 20 2-4-0 300 0x12c 2-3c 76 60 2-0-1 241 0xf1 2-1 e8 1 2-4-1 301 0x12d 2-3d 75 61 2-0-2 242 0xf2 2-2 e4 2 2-4-2 302 0x12e 2-3e 74 62 2-0-3 243 0xf3 2-3 e2 3 2-4-3 303 0x12f 2-3f 73 63 2-0-4 244 0xf4 2-4 e1 4 2-4-4 304 0x130 2-40 72 64 2-0-5 245 0xf5 2-5 e0 5 2-4-5 305 0x131 2-41 71 65 2-0-6 246 0xf6 2-6 dc 6 2-4-6 306 0x132 2-42 6e 66 2-0-7 247 0xf7 2-7 da 7 2-4-7 307 0x133 2-43 6d 67 2-0-8 248 0xf8 2-8 d9 8 2-4-8 308 0x134 2-44 6c 68 2-0-9 249 0xf9 2-9 d6 9 2-4-9 309 0x135 2-45 6b 69 2-0-10 250 0xfa 2-a d5 10 2-4-10 310 0x136 2-46 6a 70 2-0-11 251 0xfb 2-b d4 11 2-4-11 311 0x137 2-47 69 71 2-0-12 252 0xfc 2-c d3 12 2-4-12 312 0x138 2-48 67 72 2-0-13 253 0xfd 2-d d2 13 2-4-13 313 0x139 2-49 66 73 2-0-14 254 0xfe 2-e d1 14 2-4-14 314 0x13a 2-4a 65 74 17 2-1-0 255 0xff 2-f ce 15 21 2-5-0 315 0x13b 2-4b 63 75 2-1-1 256 0x100 2-10 cd 16 2-5-1 316 0x13c 2-4c 5c 76 2-1-2 257 0x101 2-11 cc 17 2-5-2 317 0x13d 2-4d 5a 77 2-1-3 258 0x102 2-12 cb 18 2-5-3 318 0x13e 2-4e 59 78 2-1-4 259 0x103 2-13 ca 19 2-5-4 319 0x13f 2-4f 56 79 2-1-5 260 0x104 2-14 c9 20 2-5-5 320 0x140 2-50 55 80 2-1-6 261 0x105 2-15 c7 21 2-5-6 321 0x141 2-51 54 81 2-1-7 262 0x106 2-16 c6 22 2-5-7 322 0x142 2-52 53 82 2-1-8 263 0x107 2-17 c5 23 2-5-8 323 0x143 2-53 52 83 2-1-9 264 0x108 2-18 c3 24 2-5-9 324 0x144 2-54 51 84 2-1-10 265 0x109 2-19 bc 25 2-5-10 325 0x145 2-55 4e 85 2-1-11 266 0x10a 2-1a ba 26 2-5-11 326 0x146 2-56 4d 86 2-1-12 267 0x10b 2-1b b9 27 2-5-12 327 0x147 2-57 4c 87 2-1-13 268 0x10c 2-1c b6 28 2-5-13 328 0x148 2-58 4b 88 2-1-14 269 0x10d 2-1d b5 29 2-5-14 329 0x149 2-59 4a 89 18 2-2-0 270 0x10e 2-1e b4 30 22 2-6-0 330 0x14a 2-5a 49 90 2-2-1 271 0x10f 2-1f b3 31 2-6-1 331 0x14b 2-5b 47 91 2-2-2 272 0x110 2-20 b2 32 2-6-2 332 0x14c 2-5c 46 92 2-2-3 273 0x111 2-21 b1 33 2-6-3 333 0x14d 2-5d 45 93 2-2-4 274 0x112 2-22 ae 34 2-6-4 334 0x14e 2-5e 43 94 2-2-5 275 0x113 2-23 ad 35 2-6-5 335 0x14f 2-5f 3c 95 2-2-6 276 0x114 2-24 ac 36 2-6-6 336 0x150 2-60 3a 96 2-2-7 277 0x115 2-25 ab 37 2-6-7 337 0x151 2-61 39 97 2-2-8 278 0x116 2-26 aa 38 2-6-8 338 0x152 2-62 36 98 2-2-9 279 0x117 2-27 a9 39 2-6-9 339 0x153 2-63 35 99 2-2-10 280 0x118 2-28 a7 40 2-6-10 340 0x154 2-64 34 100 2-2-11 281 0x119 2-29 a6 41 2-6-11 341 0x155 2-65 33 101 2-2-12 282 0x11a 2-2a a5 42 2-6-12 342 0x156 2-66 32 102 2-2-13 283 0x11b 2-2b a3 43 2-6-13 343 0x157 2-67 31 103 2-2-14 284 0x11c 2-2c 9f 44 2-6-14 344 0x158 2-68 2e 104 19 2-3-0 285 0x11d 2-2d 9e 45 23 2-7-0 345 0x159 2-69 2d 105 2-3-1 286 0x11e 2-2e 9d 46 2-7-1 346 0x15a 2-6a 2c 106 2-3-2 287 0x11f 2-2f 9b 47 2-7-2 347 0x15b 2-6b 2b 107 2-3-3 288 0x120 2-30 98 48 2-7-3 348 0x15c 2-6c 2a 108 2-3-4 289 0x121 2-31 97 49 2-7-4 349 0x15d 2-6d 29 109 2-3-5 290 0x122 2-32 90 50 2-7-5 350 0x15e 2-6e 27 110 2-3-6 291 0x123 2-33 8f 51 2-7-6 351 0x15f 2-6f 26 111 2-3-7 292 0x124 2-34 88 52 2-7-7 352 0x160 2-70 25 112 2-3-8 293 0x125 2-35 84 53 2-7-8 353 0x161 2-71 23 113 2-3--9 294 0x126 2-36 82 54 2-7-9 354 0x162 2-72 1f 114 2-3-10 295 0x127 2-37 81 55 2-7-10 355 0x163 2-73 1e 115 2-3-11 296 0x128 2-38 80 56 2-7-11 356 0x164 2-74 1d 116 2-3-12 297 0x129 2-39 7c 57 2-7-12 357 0x165 2-75 1b 117 2-3-13 298 0x12a 2-3a 7a 58 2-7-13 358 0x166 2-76 18 118 2-3-14 299 0x12b 2-3b 79 59 2-7-14 359 0x167 2-77 17 119

Page 218: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 217EMC Confidential - Internal Use Only

BUS 3 CX3-Series Arrays fru # fru #

DAE# b-e-d dec hex b-d al_pa loop-id DAE# b-e-d dec hex b-d al_pa loop-id 24 3-0-0 360 0x168 3-0 ef 0 28 3-4-0 420 0x1a4 3-3c 76 60 3-0-1 361 0x169 3-1 e8 1 3-4-1 421 0x1a5 3-3d 75 61 3-0-2 362 0x16a 3-2 e4 2 3-4-2 422 0x1a6 3-3e 74 62 3-0-3 363 0x16b 3-3 e2 3 3-4-3 423 0x1a7 3-3f 73 63 3-0-4 364 0x16c 3-4 e1 4 3-4-4 424 0x1a8 3-40 72 64 3-0-5 365 0x16d 3-5 e0 5 3-4-5 425 0x1a9 3-41 71 65 3-0-6 366 0x16e 3-6 dc 6 3-4-6 426 0x1aa 3-42 6e 66 3-0-7 367 0x16f 3-7 da 7 3-4-7 427 0x1ab 3-43 6d 67 3-0-8 368 0x170 3-8 d9 8 3-4-8 428 0x1ac 3-44 6c 68 3-0-9 369 0x171 3-9 d6 9 3-4-9 429 0x1ad 3-45 6b 69 3-0-10 370 0x172 3-a d5 10 3-4-10 430 0x1ae 3-46 6a 70 3-0-11 371 0x173 3-b d4 11 3-4-11 431 0x1af 3-47 69 71 3-0-12 372 0x174 3-c d3 12 3-4-12 432 0x1b0 3-48 67 72 3-0-13 373 0x175 3-d d2 13 3-4-13 433 0x1b1 3-49 66 73 3-0-14 374 0x176 3-e d1 14 3-4-14 434 0x1b2 3-4a 65 74 25 3-1-0 375 0x177 3-f ce 15 29 3-5-0 435 0x1b3 3-4b 63 75 3-1-1 376 0x178 3-10 cd 16 3-5-1 436 0x1b4 3-4c 5c 76 3-1-2 377 0x179 3-11 cc 17 3-5-2 437 0x1b5 3-4d 5a 77 3-1-3 378 0x17a 3-12 cb 18 3-5-3 438 0x1b6 3-4e 59 78 3-1-4 379 0x17b 3-13 ca 19 3-5-4 439 0x1b7 3-4f 56 79 3-1-5 380 0x17c 3-14 c9 20 3-5-5 440 0x1b8 3-50 55 80 3-1-6 381 0x17d 3-15 c7 21 3-5-6 441 0x1b9 3-51 54 81 3-1-7 382 0x17e 3-16 c6 22 3-5-7 442 0x1ba 3-52 53 82 3-1-8 383 0x17f 3-17 c5 23 3-5-8 443 0x1bb 3-53 52 83 3-1-9 384 0x180 3-18 c3 24 3-5-9 444 0x1bc 3-54 51 84 3-1-10 385 0x181 3-19 bc 25 3-5-10 445 0x1bd 3-55 4e 85 3-1-11 386 0x182 3-1a ba 26 3-5-11 446 0x1be 3-56 4d 86 3-1-12 387 0x183 3-1b b9 27 3-5-12 447 0x1bf 3-57 4c 87 3-1-13 388 0x184 3-1c b6 28 3-5-13 448 0x1c0 3-58 4b 88 3-1-14 389 0x185 3-1d b5 29 3-5-14 449 0x1c1 3-59 4a 89 26 3-2-0 390 0x186 3-1e b4 30 30 3-6-0 450 0x1c2 3-5a 49 90 3-2-1 391 0x187 3-1f b3 31 3-6-1 451 0x1c3 3-5b 47 91 3-2-2 392 0x188 3-20 b2 32 3-6-2 452 0x1c4 3-5c 46 92 3-2-3 393 0x189 3-21 b1 33 3-6-3 453 0x1c5 3-5d 45 93 3-2-4 394 0x18a 3-22 ae 34 3-6-4 454 0x1c6 3-5e 43 94 3-2-5 395 0x18b 3-23 ad 35 3-6-5 455 0x1c7 3-5f 3c 95 3-2-6 396 0x18c 3-24 ac 36 3-6-6 456 0x1c8 3-60 3a 96 3-2-7 397 0x18d 3-25 ab 37 3-6-7 457 0x1c9 3-61 39 97 3-2-8 398 0x18e 3-26 aa 38 3-6-8 458 0x1ca 3-62 36 98 3-2-9 399 0x18f 3-27 a9 39 3-6-9 459 0x1cb 3-63 35 99 3-2-10 400 0x190 3-28 a7 40 3-6-10 460 0x1cc 3-64 34 100 3-2-11 401 0x191 3-29 a6 41 3-6-11 461 0x1cd 3-65 33 101 3-2-12 402 0x192 3-2a a5 42 3-6-12 462 0x1ce 3-66 32 102 3-2-13 403 0x193 3-2b a3 43 3-6-13 463 0x1cf 3-67 31 103 3-2-14 404 0x194 3-2c 9f 44 3-6-14 464 0x1d0 3-68 2e 104 27 3-3-0 405 0x195 3-2d 9e 45 31 3-7-0 465 0x1d1 3-69 2d 105 3-3-1 406 0x196 3-2e 9d 46 3-7-1 466 0x1d2 3-6a 2c 106 3-3-2 407 0x197 3-2f 9b 47 3-7-2 467 0x1d3 3-6b 2b 107 3-3-3 408 0x198 3-30 98 48 3-7-3 468 0x1d4 3-6c 2a 108 3-3-4 409 0x199 3-31 97 49 3-7-4 469 0x1d5 3-6d 29 109 3-3-5 410 0x19a 3-32 90 50 3-7-5 470 0x1d6 3-6e 27 110 3-3-6 411 0x19b 3-33 8f 51 3-7-6 471 0x1d7 3-6f 26 111 3-3-7 412 0x19c 3-34 88 52 3-7-7 472 0x1d8 3-70 25 112 3-3-8 413 0x19d 3-35 84 53 3-7-8 473 0x1d9 3-71 23 113 3-3-9 414 0x19e 3-36 82 54 3-7-9 474 0x1da 3-72 1f 114 3-3-10 415 0x19f 3-37 81 55 3-7-10 475 0x1db 3-73 1e 115 3-3-11 416 0x1a0 3-38 80 56 3-7-11 476 0x1dc 3-74 1d 116 3-3-12 417 0x1a1 3-39 7c 57 3-7-12 477 0x1dd 3-75 1b 117 3-3-13 418 0x1a2 3-3a 7a 58 3-7-13 478 0x1de 3-76 18 118 3-3-14 419 0x1a3 3-3b 79 59 3-7-14 479 0x1df 3-77 17 119

Page 219: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 218EMC Confidential - Internal Use Only

CX3-Series Array Port Numbering CX3-10c

SP: SPA Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.a0 SP: SPA Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.a1 SP: SPA Port ID: 2 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 3 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.b0 SP: SPB Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.b1 SP: SPB Port ID: 2 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 3 UID: 50:06:01:69:##:##:##:## (MirrorView) CX3-20

SP: SPA Port ID: 0 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 1 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 1 UID: 50:06:01:69:##:##:##:## (MirrorView) CX3-20c

SP: SPA Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.a0 SP: SPA Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.a1 SP: SPA Port ID: 2 UID: iqn.1992-04.com.emc:cx.<serial#>.a2 SP: SPA Port ID: 3 UID: iqn.1992-04.com.emc:cx.<serial#>.a3 SP: SPA Port ID: 4 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 5 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.b0 SP: SPB Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.b1 SP: SPB Port ID: 2 UID: iqn.1992-04.com.emc:cx.<serial#>.b2 SP: SPB Port ID: 3 UID: iqn.1992-04.com.emc:cx.<serial#>.b3 SP: SPB Port ID: 4 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 5 UID: 50:06:01:69:##:##:##:## (MirrorView)

Page 220: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 219

CX3-20f

SP: SPA Port ID: 0 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 1 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPA Port ID: 2 UID: 50:06:01:62:##:##:##:## SP: SPA Port ID: 3 UID: 50:06:01:63:##:##:##:## SP: SPA Port ID: 4 UID: 50:06:01:64:##:##:##:## SP: SPA Port ID: 5 UID: 50:06:01:65:##:##:##:## SP: SPB Port ID: 0 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 1 UID: 50:06:01:69:##:##:##:## (MirrorView) SP: SPB Port ID: 2 UID: 50:06:01:6A:##:##:##:## SP: SPB Port ID: 3 UID: 50:06:01:6B:##:##:##:## SP: SPB Port ID: 4 UID: 50:06:01:6C:##:##:##:## SP: SPB Port ID: 5 UID: 50:06:01:6D:##:##:##:## CX3-40

SP: SPA Port ID: 0 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 1 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 1 UID: 50:06:01:69:##:##:##:## (MirrorView) CX3-40c

SP: SPA Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.a0 SP: SPA Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.a1 SP: SPA Port ID: 2 UID: iqn.1992-04.com.emc:cx.<serial#>.a2 SP: SPA Port ID: 3 UID: iqn.1992-04.com.emc:cx.<serial#>.a3 SP: SPA Port ID: 4 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 5 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: iqn.1992-04.com.emc:cx.<serial#>.b0 SP: SPB Port ID: 1 UID: iqn.1992-04.com.emc:cx.<serial#>.b1 SP: SPB Port ID: 2 UID: iqn.1992-04.com.emc:cx.<serial#>.b2 SP: SPB Port ID: 3 UID: iqn.1992-04.com.emc:cx.<serial#>.b3 SP: SPB Port ID: 4 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 5 UID: 50:06:01:69:##:##:##:## (MirrorView)

EMC Confidential - Internal Use Only

Page 221: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 220EMC Confidential - Internal Use Only

CX3-40f

SP: SPA Port ID: 0 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 1 UID: 50:06:01:61:##:##:##:## (MirrorView) SP: SPA Port ID: 2 UID: 50:06:01:62:##:##:##:## SP: SPA Port ID: 3 UID: 50:06:01:63:##:##:##:## SP: SPB Port ID: 0 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 1 UID: 50:06:01:69:##:##:##:## (MirrorView) SP: SPB Port ID: 2 UID: 50:06:01:6A:##:##:##:## SP: SPB Port ID: 3 UID: 50:06:01:6B:##:##:##:## CX3-80

SP: SPA Port ID: 0 UID: 50:06:01:60:##:##:##:## SP: SPA Port ID: 1 UID: 50:06:01:61:##:##:##:## SP: SPA Port ID: 2 UID: 50:06:01:62:##:##:##:## SP: SPA Port ID: 3 UID: 50:06:01:63:##:##:##:## (MirrorView) SP: SPB Port ID: 0 UID: 50:06:01:68:##:##:##:## SP: SPB Port ID: 1 UID: 50:06:01:69:##:##:##:## SP: SPB Port ID: 2 UID: 50:06:01:6A:##:##:##:## SP: SPB Port ID: 3 UID: 50:06:01:6B:##:##:##:## (MirrorView) NOTES: 1. iSCSI and FC Ports are frontend ports – BE ports are backend fibre channel ports 2. MirrorView/S and MirrorView/A Port Usage

• Port 5 for a CX3-20c or CX3-40c SP • Port 3 for a CX600, CX700, CX3-80, or CX3-10c SP • Port 1 for a CX400, CX500, CX3-20, CX3-20f, CX3-40, or CX3-40f system SP

3. MirrorView/S and MirrorView/A use a front-end port on each storage processor (SP) as a communication channel between the storage systems in a remote mirror configuration. Although server I/O can share the front-end port with MirrorView/S or MirrorView/A, for performance reasons, we strongly recommend that server I/O use the front-end ports that MirrorView/S or MirrorView/A are not using. 4. MirrorView/S (and/or MirrorView/A) and SAN Copy software cannot share the same SP port. Before installing the MirrorView enabler, you must deselect any MirrorView ports that a SAN Copy session is using. Otherwise, any SAN Copy sessions using the MirrorView port will fail. 5. LAN ports, blue designates customer LAN and white designates service LAN.

Page 222: 58348378 CL Troubleshooting 2ndEdition B03

EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential

Copyright © 2007 EMC Corporation. All rights reserved. Revision B03 221

EMC Confidential - Internal Use Only