dimm policy whitepaper

21
White Paper: Implementing Sun's Enhanced DIMM Management Policy Implementing Sun's Enhanced DIMM Management Policy for UltraSPARC® II, III, IV and T1 Systems A Technical White Paper April 2006 Steve Chessin, Customer Advocacy Charlie Slayman, Memory Technology Group Darin Carlson, Product Technical Support Approved by Legal and SSG Chief Quality Officer/VP for customers with CDA Sun Confidential: Confidential Disclosure Agreement (CDA) Required Sun Confidential

Upload: roberto-tobias-garcia

Post on 10-Sep-2014

84 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: DIMM Policy Whitepaper

White Paper: Implementing Sun's Enhanced DIMM Management Policy

Implementing Sun's Enhanced DIMM Management Policy for UltraSPARC® II, III, IV and T1 Systems

A Technical White Paper April 2006

Steve Chessin, Customer Advocacy

Charlie Slayman, Memory Technology Group

Darin Carlson, Product Technical Support

Approved by Legal and SSG Chief Quality Officer/VP for customers with CDA

Sun Confidential: Confidential Disclosure Agreement (CDA) Required

Sun Confidential

Page 2: DIMM Policy Whitepaper

P 2 Sun Microsystems, Inc.

Copyright 2006 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved.

Sun Microsystems, Inc. has intellectual property rights relating to technology that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more of the U.S. patents listed at http://www.sun.com/patents and one or more additional patents or pending patent applications in the U.S. and in other countries.

This document and the product to which it pertains are distributed under licenses restricting their use, copying, distribution, and decompilation. No part of the product or of this document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any.

Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.

Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd.

Sun, Sun Microsystems, the Sun logo, AnswerBook2, docs.sun.com, Sun Fire, SunVTS, SunSoft Alliance Plus, Sun BluePrints, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and in other countries.

All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and in other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.

The OPEN LOOK and Sun? Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun's licensees who implement OPEN LOOK GUIs and otherwise comply with Sun's written license agreements.

U.S. Government Rights-Commercial use. Government users are subject to the Sun Microsystems, Inc. standard license agreement and applicable provisions of the FAR and its supplements.

DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Copyright 2006 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits réservés.

Sun Microsystems, Inc. a les droits de propriété intellectuels relatants à la technologie qui est décrit dans ce document. En particulier, et sans la limitation, ces droits de propriété intellectuels peuvent inclure un ou plus des brevets américains énumérés à http://www.sun.com/patents et un ou les brevets plus supplémentaires ou les applications de brevet en attente dans les Etats-Unis et dans les autres pays.

Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l'utilisation, la copie, la distribution, et la décompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, parquelque moyen que ce soit, sans l'autorisation préalable et écrite de Sun et de ses bailleurs de licence, s'il y ena.

Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des fournisseurs de Sun.

Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l'Université de Californie. UNIX est une marque déposée aux Etats-Unis et dans d'autres pays et licenciée exclusivement par X/Open Company, Ltd.

Sun, Sun Microsystems, le logo Sun, AnswerBook2, docs.sun.com, Sun Fire, SunVTS, SunSoft Alliance Plus, Sun BluePrints, et Solaris sont des marques de fabrique ou des marques déposées de Sun Microsystems, Inc. aux Etats-Unis et dans d'autres pays.

Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans d'autres pays. Les produits protant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.

L'interface d'utilisation graphique OPEN LOOK et Sun? a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développment du concept des interfaces d'utilisation visuelle ou graphique pour l'industrie de l'informatique. Sun détient une license non exclusive do Xerox sur l'interface d'utilisation graphique Xerox, cette licence couvrant également les licenciées de Sun qui mettent en place l'interface d 'utilisation graphique OPEN LOOK et qui en outre se conforment aux licences écrites de Sun.

Sun Confidential

Page 3: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 3

Implementing Sun's Enhanced DIMM Management Policy

As part of Sun's ongoing effort to simplify the data center, Sun is offering its customers an enhanced DIMM (dual inline memory module) management policy and new memory diagnostic tool set aimed at reducing the complexity and frequency of dealing with memory issues on UltraSPARC® II, III, and IV systems. Specifically, the policy and tool set are aimed at reducing the overall downtime associated with memory events, reducing the frequency with which customers have to focus on memory issues, and reducing the number of memory-related service actions customers experience, to help enhance the availability of Sun™ systems and reduce customers' total cost of ownership.

Sun recognizes that DIMM maintenance has been a source of concern for customers and has been taking steps over the past year to enhance the Solaris™ Operating System's (OS's) ability to monitor and respond to system telemetry about DIMMs. (Note that Sun's focus has been on the Solaris OS and not on the DIMMs themselves, because Sun already receives the best memory components currently available in the industry. See Sun white paper “Sun UltraSPARC III Memory for RAS.”) Recent Kernel Updates (KUs) to Solaris 8 and 9 offer memory page retirement (MPR), which provides automated offlining of suspect portions of memory DIMMs (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later. See Sun white paper “Solaris Operating System Availability Features” at http://www.sun.com/blueprints/0504/817-7039.pdf.) Predictive Self-Healing (PSH), available with Solaris 10, includes this functionality and more. (See Sun article “Predictive Self-Healing in the Solaris 10 Operating System: A Technical Introduction” at http://www.sun.com/bigadmin/content/selfheal/selfheal_overview.pdf.)

The error correction code (ECC) on Sun systems—with and without MPR—distinguishes between uncorrectable memory errors (UEs), which require immediate attention, and correctable memory errors (CEs), to which monitoring policies (both manual and automated) and thresholds can be applied to help administrators effectively plan DIMM maintenance windows. Until now, Sun's recommended threshold for CEs has been an extremely conservative “greater than 2 CEs in 24 hours” rule, but that CE threshold has become outdated because it does not take into consideration recent enhancements to the Solaris OS and patch levels currently available.

The enhanced policy, while still conservative, builds on recent DIMM research at Sun and allows system administrators to adopt DIMM maintenance practices most suitable for a system's Solaris OS patch level. A full description of the enhanced policy follows.

Replace a DIMM when:

1. Power-on Self Test (POST) fails it.

2. For systems with Predictive Self-Healing (Solaris 10 and later, except on UltraSPARC II-based platforms), when the system tells you to.

3. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports a UE or disrupting UE (DUE), and investigation shows that the UE or DUE truly originated from memory, and not from a transfer from some CPU's cache, as determined by a qualified Sun Support specialist.

Sun Confidential

Page 4: DIMM Policy Whitepaper

P 4 Sun Microsystems, Inc.

4. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs that are (either a or b as follows):

a. from different physical addresses on each of two or more different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative checkword (that is, the asynchronous fault address registers [AFARs] are all the same modulo 64). (Note: This means at least four CEs; two from one bit position, with unique addresses, and two from another, also with unique addresses, and the lower 6 bits of all the addresses are the same.) The new memory diagnostic tool set, referred to as the “cediag tool set” checks for this pattern.

b. or from different physical addresses on each of three or more different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords. (Note: This means at least 6 CEs – two from one DRAM output signal with unique addresses, two from another output from the same DRAM, also with unique addresses, and two more from yet another output from the same DRAM, again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords.) The new memory diagnostic tool set, referred to as the “cediag tool set” checks for this pattern.

5. For Solaris 8 and 9 systems with memory page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 or later, when the system indicates that the page retirement limit of 0.1 percent of physical memory has been reached and denotes one and only one DIMM as suspect (that is, it has accumulated 130 or more non-intermittent CEs). If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. (Note: Determining these factors will be aided by the cediag tool set.) In the unlikely event that the system indicates that the page retirement limit has been reached but no DIMM is marked as suspect, contact a Sun Support specialist for assistance in determining any necessary action.

6. For older Solaris releases and patch levels, when Solaris reports more than 24 non-intermittent CEs in 24 hours from a single DIMM. If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs.

Limitations:

• Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and when Solaris encounters CEs from them again.

• POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time.

By adopting Sun's enhanced DIMM management policy and new memory diagnostic tool set, customers can expect to reduce DIMM replacements and realize additional uptime per system annually. Fewer DIMM replacements also mean fewer human contacts with the system board and, ultimately, fewer opportunities to inadvertently introduce anomalies into the system, an added benefit for customers adopting the enhanced policy and tool set.

This paper addresses the following topics:

“Background: Memory Errors,” which describes the causes and types of memory errors and their relationship to DIMM replacement.

“Supporting Data for the Revised Policy,” which describes the internal testing and empirical research effort that Sun conducted on DIMMs in 2004.

“Solaris CE Messaging,” which provides a brief history of CE messaging with the Solaris OS.

Sun Confidential

Page 5: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 5

“Policy for Solaris 10 Systems,” which briefly reviews the portions of the policy applicable to systems running the Solaris 10 OS.

“Policy for Solaris 8 and 9 Systems With MPR,” which reviews MPR and the new, applicable policy thresholds and explains how to implement the policy on these systems.

“Policy for Systems Running Older Solaris OS Versions,” which reviews the applicable policy threshold and explains how to implement the policy on UltraSPARC II, III, and IV systems running Solaris 8 and 9 at lower patch levels or older versions of the Solaris OS.

This paper assumes that you have a basic understanding of physical memory on Sun systems and an administrator's knowledge of the Solaris OS.

Background: Memory ErrorsThe following paragraphs provide a description of memory component failure mechanisms.

Causes of Soft Errors and Failures in DRAM

Dynamic random-access memory (DRAM) errors can occur from natural events as well as from latent defects in the integrated circuit (IC).

“Soft error” refers to a naturally occurring event caused by background radiation—either radioactive material in the IC package (alpha particle radiation) or cosmic rays that penetrate the earth’s atmosphere (neutron radiation). When an energetic alpha particle or neutron from one of these events upsets a memory cell, the stored charge is lost, resulting in an error in that bit. These events do not cause permanent damage to the DRAM chip, but the data stored in the cell is lost. (ECC protection allows recovery of that lost data.)

A “hard fail” occurs when a latent defect degrades over time and results in permanent physical damage in one or more of the IC layers. Some examples are dielectric breakdown, polysilicon bridging defects, or metal lines breaking from electromigration. These result in a permanent failure of the portion of the DRAM chip, and corrective action must be taken to eliminate the defect from operation. Replacement of the DIMM or retiring the page that contains the defect would be two examples of corrective action.

A ”weak bit” is a category of hard error that looks very much like a soft error. A weak bit is a cell that tends to lose data over time (junction leakage) or is easily upset (pattern sensitivity). It might take several write/read cycles for the problem to repeat, depending on how bad the leakage or how unique the pattern is. So a weak bit will look like a soft error. But unlike soft errors, which are random in location, errors due to a weak bit will always appear in the same location. Memory vendors use various test programs to screen for these weak cells, but due to latent defects that degrade over time, defect-free memory can develop weak bit symptoms over time, just as it can hard fails.

In a normal manufacturing process, latent defects occur randomly across the IC. These defects tend to degrade randomly as a function of voltage stresses and temperature, which leads to a failure rate that is constant in time. Typical DRAM chips display a failure rate on the order of 50 FIT. (1 FIT = one device failure per about 114,000 years. 50 FIT = one device failure per about 2280 years.) Misprocessing, wafer contamination, or design errors can lead to wear-out mechanisms that drive all the latent defects to fail in a short window of time. There are many techniques to detect and screen defective DRAM chips from entering the supply chain that are beyond the scope of this paper, which focuses on DIMM management policy for

Sun Confidential

Page 6: DIMM Policy Whitepaper

P 6 Sun Microsystems, Inc.

normal material.

A defect-driven failure is localized to the cells and circuits connected to that defect. Any issue arising from that defect does not spread over time. However, over time, other latent defects can cause new failures. But because of their random spatial and temporal nature, observation of one hard fail is not a good predictor of future fails. This is an important consideration when developing a component (DIMM) management policy.

Another consideration is that failure of other components in the system might cause what appear to be DRAM errors. For example, a defect that appears in one of the data path circuits might cause data to be read incorrectly from memory, or might cause data to be written incorrectly to memory, or both. The first case ("bad reader") will cause the receiving CPU to report a CE or UE from memory, even though both the memory and the data it contains are actually good. The second case ("bad writer") will cause a CE or UE to be reported when that memory location is subsequently accessed. Depending on the location of the defective part in the data path, these errors will appear to come either from a single DIMM or from many DIMMs, either on one system board or on many system boards, and be reported by just one CPU or by many CPUs.

On Sun's midrange and high-end servers, the data path circuitry is monitored by a service processor or system controller that produces messages that can help in distinguishing such problems from true memory failures. Describing how to do so is beyond the scope of this paper. Due to the complexity of diagnosing such problems, the enhanced DIMM management policy calls for a diagnosis from Sun Support specialists when data-path or other non-DIMM-related failures might be present.

Types of Memory Errors

Memory errors are categorized as correctable (CE) and uncorrectable (UE), depending on how many bits from the DRAM are in error and the error correction code used. UltraSPARC II, III, and IV designs use Single-bit Error Correction/Double-bit Error Detection (SEC-DED) code. So single-bit errors will be reported as CEs, and double-bit errors will be reported as UEs.

Extremely energetic neutrons can upset multiple cells in the vicinity of a cosmic ray strike. To minimize this effect, the physical interleaving used in memory arrays facilitates neighboring cells that map to different words, resulting in multiple single-bit errors (multiple CEs).

A hard fail that impacts only the operation of a single cell, or multiple nearest neighbor cells, will result in a single CE or multiple CEs. If the defect occurs on a metal or polysilicon line, all the cells in a row or column might fail. DIMMs are designed such that a row or column failure maps to multiple CEs. If a power distribution line fails due to electromigration, an entire sub-array might fail, which would cause UEs.

Roughly 60 percent of a DRAM chip consists of memory cells (or memory array). The remaining approximately 40 percent of the chip area is taken up by periphery circuits—bonding pads, interconnect routing, control and address circuitry to access the memory array, and amplifiers to boost the signals from the cells. However, the DRAM cells use smaller geometries (or design rules) than the other parts of the circuit. So, even though latent defects are distributed randomly across the chip, degradation of small latent defects is much more likely to affect the memory array than the periphery circuits. While the ratio of memory array-to-periphery circuitry might be 60%/40%, the ratio of memory array-to-periphery hard fails could be 95%/5% or even higher.

When to Replace Memory?

Soft Errors — Memory should not be replaced based on soft errors since these are naturally occurring events. DIMM replacement will not improve the situation, and no permanent damage has occurred. However, an

Sun Confidential

Page 7: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 7

abnormal soft error rate could actually be due to weak cell behavior, which is actually the onset of a hard fail. Soft Error Rate Discriminator (SERD) engines are designed to differentiate naturally occurring soft error events from the onset of hard failures.

Hard Fails — When a hard fail occurs, it will cause either UEs or repeating CEs, or both. When a hard fail occurs, some form of corrective action is required. If the hard fail causes UEs, then the corrective action is to replace the DIMM. If it causes repeating CEs, then corrective action is required not because things will get worse over time (the defect has already occurred) but because the system is operating in a reduced ECC capacity. For systems with MPR, the appropriate action would be to retire the page that contains the defect. If the defect impacts multiple pages (for example, row failure or column failure), it will be necessary to retire multiple pages. If retirement of too many pages impacts system performance due to reduced memory capacity, it would then be necessary to replace the DIMM. But it should be noted that with the occurrence of a single-cell fail, the successful retirement of the page it resides on is sufficient corrective action, and DIMM replacement is not necessary.

Supporting Data for the Revised PolicyOne of the largest contributors to service repair costs and system downtime on Sun UltraSPARC systems running Solaris is DIMM replacement. This is due to two key factors:

• ECC detects errors in any memory data transaction, and causes all of these events to be logged in the /var/adm/messages files on systems without Predictive Self-Healing (that is, Solaris 9 and earlier versions on all systems, as well as Solaris 10 and later versions on UltraSPARC II-based systems).

• The Solaris OS assigns CEs to a particular DIMM and UEs to a bank of DIMMs. It is easy to misinterpret this “error report” message as a “fault diagnosis” message, and replace a DIMM that is the innocent victim of a data transaction from another faulty component.

Various technical teams at Sun have embarked on an effort to better understand these errors and improve system uptime with respect to memory-related issues. Two of the efforts, described in the following paragraphs, have had a direct impact on Sun's enhanced DIMM management policy.

Sun's Internal Study of CEs and DIMMs

In January 2004, an internal SunSM Sigma team undertook the study of DIMMs returned from the field. (Note that “Sun Sigma” is Sun's own terminology for “Six Sigma.”) The purpose of the study was to investigate the behavior of systems experiencing various CE rates and to determine an optimum CE rate threshold for which DIMM replacement is appropriate.

For the study, 843 DIMMs were randomly selected from Sun's returned parts inventory and sent through a screening process. This number was selected for a full factorial design of experiment (DOE) that included defective system boards installed in Sun Fire™ 6800 and 15K systems, DRAM refresh rate, and memory diagnostic tests. MPR was disabled to determine the effects of recurring CEs on system behavior. Of these DIMMs, 576 were “good.” That is, they continued to operate throughout the screening process without displaying any errors. This ratio (576 to 843) supports Sun anecdotal data showing that the No Trouble Found (NTF) rate for DIMMs returned from the field exceeds 70 percent. So the Sigma team's initial hypothesis was that the previous DIMM replacement policy had been overly conservative and was potentially a contributing factor to the NTF rate.

Of the remaining “suspect” DIMMs, 75 either failed POST immediately or displayed UEs within eight hours of testing with the SunVTS™ technology. Because stable system operation could not be achieved with these

Sun Confidential

Page 8: DIMM Policy Whitepaper

P 8 Sun Microsystems, Inc.

DIMMs, they were excluded from further DOE study. Whatever failure had occurred on these DIMMs, it was not possible to observe CE behavior. The remaining 768 DIMMs (576 “good” plus 192 “suspect”) were run over the five-month period of the DOE. At no time during the DOE was a UE observed.

CE Rate Data Distribution

Figure 1 is an example of a cumulative probability plot of CEs per DIMM (during 24 hours) in one of the DOE runs. In this run, 64 DIMMs displayed at least one (1) CE in a 24-hour period. Note that each dot in the figure represents a DIMM. Of these 64:

• 6 DIMMs (approximately 9 percent) experienced CEs at a rate that fell below the old policy threshold of “greater than 2 CEs in 24 hours,” and thus would not be replaced under that policy.

• Another 15 DIMMs (approximately 23 percent) displayed CEs at a rate that fell between 3 and 24 CEs in a 24-hour period. (These DIMMs would have been replaced using the former policy but not the enhanced policy).

• The remaining 43 DIMMs (approximately 67 percent) displayed CEs at a rate greater than 24 in 24 hours.

Because none of these DIMMs displayed any UE behavior, analysis of this distribution led the Sigma team to conclude that increasing the CE threshold from “greater than 2 CEs in 24 hours” to “greater than 24 non-intermittent CEs in 24 hours” could reduce the DIMM replacement rate by approximately 1/3 ( 9% + 23%). Increasing the threshold to a higher CE rate would further reduce the replacement rate, but the long tail of the distribution indicates that a very large increase in the threshold would be required to gain a small additional benefit. For example, to gain an additional 30 percent reduction in DIMM replacements, the threshold would have to be increased beyond 1000 CEs within a 24-hour period! So the Sigma team recommended “greater than 24 non-intermittent CEs in 24 hours” as a prudent change that yields a large benefit in reducing the replacement of DIMMs and offers only a modest change to the current policy.

Figure 1. Cumulative probability plot of CEs per DIMM within a 24-hour period.

Sun Confidential

Page 9: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 9

Note that page retirement has been turned off in this example. For systems with page retirement implemented, many if not all of the CEs would be avoided after the third event.

CEs and System Performance

During these experiments, over 200,000 CEs were observed in 24 hours on a single system. The system experienced no degradation of system performance. Specifically, the Sigma team observed that no login response times, no kernel command process times, and no ramtest completion times were degraded. For the Sigma team, this was an indication of the robust and efficient behavior of the system's ECC and the Solaris CE-handling code.

Implications of the Study

As previously indicated, the Sigma team, based on an analysis of the study's data, recommended that Sun make relatively conservative changes in the policies governing DIMM replacement. The team indicated that with the enhanced “24 non-intermittent CEs in 24 hours” policy, Sun could be confident that its customers would experience improved uptime (due to less system maintenance time) and at least a 23 percent reduction in DIMM replacements annually—more than a 32 percent reduction if the enhanced policy guidelines were strictly adhered to (see Figure 1).

The team noted that the absence of UEs during the study was remarkable. This called into question the policy of replacing DIMMs displaying CEs in the hopes of avoiding a UE. In fact, to the contrary, even the DIMMs that displayed extremely high CE rates never displayed UE behavior.

The team recognized that there are two modes that could possibly give rise to CEs turning into UEs: “collision” of CEs or gradual degradation of the DIMM. In the first case (“collision”), there is a finite probability that two random single-bit errors could wind up on the same checkword at the same time, which is considered a UE by the UltraSPARC II, III, and IV SEC-DED ECC code. Specific calculations by the Sigma team about this probability proved its likelihood to be extremely remote: once in 123 years for a DIMM displaying an error rate of 24 CEs in 24 hours. The team's calculations also showed that an even more remote “collision” possibility is the random combination of a radiation-induced bit flip with a pre-existing stuck bit in the DRAM. Radiation-induced bit flips in DRAM chips occur at a rate of roughly 10-15 bits per hour. For a 144-bit checkword that already has a stuck bit (CE), it would take roughly 109 years to experience a radiation-induced hit that would lead to a UE.

The second mode—degradation over time of components on the DIMM, described in the “Background: Memory Errors” section—is discussed in the following section.

Sun's Empirical Research With Sun� Explorer Files

Because of the lack of evidence supporting the assumption that DIMMs exhibiting a high CE rate would eventually lead to a UE event, a study was undertaken to understand the history of a DIMM's error log prior to a UE event and determine if establishing a CE rate threshold as a “leading indicator” of a UE is possible. Explorer files from 2393 UltraSPARC II and UltraSPARC III systems containing UEs were reviewed to determine if there was a correlation between CE rate and UEs.

Of the 2393 UE cases reviewed (see Table 1):

• 2350 cases (98.2 percent) showed either no CEs prior to the UEs or would not have triggered the old policy or the enhanced one (that is, they displayed only 1 or 2 CEs).

Sun Confidential

Page 10: DIMM Policy Whitepaper

P 10 Sun Microsystems, Inc.

• 43 cases (1.8 percent) would trigger the old policy to replace the DIMM before a UE was observed. However, only 29 cases (1.2 percent) gave an adequate warning interval (greater than six hours) to make a DIMM replacement possible.

• The breakdown of these 43 cases is as follows:

• 23 cases (9 + 14) (1.0 percent) would have triggered a DIMM replacement under both the old and enhanced policies. However, in only 14 of these (0.6 percent) did the CE pattern give an adequate warning interval (greater than six hours) to make DIMM replacement possible.

• 20 cases (43 - 23) (0.8 percent) would have triggered DIMM replacement only under the old policy and not the enhanced one. But only 15 of these (0.6 percent) would have allowed for an adequate response time (not shown in Table 1).

Table 1. Summary of UE Explorer File Study (Total of 2393 Cases).

Table 1 summarizes the results of the Explorer file study. The actual percentages of cases in which CE behavior would have effectively anticipated a UE event is only 1.2 percent in the case of the old policy, and 0.6 percent in the case of the enhanced policy. In either case, CE behavior as a predictor of a UE event was likely less than 1.5 percent of the time. This analysis supports the Sigma team's earlier finding that the correlation of CEs and UEs is very weak. The hypothesis that DIMMs gradually degrade from CEs to UEs over time is, therefore, not supported by this analysis.

Solaris CE MessagingThe way the Solaris OS has reported CEs has evolved over time. Early releases of Solaris 8 logged each CE both to the console and to the /var/adm/messages files. Later, to help implement the now-obsolete “replace DIMMs with more than 2 CEs in 24 hours” rule, a “leaky bucket” SERD algorithm was implemented. (The “leaky bucket” SERD algorithm first appeared in Solaris 8 patch 108528-16 and Solaris 9 patch 112233-01.) This produced a message (both on the console and in /var/adm/messages) whenever the “more than 2 CEs in 24 hours” threshold was reached. Each individual CE was still logged in /var/adm/messages, but was no longer displayed on the console.

The following is an example of the warning message:

Nov 15 20:27:17 sample300 unix: [ID 358211 kern.warning] WARNING: [AFT0] 3 soft errors in less than 24:00 (hh:mm) detected from Memory Module Slot A: J8000

In this case, three CEs (or “soft errors”) were detected from the DIMM (or “Memory Module”) identified as J8000. This warning message would continue to be generated for each new CE on the DIMM as long as the three most recent CEs occurred within 24 hours of each other.

MPR added different CE messages: one when a page was scheduled for retirement (or, as it says in the message itself, for “removal”), and another when the page was actually retired (removed from service). (See

Sun Confidential

Old Policy 2350 cases (98.2%) 14 cases (0.6%) 29 cases (1.2%)

Enhanced Policy 2370 cases (99.0%) 9 cases (0.4%) 14 cases (0.6%)

Provided No Trigger For a UE Event

Trigger with <6hr Notice

Trigger with >6hr Notice

Page 11: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 11

the section in this paper: “A Brief Description of MPR.”) Following are examples of each of these messages:

Sep 16 03:57:44 sample-a unix: NOTICE: Scheduling removal of page 0x00000000.fad60000

Sep 16 03:57:47 sample-a unix: NOTICE: Page 0x00000000.fad60000 removed from service

With the advent of Predictive Self-Healing in Solaris 10, error messages are replaced by fault messages. Instead of relying on people (or ad hoc scripts) to process error messages to figure out which part to replace and when, the system does its own analysis of errors and, when appropriate, produces a fault message indicating exactly what action the administrator should take. Each error is still logged, but only in a binary-format log file that rarely needs to be consulted.

Policy for Solaris™ 10 Systems (Except UltraSPARC II Systems)For applicable systems running Solaris 10, replace a DIMM when one of the following occurs:

• POST fails the DIMM

• A system message indicates that you are to replace the DIMM

Additional information about messaging and maintenance on systems running Solaris 10 should be available in existing Sun documentation or will be forthcoming soon. For the latest Solaris 10 information, check the BigAdmin Web site: http://www.sun.com/bigadmin/.

Policy for Solaris 8 and 9 Systems With MPR and UltraSPARC II-based Solaris 10 SystemsFor all systems at Solaris 8 and 9 patch levels capable of MPR (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 or later, replace a DIMM when one of the following occurs:

• POST fails the DIMM.

• Solaris reports a UE or DUE, and investigation shows that the event truly originated from memory.

• Solaris reports two or more CEs from two or more different physical addresses on each of a) two or more different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative checkword or b) three or more different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords. (Rule 4 of the policy).

• The system indicates that the page retirement limit of 0.1 percent of physical memory has been reached and denotes one and only one DIMM as suspect (that is, it has accumulated 130 or more non-intermittent CEs) (Rule 5 of the policy).

• If more than one DIMM has been marked as suspect, then the policy indicates that other possible causes of CEs have to be ruled out by a qualified Sun Support specialist. (Note: Determining these factors will be aided by the cediag tool set.)

Sun Confidential

Page 12: DIMM Policy Whitepaper

P 12 Sun Microsystems, Inc.

• In the unlikely event that the tool indicates that the page retirement limit has been reached but no single DIMM has been marked as suspect, the policy requests that a customer contact a Sun Support specialist for assistance in determining any necessary action.

To implement this policy in your data center, you should understand:

• How the policy complements MPR

• How the thresholds support DIMM management

• How to leverage the cediag tool set

• How to work with a Sun Support specialist

• How the policy interacts with dynamic reconfiguration (DR)

• How to adjust system tunables if needed

Brief Description of MPR

Memory page retirement (MPR) is the process of logically removing a physical eight-kilobyte (8K) page of memory from service so that the retired page is not used by user applications or by the running Solaris instance. A page is retired when the number of correctable errors (CEs) detected on the DIMM that contains that page exceeds a certain threshold within a set time. In Solaris 8 and 9, pages that were retired stay retired until the system is rebooted or memory is replaced using DR. (Systems that do not support DR for memory components will require a shutdown for service to replace the memory in accordance with the policy described in the previous paragraphs.)

Generally, MPR isolates locations from a DIMM that produce at least one “sticky” CE or more than two “persistent” CEs in less than 24 hours. MPR also isolates locations that produce a UE that does not cause a system outage.

Sticky and persistent CEs are two of the three CE types identified and responded to by the Solaris CE-handling code:

• Intermittent: A CE is deemed intermittent if an immediate re-read of the location does not reproduce the CE.

• Persistent: A CE is deemed persistent if an immediate re-read of the location does reproduce the CE, and a rewrite of the location is successful in clearing the CE. A Persistent CE is a likely indication of a “soft error” or a “weak bit.”

• Sticky: A CE is deemed sticky if an immediate re-read of the location does reproduce the CE, and a rewrite of the location is unsuccessful in clearing the CE. A Sticky CE is a likely indication of a hard fail.

CEs are, by definition, correctable, and Solaris can handle hundreds of thousands of them a day without difficulty. Thus you might ask, why retire a page containing correctable errors, since the system can handle them? The reasons are simple:

1. Reduce the clutter in (and size of) the /var/adm/messages files. Because MPR will retire a page that produces repeated CEs, additional CE messages from that page will no longer be produced. This helps to prevent the /var/adm/messages files from growing to an unreasonable size. Reducing the clutter in /var/adm/messages also makes it more likely that messages indicating more serious issues will be noticed.

2. Help spot patterns more quickly that may indicate DIMM or system board issues. Because MPR eliminates future CE messages from a retired page, it becomes easier (both for people and for software) to notice CE messages from other locations. This helps with the implementation of Rules 4 and 5.

Sun Confidential

Page 13: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 13

3. Increase availability. It was noted earlier that the odds of a radiation-induced bit-flip striking any one particular checkword are extremely low. However, should such an event occur to a checkword that repeatedly produces CEs, a two-bit UE is likely to occur, possibly resulting in a system outage. By retiring the memory page containing the exposed checkword, MPR prevents just such an occurrence.

The enhanced DIMM policy works with both versions of MPR: “passive” and “aggressive.” The “passive” version of MPR ( Solaris 8 patch level 108528-24 and Solaris 9 patch level 112233-11) immediately retires affected pages, except:

• Pages locked or otherwise unavailable

• Modified pages (that is, the contained data had not yet been pushed to consistent storage)

In both cases, the pages are marked for retirement when they are no longer in use.

An enhanced version of MPR, “aggressive” MPR (Solaris 8 patch level 117000-03 and Solaris 9 patch level 112233-12), retires affected locked or unavailable pages as soon as they are available. It also retires modified memory pages after preserving the affected page's data (copying the data to a new page).

Note that in Solaris 8 and 9, MPR (passive or aggressive) is not persistent over reboots. That is, any retired pages become immediately available for use whenever Solaris is rebooted, and Solaris will attempt to retire those pages only when CEs are again experienced from those pages. This situation is different in Solaris 10, because the list of retired pages is preserved over reboots. Pages are added to this list as they are retired and removed from the list when the containing memory is removed from the system and/or replaced. Any pages on the list are immediately retired by Solaris 10 upon reboot.

Enhanced-Policy Thresholds

The objective of the MPR algorithm is to keep a system up and available by protecting it from specific sets of symptoms. However, its function does not extend to analyzing the cause or source of the symptoms. This is where the thresholds in Sun's enhanced policy can help.

To be more specific, the CEs that trigger retirement of a page do not always mean that a DIMM is at fault. As explained in the “Background: Memory Errors” section of this paper, sometimes there is another source for the CEs. A CPU could be acting as a "bad reader" (reading CEs that are not there) or as a "bad writer" (writing single-bit errors into memory, causing CEs when that data is later referenced). The page retirement algorithm cannot determine if CEs are due to a faulty memory location or a bad reader or a bad writer. However, it can self-regulate to prevent a bad reader or writer from causing runaway page retirement. An MPR tunable allows for a “stop loss� limit of 0.1 percent of total system memory for memory page retirement. When this limit is reached, the MPR algorithm stops retiring pages. Note that the limit of 0.1 percent was chosen because it is high enough to warrant investigation, yet small enough that overall system performance is not impacted.

If the 0.1 percent limit is reached, you can use Sun's enhanced policy and cediag tool set to begin an analysis of why 0.1 percent of memory has been retired, and to replace any parts determined to be the source (see the next section “Leveraging the cediag Tool Set”).

When the page retirement limit has been reached, one of three possible scenarios can exist, which you can determine from running the cestat tool (part of the cediag tool set):

(a) A single DIMM has been marked as suspect (more than 130 non-intermittent CEs charged against it).

(b) Multiple DIMMs have been marked as suspect.

Sun Confidential

Page 14: DIMM Policy Whitepaper

P 14 Sun Microsystems, Inc.

(c) No DIMMs have been marked as suspect.

If Scenario (a) has occurred, the affected DIMM is likely responsible for most, if not all, of the retired pages, and it should be replaced.

If Scenario (b) has occurred, it is possible that a system board or some other component is acting as a bad reader or bad writer, and the DIMMs are just innocent victims of the errant board or component. (Note that the errant system board or component can cause CEs to show up as originating from DIMMs on other system boards, too.) Or it is possible that there really is more than one problematic DIMM in the system. To determine the best course of action with this scenario, you should call your Sun Support specialist. (It is beyond the scope of this paper to describe the techniques used by Sun Support specialists to determine the cause of the CEs in this case, and thus which part or parts should be replaced.)

Occurrences of Scenario (c) are unlikely, but the scenario can occur in small-memory systems or can occur if the zerocecnt tool (part of the cediag tool set) has been run after DR was used to replace a system board that had been responsible for creating CEs on other system boards. Again, to determine the best course of action with this scenario, you should call your Sun Support specialist.

Because the threshold of 130 non-intermittent CEs plays an important role in all three scenarios, you should understand the logic behind it. Note that a DRAM chip contains 128 columns, so a completely “bad row” in a DRAM chip would produce 128 separate CE-containing pages. Because it could take at least 130 persistent CEs to retire 128 memory pages (three to retire the first page, and then one each to retire each of the next 127 pages), the threshold for non-intermittent CEs on a single DIMM (on systems running MPR) is set at 130.

Leveraging the cediag Tool Set

The cediag tool identifies and displays all DIMMs that need to be replaced, based upon application of the enhanced DIMM policy rules listed at the beginning of this paper. When cediag recommends a DIMM be replaced, it also displays an indication of urgency based upon the rule that called for the DIMM's replacement.

The cediag tool has two modes of operation, “live” and “offline.” In the “live” mode, cediag applies the rules of the enhanced DIMM policy to the system on which it is running and, with the -L option, places formatted results and recommendations in the /var/adm/messages files on the system. In the "offline" mode, cediag applies the rules of the enhanced DIMM policy either to a directory containing the output of the Sun Explorer Data Collector (see http://sunsolve.sun.com/explorer) or to a collection of messages files copied from a /var/adm/ directory.

Sun recommends that cediag be run on a daily basis to provide status on the health of the system's DIMMs. The ideal method is to insert an entry similar to the following example into root's crontab(1):

0 0 * * * /opt/SUNWcest/bin/cediag -L >/dev/null # SUNWcest

With this entry, cediag runs automatically every night at midnight and places its findings and recommendations in /var/adm/messages.

Following is an example of running cediag in “live” mode and detecting that no DIMMs in the system fail Rule 5 (greater than 0.1 percent of physical memory has been retired and at least one DIMM has been marked as suspect) of the policy:

# cediagcediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTCcediag: Analysed System: SunOS 5.9 with KUP 117171-05 (MPR active)

Sun Confidential

Page 15: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 15

cediag: Pages Retired: 0 (0.00%)cediag: findings: 0 datapath fault message(s) foundcediag: findings: 0 UE(s) found - there is no rule#3 matchcediag: findings: 0 DIMMs with a failure pattern matching rule#4cediag: findings: 0 DIMMs with a failure pattern matching rule#5

In the following example of cediag output, four DIMMs on the system fail Rule 5.

# cediagcediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTCcediag: Analysed System: SunOS 5.8 with KUP 108528-29 (MPR active)cediag: Pages Retired: 3070 (0.10%)cediag: findings: 0 datapath fault message(s) foundcediag: findings: 0 UE(s) found - there is no rule#3 matchcediag: findings: 0 DIMMs with a failure pattern matching rule#4cediag: findings: 4 DIMMs with a failure pattern matching rule#5cediag: findings: DIMM 'Slot D: J3000' matched rule#5 failure patterncediag: findings: DIMM 'Slot D: J3200' matched rule#5 failure patterncediag: findings: DIMM 'Slot D: J8000' matched rule#5 failure patterncediag: findings: DIMM 'Slot D: J8200' matched rule#5 failure patterncediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs

When cediag is executed out of crontab, the preceding messages will be logged to /var/adm/messages instead of appearing at a terminal. Thus if the messages log contains "advice" messages, the system administrator should follow any advice accompanying those entries to determine the course of action to be taken, per the policy.

A system administrator can also run the cestat tool, independently of the cediag tool, to monitor a system specifically for Rule 5. Examples of output from the cestat tool and the zerocecnt tool are in the “Example Tool Outputs” section of this paper.

Working With a Sun Support Specialist

If the policy (or the cediag tool set) advises you to contact your Sun Support specialist, provide the specialist with the messages files stored in the /var/adm directory, as well as the output of recent executions of the tool. If possible, execute Explorer, and save its output.

Enhanced Policy and DR

Dynamic reconfiguration (DR) can be used to avoid shutting down a domain if a DIMM or system board replacement is merited. System administrators should take note of how DR interacts with MPR if they manage systems on which both are used and should learn how to apply the zerocecnt tool appropriately.

If DR is used to remove a system board on which DIMMs will be replaced, any retired pages contained on that system board will be un-retired when the system board is re-inserted into the system. However, the CE counts for the DIMMs on that system board (as well as the CE counts for the DIMMs on other system boards) will be unchanged. You should use the zerocecnt tool to reset the CE counts for all DIMMs that are re-inserted into the system using DR.

Sun Confidential

Page 16: DIMM Policy Whitepaper

P 16 Sun Microsystems, Inc.

If DR is used to replace a system board that is responsible for CEs reported against DIMMs on other system boards, you must manually set the CE counts for all DIMMs in the system to zero by running the zerocecnt tool. Any retired pages on the other system boards will remain retired, however, and will not be available to the system. Because it is likely that these retired pages were innocent victims of the replaced system board, you might want to recover them and make them again available for use. You can do this in one of the following two ways:

• Dynamically reconfigure each board sequentially out of and then back into the system.

• Reboot the system.

Tunables

The following kernel variables affect the MPR algorithm. They should not be changed without the guidance of a Sun Support specialist.

max_pages_retired_bps The limit on the amount of memory that can be retired, expressed in basis points (100 basis points is 1 percent). The default value is 10 (0.1 percent).

ecc_softerr_limit The N value in the "leaky bucket" SERD algorithm. (See explanatory paragraph below.) The default is 2.

ecc_softerr_interval The T value in the "leaky bucket" SERD algorithm, in minutes. The default is 1440 (24 hours).

The "leaky bucket" SERD algorithm is used to determine when repeated persistent CEs should cause a page to be retired. (A single sticky CE triggers immediate retirement.) If more than N persistent CEs occur on a DIMM in time T, then the page containing the most recent CE is retired. (This is expressed as >N in T.)

Example Tool Outputs

Following is example output from the cestat tool:

Page Retirement Statistics:number percent

Total pages 1384150 100.00%Retirement limit 1384 0.10%Pages currently retired 4 0.00%

Page retirement limit has not been reached.

Correctable Error counts by DIMM:DIMM Location/Name Int Per Stk Total P+S Suspect?SB0/P0/B0/D3 J13600 0 3 0 3 3 noSB0/P1/B0/D3 J14600 0 3 0 3 3 noSB0/P2/B0/D3 J15600 0 5 0 5 5 noSB0/P3/B0/D3 J16600 0 4 0 4 4 no

Total number of Suspect DIMMs: 0

Sun Confidential

Page 17: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 17

In this example, four pages have been retired, and the default limit of 0.1 percent has not been reached. The CE count by DIMM is summarized by category (Intermittent, Persistent and Sticky). DIMMs that have Persistent + Sticky (P+S) CEs greater than 130 are labeled under Suspect? as “yes.” Those with P+S counts equal to or less than 130 are labeled under Suspect? as “no.” If the page retire limit of 0.1 percent has been reached, any DIMMs labeled "yes" under Suspect? should be investigated further and are candidates for replacement. The DIMM or DIMMs with the highest P+S counts that are marked "yes" under "Suspect?" are the DIMM(s) that are the likely candidates for replacement. Again, contact your Sun Support specialist for a final determination.

The following is an example from running the zerocecnt tool:

# ./zerocecnt SB0/P2/-Correctable Error counts by DIMM:DIMM Location/Name Int Per Stk Total P+S Suspect?SB0/P2/B0/D3 J15600 0 5 0 5 5 no

Clear these dimm counts? (y/n) y

SB0/P2/B0/D3 J15600 0 0 0 0 0 no

Policy for Systems Running Older Solaris OS VersionsFor systems running older releases of the Solaris OS or at patch levels earlier than Solaris 8 KU108528-24 or Solaris 9 KU112233-11, replace the DIMM when one of the following occurs:

• POST fails the DIMM.

• Solaris reports a UE or DUE, and investigation shows that the event truly originated from memory.

• Solaris reports two or more CEs from two or more different physical addresses on each of two or more different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative checkword.

• Solaris reports more than 24 non-intermittent CEs in 24 hours from a single DIMM (Rule 6 of the policy).

To implement this policy in your data center, you should understand:

• How the thresholds support DIMM management

• How to leverage the cediag tool set

• How to work with a Sun Support specialist

Enhanced-Policy Thresholds

The occurrence of a large number of CEs in a 24-hour period is indicative that something other than a radiation event is happening and, therefore, cannot easily be explained by natural phenomena. In such situations, how should a CE threshold be set for systems not running MPR so that unnecessary service action is avoided, but necessary action is not unduly delayed? The results of the Sun internal study mentioned in a previous section of this paper helped to define a threshold for such circumstances. Because of the robust and efficient behavior of the Solaris CE-handling code, the occurrence of random CEs by themselves is not a

Sun Confidential

Page 18: DIMM Policy Whitepaper

P 18 Sun Microsystems, Inc.

cause for concern with respect to system reliability or performance. So the threshold for CEs can be increased without exposing the system to increased risk. The “24 non-intermittent CEs in 24 hours” threshold was selected to have a maximum impact on reduction of DIMM service actions. In the study, approximately 32 percent of the DIMMs displaying CEs were below this threshold. The remaining 68 percent of the DIMMs displaying CEs were distributed over a very broad range (25 to 200,000). Therefore, increasing the threshold in minor increments above 24 would have a minimal impact on service action.

Leveraging the cediag Tool Set

As explained in a previous section of this paper, CEs on systems running older versions of the Solaris OS are logged in the /var/adm/messages files. Because processing these CEs by hand to determine if the "more than 24 non-intermittent CEs in 24 hours" threshold has been reached can be cumbersome, the cediag tool set is designed to do this processing for you. As stated earlier in this paper, the ideal situation is to insert a root crontab entry so that the cediag tool runs on a system once daily.

Following is an example of running cediag in “live” mode and detecting that no DIMMs in the system fail Rule 6 (greater than 24 non-intermittent CEs in 24 hours) of the policy:

# cediagcediag: please install a [M]emory [P]age [R]etirement Kernel Update Patchcediag: SunOS 5.8 requires 108528-24 (or higher) installed for MPR supportcediag: please install 'SUNWcest' package to allow access to 'cestat' datacediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTCcediag: Analysed System: SunOS 5.8 with KUP 108528-22 (MPR unavailable)cediag: findings: 0 datapath fault message(s) foundcediag: findings: 0 UE(s) found - there is no rule#3 matchcediag: findings: 0 DIMMs with a failure pattern matching rule#4cediag: findings: 0 DIMMs with a failure pattern matching rule#6

In the following example of cediag output, two DIMMs in the system fail Rule 6.

# cediagcediag: please install a [M]emory [P]age [R]etirement Kernel Update Patchcediag: SunOS 5.8 requires 108528-24 (or higher) installed for MPR supportcediag: please install 'SUNWcest' package to allow access to 'cestat' datacediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTCcediag: Analysed System: SunOS 5.8 with KUP 108528-22 (MPR unavailable)cediag: findings: 0 datapath fault message(s) foundcediag: findings: 0 UE(s) found - there is no rule#3 matchcediag: findings: 0 DIMMs with a failure pattern matching rule#4cediag: findings: 2 DIMMs with a failure pattern matching rule#6cediag: findings: DIMM 'C3/P0/B0/D1: J0602' matched rule#6 (24 in 24) failure pattern

Sun Confidential

Page 19: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 19

cediag: findings: DIMM 'C3/P0/B1/D1: J0702' matched rule#6 (24 in 24) failure patterncediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs

As with the earlier examples from systems with MPR, when cediag is executed out of crontab, the preceding messages will be logged to /var/adm/messages instead of appearing at a terminal. Thus if the messages log contains "advice" messages, the system administrator should follow any advice accompanying those entries to determine the course of action to be taken, per the policy.

The only tunable in the cediag tool is the Rule 6 CE threshold. The default setting is 24 non-intermittent CEs in 24 hours. If a DIMM exceeds this threshold, the cediag tool identifies the DIMM as showing a failure pattern that matches Rule 6. If more than one DIMM exceeds this threshold, a Sun Support specialist should be consulted. This tunable should not be changed without the guidance of a Sun Support specialist.

Working With a Sun Support Specialist

The system administrator should watch for an abnormal rate of CEs coming from an individual DIMM. (Note: A high rate of CEs distributed across multiple DIMMs could be indicative of a bad reader or writer situation in which other components in the system are causing CEs; contact your Sun Support specialist for a specific determination). If the “greater than 24 non-intermittent CEs in 24 hours” threshold (Rule 6) is triggered for a single DIMM, then the DIMM should be scheduled for replacement at the most convenient time.

If the enhanced policy or the cediag tool set advises you to contact your Sun Support specialist, provide the specialist with the messages files stored in the /var/adm directory, as well as the output of recent executions of the cediag tool. If possible, execute Explorer, and save its output.

About the Author: Steve Chessin

Steve Chessin is a Sun Distinguished Engineer in the corporate-level Customer Advocacy group. Steve holds a Ph.D. in physics from the University of California (UC) at Berkeley. His area of expertise is the hardware-software interface from the software side, especially in relation to software handling of hardware errors. He has worked at Sun since 1988.

About the Author: Charlie Slayman

Charlie Slayman is a Memory Technology Engineer in Sun's Memory Technology Group (MTG), with responsibilities for memory component and module reliability, device physics, and soft error rates. He has more than 20 years' experience in fiber optic, integrated optic, and semiconductor technology. His background ranges from device and process development to foundry operations.

About the Author: Darin Carlson

Darin Carlson has been with Sun for 8 years and has over 16 years experience in the UNIX® industry. Currently, he works in Sun's Product Technical Support (PTS) Strategic Solutions Group. Before joining PTS, Darin worked as a NSSE/CSSE for four years, focusing on servers and operating systems. Prior to working for

Sun Confidential

Page 20: DIMM Policy Whitepaper

P 20 Sun Microsystems, Inc.

Enterprise Services, Darin worked for the SunSoft Alliance PlusSM program, providing technical account management to the SunSoft OEM development partners.

Acknowledgments

The authors would like to recognize the following individuals for their contributions to this paper:

• Ron Melanson, Chief Quality Officer, Scaleable Systems Group (SSG)

• Zuheir Totari, SSG Quality Office

• Debra Kahn, Sun Quality Communications Office, Customer Advocacy

• Tom Chalfant, Product Technical Support

• Tom Krehel, Product Technical Support Serviceability Engineering

• Steve Wiley, Product Technical Support Serviceability Engineering

• Jamie Riggs, Product Technical Support Serviceability Engineering

• Kenneth Gibbons, Memory Supply Engineering

• Douglas Baker, Product Technical Support

• Fred Cerauskis, SSG Quality Office

• David Jeffrey, Memory Technology Group

• William Heavlin, Sun's Customer Advocates for Reliability (SunCARE)

• Kumar Loganathan, Product Technical Support

• David Savard, Product Technical Support Serviceability Engineering

References

Chalfant, Thomas M. “Solaris Operating System Availability Features,” Sun BluePrints™ Online, May 2004. To access this paper online, go to http://www.sun.com/blueprints/0504/817-7039.pdf.

Shapiro, Michael. “Predictive Self-Healing in the Solaris 10 Operating System: A Technical Introduction,” June 2004. To access this paper online, go to http://www.sun.com/bigadmin/content/selfheal/selfheal_overview.pdf.

Storm, Shawn; Neil Duncan; Tay Wong; Charlie Slayman; and Itir Clarke. “Sun UltraSPARC III Memory for RAS,” November 2002. To obtain a PDF file of this white paper, please contact a Sun representative.

Ordering Sun Documents

The SunDocsSM program provides more than 250 manuals from Sun Microsystems, Inc. If you live in the United States, Canada, Europe, or Japan, you can purchase documentation sets or individual manuals through this program.

Accessing Sun Documentation Online

The docs.sun.com web site enables you to access Sun technical documentation online. You can browse the docs.sun.com archive or search for a specific book title or subject. The URL is http://docs.sun.com/

Sun Confidential

Page 21: DIMM Policy Whitepaper

Sun Microsystems, Inc. P 21

To reference Sun BluePrints OnLine articles, visit the Sun BluePrints OnLine Web site at: http://www.sun.com/blueprints/online.html

Sun Confidential