efficient scrub mechanisms for error-prone emerging memories

25
Efficient Scrub Mechanisms for Error- Prone Emerging Memories Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran , Viji Srinivasan ǂ Micron, University of Utah, IBM Research

Upload: beate

Post on 24-Feb-2016

87 views

Category:

Documents


0 download

DESCRIPTION

Efficient Scrub Mechanisms for Error-Prone Emerging Memories. Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡ , Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research. Executive Summary. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

Efficient Scrub Mechanisms for Error-Prone Emerging Memories

Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji

Srinivasan‡

ǂMicron, ⁺University of Utah, ‡IBM Research

Page 2: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

2

Executive Summary

• Future memory hierarchies will likely include NVMs– Focus of this work: Phase Change Memory (PCM)

• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors

and lifetime issues of PCM devices• Resistance Drift is a lesser explored phenomenon– Will become primary cause of “soft errors” -- Need to

explore holistic solutions to counter drift

Page 3: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

3

Phase Change Memory - MLC

• Chalcogenide material can exist in crystalline or amorphous states

• The material can also be programmed into intermediate states– Leads to many intermediate states, paving way for Multi

Level Cells (MLCs)

Resistance AmorphousCrystalline

(11) (00)(10) (01)

(111) (000)(101) (011) (010) (001) (110) (100)

Page 4: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

4

What is Resistance Drift?

Resistance

Time11 10 01 00

A

B

ERROR!!

T0

Tn

Page 5: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

5

Resistance Drift - Issues

• Programmed resistance drifts according to power law equation -

• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on – Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!

Rdrift(t) = R0 х (t)α

Page 6: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

6

Resistance Drift - How it happens

11 10 01 00

Median case cell• Typical R0• Typical α

Worst case cell• High R0• High α

Scrub rate will be dictated by the Worst Case R0 and Worst Case α

Drift Drift

R0 R0Rt Rt

ERROR!!

Number of Cells

Page 7: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

7

Resistance Drift DataCell Type Drift Time at Room

temperature (secs)Median 11 cell 10499

Worst 11 Case cell 1015

Median 10 cell 1024

Worst Case 10 cell 5.94

Median 01 cell 108

Worst Case 01 cell 1.81

(11) (00)(10) (01)

Page 8: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

8

Uncorrectable Errors

Page 9: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

9

Problem Severity

Page 10: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

10

Inadequacy of Device Level Solutions

• Precise writes –> write – compare – write– Can be effective, requires multiple iterations– Increases write time, total write energy, decreases

lifetime• Correcting values during read not effective• Coding techniques– Can help, but cannot eliminate the problem by

themselves

Page 11: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

11

Naïve Solution : Scrubbing

• Drift resets with every cell reprogram (write)• Leverage existing error correction mechanisms,

e.g., ECC - has its own drawbacks• A Full Refresh/Scrub (read-compare-write) is

extremely costly in PCM– Each PCM write takes 100 - 1000ns– Writing to a 2-bit cell may consume as much as 1.6nJ– Requires 600 refreshes in parallel

Refresh should be reactionary NOT precautionary!

Page 12: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

12

Solution Outline

• A three pronged approach

– Incorporate stronger ECC codes

– Incorporate low-cost error detection mechanisms

– Adapt scrub algorithms that are conservative and adaptive

Page 13: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

13

Additional ECC support

• With stronger ECC support, scrub operations can be made infrequent– If E errors can be corrected and E+1 detected, then

scrub operation can be postponed till the Eth error– Can provide support for hard error tolerance as well

• We advocate ECC-8 (E-8)– Correct eight errors, detect nine– Leads to 14.25% storage overhead, for 512 bits

Page 14: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

14

• Light Array Reads for Drift Detection– Support for E Error-correcting, E+1

error detecting codes assumed– Lines are read periodically and

checked for correctness– Only after the number of errors

reaches a threshold, scrubbing is performed

– Drift detection has to be simple, on-chip

LARDD

Read Line

Check for Errors

Errors < E

Scrub Line

True

False

After N cycles

Page 15: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

15

Read Pipeline with LARDD & Parity

• Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32-cell fields– Count number of drift prone states, store as parity

information– For every LARDD, check if parity has changed– If true, invoke BCH circuitry

1515(11) (00)(10) (01)

Page 16: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

16

Headroom

• Headroom-h scheme – scrub is triggered if E-h errors are detected

† Decreases probability of errors slipping through

– Increases frequency of full scrub and hence decreases life time

• Presents trade-off between Hard and Soft errors

Read Line

Check for Errors

Errors < E-h

Scrub Line

True

False

After N cycles

Page 17: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

17

Headroom Gradual

• Start with a small LARDD frequency

• Increase frequency as drift-based errors increase– Decreases energy

overhead– Might cause slight

increase in error rate

Read Line

Check for Errors

Errors <= E-h-g True

False

After N cycles

Double LARDD rate

Page 18: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

18

Adaptive Scrub

• Dynamic events can affect reliability– Temperature increases can increase α and decrease drift

time– Cell lifetime/wearout is also an issue– Soft error rate depends on prevalence of drift prone states– Hard errors show up with aging cells

• Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors

• Can increase LARDD frequency when total number of hard errors increases preset threshold

Page 19: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

19

Results

Stronger ECC + LARDD support leads to decrease in unrecoverable errors

Page 20: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

20

Results - Headroom Schemes

• Compared to ECC-1, for 512 s LARDD interval– ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4%

reduction in scrub energy– ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate,

37.8% reduction in scrub energy

Page 21: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

21

Device Level Solution – Non Uniform Banding

Resistance

Before

After

Mean R0

11 10 01 00

Page 22: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

22

Comparing with Device Level Solutions

Baselin

e ECC-1 (2

.75 σ)

ECC-1 +

Precise W

rite (1

.375 σ)

ECC-1 +

Non-uniform

banding

ECC-8 (2

.75 σ)

ECC-8-head

room-3 (2.75 σ)

0E+00

1E+05

2E+05

Number of Uncorrectable Errors

ECC-8-headroom-3 provides for least number of uncorrectable errors

Page 23: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

23

Conclusions

• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for

PCM– Increased write energy, decreased lifetimes

• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in

lifetime

Page 24: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

24

Thanks!!

www.cs.utah.edu/arch-research

Page 25: Efficient Scrub Mechanisms for Error-Prone Emerging Memories

25

Backup Slides