ds - vi - ftm - 0 humboldt-universitÄt zu berlin institut fÜr informatik dependable systems...

21
DS - VI - FTM - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester 2002/03 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 1

HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK

Dependable Systems

Vorlesung 6

FAULT-TOLERANT AND FAULT-SECURE MEMORIES

Wintersemester 2002/03

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/ftc

Page 2: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 2

Fault-tolerant and Fault-secure Memories

• Objectives: – To study techniques of fault-tolerant and fault-secure memory

design used in memory manufacturing and applications

• Contents: – Fault-tolerant techniques in manufacturing

– Replication

– Codes

– Reconfiguration

Page 3: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 3

Fault-tolerant Technique in Memory Manufacturing

(Overhead From 2% to 10%)

• Depending on expected failure density. A number of additional rows and/or columns are added and therefore included on the chip.

• Polysilicon fuses in decoding circuitry are selectively blown to allow addressing of the spare rows and columns.

• Two methods exist for blowing fuses:– By focusing a laser on a given fuse for about one second

– By applying a 10 - 50 volt signal across a highly resistive fuse

• With the rapidly increasing chip densities, the use of redundancy is standard among memory manufacturers

Page 4: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 4

Fault-tolerant Memories (Overhead From 2% to 200%)

• Identical copies of memory are used to mask erroneous results• Replication is usually implemented at the module level to

minimize the number of voters needed to determine the correct output, and may consist of static or dynamic redundancy, or a combination of both.– Duplex

– Half-duplex (two halves of memory are encoded into a third half residing in a back-up module such that the original data may be recovered if one of the three modules fails)

– N-modular redundancy (usually triple modular redundancy)

• Additional hardware includes:– Memory units voter

– Or disagreement detector

Page 5: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 5

Fault-tolerant Memories (Continued)

Exemplary systems with replicated memories include:• Star

– (Self-testing and self-repairing computer)

• Ftmp– (Fault-tolerant multiprocessor)

• Sift– (Software-implemented fault tolerance computer)

• Comtrac– (Computer-aided traffic control system)

• (4,2) concept– (Communication controller with four processors and duplicated memory

from philips)

• Stratus– (Commercial fault-tolerant system) 

• 3b20 from at&t – (commercial fault-tolerant system)

Page 6: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 6

Memory Codes Parity Codes

• Even parity

• Odd parity– (Better coverage since all o's or 1's errors can be detected in any word with

even number of bits) 

• Byte-parity– (Parity bit is appended to every 7 or 8 bits)

• Interlaced parity

• Chip-wide parity

• Two-dimensional parity

Page 7: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 7

Chip-wide Parity Method

Word in chip 1 (b bits)

Word in chip 2 (b bits)

Word in chip m (b bits)

. . . . . . . . . . . .

b b b

Word in parity chip

m bits

Parity logic

.

.

.

.

.

.

Page 8: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 8

Two-dimensional Parity Method

01 1 0 1 0

0 0 1 1 1

1 0 0 0 1

1 1 0 1 0

0 1 1 0 1

0

0

1

1

0

k words

ColumnParity

Register

No No Yes No NoParity Error ?

n bits/word

No

No

Yes

No

Overall paritycheck bit

Row ParityRegister Parity Error ?

Page 9: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 9

Hamming Codes

• Hamming codes provide error detection as well as error correction in a b-bit long word. Log2b check bits are generated whose values allow determination of the single bit if a single bit error occurs.

• As an example a (7, 4) single error-detecting hamming code is shown. There are a total of seven bits, four of which are data bits.

• Even though the code requires 15 - 70% additional hardware and results in degraded memory speed (due to encoding and decoding of the check bits), it often results in orders of magnitude or higher increase in the mean time between failures (mtbf) for the memory, a tradeoff which is often accepted.

• Hamming codes may be extended to provide k-error correction and 2k-error detection, but such modifications require even greater hardware and software overheads.

Page 10: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 10

Single-error Correction Example For A (7, 4) HAMMING CODE (Bit D3 Is in Error)

Data bits are d1, d2, d3, d4

Check bits are c1, c2, c3

Equations used for syndrome generation:

s3 = d1 d2 d4 c1

s2 = d1 d3 d4 c2

s1 = d2 d3 d4 c3

=1 0 1 0 1 0 10 1 1 0 0 1 10 0 0 1 1 1 1

c1 c2 d1 c3 d2 d3 d4 1110010

c1

c2

d1

c3

d2

d3

d4

1 1 0

s1 s2 s3

Parity-Check Matrix (PCM)Data Word

From Memory Syndrome

Page 11: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 11

Sec-ded Memory Design

32-bit error detection and correction unit

• Corrects all single-bit errors• Detects all double errors• Detects some triple errors• Detection in 32 nsec, correction in 64 nsec• 7 check bits for 32-bit word via a modified hamming code• May also work on 8-bit bytes• Built-in diagnostics

Page 12: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 12

Block Diagram Of Memory System

SYSTEM DATA BUS

Dynamic RAM

ControlError

Detection and Correction Unit

Bus Buffers

32

32

Data Bits

Nx32 Memory ArrayNx7

Check Array

32Check Bits

WRITE: DATA BUS BUFFERS EDC BUFFERS MEMORY ARRAY

7

READ: MEMORY ARRAY BUFFERS EDC BUFFERS DATA BUS

Page 13: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 13

Edc Unit Operation Configuration: 32-Bit Memory Array/Data Bus, 7-Bit Check Array

Memory Read Cycle 

1. Data read from memory array to buffers and from check array to check-bit inputs 

2. EDC unit gets data from buffers 

3. EDC unit computes check bits and syndrome 

4. On non-zero syndrome, error(s) are indicated via error or multierror lines and bit correction occurs (1-bit error) 

5. EDC unit passes (corrected) data to buffers and then to data bus

Memory Write Cycle 

1. Data from data bus via buffers to EDC unit 

2. Check bits are computed 

3. Data from EDC unit via buffers to memory and check bits from EDC unit to check-bit memory array 

In the 2M bytes memory MTFB improved from 95h to 15,000h 

Up to 35% increase in cost on 16K memory cost

Up to 40% increase in power consumption PARITY + COMPLEMENT METHOD FOR ERROR CORRECTION

Page 14: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 14

EDC UNIT OPERATION (Continued)1st Write 1 1 0 1 0 0 1 1 0 Original Data 

1st Read 1 1 0 1 0 1 1 1 0 PE (Parity Error)

D D 0 0 1 0 1 0 0 0 1 Data Complement 

2nd Write 0 0 1 0 1 0 0 0 1 Complemented Data 

2nd Read 0 0 1 0 1 1 0 0 1 PE (Parity Error)

D D 1 1 0 1 0 0 1 1 0 Data Complement

(Correct Data)

Hard

Error

Location

 

This double complement method in combination with an ECC system can correct additional errors, e.g., National Semiconductor DP8400 chip (detects 100% of 2-bit errors and both errors are correctable if no more than one of them is soft)

Page 15: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 15

Reconfiguration

Reconfiguration involves the  

Permutation of the address 

and/or data lines between an array of 

memory chips and the cpu to prevent the 

building of multiple hard errors 

 

Spare memory locations technique 

(spare blocks method)

 

 

 

Spare switchable columns technique

Page 16: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 16

Spare Blocks Method Special purpose hardware

 Intel's iQX Module using Reallocation Technique

Hard error rate is 0.027% in 1000 hours 

Soft error rate is 0.1% in 1000 hours 

in the 2Mbyte memory system

Memory Allocation RAM (Mapping Table)

High-Order Address

Low-Order Address

Block containing Faulty Data

Main Memory

Spare Memory Blocks{

.

.

.

Memory Address from Host

Page 17: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 17

Spare Switchable Columns Method

If a particular memory location is faulty the entire block is switched

to a spare column. Parity is used for fault detection. (GMD-Siemens)

Memory Address Register

i

0 1 0 0 0 0 0

SC1 P1 P2 ... Pn SC2 P1 P2 ... Pn

1Entry i

Fault Status TableField 1 Field 2

Normal Columns (n is a word length)

Spare Columns

Memory chip array

1 2 ... n 1 2

Ein EinD1 D2

Decoders

...

{ {

Memory Data Register

Switching Network

1 2 ... n

Blo

ck i

Page 18: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 18

Fault-tolerant Memories In Commercial Systems (1)

INTEL'S SERIES 90/IQX 

(A sec-ded code on the data, a parity check on the address bus, and the scrubbing of memory, which is the periodic dumping and rewriting of data to prevent the build-up of multiple soft errors, spare memory with pointer table) 

Vax-11/780 and microvaxes 

(a 7-bit sec-ded hamming code for 32-bit words and error logging)

 

Memory systems for spaceborne computers

 

(Sec-ded with periodic scrubbing or bit-per-chip memory organization with row/column, power isolation and error protocol data to assist reconfiguration)

Page 19: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 19

Fault-tolerant Memories In Commercial Systems (2)

• IBM 30xx AND 43xx – Use a hamming sec-ded code and parity + complement method

• UNIVAC 1100/60 – Employs sec-ded and sends an error signal to the requesting

device if a double error is detected

• VAX-11/780 – Employ a hamming sec- and microvax ded code with error logging

• CRAY-XMP & YMP – Use an 8-bit sec-ded code word with each 64-bit memory word

• SUN WORKSTATIONS – Some use sec-ded

Page 20: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 20

Fault-tolerant Memories In Fault-tolerant Computers

Self-testing and repairing (star) computer

• 12 bits of instruction words are stored in 2-out-of-4 code while the remaining 20 bits consist of 16-bits for the address field and 4 check bits.

• An inverse modulo-15 code is used to set the check bits such that the combined 20 bits represent a number that is divisible by 15.

• Operands also use the inverse modulo-15 code (28 data bits and 4 check bits in the data words) critical programs can be written into multiple memory units.

Page 21: DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester

DS - VI - FTM - 21

Examples

• Carnegie-mellon university computers– C.Mmp uses two parity bits (one odd, one even) in its memory. 

– Cm* employs retry and error reporting mechanisms.

– C.Vmp uses tmr.

• Electronic switching systems (bell labs)– Ess 1 uses two parity bits (one covering both address and data, the other

covering just the address). The system also supports error logging, auto-retry and software error handling.

– Ess 3a makes extensive use of totally self-checking checkers and duplication of critical processors to recover from errors.

• Fault-tolerant building blocks architecture (jpl-ucla)– Uses sec-ded and two spare switchable bits

• Other examples:– Tandem, stratus, august systems, plessey (great britain), philips (4.2)-

concept (the netherlands), comtrac (japan) and copra (france)