Download - Low Power/Low Voltage Computinglph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS... · H K K M M F D V S G E M N Normalized IPC Performance loss ~5%. 16 0.47 0.45 1.0 Norm EPI 1.08

1

Low Power/Low Voltage

Computing

LACSS Workshop

Acknowledgement:

Alaa Alameldeen, Keith Bowman, Zeshan Chishti,

Dinesh Somasekhar, James Tschanz, Chris Wilkerson, Wei Wu

Shih-Lien Lu

Intel Labs

October 13, 2010

2

Power Limit

• Mark Seager

– 2 Petaflop -> 6 MW (1.27 PF -> 2MW)

– Linear scaling: 1 Exaflop -> 3GW (1.6GW)

– With idea technology scaling @ constant die

size and freq

• 3 generation: ~380MW (200MW)

• 4 generation: ~190MW (100MW)

– This talk addresses challenges facing

aggressive voltage scaling

3

Resiliency: A Familiar Topic

• Resilient design employs techniques that

handle faults to give correct operation

• Past Focus: Increase Reliability

• Resiliency as Part of Optimization

FaultsFaultsFaultsFaultsData Corruption ORData Corruption ORData Corruption ORData Corruption OR

Loss of ControlLoss of ControlLoss of ControlLoss of Control

4

Intrinsic Level 4GHz0.6V

Aging

Vcc/Temp

Process Variations

0.85V 3GHz

Reduce or Eliminate Guardbands

Lower Voltage

Lower PowerHigher Frequency

Higher Performance

Released Level

Higher Voltage

Higher Power

Lower Frequency

Lower Performance

Guardband

Potential G

ain

Potentia

l Gain

Margins added for rare occurrences impact Margins added for rare occurrences impact Margins added for rare occurrences impact Margins added for rare occurrences impact PowerPowerPowerPowerPowerPowerPowerPower and and and and PerformancePerformancePerformancePerformancePerformancePerformancePerformancePerformance

5

logic

Faults Caused by Vcc Reduction

• Decreases SRAM stability, more failing bits

• More timing violations—reduces frequency

Vcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functional

Vcc < Vmin : A few bits failVcc < Vmin : A few bits failVcc < Vmin : A few bits failVcc < Vmin : A few bits fail

Vcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits fail

Flip FlopFlip FlopFlip FlopFlip Flop Flip FlopFlip FlopFlip FlopFlip Flop

Vcc > Vmin : No timing violationsVcc > Vmin : No timing violationsVcc > Vmin : No timing violationsVcc > Vmin : No timing violations

Vcc < Vmin : Some violationsVcc < Vmin : Some violationsVcc < Vmin : Some violationsVcc < Vmin : Some violations

Vcc << Vmin : More violationsVcc << Vmin : More violationsVcc << Vmin : More violationsVcc << Vmin : More violations

6

Addressing On-die Memory Errors

• Cache line disabling

– Coarse grain

– Fine grain – Wilkerson et. al. ISCA 2008

• Multi-segmented ECC

– Take part of the cache to store ECC bits

– Segmented protection

– OLSC: simple and modular encode/decode

• Variable Strength ECC

– Current project

50% EPI reduction50% EPI reduction50% EPI reduction50% EPI reductionWith small performance lossWith small performance lossWith small performance lossWith small performance loss

7

On-die Memory Fault Types

• Persistent

– Permanent defect

– Read/Write stability

– Retention

• Transient

– Particle strike

– Proximity disturbance

• Erratic

8BitLine

BitLine

Vcc

X 0 X 1

P 0 P 1

Wordline

Vss Vss

“0” “1”

N 0 N 1

I 0 I 1

Contention between READ and WRITE on device sizesEx: Weak pass device (X1) vs. a strong pull up device (P1) can cause a write failure

�Random within-die variations primarily responsible

SRAM Failures Result from Mismatched

Devices in a Single Cell

9

4 Types of SRAM Failures

• Write failure

– Device mismatch prevents cell from flipping

• Read failure

– Cell flips during read

• Access failure

– Insufficient differential increases latency.

• Retention failure

– Reduced margin, failures occur due to noise

10

Resiliency Techniques

• Require testing (a priori info)

– Advantages

• More information means simpler (cheaper) remedy

– Disadvantages

• Test cost

• Faults cannot be tested not covered

– Techniques

• Sparring – physical redundancy

• Disabling – graceful degradation

11

Spares

• Extra capacity

• Column redundancy

– Random cell failure

– Column mux

– BIST & fuse

• Row redundancy

– Word line failure

– Multi-bit failure

– Word-line segmentation

• Block redundancy

mux

12

Disabling

• Wide dynamic range

• Graceful degradation

– Static and dynamic sizing

– Bank disabling

– Fine grain disabling

• Example

– Cache design

131.0E-11

1.0E-10

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

0.4 0.5 0.6 0.7 0.8Supply Voltage

Failu

re P

rob

.

6T Cell

ST Cell

6T - 10-bit ECC

6T - 1-bit ECC

0.46 0.53 0.67 0.83Pfailtarget

Known Methods (2MB Cache)

14

1.0E-11

1.0E-10

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

0.4 0.45 0.5 0.55 0.6 0.65 0.7Supply Voltage

Fail

ure

Pro

b.

ST Cell

6T - 10-bit ECC

6T - 1-bit ECC

0.46 0.53 0.670.48 0.51

6T – Word Disable

6T – Bit Fix

Vmin for Proposed Techniques

15

L1WDis_L2BFix Normalized to ST Cell

0.80

0.85

0.90

0.95

1.00

DH

FP2KIN

T2K GM

MM OF

PRO

DSE

RV

WS

GEM

AN

No

rmali

zed

IP

C

Performance loss ~5%

16

0.47

0.45

1.0

Norm

EPI

1.08 (L1)

1.00 (L2)

500L1WDis_L2BFix

2.0530ST Cell (circuit sol.)

1.08256T Cell

Norm

Area

Vccmin

(mV)

•Lower Voltage & EPI (Energy Per Inst) vs 6T

•Much less area overhead than ST Cell

Voltage/Area/Energy Comparison

17

Summary for this Example

• Vccmin limits energy scaling

– Ability to reduce voltage critical but limited by memory

reliability

• The ability to detect/avoid failures allows low

voltage operation and reduces energy

• Configurable approach that “trades off cache

capacity”

– Maximizes performance at high voltage

– Enables ~50% improvement in energy per instruction

when operating at low voltage

18

Resiliency Techniques (2)

• No a priori information (through testing)

– Adv

• Lower test cost

– Disadv

• Overhead

• Techniques

– ECC

• Random error

• Correlated error

19

Our Observations with ECC

• Use of systematic H’ matrix instead of non-

systematic

• Separate error detection from correction

• Codes with different strength can share H

matrix

• Trade code density with logic complexity

and modularity

20

Non-Persistent Failures

• Exhibit sporadic failing behavior

• Examples: soft errors, erratic failures

• Also exhibit supply voltage dependence

• Cannot be detected by apriori memory

testing

• Both persistent and non-persistent failures

affect Vccmin

21

Approach

• Key Idea:

• Adaptive cache that works at both high and

low voltages

– As big as possible when performance is

important (high voltages)

– Sacrifice capacity when power is most

important (low voltages)

– In low voltage mode, use a portion of cache to

store ECC

– Enough check bits to correct both persistent

and non-persistent errors

– No additional testing to isolate defective bits

22

Trading off Code Density for

Simplicity• Traditional codes optimized for check bit overhead

• But, complexity grows rapidly with no. of

corrections

• We need large number of corrections

– E.g., up to 10 corrections in each cache line for 500 mV

operation

• Traditional BCH-based code too complex for such

corrections

• Solution: Orthogonal Latin Square Codes (OLSC)

• Less complexity at the cost of more check bits

23

Multi-bit Segmented ECC (MS-ECC)

Data

Way

0

ECC

Way

0

Data

Way

1

ECC

Way

3

......

Way 0 Way 1 Way 2 Way 7

ECC

Way

1

Way 3

Read Logic

Data Out

Data In

Write Logic

24

Orthogonal Latin Square Codes

(OLSC)• Modular error correction hardware

– More regular implementation than BCH

• Based on majority voting

– Example: TMR triplicates data and uses

majority function

• Instead of keeping multiple copies of data

bits,

– Encode orthogonal groups of data bits to form

check bits

– For t-corrections in m2 data bits, need 2tm

check bits

25

Methodology

• Two modes of operation:– High voltage: 1.3V, 3 GHz

– Low voltage: 0.5V, 500 MHz

• 32K 8-way L1 caches, 2M 8-way L2 cache

• Compare– Baseline: SECDED ECC

– MS-ECC: 64-bit segments, 4 corrections per segment• 50% capacity, 1-cycle added latency overhead

• Compare against: Bit-fix with SECDED ECC (BFXECC)

• Can correct 10-bit persistent and 1-bit non-persistent errors

26

Reliability (2MB Cache)

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0.4 0.5 0.6 0.7 0.8 0.9

Vcc (V)

Pro

ba

bilit

y o

f F

ailu

re

BFXECC (Low)

BFXECC (Hi)

MS-ECC (Hi/Low)

27

Performance Overhead

0.8

0.85

0.9

0.95

1

DH

FP2K

INT2K

GM

MM

OF

PROD

SERV

WS

GMEAN

No

rmali

zed

IP

C

Defect-free Baseline

MS-ECC

10% IPC degradation relative to unrealistic defect-free baseline

28

Energy

SCHEME VCCMIN

(mV) FREQUENCY

(MHZ) NORM. POWER

NORM. EPI

BASELINE 725 1400 1 1

BFXECC 630 1000 0.57 0.8

MS-ECC 520 700 0.29 0.58

MS-ECC reduces energy-per-instruction by 42% relative to baseline SECDED ECC

29

Summary for This Example

• Reducing supply voltage key to higher

energy efficiency

• Supply voltage reduction limited by memory

reliability

• MS-ECC: novel technique to mitigate bit

failures• Leverages error correction codes based on OLSC

• Does not rely on testing to isolate defects

• Reduces Vccmin by ~ 200 mV, EPI by 42%

30

Addressing Logic Errors

• Timing faults

• Error detection sequential

• Detect timing faults at the circuit level

• Replay pipeline at the microarchitecture

level

• A research processor in 45nm

– “A 45nm resilient and adaptive microprocessor

core for dynamic variation tolerance,” ISSCC

2010

31

Error-Detection Sequential (EDS)

• Contains additional scan-enabled latch for testing

� mode=0: EDS

� mode=1: FF

D

CLK

Q

ERRORFF

LATCHLATCH

mode

EDS

Implementation

32

DE

RA

EX

ME

M

X

WB

PC IF

Microprocessor Core Overview

� Adaptive clock control enables dynamic FCLK change

33

Tunable Replica Circuit (TRC)F

F

ED

S

ERROR

Tuning Bits

TRC

� TRC monitors critical path delays

� Non-intrusive design

J. Tschanz, et al., Symp. VLSI Circuits, 2009.

34

DE

RA

EX

ME

M

X

WB

PC IF

TRC TRC TRC TRC TRC

Tunable Replica Circuit (TRC)

� TRC tuned to track critical paths per pipeline stage

� TRC must always fail if any critical path fails

� TRC error initiates pipeline error recovery

35

EDS & TRC Overheads

0.8%2.2%Error Detection & Accumulation Area Overhead

1.4%1.4%ECU & Clock Control Area Overhead

L0.2%Min-Delay Buffer Insertion Area Overhead

Circuit Blocks EDS TRC

Total Area Overhead 3.8% 2.2%

Total Power Overhead (iso-FCLK, iso-VCC) 0.9% 0.6%

36

Error-Recovery Circuits

1) Instruction Replay at ½FCLK

� Clock divider generates ½FCLK without PLL re-lock

� Clock high-phase delay remains unchanged

2) Multiple Issue Instruction Replay at FCLK

� Does not require clock control

� Issue replica instructions to setup pipeline registers

� Last issue is a valid instruction

37

Characteristics & Measurements

� Programs compiled from C

code

� Caches and settings loaded

via JTAG scan

I/O

DC

ac

he

ICac

he

Co

re

Clo

ck

RFJTAG

I/O

DC

ac

he

ICac

he

Co

re

Clo

ck

RFJTAG1.45GHz at 1.0VCore FMAX

Technology 45nm CMOS

Die Area 13.64 mm2

Core Area 0.39 mm2

Core Power 135mW at 1.0V

38

Measured Throughput (TP) vs FCLK

1.1 1.2 1.3 1.4 1.5 1.6 1.7

Clock Frequency (GHz)

1.3

1.2

1.1

1.0

0.9

0.8

101

100

10-1

10-2

10-3

10-4

No

rmali

zed

Th

rou

gh

pu

t

Reco

very

Cycle

s (

%)

Conventional

Max TP

TRC Guardband

+ Path Activation

Resilient: EDS

Max TPFCLK

Guardband

Unnecessary

Recovery

VCC=1.0V

10% VCC Droop

EDS

TRC

Resilient: TRC

Max TP

� 16% TP gain with EDS � 12% TP gain with TRC

39

• TRC & EDS resilient circuits enable:

� 41% throughput gain at equal energy

� 22% energy reduction at equal throughput

Measured Energy vs Throughput

22%

41%

10% VCC Droop

0.0

0.4

0.8

1.2

1.6

0.0 0.2 0.4 0.6 0.8 1.0 1.2Throughput (BIPS)

To

tal E

ne

rgy

(m

J)

Conventional

EDS

TRC

40

• Simple microprocessor core employs resiliency to mitigate

dynamic variation guardbands

• Error-detection circuits:

� Error-detection sequential (EDS)

� Tunable replica circuit (TRC)

• Error-recovery circuits:

� Instruction replay at ½FCLK

� Multiple issue instruction replay at FCLK

• Silicon measurements indicate:

� 41% throughput gain at iso-energy

� 22% energy reduction at iso-throughput

• Resilient & adaptive circuits enable the microprocessor to

adjust to operating variations for maximum efficiency

Summary of This Example

41

Networking Approach

Application Layer

Transport Layer

Internet Layer

Link Layer

••In networking failures at each layer may be dealt In networking failures at each layer may be dealt

with within the layer or passed to layer above. with within the layer or passed to layer above.

Internet Layer

Link Layer

Transport Layer

FailureFailure ReRe--transmittransmit

Application Layer

Example: Internet Protocol

42

Unified Adaptive Design Framework

Software Layer

Platform Layer

Microarchitecture Layer

Circuits Layer

•Adaptive design proposes to handle failures in each layer by reporting failures to the next layer which delivers a response.

Microarchitecture Layer

Circuits Layer

ProblemsAdapt/Retry

Global Optimization By Reconfiguring at the Appropriate Layer

Software Layer

Platform Layer

43

Conclusion

• Resiliency as part of the optimization

equation for performance/energy

• Memory is easier

• Logic is much harder

– We addressed a solution for timing faults

– Other transient faults?

– Permanent fault?

– Reconfigurable logic helpful?

• Cross-layer resiliency

Download - Low Power/Low Voltage Computinglph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS... · H K K M M F D V S G E M N Normalized IPC Performance loss ~5%. 16 0.47 0.45 1.0 Norm EPI 1.08

Top Related