1
Low Power/Low Voltage
Computing
LACSS Workshop
Acknowledgement:
Alaa Alameldeen, Keith Bowman, Zeshan Chishti,
Dinesh Somasekhar, James Tschanz, Chris Wilkerson, Wei Wu
Shih-Lien Lu
Intel Labs
October 13, 2010
2
Power Limit
• Mark Seager
– 2 Petaflop -> 6 MW (1.27 PF -> 2MW)
– Linear scaling: 1 Exaflop -> 3GW (1.6GW)
– With idea technology scaling @ constant die
size and freq
• 3 generation: ~380MW (200MW)
• 4 generation: ~190MW (100MW)
– This talk addresses challenges facing
aggressive voltage scaling
3
Resiliency: A Familiar Topic
• Resilient design employs techniques that
handle faults to give correct operation
• Past Focus: Increase Reliability
• Resiliency as Part of Optimization
FaultsFaultsFaultsFaultsData Corruption ORData Corruption ORData Corruption ORData Corruption OR
Loss of ControlLoss of ControlLoss of ControlLoss of Control
4
Intrinsic Level 4GHz0.6V
Aging
Vcc/Temp
Process Variations
0.85V 3GHz
Reduce or Eliminate Guardbands
Lower Voltage
Lower PowerHigher Frequency
Higher Performance
Released Level
Higher Voltage
Higher Power
Lower Frequency
Lower Performance
Guardband
Potential G
ain
Potentia
l Gain
Margins added for rare occurrences impact Margins added for rare occurrences impact Margins added for rare occurrences impact Margins added for rare occurrences impact PowerPowerPowerPowerPowerPowerPowerPower and and and and PerformancePerformancePerformancePerformancePerformancePerformancePerformancePerformance
5
logic
Faults Caused by Vcc Reduction
• Decreases SRAM stability, more failing bits
• More timing violations—reduces frequency
Vcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functionalVcc > Vmin : Memory fully functional
Vcc < Vmin : A few bits failVcc < Vmin : A few bits failVcc < Vmin : A few bits failVcc < Vmin : A few bits fail
Vcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits failVcc << Vmin : Multiple bits fail
Flip FlopFlip FlopFlip FlopFlip Flop Flip FlopFlip FlopFlip FlopFlip Flop
Vcc > Vmin : No timing violationsVcc > Vmin : No timing violationsVcc > Vmin : No timing violationsVcc > Vmin : No timing violations
Vcc < Vmin : Some violationsVcc < Vmin : Some violationsVcc < Vmin : Some violationsVcc < Vmin : Some violations
Vcc << Vmin : More violationsVcc << Vmin : More violationsVcc << Vmin : More violationsVcc << Vmin : More violations
6
Addressing On-die Memory Errors
• Cache line disabling
– Coarse grain
– Fine grain – Wilkerson et. al. ISCA 2008
• Multi-segmented ECC
– Take part of the cache to store ECC bits
– Segmented protection
– OLSC: simple and modular encode/decode
• Variable Strength ECC
– Current project
50% EPI reduction50% EPI reduction50% EPI reduction50% EPI reductionWith small performance lossWith small performance lossWith small performance lossWith small performance loss
7
On-die Memory Fault Types
• Persistent
– Permanent defect
– Read/Write stability
– Retention
• Transient
– Particle strike
– Proximity disturbance
• Erratic
8BitLine
BitLine
Vcc
X 0 X 1
P 0 P 1
Wordline
Vss Vss
“0” “1”
N 0 N 1
I 0 I 1
Contention between READ and WRITE on device sizesEx: Weak pass device (X1) vs. a strong pull up device (P1) can cause a write failure
�Random within-die variations primarily responsible
SRAM Failures Result from Mismatched
Devices in a Single Cell
9
4 Types of SRAM Failures
• Write failure
– Device mismatch prevents cell from flipping
• Read failure
– Cell flips during read
• Access failure
– Insufficient differential increases latency.
• Retention failure
– Reduced margin, failures occur due to noise
10
Resiliency Techniques
• Require testing (a priori info)
– Advantages
• More information means simpler (cheaper) remedy
– Disadvantages
• Test cost
• Faults cannot be tested not covered
– Techniques
• Sparring – physical redundancy
• Disabling – graceful degradation
11
Spares
• Extra capacity
• Column redundancy
– Random cell failure
– Column mux
– BIST & fuse
• Row redundancy
– Word line failure
– Multi-bit failure
– Word-line segmentation
• Block redundancy
mux
12
Disabling
• Wide dynamic range
• Graceful degradation
– Static and dynamic sizing
– Bank disabling
– Fine grain disabling
• Example
– Cache design
131.0E-11
1.0E-10
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
0.4 0.5 0.6 0.7 0.8Supply Voltage
Failu
re P
rob
.
6T Cell
ST Cell
6T - 10-bit ECC
6T - 1-bit ECC
0.46 0.53 0.67 0.83Pfailtarget
Known Methods (2MB Cache)
14
1.0E-11
1.0E-10
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
0.4 0.45 0.5 0.55 0.6 0.65 0.7Supply Voltage
Fail
ure
Pro
b.
ST Cell
6T - 10-bit ECC
6T - 1-bit ECC
0.46 0.53 0.670.48 0.51
6T – Word Disable
6T – Bit Fix
Vmin for Proposed Techniques
15
L1WDis_L2BFix Normalized to ST Cell
0.80
0.85
0.90
0.95
1.00
DH
FP2KIN
T2K GM
MM OF
PRO
DSE
RV
WS
GEM
AN
No
rmali
zed
IP
C
Performance loss ~5%
16
0.47
0.45
1.0
Norm
EPI
1.08 (L1)
1.00 (L2)
500L1WDis_L2BFix
2.0530ST Cell (circuit sol.)
1.08256T Cell
Norm
Area
Vccmin
(mV)
•Lower Voltage & EPI (Energy Per Inst) vs 6T
•Much less area overhead than ST Cell
Voltage/Area/Energy Comparison
17
Summary for this Example
• Vccmin limits energy scaling
– Ability to reduce voltage critical but limited by memory
reliability
• The ability to detect/avoid failures allows low
voltage operation and reduces energy
• Configurable approach that “trades off cache
capacity”
– Maximizes performance at high voltage
– Enables ~50% improvement in energy per instruction
when operating at low voltage
18
Resiliency Techniques (2)
• No a priori information (through testing)
– Adv
• Lower test cost
– Disadv
• Overhead
• Techniques
– ECC
• Random error
• Correlated error
19
Our Observations with ECC
• Use of systematic H’ matrix instead of non-
systematic
• Separate error detection from correction
• Codes with different strength can share H
matrix
• Trade code density with logic complexity
and modularity
20
Non-Persistent Failures
• Exhibit sporadic failing behavior
• Examples: soft errors, erratic failures
• Also exhibit supply voltage dependence
• Cannot be detected by apriori memory
testing
• Both persistent and non-persistent failures
affect Vccmin
21
Approach
• Key Idea:
• Adaptive cache that works at both high and
low voltages
– As big as possible when performance is
important (high voltages)
– Sacrifice capacity when power is most
important (low voltages)
– In low voltage mode, use a portion of cache to
store ECC
– Enough check bits to correct both persistent
and non-persistent errors
– No additional testing to isolate defective bits
22
Trading off Code Density for
Simplicity• Traditional codes optimized for check bit overhead
• But, complexity grows rapidly with no. of
corrections
• We need large number of corrections
– E.g., up to 10 corrections in each cache line for 500 mV
operation
• Traditional BCH-based code too complex for such
corrections
• Solution: Orthogonal Latin Square Codes (OLSC)
• Less complexity at the cost of more check bits
23
Multi-bit Segmented ECC (MS-ECC)
Data
Way
0
ECC
Way
0
Data
Way
1
ECC
Way
3
......
Way 0 Way 1 Way 2 Way 7
ECC
Way
1
Way 3
Read Logic
Data Out
Data In
Write Logic
24
Orthogonal Latin Square Codes
(OLSC)• Modular error correction hardware
– More regular implementation than BCH
• Based on majority voting
– Example: TMR triplicates data and uses
majority function
• Instead of keeping multiple copies of data
bits,
– Encode orthogonal groups of data bits to form
check bits
– For t-corrections in m2 data bits, need 2tm
check bits
25
Methodology
• Two modes of operation:– High voltage: 1.3V, 3 GHz
– Low voltage: 0.5V, 500 MHz
• 32K 8-way L1 caches, 2M 8-way L2 cache
• Compare– Baseline: SECDED ECC
– MS-ECC: 64-bit segments, 4 corrections per segment• 50% capacity, 1-cycle added latency overhead
• Compare against: Bit-fix with SECDED ECC (BFXECC)
• Can correct 10-bit persistent and 1-bit non-persistent errors
26
Reliability (2MB Cache)
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0.4 0.5 0.6 0.7 0.8 0.9
Vcc (V)
Pro
ba
bilit
y o
f F
ailu
re
BFXECC (Low)
BFXECC (Hi)
MS-ECC (Hi/Low)
27
Performance Overhead
0.8
0.85
0.9
0.95
1
DH
FP2K
INT2K
GM
MM
OF
PROD
SERV
WS
GMEAN
No
rmali
zed
IP
C
Defect-free Baseline
MS-ECC
10% IPC degradation relative to unrealistic defect-free baseline
28
Energy
SCHEME VCCMIN
(mV) FREQUENCY
(MHZ) NORM. POWER
NORM. EPI
BASELINE 725 1400 1 1
BFXECC 630 1000 0.57 0.8
MS-ECC 520 700 0.29 0.58
MS-ECC reduces energy-per-instruction by 42% relative to baseline SECDED ECC
29
Summary for This Example
• Reducing supply voltage key to higher
energy efficiency
• Supply voltage reduction limited by memory
reliability
• MS-ECC: novel technique to mitigate bit
failures• Leverages error correction codes based on OLSC
• Does not rely on testing to isolate defects
• Reduces Vccmin by ~ 200 mV, EPI by 42%
30
Addressing Logic Errors
• Timing faults
• Error detection sequential
• Detect timing faults at the circuit level
• Replay pipeline at the microarchitecture
level
• A research processor in 45nm
– “A 45nm resilient and adaptive microprocessor
core for dynamic variation tolerance,” ISSCC
2010
31
Error-Detection Sequential (EDS)
• Contains additional scan-enabled latch for testing
� mode=0: EDS
� mode=1: FF
D
CLK
Q
ERRORFF
LATCHLATCH
mode
EDS
Implementation
32
DE
RA
EX
ME
M
X
WB
PC IF
Microprocessor Core Overview
� Adaptive clock control enables dynamic FCLK change
33
Tunable Replica Circuit (TRC)F
F
ED
S
ERROR
Tuning Bits
TRC
� TRC monitors critical path delays
� Non-intrusive design
J. Tschanz, et al., Symp. VLSI Circuits, 2009.
34
DE
RA
EX
ME
M
X
WB
PC IF
TRC TRC TRC TRC TRC
Tunable Replica Circuit (TRC)
� TRC tuned to track critical paths per pipeline stage
� TRC must always fail if any critical path fails
� TRC error initiates pipeline error recovery
35
EDS & TRC Overheads
0.8%2.2%Error Detection & Accumulation Area Overhead
1.4%1.4%ECU & Clock Control Area Overhead
L0.2%Min-Delay Buffer Insertion Area Overhead
Circuit Blocks EDS TRC
Total Area Overhead 3.8% 2.2%
Total Power Overhead (iso-FCLK, iso-VCC) 0.9% 0.6%
36
Error-Recovery Circuits
1) Instruction Replay at ½FCLK
� Clock divider generates ½FCLK without PLL re-lock
� Clock high-phase delay remains unchanged
2) Multiple Issue Instruction Replay at FCLK
� Does not require clock control
� Issue replica instructions to setup pipeline registers
� Last issue is a valid instruction
37
Characteristics & Measurements
� Programs compiled from C
code
� Caches and settings loaded
via JTAG scan
I/O
DC
ac
he
ICac
he
Co
re
Clo
ck
RFJTAG
I/O
DC
ac
he
ICac
he
Co
re
Clo
ck
RFJTAG1.45GHz at 1.0VCore FMAX
Technology 45nm CMOS
Die Area 13.64 mm2
Core Area 0.39 mm2
Core Power 135mW at 1.0V
38
Measured Throughput (TP) vs FCLK
1.1 1.2 1.3 1.4 1.5 1.6 1.7
Clock Frequency (GHz)
1.3
1.2
1.1
1.0
0.9
0.8
101
100
10-1
10-2
10-3
10-4
No
rmali
zed
Th
rou
gh
pu
t
Reco
very
Cycle
s (
%)
Conventional
Max TP
TRC Guardband
+ Path Activation
Resilient: EDS
Max TPFCLK
Guardband
Unnecessary
Recovery
VCC=1.0V
10% VCC Droop
EDS
TRC
Resilient: TRC
Max TP
� 16% TP gain with EDS � 12% TP gain with TRC
39
• TRC & EDS resilient circuits enable:
� 41% throughput gain at equal energy
� 22% energy reduction at equal throughput
Measured Energy vs Throughput
22%
41%
10% VCC Droop
0.0
0.4
0.8
1.2
1.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2Throughput (BIPS)
To
tal E
ne
rgy
(m
J)
Conventional
EDS
TRC
40
• Simple microprocessor core employs resiliency to mitigate
dynamic variation guardbands
• Error-detection circuits:
� Error-detection sequential (EDS)
� Tunable replica circuit (TRC)
• Error-recovery circuits:
� Instruction replay at ½FCLK
� Multiple issue instruction replay at FCLK
• Silicon measurements indicate:
� 41% throughput gain at iso-energy
� 22% energy reduction at iso-throughput
• Resilient & adaptive circuits enable the microprocessor to
adjust to operating variations for maximum efficiency
Summary of This Example
41
Networking Approach
Application Layer
Transport Layer
Internet Layer
Link Layer
••In networking failures at each layer may be dealt In networking failures at each layer may be dealt
with within the layer or passed to layer above. with within the layer or passed to layer above.
Internet Layer
Link Layer
Transport Layer
FailureFailure ReRe--transmittransmit
Application Layer
Example: Internet Protocol
42
Unified Adaptive Design Framework
Software Layer
Platform Layer
Microarchitecture Layer
Circuits Layer
•Adaptive design proposes to handle failures in each layer by reporting failures to the next layer which delivers a response.
Microarchitecture Layer
Circuits Layer
ProblemsAdapt/Retry
Global Optimization By Reconfiguring at the Appropriate Layer
Software Layer
Platform Layer
43
Conclusion
• Resiliency as part of the optimization
equation for performance/energy
• Memory is easier
• Logic is much harder
– We addressed a solution for timing faults
– Other transient faults?
– Permanent fault?
– Reconfigurable logic helpful?
• Cross-layer resiliency