microfix: exploiting path-grained timing adaptability for improving power-performance efficiency...

MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency

Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li

Key Laboratory of Computer System and Architecture,ICT (Institute of Computing Technology), CAS, Beijing, P.R. China

NVIDIA Corporation, USA

Outline

• What’s Path-grained Timing Adaptability (PTA)

• Potential of PTA for Efficiency Improvement

• How to Exploit PTA

• Case Study Results

• Conclusions

Impact of DVFS to Path Delay

P1

P2

FF

FFCritial Path

TCycle Period

Non-critical Path

K-1th stage Kth stage

T T

• Traditionally, suppose voltage scaling down makes P1 and P2 timing critical, then what?

• Scaling down frequency to all stages of pipeline

Question:

• Can these emerging critical paths be salvaged to trade more voltage scaling down?

• Maybe Yes! By fine-grained time stealing

Timing Imbalance

T T

FF

FF

FF

FF

PCP

TCycle Period

NCP

Generous Flip-flop (GFF)

Backward Adaptable Flip-flop (BAFF)

Forward Adaptable Flip-flop (FAFF)

Unadaptable Flip-flop (UAFF)

Slack_up > TH, Slack_dn > TH

Slack_up > TH, Slack_dn ≤ TH

Slack_up ≤ TH, Slack_dn > TH

Slack_up ≤ TH, Slack_dn ≤ TH

Intrinsic Timing Imbalance

• Case study• FPU, adopted by OpenSPARC T1

• Support all IEEE 754 floating-point data types

• Synthesized by Synopsys Design Compiler with UMC 0.18um technology

• Cycle period: (1+10%) ×T critical

1

10

100

1000

10000

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

TH=0.1Cycle

TH=0.15Cycle

TH=0.2Cycle

TH=0.25Cycle

TH=0.3Cycle

TH=0.35Cycle

TH=0.4Cycle

# F

lip-f

lop

s

The GFFs, FAFFs, and BAFFs take considerable even dominated proportion!

Attractive Potential

DVFS Exacerbating Imbalance

• Generally, the time margin of longer paths diminish much more faster than that of short ones

FF

• Assume that the path delay is the sum of delay of gates on the path

• TG : the gate delay

• Delta: the delay change during the voltage scaling down

• Before voltage scaling down• △S1 = (n － m) × TG

• After voltage scaling down• △S2 = (n － m) × (TG + Delta)

Define: S=|Slack_dn △ － Slack_up|

Slack_dnSlack_up

n gates m gates

△S1 < S△ 2

Example

If the Imbalance be utilized…• Check the lower bound of cycle period T

• Traditionally:

T1 = n× (TG+Delta)• From MicroFix’s perspective:

T2 = (m+n)/2 × (TG+Delta) ≤ T1 － TH

T

δ

n

(m+n)/2

Without MicroFix

With MicroFix

F=1/Tδ= δ(V)

F

1/V

1/n

2/(m+n)

Without MicroFix

With MicroFix

Note: preclude the UAFFs

How to deal with UAFFs?

• Two-supply voltage scheme [Usami, JSSC’98] [Ghosh, TCAD’07]

• Critical Isolation: the critical paths resulting in UAFFs

• The supply voltage of Critical Isolation are more conservative than that of other portion out of Critical Isolation.

Critical Isolation

Powered by Conservative Voltage

Powered by Aggressive Voltage

The exploitable scope of MicroFix

How to “Fix’’?

• Two supply voltage scheme• Timing sensors [Yan, DATE’09][Agarwal, VTS’07]

• Multiple-phase Clocks (generated by a DLL)

……(K-1)FFs

KFFs

Delay Error Prediction Signals

……(K-1)th stage

LogicKth stage

Logic

Timing Sensors

Timing Sensors

Target Pipeline

Voltage/Frequency

Control

Normal Voltage Supply

………… …… …… ……

CLK……

……

……

FCLK

BCLK

Conservative Voltage Supply

CLK

BCLK

FCLKT×TH

T×TH

UAFFFAFF

GFF

FCLK BCLK

BAFF

CLK

FFs

Operational Principles

V, F V, FV, F

Reduce frequency from F to F Reduce voltage from V to V

(a) Traditional DVFS

Increase frequency from F to F Increase voltage from V to V

Reducing Power

Increasing Performance

V, FV, F

Increase voltage from V to V

Increase frequency from F to F

V, F

V V-v Monitoring

No error predicted

V V+ v

Error predicted

F F + fMonitoring

No error predicted

F F- fError

predicted

(b) MicroFix enhanced DVFS

Reduce frequency from F to F

Reduce voltage from V to V

Restore a tight margin

Restore a tight margin

Ensure that the restored margin ‘v’ and ‘f ’ can guard safe voltage and frequency turning.

Experimental Setup

• Gate-level• Study the adaptability and overhead with a

synthesized FPU – Timing info. -> PrimeTime

• Transistor-level • Investigated the Power-Performance tradeoffs

with Hspice simulations – 32nm PTM models dedicated for HP and LP

applications, respectively.

Exploring Design Tradeoffs

• ‘TH’ play a critical role in determining the ultimate Efficiency

Critical Isolation


Critical Isolation


Smaller ‘TH’, smaller CI, but less aggressive voltage reduction!

1

10

100

1000

10000

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

GF

F

FA

FF

BA

FF

UA

FF

TH=0.1Cycle

TH=0.15Cycle

TH=0.2Cycle

TH=0.25Cycle

TH=0.3Cycle

TH=0.35Cycle

TH=0.4Cycle

# F

lip-f

lop

s

Larger ‘TH’, larger CI, but more aggressive voltage reduction!

What ‘TH’ is optimal?

Exploring Design Tradeoffs /2

• Percentage of Cells in Critical Isolation

0.00% 0.00% 0.16%2.04%

10.82%

22.70%

33.52%

0%

5%

10%

15%

20%

25%

30%

35%

40%

0.1 0.15 0.2 0.25 0.3 0.35 0.4

TH

Pe

rce

nta

ge

of

Ce

lls


• Sensor Area Overhead• a sensor is about 8x that of a pipeline flip-flop (based on the

number of transistors) [Yan, DATE09]

• The paths in the critical isolation and those with ‘over-larger’ slack (i.e. slack >T × TH + tmargin) do not need to be monitored by sensors

0.00%

2.10%3.75%

9.20%

12.34%10.95%

9.97%

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

0.1 0.15 0.2 0.25 0.3 0.35 0.4TH

Se

nso

r a

rea

ove

rhe

ad


• Sensor Power Overhead• in the most pessimistic case (TH=0.3, all sensors

simultaneously flag timing errors): 14%

• HOWEVER, such worst-case power overhead can hardly happen due to three reasons

1) Sensors do not need to be always on

2) It’s almost impossible all sensors flag impending timing errors simultaneously

3) TH=0.3 actually is not a optimal configuration

Therefore, the pessimistic power overhead won’t offset much efficiency of MicroFix!

Hspice Simulations

• Object: Investigate the detailed delay-power relation of the target pipeline

• It is ideal to directly simulate the transistor-level model of the target pipeline with Hspice; however it is very labor-intensive and time consuming.

• So we took a indirect way to conduct the Hspice simulations

Ptotal(V,F) = Pcomb(V,F)+Pff(V,F)1/F = T = tc + tsetup + tc−to−q

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

Voltage (V)N

orm

aliz

ed P

ower

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

Voltage (V)

Nor

amliz

ed D

elay

(a) (b)

High Perf.

Low PowerLow Power

High Perf.

Combinational Component

• ISCAS85 (c432, c499, c880, c1355, c1908, c2670)

• 32nm PTM models (HP and LP versions)

Normalized V-D and V-P relations comply well with all of the simulated benchmarks!

Sequential Component

• V-D

• V-P

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

1 0.9 0.8 0.7 0.6Voltage (V)

No

rma

lize

d D

ela

y

t_setup + t_c-to_q

t_setup + t_c-to_q

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6Voltage (V)

Nor

mal

ized

Pow

er

α=1α=0.5α=0.25α=1α=0.5α=0.25

Low PowerHigh Perf.

Low Power

High Perf.

(a) (b)

Efficiency Comparsion

TH = 0.2 is an optimal choice!Efficiency Improvement: 35% EDP, 28% PDP

0.00%

2.10%3.75%

9.20%

12.34%10.95%

9.97%

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

0.1 0.15 0.2 0.25 0.3 0.35 0.4TH

Se

nso

r a

rea

ove

rhe

ad

Conclusion

• MicroFix can improve DVFS efficiency by exploiting the path-grained adaptability

• The timing imbalance threshold, TH, implies a critical design tradeoff

• The efficiency of EDP for HP application up to 35% and PDP for LP application up to 28%, at the expense of only 7% area overhead

Thanks!

Q&A

microfix: exploiting path-grained timing adaptability for improving power-performance efficiency...

Documents

dn slack

path delay

dn thslack

voltage scaling downs1

path tg

p2 timing critical

delay change

sum of delay