on tuning microarchitecture for programs

On Tuning Microarchitecture for Programs

Daniel Crowell, Wenbin Fang, and Evan Samanas

1

Outline

Goal: Adapt µArch to meet program’s performance/energy requirement during runtime

•Motivation•A flexible framework for µArch adaptivity•Feasibility study on different adaptive components. •Case study on adaptive cache (selective-way/set)•Evaluation on adaptive cache•Conclusion

2

Motivation

• Optimizing for all is optimizing for nothing• Software is more and more complex, and

many are close source• S/W and H/W codesign is infeasible for legacy

software

3

Three Questions for Microarchitecture Adaptivity

• When to adapt? => Policy– Interval? Context switch? Function boundary?

• What goal(s)? => Policy– Performance first? Performance-power ratio first?

• How to adapt? => Mechanism– What technique to use to allow reconfiguration

during runtime?

4

Reference: Lee and Brooks [1], and Albonesi et al. [7]

Adaptivity Framework

5

Reference: Lee and Brooks [1] and Albonesi et al. [7]

Policy

• Instruction 1: adapt_advise– Inspired from “madvise” in os system calls– When to adapt ： when this instruction is

executed– What goal: an operand (performance? energy?

both?)• Instruction 2: adapt_setup

– Privilleged, only used by OS– Operand: allowed user programs to use

adapt_advise or not6

Reference: Ipek [5], and Clark [6]Adding new instructions to SimpleScalar: http://ce.et.tudelft.nl/~demid/SSIAT/

http://ce.et.tudelft.nl/~demid/SSIAT/

PolicyApplication boundary (OS)

Time interval (OS)

Context switching (OS)

User program (Compiler / User program)

7

[3]

[1][2]

[4]

Feasibility study

• Back up motivation: What should be configured?

• Ideal configuration differs by workload• L1 Data Cache, TLB, Branch Predictor• Simplescalar, Wattch• 6 Programs from SPEC2000Int

8

Feasibility study (TLB cont.)

9

Feasibility study (TLB)

10

Feasibility Study (Branch Predictor)

11

Feasibility Study (Cache)

12

Feasibility Study (Cache)

13

What We Learned

• TLB– Variability with # entries– Fully-associative better

• Branch Predictor: Combined better• Cache: Variability in both

– Size Variability > Assoc. Variability

• Cache most interesting– Lots of Literature

14

Selective set (Yang et. al. 2001)

• Adjust size (# of sets) of L1 cache– Double size– Shrink by half

• Goal: Decrease static power by reducing leakage

• Adjust by miss rate threshold• Size-bound• Focus on I-Cache

15

Selective way (Albonesi 1999)

• Disables “unneeded” cache ways– Reduces cache switching activity

• When to disable: Extend ISA?• When to enable: Performance Degradation

Threshold

16

Evaluation

• Simulator– SimpleScalar 3.0– Wattch

• Workload– 6 programs from SPEC 2000

• Case study: Adaptive Cache

17

SimpleScalar changes

Two methods used:•Simplescalar implementation of Selective Sets

– Used timer with miss counter to determine sets to disable– Power down portions of cache and selectively flush dirty

data

•Scripting based method – can use this same design for both selective sets and

selective ways– Completely replaces cache when resized, flushes all values

at each interval

18

Application-boundary policy

19

• Instructions Per Cycle vs Energy Delay• IPC: considers only performance (higher better)

– Energy Delay: considers both performance and power (lower better)• Smaller cache size

– Energy delay decreases at first, but rises later– Want to choose point where it is smallest

Selective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set Cache

Configuration set at start of program, then remains unchanged

Application-boundary policy

20

Selective-way Cache

• Similar tradeoffs in IPC and Power to Selective Set

• Fewer choices – simplescalar limits to power of two associatively– Unlike cache set size, power of two limit not normally necessary

Time-interval policy

21

• Reconfigurations occur every so many CPU cycles• Why?

– Good if program behavior not known before execution– Program may require fewer/more cached data later in execution

• For our cache study: Relies on % Cache misses to determine reconfiguration.

• Performance hit to changing too frequently– May oscillate between two roughly equivalent states– Reconfiguration requires temporarily halting, possibly flushing values

from cache


22

Selective-set Cache

• What is the minimum allowed cache miss rate? (1%, 2%, 3%, 4%? – policy choice)

• Notice positive energy delay on right graph (not good!)– never resizes down, since miss rate always higher than 1%– So all adaptivity adds is overhead under those circumstances

Cache miss rate Cache miss rate


23

Selective-way Cache

Cache miss rate Cache miss rate

• Again, similar to selective sets• Differences dependent upon program being executed

Cache miss rateProgram 8-way 4-way 2-way 64 KB 16 KB 4 KB

Gcc 0.71% 1.11% 1.19% 1.09% 2.64% 5.42%

Gzip 1.16% 1.68% 2.41% 1.16% 2.93% 4.02%

Mcf 0.11% 0.11% 0.13% 0.11% 0.14% 0.15%

Perlbmk 0.43% 0.51% 0.78% 0.51% 0.70% 5.87%

Vortex 0.19% 0.45% 1.34% 0.35% 1.90% 5.7%

Vpr 3.91% 4.58% 5.53% 4.55% 5.74% 8.54%

24

Decreasing number of ways or sets almost always increases miss rate

Problem Mentioned Earlier: See how Gzip and Vpr are always higher than 1%, which does not work well with a < 1% dynamic reconfiguration level

Conclusion

• Adaptivity is useful– Tune for different program requirements– Save power

• A flexible adaptivity framework– Mechanism– Policy

• Cache just one of many areas where this is useful

25

Reference

[1] B. C. Lee and D. Brooks. Efficiency trends and limits from comprehensive microarchitectural adaptivity. In ASPLOS, 2008.[2] S.-H. Yang, M. D. Powell, B. Falsa, K. Roy, and T. Vijaykumar. An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance i-caches. In HPCA, 2001.[3] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In JILP, 2000.[4] M. C. Huang, J. Renau, and J. Torrellas. Positional adaptation of processors: application to energy reduction. In ISCA, 2003.[5] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. In ISCA, 2007.[6] M. Clark and L. K. John. Performance evaluation of congurable hardware features on the amd-k5. In ICCD, 1999.[7] D. H. Albonesi, R. Balasubramonian, S. G. Dropsho, S. Dwarkadas, E. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster. Dynamically tuning processor resources with adaptive processing. In Computer, 2003

26

Question?

27

on tuning microarchitecture for programs

Documents