problem spm management framework caches in wcet …esaule/nsf-pi-csr-2017-website/posters… · spm...

1
Scaling the Real-time Capabilities of Powertrain Controllers in Automotive Systems Background - The complexity and the size of automotive control systems are increasing. - Such systems have hard real-time constraints, thus high- performance-yet-predictable control systems are needed. PI: Aviral Shrivastava, Compiler Microarchitecture Lab, Arizona State University Source code Task period # of cores SPM size per core Core speed Task Info Architecture Info Mapping/ scheduling Inter-task Analyzer Intra-task Analyzer Code Synthesizer WCET estimation Intra-task SPM partitioning Code generation Inter-task SPM partitioning Inter-task runtime management WCRT estimation Application Size/ Complexity 4-bit microprocessors 80’s 2000’s 2020’s Vehicles communicating with other vehicles and road infrastructure Multiple processors over in-vehicle network Increasing regula:ons on fuel efficiency & emissions More and more ADAS (Advanced Driver Assistance Systems) Problem - Current control system design in practice relies on testing, even though the correctness can only be guaranteed by static analysis at design time, not by testing. - This is because of the unacceptable amount of pessimism in the analysis results. Execution time # of measurements A safe upper bound found by analysis All execution scenarios Observed by testing Too much overestimation! Caches: The main culprit - HW uncertainties are becoming more significant due to ever- increasing application sizes and system complexity. - Caches are the main source that increases the uncertainties. SW uncertain:es (Loop bounds, pointer values, infeasible paths, …) Microarchitectural state uncertain:es (Caches, pipeline, branch predictors) Significance Time MISRA C coding standards user annotations ... Application size grows beyond on-chip memory sizes Resource sharing in multi- tasking on multi-core architectures Solution: Using scratchpad memories (SPMs) Caches make: - WCET analysis difficult and pessimistic - WCRT analysis practically impossible - Replace caches with scratchpad memories (SPMs) - SPMs have the following benefits: - Tight WCET estimation thanks to explicit management - Better performance than caches - [DAC 2013] SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs) - [CODES+ISSS 2013] CMSM: An Efficient and Effective Code Management for Software Managed Multicores SPM Management Framework 0 1 A A B C B SPM ... DMA load A to 0 DMA load B to 1 DMA load C to 0 Predictable memory behavior - Using SPMs enables various optimizations in compiler and scheduler to improve real-time capabilities, tailored to each application (no hardwired logic). Results with Code Management 0 0.2 0.4 0.6 0.8 1 Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM Cache SPM matmult 1.1KB, 0.3KB lms 3.9KB, 1KB compress 3.1KB, 0.9KB sha 3.2KB, 1.2KB edn 4.6KB, 1.9KB adpcm 8.3KB, 2.3KB rijndael 9.3KB, 3.1KB statemate 11KB, 3.5KB 1REG 28KB, 7.6KB DAP1 36KB, 28KB susan 51KB, 11KB DAP3 56KB, 42KB Normalized WCET Es1mates Cache C Cache L SPM C SPM L SPM M -0.1% -0.2% -24% -43% -8% -9% -81% -70% -76% -90% -89% -79% size size Toyota proprietary powertrain controller models Management overhead DMA operations overhead Computation time Cache miss overhead 4KB SPM vs. 4KB 4-way set associative cache (LRU) This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, Toyota Motors and the National Science Foundation grant CNS 1525855 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the sponsors. Acknowledgement - SPM space is partitioned for each task, and each type of data to privatize memory accesses. - SPM space is allocated to minimize memory interference in the worst-case execution path. Core Core Core Core τ 1 τ 2 τ 3 code stack global f 1 f 2 f 3 f 4 Inter-task SPM partitioning Task mapping Intra-task SPM partitioning SPM space allocation Cores Local SPM SPM partition for a task SPM partition for a type of data read z read y read x write z 1. Initial cache contents unknown. 2. Need to combine information. 3. Cannot resolve address of z . [Courtesy of J. Reineke] τ 1 τ 2 τ 3 τ 1 τ 2 time Indefinite delays due to corrupted cache states

Upload: hoangtu

Post on 07-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Problem SPM Management Framework Caches in WCET …esaule/NSF-PI-CSR-2017-website/posters… · SPM size per core Core speed ... marks. This is mainly ... 3.Calculate upper bounds

Scaling the Real-time Capabilities of Powertrain Controllers in Automotive Systems

Background -  The complexity and the size of

automotive control systems are increasing.

-  Such systems have hard real-time constraints, thus high-performance-yet-predictable control systems are needed.

PI: Aviral Shrivastava, Compiler Microarchitecture Lab, Arizona State University

Source code Task period

# of cores SPM size per core

Core speed

TaskInfo ArchitectureInfo

Mapping/ scheduling

Inter-taskAnalyzer Intra-taskAnalyzer

CodeSynthesizer

WCET estimation

Intra-task SPM partitioning

Code generation

Inter-task SPM partitioning

Inter-task runtime management

WCRT estimation

Application Size/ Complexity

4-bit microprocessors

80’s 2000’s 2020’s

Vehicles communicating with other vehicles and

road infrastructure

Multiple processors over in-vehicle

network

•  Increasingregula:onsonfuelefficiency&emissions

•  MoreandmoreADAS(AdvancedDriverAssistanceSystems)

Problem -  Current control system design in

practice relies on testing, even though the correctness can only be guaranteed by static analysis at design time, not by testing.

-  This is because of the unacceptable amount of pessimism in the analysis results.

Execution time

# of

mea

sure

men

ts

A safe upper bound found by analysis

All execution scenarios

Observed by testing

Too much overestimation!

Caches: The main culprit -  HW uncertainties are becoming

more significant due to ever-increasing application sizes and system complexity.

-  Caches are the main source that increases the uncertainties.

SWuncertain:es(Loopbounds,pointervalues,

infeasiblepaths,…)

Microarchitecturalstateuncertain:es(Caches,pipeline,branchpredictors)

Significance

Time

•  MISRA C coding standards

•  user annotations

...

•  Application size grows beyond on-chip memory sizes

•  Resource sharing in multi-tasking on multi-core architectures

Solution: Using scratchpad memories (SPMs)

Caches make: -  WCET analysis difficult and pessimistic -  WCRT analysis practically impossible

-  Replace caches with scratchpad memories (SPMs) -  SPMs have the following benefits:

-  Tight WCET estimation thanks to explicit management -  Better performance than caches

-  [DAC 2013] SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)

-  [CODES+ISSS 2013] CMSM: An Efficient and Effective Code Management for Software Managed Multicores

SPM Management Framework

0 1

A

A B

C B

SPM

...DMAloadAto0…DMAloadBto1…DMAloadCto0…

Predictable memory behavior

-  Using SPMs enables various optimizations in compiler and scheduler to improve real-time capabilities, tailored to each application (no hardwired logic).

Results with Code Management

0"0.2"0.4"0.6"0.8"1"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

Cache"

SPM"

matmult"1.1KB,"0.3KB"

lms"3.9KB,""1KB"

compress"3.1KB,"0.9KB"

sha"3.2KB,"1.2KB"

edn"4.6KB,"1.9KB"

adpcm"8.3KB,"2.3KB"

rijndael"9.3KB,"3.1KB"

statemate"11KB,"3.5KB"

1REG"28KB,"7.6KB"

DAP1"36KB,"28KB"

susan"51KB,"11KB"

DAP3"56KB,"42KB"

Normalized+WCET+Es1mates+ Cache"C" Cache"L" SPM"C" SPM"L" SPM"M"

-0.1% -0.2% -24% -43% -8% -9% -81% -70% -76% -90% -89% -79%

Code size Max function size

Figure 5: WCET is reduced up to 90% compared to 4-way set associative cache of 4KB with LRU replacement policy.

the cache miss overhead is relatively significant due to lessiteration counts (< 50), so our approach shows more signifi-cant WCET reduction thanks to the burst mode DMA trans-fers. The WCET reduction is significant for larger bench-marks. This is mainly caused by large nested loops with it-then-else or switch-case conditions, which frequently appearin these benchmarks. Without condition statements, oncean instruction is loaded into the cache line by a cold miss,the fetch to the instruction can become cache hits for the restof the iterations. However, when there are many conditionstatements in a loop, the number of instructions compete fora cache line in an abstract cache state increases and finallybecome larger than the associativity of the cache, which re-sults in always-miss (assuming cache misses for all itera-tions) [4]. This e↵ect becomes significant in nested loopssince the iteration counts can be high. With the SPM, onthe other hand, DMA operations in loops can be avoidedin most cases thanks to the code mapper which can findan optimal partition-to-region mapping for WCET in thegiven SPM size. Splitting functions into partitions greatlyhelps in this because it can shrink functions’ footprints bykeeping only the important parts, such as loop bodies, inthe SPM. The largest functions of ‘1REG’, ‘DAP1’, ‘susan’,and ‘DAP3’ are much larger than 4KB. These benchmarkscannot run on 4KB SPM without function splitting.

5. CONCLUSIONSThis paper presents a new code management technique for

SMM architectures. The proposed technique splits functionsinto partitions whereas all previous techniques manage codeat the granularity of functions. Splitting functions can im-prove SPM space utilization and solve the limitation on thesizes of applications caused by by function-granularity man-agement. The functions to split and the number of parti-tions are selected heuristically using a feedback loop from anoptimal code mapping technique that allocates SPM spacefor each partition and estimates the WCET. The resultson various benchmarks including automotive control appli-cations show that the proposed technique can significantlyreduce the WCET estimates compared to the optimal solu-tions found by function-granularity technique and 4-way setassociative cache of the same size. Our future work includesmaking the function splitting algorithm optimal and findingthe minimal SPM size for a task to meet its deadline.

6. REFERENCES[1] K. Bai, J. Lu, A. Shrivastava, and B. Holton. CMSM:

An E�cient and E↵ective Code Management for

Software Managed Multicores. In Proc.CODES+ISSS, 2013.

[2] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, andK. S. Chatha. A Performance Model and CodeOverlay Generator for Scratchpad EnhancedEmbedded Processors. In Proc. CODES+ISSS, 2010.

[3] D. Broman, P. Derler, and J. C. Eidson. Temporalissues in cyber-physical systems. JIISc, 93(3), 2013.

[4] C. Cullmann. Cache persistence analysis: Theory andpractice. ACM TECS, 12(1s), 2013.

[5] H. Falk and J. C. Kleinsorge. Optimal StaticWCET-Aware Scratchpad Allocation of ProgramCode. In Proc. DAC, 2009.

[6] H. Falk and H. Kotthaus. Wcet-driven cache-awarecode positioning. In Proc. CASES, 2011.

[7] I. Gurobi Optimization. Gurobi Optimizer ReferenceManual, 2013.

[8] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper.The Malardalen WCET Benchmarks - Past, Presentand Future. In Proc. WCET, 2010.

[9] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M.Austin, T. Mudge, and R. B. Brown. MiBench: AFree, Commercially Representative EmbeddedBenchmark Suite. In Proc. WWC, 2001.

[10] S. Hepp and F. Brandner. Splitting functions intosingle-entry regions. In Proc. CASES, 2014.

[11] S. Jung, A. Shrivastava, and K. Bai. Dynamic CodeMapping for Limited Local Memory Systems. In Proc.ASAP, 2010.

[12] Y. Kim, D. Broman, J. Cai, and A. Shrivastava.WCET-Aware Dynamic Code Management onScratchpads for Software-Managed Multicores. InProc. RTAS, 2014.

[13] M. Schoeberl. A time predictable instruction cache fora java processor. LNCS, 3292, 2004.

[14] A. Schranzhofer, J.-J. Chen, and L. Thiele. Timinganalysis for tdma arbitration in resource sharingsystems. In Proc. RTAS, 2010.

[15] L. Thiele and R. Wilhelm. Design for timingpredictability. Real-Time Systems, 28, 2004.

[16] J. Whitham and N. Audsley. ImplementingTime-Predictable Load and Store Operations. In Proc.EMSOFT, 2009.

[17] H. Wu, J. Xue, and S. Parameswaran. OptimalWCET-Aware Code Selection for Scratchpad Memory.In Proc. EMSOFT, 2010.

Toyota proprietary powertrain controller models

Management overhead

DMA operations overhead

Computation time Cache miss overhead

4KB SPM vs. 4KB 4-way set associative cache (LRU) This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, Toyota Motors and the National Science Foundation grant CNS 1525855 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the sponsors.

Acknowledgement

-  SPM space is partitioned for each task, and each type of data to privatize memory accesses.

-  SPM space is allocated to minimize memory interference in the worst-case execution path.

Core

Core Core

Core

τ1 τ2 τ3

codestack global

f1 f2 f3 f4

Inter-task SPM partitioning

Task mapping

Intra-task SPM partitioning

SPM space allocation

Cores

Local SPM

SPM partition for a task

SPM partition for a type of data

Caches in WCET Analysis

Jan Reineke Saarland University

Contact Info:eMail: [email protected]: http://rw4.cs.uni-sb.de/~reinekePhone: +49 681 302 55 73Fax: +49 681 302 30 65

Uncertainty in Cache Analysis

readz

ready

readx

writez

1. Initial cache contents unknown.

2. Need to combine information.

3. Cannot resolve address of z.

=� Amount of uncertainty determinedby ability to recover information

Jan Reineke Caches in WCET Analysis November 7th , 2008 10 / 33

Uncertainty in Cache AnalysisPredictability Metrics

EvictFill

[dex ][fde]

[gfd ]

[hgf ][fec]

[gfe]

[fed ]

Sequence: �a, . . . , e, f, g, h⇥

Jan Reineke Caches in WCET Analysis November 7th , 2008 11 / 33

Predictability Metrics Evaluation of Policies

Policy Evict(k) Fill(k) Evict(8) Fill(8)LRU k k 8 8FIFO 2k � 1 3k � 1 15 23MRU 2k � 2 ⇤/3k � 4 14 ⇤/20PLRU k

2 log2 k + 1 k2 log2 k + k � 1 13 19

LRU is optimal w.r.t. metrics.Other policies are much less predictable.

�⇥ Use LRU.

How to obtain may- and must-information within the given limits forother policies?

Jan Reineke Caches in WCET Analysis November 7th , 2008 13 / 33

Evaluation of Policies

LRU is optimal wrt. metricsFIFO, MRU, PLRU less predictable

use LRU

Relative Competitiveness of Cache Replacement PoliciesJan Reineke Daniel Grund

Saarland University

IntroductionIn order to fulfill stringent performance requirements, caches are now alsoused in hard real-time systems. In such systems, upper and lower bounds onthe execution times of a task have to be computed. To obtain tight bounds,timing analyses must take into account the cache architecture. However,developing cache analyses – analyses that determine whether a memoryaccess is a hit or a miss – is a difficult problem for some cache architectures.

GoalDetermine safe bounds on the number of cache hits and misses by a task Tunder FIFO(k), PLRU(l), or any another replacement policy.

Approach

1. Determine competitiveness of the desired policy P relative to policy Q.

mP ⇥ k · mQ + c mQ(T) = mP(T)

2. Compute performance prediction of task T for policy Q by cache analysis.

mP ⇥ k · mQ + c mQ(T) = mP(T)

3. Calculate upper bounds on the number of misses for P using the cacheanalysis results for Q and the competitiveness results of P relative to Q.

mP ⇥ k · mQ + c mQ(T) = mP(T)

Relative CompetitivenessCompetitiveness (Sleator and Tarjan): worst-case performance of an onlinepolicy relative to the optimal offline policy.Relative competitiveness: worst-case performance of an online policy relativeto another online policy.

Let mP (q, s) be the number of misses incurred by policy P , starting in cache-setstate q, processing memory access sequence s.Definition 1 (Relative Miss-Competitiveness).Policy P is k-miss-competitive relative to policy Q with additive constant c, if

mP (p, s) ⇥ k · mQ(q, s) + c

for all access sequences s � S and comp. cache-set states p � CP, q � CQ.

“Policy P will incur at most k times the number of misses of policy Q plusconstant c on any access sequence.”

Hit-competitiveness is defined analogously.The competitive miss ratio cm

P,Q is the smallest k such that P is k-miss-competitive relative to Q with some additive constant.

Computing Competitive RatiosP and Q induce a transition system that captures how the two policies actrelative to each other, processing the same memory accesses:

[eabc]FIFO, [eabc]LRU

e(h, h)

[abcd]FIFO, [abcd]LRUe

(m, m) a

(h, h)

[eabc]FIFO, [ceab]LRU

c (h, h)

[abcd]FIFO, [dabc]LRU

d (h, h)

[eabc]FIFO, [ceda]LRU [eabc]FIFO, [edab]LRU

e (m, m)

c

(h, m)[deab]FIFO, [deab]LRU

d

(m, h)

Legend[abcd]FIFO Cache-set state

· ·d Memory access

(h, m), . . . Misses in pairs ofcache-set states

LRUMRU

last-infirst-in

Figure 1: Small part of the transition system in the computation of competitive-ness results for FIFO(4) vs. LRU(4).

Competitive ratio = maximum ratio of misses in policy P relative to the numberof misses in policy Q in transition system.

Quotient Transition SystemProblem: The induced transition system is ⌥ large.Goal : Construct finite transition system with same properties.Observation: Only the relative positions of elements matter:

[abc]LRU, [b e]FIFO [fgl]LRU, [g m]FIFO⌅

[cab]LRU, [cb ]FIFO

(h, m)c

[lfg]LRU, [lg ]FIFO

(h, m)l

Two pairs of cache-set states are ⌅-equivalent, if they can be transformed intoeach other by a renaming of their contents.Merging equivalent pairs of cache-set states yields a finite quotient transitionsystem that retains competitivenss properties:

[abcd]FIFO, [abcd]LRU

(h, h)

(m, m)

[abcd]FIFO, [dabc]LRU

(h, h)

[eabc]FIFO, [edab]LRU

(m, m)

(m, h)

[eabc]FIFO, [ceda]LRU(h, m)

Figure 2: ⌅-equivalent states are similarly colored in Figure 1. Merging equiv-alent states yields the quotient structure depicted here.

ResultsMiss-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:

Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7

FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4

PLRU vs LRU 1, 0 � ⌥ � � � ⌥FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8

PLRU vs FIFO 2, 1 � ⌥ � � � ⌥

Example: LRU(4) is 2-miss-competitive relative to PLRU(4) with constant 1.PLRU(4) is not miss-competitive relative to LRU(4) at all.

Hit-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:Associativity: 2 3 4 5 6 7 8

LRU vs FIFO 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0FIFO vs LRU 1

2,12

12, 1

12,

32

12, 2

12,

52

12, 3

12,

72

LRU vs PLRU 1, 0 � 12, 1 � � � 1

8,158

PLRU vs LRU 1, 0 � 12, 1 � � � 1

4,32

FIFO vs PLRU 12,

12 � 1

4,54 � � � 1

11,1911

PLRU vs FIFO 0, 0 � 0, 0 � � � 0, 0

Generalizations to arbitrary kPreviously unknown relations:• PLRU(k) is 1-competitive relative to LRU(1 + log2k).• FIFO(k) is 1

2-hit-competitive relative to LRU(k), whereas• LRU(k) is not l-hit-competitive relative to FIFO(k) for any l, but• LRU(2k � 1) is 1-competitive relative to FIFO(k) with constant 0 for all k�⇧ yields first may-analysis for FIFO,

• PLRU(k) is not l-miss-competitive relative to LRU(k) for k ⇤ 4 and any l.=⌃ PLRU(k) is not l-miss-competitive (in the classical sense) for k ⇤ 4.

Reference: Jan Reineke and Daniel Grund: Relative Competitive Analysis ofCache Replacement Policies. In ACM SIGPLAN/SIGBED 2008 Conference onLanguages, Compilers, and Tools for Embedded Systems – LCTES 2008

Contact information:http://rw4.cs.uni-sb.de/{reineke|grund}@cs.uni-sb.de

Supported by the

Relative Competitiveness of Cache Replacement PoliciesJan Reineke Daniel Grund

Saarland University

IntroductionIn order to fulfill stringent performance requirements, caches are now alsoused in hard real-time systems. In such systems, upper and lower bounds onthe execution times of a task have to be computed. To obtain tight bounds,timing analyses must take into account the cache architecture. However,developing cache analyses – analyses that determine whether a memoryaccess is a hit or a miss – is a difficult problem for some cache architectures.

GoalDetermine safe bounds on the number of cache hits and misses by a task Tunder FIFO(k), PLRU(l), or any another replacement policy.

Approach

1. Determine competitiveness of the desired policy P relative to policy Q.

mP ⇥ k · mQ + c mQ(T) = mP(T)

2. Compute performance prediction of task T for policy Q by cache analysis.

mP ⇥ k · mQ + c mQ(T) = mP(T)

3. Calculate upper bounds on the number of misses for P using the cacheanalysis results for Q and the competitiveness results of P relative to Q.

mP ⇥ k · mQ + c mQ(T) = mP(T)

Relative CompetitivenessCompetitiveness (Sleator and Tarjan): worst-case performance of an onlinepolicy relative to the optimal offline policy.Relative competitiveness: worst-case performance of an online policy relativeto another online policy.

Let mP (q, s) be the number of misses incurred by policy P , starting in cache-setstate q, processing memory access sequence s.Definition 1 (Relative Miss-Competitiveness).Policy P is k-miss-competitive relative to policy Q with additive constant c, if

mP (p, s) ⇥ k · mQ(q, s) + c

for all access sequences s � S and comp. cache-set states p � CP, q � CQ.

“Policy P will incur at most k times the number of misses of policy Q plusconstant c on any access sequence.”

Hit-competitiveness is defined analogously.The competitive miss ratio cm

P,Q is the smallest k such that P is k-miss-competitive relative to Q with some additive constant.

Computing Competitive RatiosP and Q induce a transition system that captures how the two policies actrelative to each other, processing the same memory accesses:

[eabc]FIFO, [eabc]LRU

e(h, h)

[abcd]FIFO, [abcd]LRUe

(m, m) a

(h, h)

[eabc]FIFO, [ceab]LRU

c (h, h)

[abcd]FIFO, [dabc]LRU

d (h, h)

[eabc]FIFO, [ceda]LRU [eabc]FIFO, [edab]LRU

e (m, m)

c

(h, m)[deab]FIFO, [deab]LRU

d

(m, h)

Legend[abcd]FIFO Cache-set state

· ·d Memory access

(h, m), . . . Misses in pairs ofcache-set states

LRUMRU

last-infirst-in

Figure 1: Small part of the transition system in the computation of competitive-ness results for FIFO(4) vs. LRU(4).

Competitive ratio = maximum ratio of misses in policy P relative to the numberof misses in policy Q in transition system.

Quotient Transition SystemProblem: The induced transition system is ⌥ large.Goal : Construct finite transition system with same properties.Observation: Only the relative positions of elements matter:

[abc]LRU, [b e]FIFO [fgl]LRU, [g m]FIFO⌅

[cab]LRU, [cb ]FIFO

(h, m)c

[lfg]LRU, [lg ]FIFO

(h, m)l

Two pairs of cache-set states are ⌅-equivalent, if they can be transformed intoeach other by a renaming of their contents.Merging equivalent pairs of cache-set states yields a finite quotient transitionsystem that retains competitivenss properties:

[abcd]FIFO, [abcd]LRU

(h, h)

(m, m)

[abcd]FIFO, [dabc]LRU

(h, h)

[eabc]FIFO, [edab]LRU

(m, m)

(m, h)

[eabc]FIFO, [ceda]LRU(h, m)

Figure 2: ⌅-equivalent states are similarly colored in Figure 1. Merging equiv-alent states yields the quotient structure depicted here.

ResultsMiss-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:

Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7

FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4

PLRU vs LRU 1, 0 � ⌥ � � � ⌥FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8

PLRU vs FIFO 2, 1 � ⌥ � � � ⌥

Example: LRU(4) is 2-miss-competitive relative to PLRU(4) with constant 1.PLRU(4) is not miss-competitive relative to LRU(4) at all.

Hit-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:Associativity: 2 3 4 5 6 7 8

LRU vs FIFO 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0FIFO vs LRU 1

2,12

12, 1

12,

32

12, 2

12,

52

12, 3

12,

72

LRU vs PLRU 1, 0 � 12, 1 � � � 1

8,158

PLRU vs LRU 1, 0 � 12, 1 � � � 1

4,32

FIFO vs PLRU 12,

12 � 1

4,54 � � � 1

11,1911

PLRU vs FIFO 0, 0 � 0, 0 � � � 0, 0

Generalizations to arbitrary kPreviously unknown relations:• PLRU(k) is 1-competitive relative to LRU(1 + log2k).• FIFO(k) is 1

2-hit-competitive relative to LRU(k), whereas• LRU(k) is not l-hit-competitive relative to FIFO(k) for any l, but• LRU(2k � 1) is 1-competitive relative to FIFO(k) with constant 0 for all k�⇧ yields first may-analysis for FIFO,

• PLRU(k) is not l-miss-competitive relative to LRU(k) for k ⇤ 4 and any l.=⌃ PLRU(k) is not l-miss-competitive (in the classical sense) for k ⇤ 4.

Reference: Jan Reineke and Daniel Grund: Relative Competitive Analysis ofCache Replacement Policies. In ACM SIGPLAN/SIGBED 2008 Conference onLanguages, Compilers, and Tools for Embedded Systems – LCTES 2008

Contact information:http://rw4.cs.uni-sb.de/{reineke|grund}@cs.uni-sb.de

Supported by the

Automatic Computation

FiniteQuotientSystem

Results and GeneralizationsRelative Competitiveness

Problem:Cache Analyses for LRU only

Idea:Transfer guarantees for LRU to FIFO, PLRU, etc.

Solution:Relative Competitiveness

Definition – Relative Miss-Competitiveness

NotationmP(p, s) = number of misses that policy P incurs on

access sequence s ⇤ M� starting in state p ⇤ CP

Definition (Relative miss competitiveness)Policy P is (k , c)-miss-competitive relative to policy Q if

mP(p, s) � k · mQ(q, s) + c

for all access sequences s ⇤ M� and cache-set states p ⇤ CP, q ⇤ CQ

that are compatible p ⇥ q.

Definition (Competitive miss ratio of P relative to Q)The smallest k , s.t. P is (k , c)-miss-competitive rel. to Q for some c.

Jan Reineke Caches in WCET Analysis November 7th , 2008 16 / 33

GeneralizationsIdentified patterns and proved generalizations by hand.Aided by example sequences generated by tool.

Previously unknown facts:PLRU(k) is (1, 0) comp. rel. to LRU(1 + log2k),

�⇥ LRU-must-analysis can be used for PLRU

FIFO(k) is (12 , k�1

2 ) hit-comp. rel. to LRU(k), whereasLRU(k) is (0, 0) hit-comp. rel. to FIFO(k), but

LRU(2k � 1) is (1, 0) comp. rel. to FIFO(k), andLRU(2k � 2) is (1, 0) comp. rel. to MRU(k).

�⇥ LRU-may-analysis can be used for FIFO and MRU�⇥ optimal with respect to predictability metric Evict

FIFO-may-analysis used in the analysis of the branch target buffer ofthe MOTOROLA POWERPC 56X.

Jan Reineke Caches in WCET Analysis November 7th , 2008 27 / 33

6.4. RESULTS

Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4LRU vs MRU 1, 0 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6

FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8FIFO vs MRU 2, 1 3, 3 4, 4 5, 5 6, 6 MEM MEM

PLRU vs LRU 1, 0 � ⇥ � � � ⇥PLRU vs FIFO 2, 1 � ⇥ � � � ⇥PLRU vs MRU 1, 0 � ⇥ � � � MEMMRU vs LRU 1, 0 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6MRU vs FIFO 2, 1 4, 3 6, 5 8, 7 10, 9 MEM MEMMRU vs PLRU 1, 0 � 4, 3 � � � MEM

Figure 6.6: Miss-Competitiveness ratios k and additive constants c relating FIFO,PLRU, LRU, and MRU at the same associativity. PLRU is only definedfor powers of two. As an example of how this should be read, LRU(4) is2-miss-competitive relative to PLRU(4) with additive constant 1, whereasPLRU(4) is not miss-competitive relative to LRU(4) at all. ⇥ indicates thatthere is no k such that the policy on the left is k-miss-competitive relativeto the policy on the right. MEM indicates that the computation requiredmore than 1.5GB of heap space.

Associativity: 2 3 4 5 6 7 8LRU vs FIFO 0 0 0 0 0 0 0LRU vs PLRU 1, 0 � 1

2 , 1 � � � 18 ,

158

LRU vs MRU 1, 0 0 0 0 0 0 0FIFO vs LRU 1

2 ,12

12 , 1

12 ,

32

12 , 2

12 ,

52

12 , 3

12 ,

72

FIFO vs PLRU 12 ,

12 � 1

4 ,54 � � � 1

11 ,1911

FIFO vs MRU 12 ,�

12 0 0 0 0 MEM MEM

PLRU vs LRU 1, 0 � 12 , 1 � � � 1

4 ,32

PLRU vs FIFO 0 � 0 � � � 0PLRU vs MRU 1, 0 � 0 � � � MEMMRU vs LRU 1, 0 0 0 0 0 0 0MRU vs FIFO 0 0 0 0 0 MEM MEMMRU vs PLRU 1, 0 � 0 � � � MEM

Figure 6.7: Hit-Competitiveness ratios k and subtractive constants c relating FIFO,PLRU, LRU, and MRU at the same levels of associativity. Again, MEMindicates that the computation required more than 1.5GB.

103

Measurement-based WCET AnalysisMeasurement-Based Timing Analysis

Run program on a number of inputs andinitial states.Combine measurements for basic blocksto obtain WCET estimation.Sensitivity Analysis demonstrates thisapproach may be dramatically wrong.

Jan Reineke Caches in WCET Analysis November 7th , 2008 29 / 33

Influence of Initial Cache State

executiontime

BCET WCET upperbound

variation due toinitial cache state

Definition (Miss sensitivity)Policy P is (k , c)-miss-sensitive if

mP(p, s) � k · mP(p⇥, s) + c

for all access sequences s ⇥ M� and cache-set states p, p⇥ ⇥ CP.

Jan Reineke Caches in WCET Analysis November 7th , 2008 30 / 33

1. Measure execution times of basic blocks

2. Combine measurements for basic blocks to obtain estimate of WCET

Influence of Initial State

How wrong can this get?

Sensitivity Results

Policy 2 3 4 5 6 7 8LRU 1, 2 1, 3 1, 4 1, 5 1, 6 1, 7 1, 8

FIFO 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8PLRU 1, 2 � ⇥ � � � ⇥MRU 1, 2 3, 4 5, 6 7, 8 MEM MEM MEM

LRU is optimal. Performance varies in the least possible way.For FIFO, PLRU, and MRU the number of misses may varystrongly.Case study based on simple model of execution time byHennessy and Patterson (2003):WCET may be 3 times higher than a measured execution timefor 4-way FIFO.

Jan Reineke Caches in WCET Analysis November 7th , 2008 31 / 33

Sensitivity Results

LRU is optimal:minimal influence of initial state

FIFO, MRU, PLRU: great influence of initial state

Predictability Limits on the Precision of Cache Analysis

Competitiveness Cache Analyses for FIFO, PLRU, etc.

Sensitivity Caches in Measurement-based WCET Analysis

References:Jan Reineke. Caches in WCET Analysis. PhD thesis, Universität des Saarlandes, Saarbrücken, November 2008.Jan Reineke and Daniel Grund: Relative competitive analysis of cache replacement policies. In LCTES 2008. Jan Reineke, Daniel Grund, Christoph Berg, and Reinhard Wilhelm:

Timing predictability of cache replacement policies. Real-Time Systems, 37(2):99-122, November 2007.

[Courtesy of J. Reineke]

τ1

τ2

τ3

τ1

τ2

time

Indefinite delays due to corrupted cache states