[embedded systems] dynamic reconfiguration in real-time systems volume 4 || energy optimization of...

Chapter 4Energy Optimization of Cache Hierarchyin Multicore Real-Time Systems

Computation using single-core processors has hit the power wall on its wayof performance improvement. Chip multiprocessor (CMP) architectures, whichintegrates multiple processing units on a single integrated circuit (IC), have beenwidely adopted by major vendors like Intel, AMD, IBM and ARM in both general-purpose computers (e.g., [48]) and embedded systems (e.g., [2, 83]). Multicoreprocessors are able to run multiple threads in parallel at lower power dissipationper unit of performance. Despite the inherent advantages, energy conservation isstill a primary concern in multicore system optimization. While power consumptionis a key concern in designing any computing devices, energy efficiency is especiallycritical for embedded systems. Real-time systems that run applications with timingconstraints require unique considerations. Due to the ever growing demands forparallel computing, real-time systems commonly employ multicore processorsnowadays [140, 148].

As discussed in Chaps. 1 and 3, the prevalence of on-chip cache hierarchy hasmade it a significant contributor of the overall system energy consumption. Foruniprocessor systems, DCR is an effective technique for cache energy reduction bytuning the cache configuration at runtime. For multicore systems, L2 cache typicallyacts as a shared resource. Recent research has shown that shared on-chip cache maybecome a performance bottleneck for CMP systems because of contentions amongparallel running tasks [62, 100]. To alleviate this problem, cache partitioning (CP)techniques judiciously partition the shared cache and maps a designated part of thecache to each core. CP is designed at the aim of performance improvement [98],inter-task interference elimination [100], thread-wise fairness optimization [67],off-chip memory bandwidth minimization [151] and energy consumption reduction[101].

In this chapter, we study novel energy optimization techniques which efficientlyintegrate cache partitioning and dynamic reconfiguration in multicore architectures.Tasks are considered with timing constraints. We will discover that DCR inL1 caches has great impact on decisions of CP in shared L2 and vice versa.Moreover, both DCR and CP play important roles in energy conservation. Our goal

W. Wang et al., Dynamic Reconfiguration in Real-Time Systems: Energy, Performance,and Thermal Perspectives, Embedded Systems, DOI 10.1007/978-1-4614-0278-7 4,© Springer Science+Business Media New York 2013

63

64 4 Energy Optimization of Cache Hierarchy in Multicore Real-Time Systems

is to minimize the cache hierarchy energy consumption while meeting all timingconstraints. Efficient static profiling techniques and algorithms are used to findbeneficial L1 cache configurations and L2 partition factors for each task. Theapproach considers multiple tasks on each core thus is more general than previousCP techniques which assume only one application per core [62, 101, 151]. Wewill examine both fixed and varying CP schemes along with DCR for multicorearchitectures. We also study the effect on design quality from different deadlineconstraints, task mappings and gated-Vdd cache lines.

This chapter is organized as follows. Related works are discussed in Sect. 4.1.Section 4.2 describes the architecture model and motivation of our work. Section 4.3presents cache energy optimization techniques for CMPs in detail, followed byexperimental results in Sect. 4.4. Finally, Sect. 4.5 concludes this chapter.

4.1 Related Work

Cache partitioning techniques are widely studied for various design objectivesfor multicore processors. Initially, majority of them focused on reducing cachemiss rate to improve performance. Suh et al. [119] utilized hardware counters togather runtime information which is used to partition the shared cache throughthe replacement unit. Qureshi et al. [98] proposed a low-overhead CP techniquebased on online monitoring and cache utilization of each application. Kim et al. [67]focused on fair cache sharing using both dynamic and static partitioning. Recently,CP is employed for low-power system designs. Reddy et al. [100,101] showed that,by eliminating inter-task cache interferences, both dynamic and leakage energycan be saved. Bank structure aware CP in CMP platforms is studied in [62]. Yuet al. [151] targeted at minimizing off-chip bandwidth through off-line profiling.Lin et al. [74] verified the effectiveness of CP in real systems. Meanwhile, CP is alsobeneficial for real-time systems to improve WCET analysis, system predictabilityand cache utilization [15, 101]. Nevertheless, these CP techniques only focus onshared L2 cache and ignore the impact as well as the optimization opportunitiesfrom private L1 caches.

4.2 Background and Motivation

In this section, we take a look at important features of the underlying architectureas well as an illustrative example to motivate the need and usefulness of employingDCR and CP simultaneously.

4.2 Background and Motivation 65

IL1

L2 Cache

Core1

DL1

IL1

Core2

DL1

IL1

Core3

DL1

IL1

Corem

DL1

To Memory

Fig. 4.1 Typical multicore architecture with shared L2 cache

Core 1 Core 2 Core 3 Core 4

8 ways in one cache set

Fig. 4.2 Way-based cachepartitioning example (fourcores with a 8-wayassociative shared cache)

4.2.1 Architecture Model

Figure 4.1 illustrates a typical CMP platform with private L1 caches (IL1 and DL1)in each core and a shared on-chip L2 cache. Here, both instruction L1 and data L1cache associated with each core are highly reconfigurable in terms of total capacity,line size and associativity as discussed in Sect. 3.1.2.

Unlike traditional least recently used (LRU) replacement policy which implicitlypartitions each cache set on a demand basis, here we use a way-based partitioningin the shared cache [106]. As shown in Fig. 4.2, each L2 cache set (here with a 8-way associativity) is partitioned at the granularity of ways. Each core is assigned agroup of ways and will only access that portion in all cache sets. LRU replacementis enforced in each individual group which is achieved by maintaining separate setsof “age bit.” It is also possible to divide the cache by sets (set-based partitioning) in


which each core is assigned a number of sets and each set retains full associativity[151]. However, since real-time embedded systems usually have small number ofcores, way-based partitioning is beneficial for exploiting energy efficiency. We referthe number of ways assigned to each core as its partition factor.1 For example, theL2 partition factor for Core 1 in Fig. 4.2 is 3.

In this chapter, we focus on static cache partitioning. In other words, L2partitioning scheme for each core are pre-determined during design time and remainthe same throughout the system execution. Dynamic partitioning [98] requiresonline monitoring, runtime analysis and sophisticated OS support thus is not feasiblefor embedded systems. Furthermore, as discussed in Sect. 1.4.2, real-time systemsnormally have highly deterministic characteristics (e.g., task release time, deadline,input, etc.) which make off-line analysis most suitable [97]. As we learned inChap. 3, input variation has none or very limited impact on task’s cache behaviorin real-time embedded systems. By static profiling, we can potentially search muchlarger design space and thus achieve better optimization results.

4.2.2 Motivation

Figure 4.3 shows the number of L2 cache misses and instruction per cycle (IPC) fortwo benchmarks (qsort from MiBench [40] and vpr from SPEC CPU2000 [116])under different L2 cache partition factors p (with a 8-way associative L2 cache)and randomly chosen four L1 cache configurations.2 Unallocated L2 cache waysremain idle. We can observe that changing L1 cache configuration will lead todifferent number of L2 cache misses. It is expected because L1 cache configurationdetermines the number of L2 accesses. System performance (i.e., IPC) is also largelyaffected by L1 configurations. Notice that for larger partition factors (e.g., 7), thedifference in L2 misses is negligible but IPC shows great diversity. It is because notonly L2 partitioning but also L1 configurations determine the performance.

It is also interesting to see that vpr shows larger variances at the same L2 partitionfactor than qsort. For example, the number of L2 cache misses becomes almostidentical (although there are differences in the number of L2 accesses) at p = 4 forqsort while it starts to converge for vpr only after p = 6. The reason behind this isthat, for qsort, there are almost only compulsory misses for p ≥ 4. In other words,p = 4 is a sufficient L2 partition factor for qsort in terms of performance. However,increasing p and reducing number of accesses continue to bring benefit for vpr asshown in Fig. 4.3c due to the fact that vpr has more capacity and conflict misses.

More importantly, the performance as well as energy efficiency of L1 caches andL2 cache depend on each other to a large extent, similar to multi-level cache tuning

1Note that it is different from the partition factor in Chap. 3.2Here c18 and c9, for example, stands for the 18th and 9th configuration for IL1 and DL1,respectively.

4.2 Background and Motivation 67

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0a

c

b

d

L2

Cac

heM

isse

s(1

06)

L2 Partition Factorqsort (L2 misses)

c18 c9c9 c7c5 c2c1 c2

.05

.10

.15

.20

.25

.30

.35

.40

.45

.50

IPC

L2 Partition Factorqsort (IPC)

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

L2

Cac

heM

isse

s(1

06)

L2 Partition Factorvpr (L2 misses)

c16 c12c8 c14

c10 c13c2 c3

.05

.10

.15

.20

.25

.30

.35

.40

.45

.50

1 2 3 4 5 6 7 1 2 3 4 5 6 7

1 2 3 4 5 6 7 1 2 3 4 5 6 7

IPC

L2 Partition Factorvpr (IPC)

Fig. 4.3 L1 DCR impact on L2 CP in performance

in single-core systems (Sect. 3.3). Consider Fig. 4.3d as an example, we can observethat increasing the L2 partition factor from 3 to 4 for vpr provides significantly morebenefit when the L1 cache configuration is (c16,c12) than others. On the other hand,as shown in Fig. 4.3a, L1 cache tuning makes more sense when L2 partition factoris less than 3. Please note that Fig. 4.3 only presents performance variations. Theactual optimization focuses on performance as well as energy efficiency, both ofwhich are determined by the L1 DCR and L2 CP decisions.

Given the above observations, we see that L1 DCR has major impact on L2 CPand there are interesting trade-offs that can be explored for optimizations. Therefore,both DCR and CP should be exploited simultaneously for energy conservationin real-time multicore systems. Experimental results in Sect. 4.4.2 confirms thisconjecture.


4.3 Dynamic Cache Reconfiguration and Partitioning

In this section, we first formulate the energy optimization problem based onperforming dynamic reconfiguration of private L1 caches and static partitioningof shared L2 cache (DCR + CP). Next, we describe static profiling strategy. Thedynamic programming based algorithm which utilizes the static profiling informa-tion is then described in detail. We then investigate the effects of task mappingto different cores. We also explore the effectiveness of employing dynamic cachepartitioning. Finally, we study the benefit of using Gated-Vdd shared cache lines.

4.3.1 Problem Formulation

We extend the system model presented in Sect. 2.1 to make it suitable for ourmulticore architecture:

• A multicore processor with m cores P{p1, p2, . . . , pm}.• Each core has reconfigurable IL1 and DL1 caches both of which supports h

different configurations C {c1,c2, . . . ,ch}.• A α-way associative shared L2 cache with way-based partitioning enabled.• A set of n independent tasks T {τ1,τ2, . . . ,τn} with a common deadline D.3

In this chapter, the tasks do not have any application-specific interaction (e.g.,data sharing or communication) except the contention for the shared cache resource.In other words, we concentrate on independent multiprogramming applications.The same assumption are made in state-of-the-art works [101, 151]. More detailsabout such application’s characteristics, how interprocess communication can besupported (e.g., using shared buffer or other persistent storage), and other practicalissues can be found in [101].

Suppose we are given:

• A task mapping M : T → P in which tasks are mapped to each core. Let ρk

denotes the number of tasks on pk.• An L1 cache configuration assignment R : CI ,CD → T in which one IL1 and

one DL1 configuration are assigned to each task.• An L2 cache partitioning scheme P{ f1, f2, . . . , fm} in which core pi ∈ P is

allocated fi ways.• Task τk,i ∈ T (ith task on core pk) has execution time of tk,i(M,R,P). Let

EL1(M,R,P) and EL2(M,R,P) denote the total energy consumption of all theL1 caches and the shared L2 cache, respectively.

3The techniques described in this chapter can be easily extended for individual deadlines.

4.3 Dynamic Cache Reconfiguration and Partitioning 69

Our goal is to find M, R and P such that the overall energy consumption E of thecache subsystem:

E = EL1(M,R,P)+EL2(M,R,P) (4.1)

is minimized subject to:

max

(ρk

∑i=1

tk,i(M,R,P)

)≤ D , ∀k ∈ [1,m] (4.2)

m

∑i=1

fi = α ; fi ≥ 1 , ∀i ∈ [1,m] (4.3)

Equation (4.2) guarantees that all the tasks in T are finished by the deadline D.Equation (4.3) ensures that the L2 partitioning P is valid.

4.3.2 Static Profiling

For the time being, we assume that the task mapping M is given (we will discussmore about it in Sect. 4.3.4). A reasonable task mapping would be a bin packingof the tasks using their base case execution time to all the m cores so that thetotal execution time in each core is similar. Here the base case execution timerefers to the time one task takes in the system where L1 cache reconfigurationis not applied (using the base configuration) and L2 cache is evenly partitioned.Theoretically, we can simply profile the entire task set T under all possiblecombinations of R and P. Unfortunately, this exhaustive exploration is not feasibledue to its excessive simulation time requirement. Similarly as described in Chap. 3,the reconfigurable L1 cache contains four banks each of which is 1 KB. Therefore,it offers total capacities of 1, 2 and 4 KB. Line size can be tuned from 16 to64 bytes and each set supports 1-way, 2-way and 4-way associativity. There aretotal h = 18 different configurations. Even if there are four cores with only twotasks per core and an 8-way associative L2 cache, the total number of multicorearchitectural simulations will be ((182)2)4 ×35. To be specific, 182 denotes the IL1and DL1 cache configurations for each task. (182)2 presents all possible L1 cachecombinations of the two tasks in each core. Upon that, ((182)2)4 denotes all thecombinations across four cores. According to (4.3), the size4 of P (|P|) equals 35.Obviously, this simulation is infeasible even if each simulation takes only 1 min.

This problem can be greatly relieved by exploiting the independence of thedesign space’s each dimension. It is observed that each task can be profiled individ-ually due to lack of application-specific interactions with other tasks. Essentially,using L2 cache partitioning, each core pi is running equivalently on a uniprocessor

4The size of P can be calculated as Cmα−1., where m and α are as described in Sect. 4.3.1.


with fi-way associative L2 cache (i.e., the capacity is fi/α of the original). L1 cacheactivities are private at each core while L2 activities happen independently in eachcore’s partition. Therefore, we simulate each task in T independently under allcombinations of L1 cache configurations and L2 cache partition factors (from 1to α − 1). In other words, the total number of single-core simulations equals toh2 · (α −1) ·n. Using the same example above, with 8 tasks, it is (182)×7×8. Notethat this profiling process is independent of the task mapping M and the numberof cores m. Apparently, it will take only reasonable profiling time (e.g., at most 3days).

4.3.3 DCR + CP Algorithm

Static profiling results are used to generate profile tables for each task. Each entryin the profile table records the cache energy consumption, for both L1 and the L2partition, as well as the execution time of that task. The dynamic energy of L2 cacheis computed using (2.9) based on the statistics (accesses) from the core to which thetask is assigned. The static energy, however, is estimated by treating the allocatedways as a standalone cache. There are h2 · (α − 1) entries in every profile table. Fortask τk,i ∈ T (ith task on core pk), let ek,i(h1,h2, fk) denote the total cache energyconsumption if task τk,i is executed with (IL1, DL1) configurations (ch1 , ch2) andL2 partition factor fk. Similarly, let tk, j(h1,h2, fk) denote the execution time. Theproblem now can be presented as to minimize:

E =m

∑k=1

ρk

∑i=1

ek,i(h1,h2, fk) (4.4)

subject to:

max

(ρk

∑i=1

tk,i(h1,h2, fk)

)≤ D , ∀k ∈ [1,m] (4.5)

m

∑i=1

fi = α ; fi ≥ 1 , ∀i ∈ [1,m] (4.6)

DCR + CP algorithm consists of two steps. Since static partitioning is used,all the tasks on each core share the same L2 partition factor fk. This fact givesus an opportunity to simplify the algorithm without losing any precision. In thefirst step, we find the optimal L1 cache assignments for the tasks on each coreseparately under all L2 partition factors. Specifically, we find R to minimizeEk( fk) = ∑ρk

i=1 ek,i(c1,c2, fk) constrained by ∑ρki=1 tk,i(c1,c2, fk) ≤ D with k and fk

fixed for ∀pk ∈ P and ∀ fk ∈ [1,α − 1]. This step (sub-problem) is illustratedin Fig. 4.4 for pm with fm = 2. Similar to the uniprocessor DVS problem as wewill study in Chap. 5, each instance of this sub-problem can be reduced from themultiple-choice knapsack problem (MCKP) and thus is NP-hard.


Fig. 4.4 Illustration ofDCR + CP algorithm

Since the subproblem size (measured by h2, ρk) and the embedded applicationsize (measured by energy value) are typically small, a dynamic programmingalgorithm can find the optimal solution quite efficiently as follows. Let emax

k ( fk)and emin

k ( fk) be defined as ∑ρki=1 max{ek,i(h1,h2, fk)} and ∑ρk

i=1 min{ek,i(h1,h2, fk)},respectively. Hence, Ek( fk) is bounded by [emin

k ( fk),emaxk ( fk)]. In order to guarantee

the timing constraint, the energy value is discretized in the dynamic programmingalgorithm. Let SEk

j denote the partial solution for the first j tasks which hasan accumulative energy consumption equal to Ek while the execution time isminimized. A two-dimensional table T is created in which each element T [ j][Ek]

stores the execution time of SEkj . The recursive relation for dynamic programming

thus is:

T [ j][Ek] = minh1,h2∈[1,h]

{T [ j− 1][Ek − ek,i(h1,h2, fk)]+ tk,i(h1,h2, fk)} (4.7)

Initially, all entries in T store some value larger than D. Based on the aboverecursion, we fill up the table T [ j][Ek] in a row by row manner for all energy valuesin [emin

k ( fk),emaxk ( fk)]. During the process, all previous i− 1 rows are filled when

the ith row is being calculated. Finally, the optimal energy consumption E∗k ( fk) is

found by:E∗

k ( fk) = {min Ek | T [ρk][Ek]≤ D} (4.8)

The algorithm iterates over all tasks in core pk (1 to ρk). During each iteration,all discretized Ek values and L1 cache configurations (1 to h2) for current task


are examined. Therefore, the time complexity is O(ρk · h2 · (emaxk ( fk)− emin

k ( fk))).Note that energy values (reflected in the last term of the complexity) can always bemeasured in certain unit so that they are numerically small to make the dynamicprogramming efficient. The size of table T decides the memory requirement, whichis ρk · (emax

k ( fk)− emink ( fk)) · sizeo f (element) bytes. In each entry of T , we can use

minimum number of bits to remember the L1 configuration index instead of realexecution time values. For calculation purposes, two two-dimensional arrays areused for temporarily storing time values for current and previous iterations. Theabove process is repeated for ∀k ∈ [1,m] and ∀ fk ∈ [1,α − 1].5 It is possible that,for some fk, there is no feasible solution for core pk satisfying the deadline. Theyare marked as invalid. The results form a new profile table G for each core in whichthere are [1,α − 1] entries and each entry stores the corresponding optimal solutionE∗

k ( fk), as shown in Fig. 4.4.In the second step, the global optimal solution E∗ can be found by calculating

the overall energy consumption for all L2 partitioning schemes in P which complieswith (4.6). Given a partition factor fk for core pk, the optimal energy consumptionE∗

k ( fk) observing D has been calculated in the first step. Invalid partitioning schemesare discarded. We have E∗ = min{∑m

k=1 E∗k ( fk)} for { f1, f2, . . . , fm} ∈ P. Therefore,

for each L2 partitioning scheme, the corresponding solution can be found in O(m)time. Since the size of P is small (e.g., 455 and 4,495 for 4 cores with 16-way and32-way associative L2 cache, respectively), an exhaustive exploration is efficientenough for this step to find the minimum cache hierarchy energy consumption inO(m · |P|) time. Otherwise, a dynamic programming algorithm can be used. Notethat E∗ is not strictly equal to the actual energy dissipation since the L2 cache stillconsumes static power in its entirety after some cores finish their tasks. Therefore,in the experimental results, this portion of static energy is added to make it accurate.If L2 cache lines are powered off in those partitions using techniques such as cachedecay [63] to save static power dissipation, E∗ is already accurate. Each core alongwith its private caches are assumed to be turned off after it finishes execution.Algorithm 3 outlines the major steps in the DCR + CP approach.

4.3.4 Task Mapping

Since static cache partitioning is used, it is beneficial to map tasks with similarshared cache demand instead of the simple bin packing method described inSect. 4.3.2 so that the shared cache is partitioned in an advantageous way formost of the time during execution. In order to characterize this demand, we defineoptimal partition factor ( fopt ), for each benchmark, as the one larger than which theimprovement of overall performance is less than a pre-defined threshold (e.g., 5%).

5If each core has at least one task, this scope can be reduced to ∀ fk ∈ [1,α −m+ 1] since theminimum partition factor for each core is 1.


Algorithm 3: DCR + CP Algorithm1: for k = 1 to m do2: for fk = 1 to α −1 do3: for l = emin

k ( fk) to emaxk ( fk) do

4: for h1,h2 ∈ [1,h] do5: if ek,1(h1,h2, fk) == l then6: if tk,1(h1,h2, fk)< T [1][l] then7: T [1][l] = tk,1(h1,h2, fk)8: end if9: end if

10: end for11: end for12: for i = 2 to ρk do13: for l = emin

k ( fk) to emaxk ( fk) do

14: for h1,h2 ∈ [1,h] do15: last = l − ek,i(h1,h2, fk)16: if T [i−1][last]+ tk,i(h1,h2, fk)< T [i][l] then17: T [i][l] = T [i−1][last]+ tk,i(h1,h2, fk)18: end if19: end for20: end for21: end for22: E∗

k ( fk) = min{Ek | T [ρk][Ek]� D}23: end for24: end for25: for all Pi{ f1, f2, . . . , fm} ∈ P do26: E∗

i = ∑mk=1 E∗

k ( fk)27: end for28: return min{E∗

i }

For example, as shown in Fig. 4.5a, we can see that fopt for swim benchmark isaround 3 since further increase in partition factor achieves very little performanceimprovement. Similarly, for parser benchmark, assigning 5 shared cache waysseems to be adequate thus fopt = 5 for parser. However, gcc (Fig. 4.5b) requiresmuch larger shared cache resource since the performance keeps increasing alongwith the partition factor (i.e., fopt = 7). As shown in Fig. 4.5d, bitcount is anextreme case in which the performance is not affected by increasing partition factor.Therefore, its fopt is 1. Study shows that L1 configuration has minor impact on eachbenchmark’s optimal partition factor with only minor exceptions. In other words, theperformance trend normally remains the same for different partition factors whenDCR is applied. Table 4.1 lists the fopt values for various benchmarks.

To illustrate the effect of task mapping, we exhaustively examine all the taskmappings for the first four task sets in Table 4.2 with two tasks per core (105possible mappings in total). Figure 4.6 compares the energy consumption of theworst and best task mapping schemes for each task set. For task set 2 and 3, thereare a number of task mappings which cannot lead to a valid solution for the givendeadline. We observe 12–18% difference between the two scenarios which suggestthat task mapping has non-negligible impact on the DCR + CP approach.


500

540

580

620

660

700

740

780E

xecu

tion

Tim

e (m

s)

L2 Partition Factorswim

300

350

400

450

500

550

600

650

Exe

cution

Tim

e (m

s)

L2 Partition Factorparser

150

180

210

240

270

300

330

Exe

cution

Tim

e (1

0ms)

L2 Partition Factorgcc

200

240

280

320

360

400

440

480

1 2 3 4 5 6 7 1 2 3 4 5 6 7

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Exe

cution

Tim

e (m

s)

L2 Partition Factorbitcount

a b

c d

Fig. 4.5 Optimal partition factor variation with L1 caches of 4 KB with 2-way associativity and32-byte line size

Table 4.1 Optimal partitionfactors for selectedbenchmarks

Benchmark fopt Benchmark fopt

swim 3 FFT 7qsort 4 untoast 2ammp 4 mgrid 2applu 2 lucas 3vpr 6 mcf 2bitcount 1 gcc 7sha 2 parser 5CRC32 2 patricia 2dijkstra 4 stringsearch 3toast 1 basicmath 7

Although it does not precisely indicate which partitioning scheme is the best,the optimal partition factor fopt reflects task’s shared cache requirement which canheuristically guide the task mapping scheme. Ideally, we should make the totalexecution time of all cores (each running multiple tasks) as close as possible.


Tab

le4.

2M

ulti

-tas

kbe

nchm

ark

sets

Cor

e1

Cor

e2

Cor

e3

Cor

e4

Set1

qsor

t,vp

rpa

rser

,toa

stun

toas

t,sw

imdi

jkst

ra,s

haSe

t2m

cf,s

hagc

c,bi

tcou

ntpa

tric

ia,l

ucas

basi

cmat

h,sw

imSe

t3ap

plu,

luca

sdi

jkst

ra,s

wim

amm

p,FF

Tba

sicm

ath,

stri

ngse

arch

Set4

mgr

id,F

FTdi

jkst

ra,p

arse

rC

RC

32,s

wim

appl

u,bi

tcou

ntSe

t5m

cf,t

oast

,sha

gcc,

pars

er,s

trin

gsea

rch

patr

icia

,qso

rt,v

prba

sicm

ath,

CR

C32

,am

mp

Set6

mgr

id,p

arse

r,gc

cto

ast,

FFT,

mcf

bitc

ount

,am

mp,

patr

icia

appl

u,di

jkst

ra,q

sort

Set7

vpr,

sha,

unto

ast

CR

C32

,luc

as,q

sort

mgr

id,b

itco

unt,

FFT

appl

u,pa

rser

,str

ings

earc

hSe

t8sh

a,m

cf,u

ntoa

st,b

asic

mat

hto

ast,

gcc,

bitc

ount

,pat

rici

alu

cas,

FFT,

CR

C32

,am

mp

vpr,

appl

u,m

grid

,sw

imSe

t9gc

c,st

ring

sear

ch,p

arse

r,di

jkst

raun

toas

t,m

cf,a

mm

p,bi

tcou

ntlu

cas,

patr

icia

,qso

rt,v

prba

sicm

ath,

toas

t,ap

plu,

CR

C32


0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

1 2 3 4

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

Task Sets

Worst BestFig. 4.6 Task mapping effect

Similarly, we should also ensure that the optimal partition factors of different tasks(assigned to each core) should be as close as possible. Experimental results inSect. 4.4.4 show that more energy savings can be achieved by wisely grouping tasks.

4.3.5 Varying Cache Partitioning Scheme

Until now, as described in Sect. 4.3.1, we assume a static L2 cache partitioningscheme P{ f1, f2, . . . , fm} which assigns a fixed number of ways to each core ( fi

for core pi). As shown in Table 4.1, different benchmarks have different preferredpartition factor ( fopt). Intuitively, since there are multiple tasks per core, the assignedpartition factor may not be beneficial for every task of that core. For example, asshown in Fig. 4.7, the first task of each core has its fopt value of 2, 1, 6 and 3,respectively. A reasonable partitioning scheme for them with 8-way L2 cache wouldbe P{1,1,4,2}. However, if the next task in each core has fopt values of 7, 5, 1 and2, respectively, P{1,1,4,2} would obviously be inferior.

We can potentially alleviate this problem by changing the partitioning scheme atruntime for better energy efficiency and performance. For example in Fig. 4.7, wecan change the partitioning scheme to P{3,3,1,1} at some point (i.e., CP point) thatbetter reflects the requirement of the remaining tasks. It is challenging to find whenand how to vary the partitioning scheme at design time.

This problem can be solved in a similar way as described in Sects. 4.3.2 and 4.3.3with the following modifications. Since the change of shared cache partitioningscheme happens simultaneously for all cores, it is possible that the partition factorchanges during one task’s execution. In order to compute the energy consumptionand execution time, we evenly divide each task into a number of checkpoints basedon a fixed time interval. In other words, the distance between two neighboring


2

D

1 63

75

12

CP point

Core 1 Core 2 Core 3 Core 4

(3)

(1) (1)

(1)(1)

(2)(4)

(3)

Fig. 4.7 Varying partitioningscheme

checkpoints is fixed, say, one million clock cycles. During static profiling, theamount of energy consumed and execution progress (in dynamic instructions) arerecorded up to each checkpoint. As a result, in each task’s profile table, for everyL1 cache configuration and L2 partition factor, there are multiple entries for allcheckpoints.

We describe a two-step algorithm, namely DCR + VCP (Varying CP), similar tothe DCR + CP algorithm in Sect. 4.3.3. At present, we assume a set of CP points isgiven. In the first step, we find the optimal L1 cache configurations for the tasks oneach core separately under all local partitioning schemes for that core. Each localpartitioning scheme Fk here consists of one partition factor for every phase definedby the CP points (e.g., fk = 3 for the first phase from time 0 to the only CP point andfk = 1 from the CP point to the end). Note that we only change the partition factor atthose CP points. There are (α − 1)λ+1 possible local partitioning schemes6 whereα is the L2 cache associativity and λ is the number of CP points. The dynamicprogramming algorithm for this step is modified as follows:

1. We need to compute the actual energy consumption and execution time for eachtask given the varying partition factor instead of simply fetching from the profiletable.

2. Task’s energy consumption and execution time now depends on its start timesince it determines the moment when the partition factor changes during thetask’s execution. Therefore, for each task τk,i with i ∈ [2,ρk], we iterate overall energy values of its previous task τk,i−1, which gives the end time of τk,i−1

(from row i− 1 of table T ), thus the start time of τk,i. Specifically, line 12–21of Algorithm 3 needs to be replaced with Algorithm 4. Here, ek,i(h1,h2,Fk,start)and tk,i(h1,h2,Fk,start) denote the cache energy consumption and execution timefor executing task τk,i using the L1 configurations, local partitioning scheme andstart time.

The second step remains the same as our DCR + CP algorithm except allpossible global varying partitioning schemes are evaluated. Each phase in a globalpartitioning scheme defines a partition factor for each core and complies with (4.6).

6Similarly, it could be reduced to (α −m+1)λ+1.


Algorithm 4: DCR + VCP Algorithm (replaces line 12–21 of Algorithm 3)1: for i = 2 to ρk do2: for l = emin

k (Fk) to emaxk (Fk) do

3: for h1,h2 ∈ [1,h] do4: start = T [i−1][l]5: if start > emax

k (Fk) then6: continue;7: end if8: this = l + ek,i(h1,h2,Fk, start)9: if T [i−1][l]+ tk,i(h1,h2,Fk, start)< T [i][this] then

10: T [i][this] = T [i−1][l]+ tk,i(h1,h2,Fk, start)11: end if12: end for13: end for14: end for

Clearly, there are a total of |P|λ+1 partitioning schemes and the minimum cachehierarchy energy consumption can be found in O(m · |P|λ+1) time.

Note that there is certain error when we compute tk,i(h1,h2,Fk,start) since the CPpoints may not always align with tasks’s checkpoints. As a result, the period betweenthat pair of checkpoints observes two partition factors: one from the first checkpointto the actual CP point and the other from the actual CP point to the next checkpoint.We do not have this partial static profiling information since, as described earlier,we only record exactly at checkpoints. Therefore, we actually estimate the lengthof that particular time period assuming a uniform distribution of execution time andenergy consumption based on the execution progress. Since the length of the timeinterval (denoted by ϕ) between two checkpoints is significantly short comparedwith the tasks’s length (i.e., there are hundreds of checkpoints in each task), theerror introduced here can simply be eliminated by setting the deadline D−λ ·ϕ . Inother words, we make the deadline slightly more stringent to ensure that the solutiongenerated by DCR + VCP satisfies the deadline. This modification is expected toperform at par with the original version since λ ·ϕ is negligible compared to D.

4.3.6 Gated-Vdd Shared Cache Lines

Powell et al. [93] showed that cache leakage power dissipation can be reducedby gating the supply voltage in unused portions. Cache decay exploits the sleeptransistors at a granularity of individual cache lines [63]. Using this technique,we can switch off some L2 cache lines in each set whenever it is beneficial. It isespecially helpful when the L2 cache has larger total capacity and associativity thanwhat all the cores actually need (i.e., the sum of fopt values of concurrent tasks is

4.4 Experiments 79

less than α) given the deadline. Specifically, the constraint ensuring the validity ofP is changed to:

m

∑i=1

fi ≤ α ; fi ≥ 1 , ∀i ∈ [1,m] (4.9)

The DCR + CP algorithm remains the same except that the number of all L2partitioning schemes (the size of P) is increased. For example, |P| becomes 1,820instead of 455 for 16-way associative L2 cache on a 4-core processor. In otherwords, the second step of our algorithm may take longer time but the complexityis still O(m · |P|).

4.4 Experiments

4.4.1 Experimental Setup

To evaluate the proposed approach’s effectiveness, we use 20 benchmarks selectedfrom MiBench [40] and SPEC CPU 2000 [116]. In order to make the size of SPECbenchmarks comparable with MiBench, the reduced (but well verified) input setsfrom MinneSPEC [69] are used. Table 4.2 lists the task sets used in the experimentswhich are combinations of the selected benchmarks. We choose 4 task sets whereeach core contains 2 benchmarks, 3 task sets where each core contains 3 benchmarksand 2 task sets where each core contains 4 benchmarks. As mentioned in Sect. 4.3.2,the task mapping is based on the rule that the total task execution time of eachcore is comparable. We evaluate different task mapping strategies in Sect. 4.4.4.The deadline D is set in a way that there is a feasible L1 cache assignment for everypartition factor in every core. In other words, all possible L2 partitioning schemescan be used. We will examine the effect of deadline variation in Sect. 4.4.3.

M5 [11] is adopted for the experiments. It is enhanced to support shared cachepartitioning as well as different line sizes in different caches (IL1, DL1 and L2) forL1 cache reconfiguration in CMP mode. The simulated system is configured with afour-core processor each of which runs at 500 MHz. The TimingSimpleCPU model[11] in M5 is used (represents an in-order core) that stalls during cache accesses andmemory response handling. The L2 cache configuration is assumed to be 32 KB,8-way associative with 64-byte lines. The memory size is set to 256 MB. The L1cache, L2 cache and memory access latency are set to 2, 20 and 200 ns, respectively.

4.4.2 Energy Savings

We now compare the following three approaches.

• CP: L2 cache partitioning only (optimal).


• DCR + UCP: L1 cache reconfiguration with a uniform L2 cache partitioning.• DCR + CP: L1 cache reconfiguration with judicious L2 cache partitioning.

Here CP only approach uses optimal L2 partitioning scheme with all L1s inbase configuration. It can be achieved using the algorithm described in Sect. 4.3.3without the first step. DCR + UCP essentially means that all the cores are equallyassigned with α/m L2 ways while L1 configurations are carefully selected in oneshot. DCR + CP exploits optimization opportunities using DCR and CP together.

Figure 4.8 illustrates this comparison in energy consumption for all task sets.Energy values are normalized to CP. As discussed in Sect. 4.3.2, the reconfigurableL1 cache has a base size of 4 KB. Here, we examine two kinds of L1 baseconfigurations: 4 KB with 2-way associativity and 32-byte line size (4KB 2W 32B),and 4 KB with 4-way associativity and 64-byte line size (4KB 4W 64B). In theformer case, DCR + CP can save 18.35% of cache energy on average comparedwith CP. In the latter case, up to 33.51% energy saving (29.29% on average) can beachieved. Compared with DCR + UCP, judiciously employing CP (DCR + CP)is able to achieve up to 14% more energy savings by carefully selecting cachepartitioning scheme P. Although results for only two L1 base configurations areshown here, similar improvements can be achieved for other base configurations(e.g., 19.30% for 2KB 2W 32B).

It is valuable to disclose the energy reduction ability of the described approach.Using task set 4 in Fig. 4.8b as an example, CP selects the best P{1,5,1,1}with L1 configuration of 4KB 4W 64B. It consumes total energy of 125.8 mJ andfinishes all tasks in 1,600 ms. With DCR + CP, the optimal P{2,4,1,1} and the L1caches are configured differently for each task. For example, FFT on Core 1 uses1KB 1W 64B and 4KB 4W 16B of IL1 and DL1, respectively, while swim on Core3 uses 4KB 4W 16B and 2KB 2W 32B. Using DCR + CP, the energy requirementis reduced to 83.6 mJ and all tasks finishes in 1,788 ms.

4.4.3 Deadline Effect

It is also meaningful to see how deadline constraint can affect the effectiveness ofthe approach. Using the same example as above, for task set 4, we vary the deadlinefrom 1,800 to 1,520 ms in step of 10 ms (there is no solution for deadlines shorterthan 1,520 ms). Figure 4.9 shows the result for both CP and DCR + CP. At eachstep, the reduced deadline negatively impacts the energy saving opportunity. In otherwords, the energy consumption increases since the configuration that was energyefficient becomes invalid due to the timing constraint. We can see that DCR + CPcan find efficient solutions and outperforms CP consistently at all deadline levels.

4.4 Experiments 81

0.5

0.6

0.7

0.8

0.9

1.0a

b

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

Task Sets

CPDCR+UCP

DCR+CP

L1 base configuration: 4KB_2W_32B

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9

Task SetsL1 base configuration: 4KB_4W_64B

1 2 3 4 5 6 7 8 9

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

CPDCR+UCP

DCR+CP

Fig. 4.8 Cache hierarchy energy reduction

4.4.4 Task Mapping Effect

In this section, we evaluate a simple task mapping heuristic based on optimalpartition factor discussed in Sect. 4.3.4. For each task set in Table 4.2, benchmarkswith equal or similar optimal partition factor (determined with L1 base configuration4KB 2W 32B) are grouped together in the same core. Ideally, we would like tominimize the average variance of fopt values in each core. However, it is clearlya hard problem and the heuristic does the task mapping at the best effort. Wecompared the solution quality of the DCR + CP approach with and without taskmapping in Fig. 4.10. It can be observed that partition factor aware task mapping


70

80

90

100

110

120

130

140

150

Ene

rgy

Con

sum

ptio

n (m

J)

Deadline

CPDCR+CP

1800 1760 1720 1680 1640 1600 1560 1520

Fig. 4.9 Deadline effect on total energy consumption

0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

TaskSets

DCR+CPDCR+CP+TM

1 2 3 4 5 6 7 8 9

Fig. 4.10 Task mapping heuristic effect on total energy consumption

achieves more energy saving than simple bin packing based mapping. However,the improvement is not significant (up to 6.3% but on average less than 5%). Thereason behind it is that, in practice, the default bin packing normally leads to atask mapping that is reasonable (note that Fig. 4.6 compares the best and the worstmappings). Another reason is that when tasks are grouped based on fopt valueswithout considering their total execution time, the imbalanced workloads on eachcore may lead to more idle L2 cache ways and thus more static energy consumption.

4.4 Experiments 83

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

1 2 3 4 5 6 7 8 9

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

Task Sets

DCR+CPDCR+VCP

Fig. 4.11 Varying partitioning scheme effect on total energy consumption

4.4.5 Effect of Varying Cache Partitioning

We compare DCR + CP and DCR + VCP in their energy saving ability to evaluatethe varying L2 cache partitioning scheme. The number of CP points is made to beequal to the number of tasks per core minus one. In other words, we try to changethe partition factor when a new task starts execution. However, since the partitioningscheme can only be altered simultaneously for all cores, the CP points are set basedon the average start time of tasks. Specifically, for example, the first CP point is at theaverage start time of all the second tasks on each core under the base configuration.Figure 4.11 illustrates the results. It is worth to note that more energy savings canbe achieved in scenarios with more tasks per core (12.1% versus 6.4% on average).The reason is that task sets with diverse fopt values make varying partition factormore advantageous.

The task mapping heuristic and varying partitioning scheme actually competewith each other. As described in Sect. 4.3.4, a good task mapping scheme makesfopt values of tasks in each core as close as possible. This fact compromisesthe advantage of changing the partition factor at runtime. In this study, there isless additional energy savings applying DCR + VCP on systems with ideal taskmappings. On the other hand, if DCR + VCP is employed, the task mappingheuristic shows limited benefit. In practice, one of the two alternatives can beselected for further improvements.

4.4.6 Gated-Vdd Cache Lines Effect

Switching-off unnecessary cache lines in our original L2 configuration (32 KB with8-way associativity) leads to limited additional energy saving since we can hardly


0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

1 2 3 4 5 6 7 8 9

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

Task Sets

Without gatingWith gating

Fig. 4.12 Gated-Vdd cache lines effect on total energy consumption

shut down any cache line given the timing constraint. Therefore, we evaluate theeffect from gated-Vdd shared cache lines using a larger L2 configuration (64 KB with16-way associativity). Figure 4.12 shows the result assuming a fixed partitioningscheme. It can be observed that 12% cache hierarchy energy consumption can besaved on average.

In this study, it is also observed that the energy saving achieved using powergating technique reduces if the deadline becomes more stringent. It is expectedbecause a tight timing constraint requires utilizing more shared cache resources tomaximize performance.

4.5 Chapter Summary

In this chapter, we examined an efficient approach to integrate dynamic cachereconfiguration and partitioning for real-time multicore systems. We observed thatthere is a strong correlation between L1 DCR and L2 CP in CMPs. While CP iseffective in reducing inter-task interferences, DCR can further improve the energyefficiency without violating timing constraints (i.e., deadlines). Static profilingtechnique drastically reduces the exploration space without losing any precision.The DCR + CP algorithm, which can find the optimal L1 configurations for eachtask and L2 partition factors for each core, is based on dynamic programmingwith discretization of energy values. Moreover, we explored varying partitioningscheme and shared cache line power gating for more energy savings. We also studiedthe effect of deadline variation and task mapping based on shared cache resourcedemand of each task. Extensive experimental results demonstrated the effectivenessof the described approach (29.29–36.01% average energy savings).

[embedded systems] dynamic reconfiguration in real-time systems volume 4 || energy optimization of...

Documents