region monitoring for local phase detection in dynamic

Region Monitoring for Local Phase Detection in Dynamic Optimization Systems∗

Abhinav Das, Jiwei Lu, Wei-Chung HsuUniversity of Minnesota

{adas,jiwei,hsu}@cs.umn.edu

Abstract

Dynamic optimization relies on phase detection for twoimportant functions (1) To detect change in code workingset and (2) To detect change in performance characteris-tics that can affect optimization strategy. Current prototyperuntime optimization systems [12][13] compare aggregatemetrics like CPI over fixed time intervals to detect a changein working set and a change in performance. While simpleand cost-effective, these metrics are sensitive to samplingrate and interval size. A phase detection scheme that com-putes performance metrics by aggregating the performanceof individually optimized regions can be misled by some re-gions impacting aggregate metrics adversely. In this pa-per, we investigate the benefits and limitations of using ag-gregate metrics for phase detection, which we call GlobalPhase Detection (GPD). We present a new model to detectchange in working set and propose that the scope of phasedetection be limited to within the candidate regions for op-timization. By associating phase detection to individual re-gions we can isolate the effects of regions that are inher-ently unstable. This approach, which we call Local PhaseDetection (LPD), shows improved performance on severalbenchmarks even when global phase detection is not able todetect stable phases.

1. Introduction

Over the last few years dynamic optimization systems

[2][3][12][13][15] have been explored by researchers to in-

crease the performance of native binaries or application exe-

cutables with intermediate representation at runtime. Their

idea is to exploit runtime profiles to select “hot-code” as

the target of optimization. Some dynamic optimization sys-

tems collect runtime profiles by exploiting the hardware

support for performance monitoring in modern micropro-

∗This work is supported by a grant from the U.S. National Science

Foundation(EIA-0220021). We would also like to thank the runtime op-

timization team at Sun Microsystems for their help and support for this

research.

cessors. Phase detection is an important component of those

dynamic optimizers. They sample hardware performance

counters to determine frequently executing instructions and

performance bottlenecks associated with those instructions.

A frequently executing region surrounding this instruction

is selected as a unit of optimization. When the working set

of the program changes, it is important to determine new

regions of execution to exploit new optimization opportu-

nities. It is also important to find out if the performance

characteristics of this region change over time as this could

affect the optimization strategy. Phase detection, as imple-

mented in current sampling-based prototype runtime opti-

mization systems [12][13], is what we call Global Phase

Detection (GPD). In GPD, global metrics like average pro-

gram counter value are used to find new code regions, and

other metrics of performance, such as CPI and DPI (Data

Cache Misses per Instruction), are used to determine if the

program performance characteristics have changed. We call

this global phase detection as program characteristics are

computed by taking into account information from all re-

gions that executed during the profiled interval. However,

most optimizations are only deployed on units of optimiza-

tion called traces, loops or regions. It seems prudent to

have a phase detection scheme that computes metrics for

these units of optimizations. However, the transition into

newer regions of execution is important and must be de-

tected. This leads to the idea that region formation and

phase detection can be decoupled. The task of region forma-

tion is to look for changes in the working set and the task of

phase detection is to analyze the performance of each code

region individually and trigger phase changes when the re-

gion’s performance characteristics change. Optimizations

are based on the characteristics observed during execution

of the region. Once these optimized traces are deployed,

it is essential to track their performance for two reasons.

Firstly, the region may change behavior affecting optimiza-

tion strategy. Secondly, the optimization deployed may not

be beneficial. This is possible due to the speculative nature

of some optimizations like data pre-fetching. Thus moni-

toring the performance of a region becomes important for

detecting change in region characteristics and to determine

Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE

prev

Figure 1. State transition diagram for the cen-troid based phase detector. Variables TH1 toTH4 are threshold values that can be set byparameters. BOS is the band of stability.

the impact of deployed optimizations.

In this paper, we present a scheme that decouples change

in working set detection and phase detection and we pro-

pose that phase detection is needed at the level of re-

gions of optimization and not at the whole program level.

Our scheme for region monitoring achieves the dual goal

of phase detection and monitoring of deployed optimiza-

tions, which we call self-monitoring. The region monitor-

ing framework also incorporates new code detection. The

outline of this paper is as follows: In section 2 we present

an analysis of the existing centroid approach as it is used

in current dynamic optimization systems and discuss its

advantages and limitations. Section 3 introduces the re-

gion monitoring framework, followed by strategies for local

phase detection and its cost/performance analysis. Section

4 discusses related work in phase detection for fast simula-

tion and runtime optimization and we conclude in section 5

with a discussion on future work.

2. The Centroid Approach for Global PhaseDetection

The centroid approach of phase detection has been suc-

cessfully applied in prototype dynamic optimization sys-

tems. In this section we will discuss the scheme in greater

detail and analyze its benefits and limitations.

181.mcf

0

500

1000

1500

2000

2500

3000

3500

Figure 2. Relation between regions andphase changes for 181.mcf.

2.1. Algorithm

The premise of the centroid scheme is that the average

value of program counter obtained by sampling the program

counter at periodic time intervals does not deviate much.

When it does deviate, it often indicates a phase change.

Figure 1 shows the state transition diagram of the centroid

based phase detection scheme. Program counter samples

are obtained by periodic sampling and stored in a buffer.

On every buffer overflow, the mean (centroid) of all the

program counter samples in the buffer is computed. The

phase detector stores a history of such centroids and com-

putes Band of Stability(BOS) using the expectation value

(E) and standard deviation value (SD) of these centroids.

The band extends from (E - SD) to (E + SD). The drift of

the new centroid from this band of stability is computed as

Δ. For example, if the current centroid is within the band

of stability then the value of Δ is 0; otherwise Δ will have a

positive value equal to the distance of the centroid from the

closer of the upper bound or the lower bound. The value of

Δ is used to transition between states. A timer is associated

with the less stable state before transitioning to the stable

state. It is used to ensure that the centroid maintains a low

Δ for some time before triggering a stable phase. Before

transitioning into less stable phase, a check is also made to

ensure that band of stability is not too thick by ensuring that

SD is less than 1/6 of E. Thresholds TH1, 2, 3 and 4 that are

currently used in the phase detector have been determined

empirically as 1%, 5%, 10% and 67% respectively.

2.2. Efficacy of Approach

Using the centroid is a very lightweight and effective

technique to detect phase changes, although it has some lim-

itations that are detailed in subsequent sections. To track


78

0 0 1 0 1

3420

0

1321 19

68139

84

1

58

18

2 3

88

0 02

0 0

3 3

9

0

4

12

6

0

8

0

3

02

1 00 0 0 0 0 02

4

02

12

0 02

0

4

02

1 01

10

100

1000

168.

wup

wise

171.

swim

172.

mgr

id

173.

applu

175.

vpr

177.

mes

a

178.

galgel

181.

mcf

183.

equa

ke

186.

craf

ty

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

197.

pars

er

200.

sixtra

ck

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

301.

apsi

# PC 45k # PC 450k # PC 900k

Figure 3. Number of phase changes for different sampling periods. Three sampling periods, 45K,450K and 900K cycles/interrupt were used. Short running benchmarks were excluded from thisanalysis.

0%

25%

50%

75%

100%

168.

wup

wise

171.

swim

172.

mgr

id

173.

applu

175.

vpr

177.

mes

a

178.

galgel

181.

mcf

183.

equa

ke

186.

craf

ty

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

197.

pars

er

200.

sixtra

ck

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

301.

apsi

#PC 45k # PC 450k # PC 900k

Figure 4. Percentage of time spent in stable phase for different sampling periods.

the efficacy of the centroid scheme we used the prototype

runtime optimization system detailed in [13] and plot the

phase changes (shown as a thick line) along with the distri-

bution of cycle samples across code regions in Figure 2 for

181.mcf. This figure plots the number of program counter

samples, collected from periodic sampling, for each code

region over the length of execution of the benchmark. Each

region is assigned a different color and y-axis is the number

of samples obtained for each region. We set the buffer size

to 2032 samples so one would expect the height of stacked

area chart to be constant at 2032. However when samples

are obtained from overlapping regions, we increment coun-

ters for all overlapping regions causing the height of the area

chart to be greater than 2032. Whenever the graph shows a

shift in regions, a phase change should be observed. The

thick line indicates unstable phase at high value and stable

phase when it is 0. If the distribution of samples across re-

gions changes, the centroid should shift to indicate the cor-

responding change in behavior. However, a phase change

may not be detected due to the location of the new region

being very close to the previous region or the centroid of

the new regions may remain the same. Phase detection for

181.mcf is able to track changes in the pattern of execution.

However, we also find that the phase remains unstable for

quite some time towards the end of execution. The 181.mcf

benchmark highlights a weakness of the centroid scheme

that frequently changing but periodic code execution pat-

tern can cause the phase detector to be in the phase unstable

state for a long time. Since the dynamic optimizer does not

attempt to optimize for unstable phases, this could miss the

optimization opportunities.

2.3. Limitations

The centroid based scheme and global phase detection,

in general, is sensitive to sampling period, interval size and

thresholds used in the phase detector. Interval size is usu-

ally determined by the sampling period, but can be inde-

pendently set. We found that changing the sampling period

can have drastic effects on the centroid based phase detec-

tion. For example, a program with a periodically chang-

ing centroid will detect a phase change on periodic shifts

in centroids. However, the program is really in one phase

but periodically shifting between regions. So a larger sam-

pling period may not show shifts in centroids. The effect

of changing the sampling period was measured for selected

SPEC CPU2000 benchmark programs and we observed that

the number of phase changes was greatly increased at low


187.facerec

0

500

1000

1500

2000

2500

3000

3500

4000

Figure 5. Region chart for 187.facerec. The thick line indicates phase change whenever it has a highvalue and phase stable whenever it is 0.

0%

10%

20%

30%

40%

164.

gzip

168.

wup

wise

171.

swim

172.

mgr

id

173.

applu

175.

vpr

176.

gcc

177.

mes

a

178.

galgel

181.

mcf

183.

equa

ke

186.

craf

ty

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

197.

pars

er

200.

sixtra

ck

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

301.

apsi

Median of %UCR Threshold

Figure 6. Median of percentage of samples not monitored by the region monitor. The line indicatesthe threshold of 30% used in this study.

sampling periods. Figure 3 shows the number of phase

changes for 3 different sampling periods. The binaries used

were compiled using baseline optimization options and ex-

ecuted on an UltraSPARC III+ machine. Figure 4 shows

the percentage of time spent in stable phase for the three

sampling periods used. Spending more time in stable phase

can translate into more opportunity for optimization. How-

ever, percentage of time spent in stable phase does not have

a correlation with the number of phase changes. For ex-

ample, 181.mcf spends more time in stable phase when the

sampling period is small and there are many phase changes.

This is due to fast response time at low sampling periods.

Since 181.mcf is quite stable within a phase, we do not see

phase changes at large sampling periods. At the other ex-

treme 187.facerec spends a large percentage of time in un-

stable phase due to frequent phase changes. Looking at the

region chart (Figure 5) for facerec, we see that there are

few actual phase changes. Facerec periodically executes

switches between 2 sets of regions. This causes frequent

phase changes. Variations caused due to sampling can also

result in frequent phase changes at low sampling periods.

2.4. Motivating Region Monitoring

When a set of samples (or aggregates computed from this

set) in an interval are compared to another set (or aggre-

gates), variations from sampling and periodicity of work-

ing set can influence phase detection greatly as we saw in

the earlier section. A phase is usually defined as a stable

working set of instructions, basic block or procedures with

or without execution frequency information. This working

set can include many regions that are independently opti-

mized. It is dependent upon the interval size selected and

will change with different window sizes. As seen earlier,

such dependency can have a negative impact on the effec-

tiveness of phase detection. This problem can be minimized

if the scope of phase detection is reduced from looking for

global phase changes to phase changes within a small code

region. In other words, phase detection will be looking for

working set changes within a bounded region and not the

whole program. In doing so, we have to associate a phase

detection mechanism with each selected region. There are

several reasons that favor detecting phase changes locally.

It is very possible that some loops exhibit highly stable be-


0%

15%

30%

45%

60%

75%

90%

Time

% s

am

ple

s in

UC

R186.crafty 254.gap

Figure 7. Percentage of samples in UCR for254.gap and 191.fma3d obtained on everybuffer overflow.

havior, but other loops may have haphazard behavior. Most

optimizers limit optimization scope to a code region1, thus

it is not necessary to determine change of whole program

characteristics but it is suffice to determine change of char-

acteristics within the monitored region. By doing this, re-

gions of stable behavior are not penalized due to unstable

regions that coexist with these stable regions. In addition

to phase change detection, monitoring region characteris-

tics allows us to verify the benefit of optimizations. In the

next section, we present the region monitoring framework

and show that it can be used for phase detection.

3. Region Monitoring

Region monitoring consists of two parts viz. (1) Region

Formation and (2) Phase Detection and Self Monitoring.

The region formation algorithm is responsible for detecting

changes in the working set of the program, building regions

for the new code and adding it to the region monitor for

phase detection and self monitoring.

3.1. Region Formation

Whenever the user buffer overflows, performance

counter samples are distributed across regions. There may

be samples that do not fall in any monitored region. We at-

tribute these samples to a single unmonitored region, which

we call the unmonitored code region (UCR). When the per-

centage of samples in the UCR is above a threshold, region

formation is triggered and it builds regions from these sam-

ples. For this work we use the region building mechanism in

[13]. In the current prototype systems, regions are primar-

ily loops that have significant samples within an interval of

1Inter-region optimizations are rarely performed due to the complexity

of analysis in a runtime optimization system. However, with the help of

compiler annotations, future dynamic optimization systems may deploy

inter-region optimizations, such as instruction cache prefetching for the

next incoming phase.

sampling. In the future, regions can also include functions

or traces. Once samples are distributed across regions, each

region can be analyzed by the local phase detector to deter-

mine locally stable phase and other performance character-

istics. It is possible that a code region cannot be built around

some frequently executing instructions. For example, a re-

gion formation algorithm that looks only for loops within

procedures may find samples in a procedure that is called in

a loop. Since procedure boundaries are crossed, no regions

are formed. Such instructions could form a large fraction

of samples and the threshold should be appropriately ad-

justed depending on the percentage of such samples. Figure

6 shows the median of the percentage of samples in the un-

monitored region. For most programs, this is below 30%.

However there are a few programs that have > 30% sam-

ples in UCR.

Figure 7 shows the percentage of samples in UCR for

two such benchmarks over time. Even after frequent region

formation triggers in 254.gap, the percentage of samples in

UCR remains high. 186.crafty tries to form regions on ev-

ery buffer overflow but the percentage of samples in UCR

does not reduce. This is due to a current limitation of the re-

gion building algorithm. A better region building algorithm

can reduce the percentage of samples in the UCR signifi-

cantly. There is no fundamental limitation to building inter-

procedural regions and if such a region building algorithm

is used it can greatly reduce the number of region forma-

tion triggers. We also plan to use compiler annotations to

improve region formation in the future.

3.2. Local Phase Detection

Hind et al. in [14] showed that there are two main pa-

rameters that need to be defined for the abstract problem

of phase shift detection, viz. Granularity and Similarity.

Granularity is defined as the partitioning of a profile into

atomic, fixed length units of comparison. Similarity is de-

fined as a boolean function that computes if two of these

units are similar. As local phase detection works on smaller

code regions (for example loops), granularity is the small-

est number of cycles required to execute a single iteration of

the code region. Since our method is based on CPU cycle

sampling, any reasonable sampling period would be greater

than the number of cycles required to go through the code

region once. Our measure of similarity is detailed next.

3.2.1. Similarity using Pearson’s Co-efficient of Correla-tion

In local phase detection, phase change analysis is carried

out for each region, independent of other regions. This is

needed to track deviation of region characteristics for re-

optimization or de-optimization. We define a local phase


-100

0

100

200

300

400

1 2 3 4 5 6 7 8 9 10

Original

Shift bottleneck by 1 inst

More Samples but similar frequencies

r=0.998

r=-0.056

Figure 8. r values when comparing two dis-tributions with the original distribution. Thex-axis can be thought of as instructions ina region and the height of the graph as thenumber of cycle samples for that instructionin a given interval.

13134-133d4 142c8-14318146f0-14770

0

1000

2000

3000

4000

5000

Figure 9. Regions in 181.mcf.

change as a significant change in the distribution of sam-

ples for a code region. To find this change, the algorithm

computes the Pearson’s co-efficient of correlation between

the current set of samples and the stable set of samples for

the target region. It is usually symbolized as r and can have

a value anywhere between -1 and 1. It is computed as:

∑ni=1xiyi − 1

n

∑ni=1xi

∑ni=1yi√∑

ni=1x

2i − 1

n (∑

ni=1xi)

2√∑

ni=1y

2i − 1

n (∑

ni=1yi)

2

Where n is number of instructions in region, xi is the

number of samples in stable set for instruction i, and yi is

the number of samples in current set for instruction i.

The larger the absolute value of r, the stronger is the as-

sociation between the two variables. Thus, a correlation of

1 or -1 implies the two variables are perfectly correlated,

meaning that you can predict the values of one variable from

the values of the other with perfect accuracy. However an

r value of 0 corresponds to a lack of correlation between

the sets. A positive correlation means that high values of

one variable correspond to high values of the other vari-

able, and low values of one are paired with low values of

0

0.2

0.4

0.6

0.8

1

1.2

146f0-14770 142c8-14318 13134-133d4

Figure 10. Pearson’s co-efficient of correla-tion for three regions in mcf.

the other. A negative correlation means that high values on

one are paired with low values of the other variable. For

our purpose, this anti-correlation is also a change of behav-

ior. Thus negative values and values close to zero indicate

a phase change. This metric has two important properties

that are seen from Figure 8. When the bottleneck shifts by

one instruction (for example, some other load starts missing

in the cache), the r value is close to zero indicating a phase

change. Thus this metric can detect shifts in instruction bot-

tlenecks quickly. When sampling is used to determine hot

instructions, there are inherent variations in the number of

samples obtained. However, if the behavior is still the same,

meaning the same instructions are hot but distribution of

samples across instructions has changed by a constant fac-

tor, then a phase change should not be triggered. The third

line in the graph and the corresponding r value of 0.998

shows that Pearson’s metric will not detect this as a phase

change.

The impact of local phase detection can be seen by look-

ing at the region chart for 181.mcf and the r values for these

regions. Analyzing regions in 181.mcf (Figure 9) we find

that a region 146f0-14770 (code region between address

146f0 and address 14770) takes up a large fraction of ex-

ecution time in the beginning and it diminishes towards the

end, whereas another region (142c8-14318 ) initially takes

a small fraction of execution but later executes for a larger

fraction. This application also shows a transition from non-

periodic to periodic behavior of regions. Figure 10 plots the

Pearson’s co-efficient of correlation for these regions. This

plot shows that in spite of changes in the fraction of execu-

tion time of regions, the samples show very high correlation

between intervals. Thus, local analysis suggests no phase

changes in 181.mcf, whereas globally phase changes are

seen every time the distribution of samples across regions

changes. Such analysis can detect a longer stable phase and

consequently increase the possibility of improving perfor-

mance.

Another advantage of local phase detection is that it al-


-0.2

0

0.2

0.4

0.6

0.8

1

1.2

7ba2c-7ba78 8d25c-8d314

Figure 11. Regions in 254.gap and stabilityof regions using Pearson’s co-efficient of Re-gression.

lows us to isolate the effects of unstable regions. To illus-

trate this point, let us look at r values of some regions in

254.gap (Figure 11). 254.gap has a large number of phase

changes at low sampling periods and few phase changes as

sampling period increases. When no samples are obtained

in an interval for a region, the value of r returned is the same

as during the last interval. Initially, we see a value of 0 for

both regions, as these regions do not execute from the start.

Also the code region 7ba2c-7ba78 is more stable than the

other region. From this we can see that some regions may

be more stable than others, and isolating phase detection for

each code region can result in more stable phase detection.

A state diagram in Figure 12 explains the phase detection

mechanism employed. Initially, a phase starts in the unsta-

ble state. After two intervals, an r-value can be computed.

If this value is greater than a threshold rt, then the state

changes to less unstable. As long as the phase is unstable or

less unstable, the stable set of samples is updated to reflect

the current set of samples. Once the phase stabilizes, the

stable set of samples is frozen till the state moves to an un-

stable state. In the state diagram shown below, the stable set

of samples is denoted as the previous histogram (prev hist)

and the current set of samples is denoted by curr hist. The

dotted lines indicate the state transitions that correspond

to a phase change (moving from unstable to stable or vice

versa). For this work we have used a value of 0.8 for rt.

3.2.2. Effect of Sampling Period

In section 2.3 we observed that the centroid scheme was

sensitive to change in sampling period. Local phase detec-

tion is less sensitive to change in sampling period. This

is because periodic jumps between regions observed at low

sampling periods will not affect local behavior. It may hap-

pen that during some intervals samples are not obtained for

a region, while samples are obtained for other intervals. Lo-

cal phase detection will not try to compute region character-

If (r >= rt ) prev_hist curr_hist

If (r < rt )

If (r < rt ) prev_hist curr_hist

If (r >= rt )If (r >= rt )

If (r < rt

|| prev_hist is empty)

prev_hist curr_hist

If (r >= rt )

prev_hist curr_hist

If (r < rt )

prev_hist curr_hist

Figure 12. State diagram for phase detectionusing co-efficient of correlation.

istics when no samples are obtained for the region for that

interval. Variations in number of samples (and correspond-

ing deviation in centroid) between two intervals is caused

when the sampling period is not aligned to the periodicity

of executing code in the program. This, too, does not affect

local phase detection as the Pearson’s metric will not trigger

phase changes due to variations in the number of samples.

We repeated the experiment of changing the sampling peri-

ods for various benchmarks to see the effect on local phase

detection. Figure 13 shows the number of phase changes for

a few code regions that contribute to a significant percentage

of program execution. Since every region has a phase detec-

tor, it is not possible to list the number of phase changes for

all regions. It is possible that some regions with few sam-

ples show repeated phase changes. However these locally

unstable regions do not affect the stability of other regions.

We observe that only a few regions change phases repeat-

edly using local phase detection. One region in 254.gap has

120 phase changes. However this is a short lived region

with few samples and is included in the graph to show that

there are some regions that are unstable while there are oth-

ers that are very stable. 188.ammp is an aberration showing

large number of phase changes at low sampling periods. We

observed that the r value lies just below the threshold. Since

the region is very large, the granularity limitation breaks

down for such a large region. We are investigating the use

of a threshold based on the size of region. Figure 14 shows

that the percentage of time spent in stable phase is quite

high for most benchmarks and all sampling periods. Lo-

cal phase detection minimizes the dependency on sampling

period, and can be more robust for dynamic optimization.

3.2.3. Cost of Local Phase Detection

Region monitoring has a higher cost than the centroid based

phase detection scheme. The cost comes mainly from the


0 0

4

0 02

0

120

0 0 0 1

5 6

0 0 0 1 0

103

0 0

9

0 0

5

0 0

11

0 0 1 03

6

0 0 0 0 0 00 0

6

0 02

0 0

7

0 02

13

1 0 0 0 0 0 0

22

1

10

100

1000

r1 r2 r1 r2 r3 r1 r2 r3 r4 r1 r2 r1 r2 r3 r4 r1 r1 r2 r3 r1 r2

181.mcf 187.facerec 254.gap 164.gzip

(ref5)

178.galgel 189.lucas 191.fma3d 188.ammp

# PC 45k # PC 450k # PC 900k

Figure 13. Sensitivity to sampling period for a selected set of benchmark programs using local phasedetection. The graph shows selected benchmarks that have a large number of phase changes at lowsampling periods using the centroid scheme. r1, r2 etc. correspond to regions 1, 2 etc. selected bythe dynamic optimizer.

0%

20%

40%

60%

80%

100%

r1 r2 r1 r2 r3 r1 r2 r3 r4 r1 r2 r1 r2 r3 r4 r1 r1 r2 r3 r1 r2

181.mcf 187.facerec 254.gap 164.gzip

(ref5)

178.galgel 189.lucas 191.fma3d 188.ammp

# PC 45k # PC 450k # PC 900k

Figure 14. Percentage of time spent in stable phase for selected benchmarks for three samplingperiods.

distribution of samples to different regions, and the phase

detection for each region. Figure 15 compares the cost of

local phase detection versus the cost of global phase detec-

tion using the centroid approach. As expected, local phase

detection is tens to hundreds of times slower than global

phase detection. Even so, for most applications, the cost is

less than 1% of execution time. Some programs like gcc,

crafty, parser, vortex, ammp and apsi have a significant per-

centage of cost for local phase detection. This cost is due

to the large number of regions monitored by these applica-

tions. However, the cost of local phase detection does not

translate into direct slowdown as region monitoring is per-

formed on a separate thread that may run in parallel, on a

separate core, to the main thread. There is a lot of scope

for reducing the cost. The algorithms used for distributing

samples go through a list of regions to determine the region

where the sample should be attributed to. A faster way to

do the same is to use interval trees[18]. This reduces the

cost from O(n) to O(log(n) + k) where n is the number

of regions and k is the number of regions that the sample

is present in. Figure 16 shows the cost of the interval tree

scheme normalized to the cost of using lists. For bench-

marks with a small number of regions, the cost is slightly

higher from the increased cost of maintaining the tree. As

the number of regions increases (e.g. gcc, crafty, fma3d,

parser and bzip) cost is significantly reduced. There are

other ways of reducing cost like region pruning, where we

can remove infrequently executing and relatively cold re-

gions from the region monitor. These will be looked into in

the future.

3.2.4. Performance of Local Phase Detection

This section presents the potential of local phase detection

using the SPEC CPU2000 benchmark suite and the proto-

type ADORE/Sparc runtime optimization system. As re-

ported in [13] only a few applications in CPU2000 have

significant data cache misses on the UltraSPARC IV+ ma-

chine. The working set of SPEC CPU2000 benchmarks is

relatively small compared to the cache hierarchy of latest

microprocessor. Furthermore, with six years of optimiza-

tion tuning, only a small number of CPU2000 programs still

suffer from serious cache misses on latest processors. How-


2.65%5.80%3.20%9.70%

0.00%

0.01%

0.10%

1.00%

10.00%

100.00%

16

4.g

zip

(5)

16

8.w

up

wis

e

17

1.s

wim

172.m

grid

173.a

pplu

17

5.v

pr(

2)

17

6.g

cc(2

)

17

7.m

esa

178.g

alg

el

18

1.m

cf

18

3.e

qu

ake

18

6.c

raft

y

18

7.f

ace

rec

188.a

mm

p

18

9.lu

ca

s

191.fm

a3d

19

7.p

ars

er

20

0.s

ixtr

ack

254.g

ap

25

5.v

ort

ex(3

)

25

6.b

zip

(3

)

30

0.t

wo

lf

30

1.a

psi

0200400600800100012001400

% overhead of Global PD % overhead of Local PD Times slower than Global PD

Figure 15. Cost of region monitoring and a comparison to the centroid based global phase detec-tor. The bars represent the overhead of centroid and the region monitoring schemes and the linerepresents the factor by which region monitoring is more expensive than the centroid scheme.

0.000

0.400

0.800

1.200

1.600

164.

gzip

(5)

168.

wup

wise

171.

swim

172.

mgr

id

173.

appl

u

175.

vpr(2

)

176.

gcc(

2)

177.

mes

a

178.

galg

el

179.

art

179.

art

181.

mcf

183.

equa

ke

186.

craf

ty

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

197.

pars

er

200.

sixt

rack

254.

gap

255.

vorte

x(3)

256.

bzip

2(3

)

300.

twol

f

301.

apsi

facto

r

Figure 16. Improvement from using interval trees instead of simple lists. The bars are normalized tothe overhead obtained using the simple list scheme.

ever, we have observed much greater performance impact

of our work on the candidate programs for the next genera-

tion of benchmarks, and we expect to report results on them

when the new benchmarks are released. We will report per-

formance (Figure 17) on a subset of these benchmarks viz.

181.mcf, 191.fma3d, 254.gap and 172.mgrid in this section.

[13] reported speedups of 35% for mcf, 8% for mgrid, 9%

for gap and 16% for fma3d at a sampling period of 800K.

For the first test we increased the sampling rate to 100,000

cycles/interrupt and ran the benchmarks with the system

proposed in [13] on the UltraSPARC IV+ machine, which

we will call RTOORIG. RTO with local phase detection is

called RTOLPD. We observed that 254.gap shows about

9.5% performance improvement over RTOORIG using lo-

cal phase detection. This is because LPD was able to detect

a stable loop while the global scheme kept detecting phase

changes on slight shifts in centroid. 172.mgrid does not

show much performance difference as many phase changes

are not detected at high sampling rates in mgrid. Con-

versely, at low sampling rates (1,500,000 cycles/interrupt),

181.mcf stays in an unstable phase for a long time and

RTOLPD can achieve a 23.84% speedup over RTOORIG.

We observed that for most of the part where periodic region

changes occur, the phase remained unstable. In 254.gap, the

low sampling rate caused phase to remain unstable for some

time resulting in a 4.9% performance improvement over

RTOORIG. For mcf, the speedup obtained from LPD in-

creases as sampling period is increased because, at low sam-

pling periods, more time is spent in unstable phase for GPD.

We saw this earlier in Figure 2. For gap the reverse is true,

as we observe a decreased speedup from LPD with higher

sampling periods. GPD becomes more stable at high sam-

pling periods reducing the benefit from LPD. Nevertheless,

we see that in general LPD outperforms GPD by detect-

ing fewer phase changes independent of sampling period.

The original RTO circumvents the sampling rate problems

by empirically determining a suitable sampling rate and not

unpatching traces when phase changes. It uses phase detec-

tion to determine change in working set and always assumes

that optimizations will be beneficial. To do a fair compar-

ison, we modified the original RTO to unpatch traces on a

phase change, so that optimizations could be re-evaluated

using performance characteristics of the original code when

the phase stabilizes. Although we have demonstrated some

benefit in the CPU2000 suite, we believe its performance

potential will be greater on the next generation benchmarks

and real applications where more performance loss due to

cache misses can be expected.


-5%

0%

5%

10%

15%

20%

25%

30%

181.mcf 172.mgrid 254.gap 191.fma3d

100k

800k

1.5M

Figure 17. Speedup of RTOLPD overRTOORIG where the original RTO usesthe centroid scheme and unpatches traceswhen phase is unstable. Three samplingperiods have been used viz. 100K, 800K and1.5M cycles/interrupt.

4. Related Work

A lot of research has been done over the years to detect

phases in an effort to deploy runtime optimization and to

reduce simulation time. Interpretation based systems like

Dynamo [2] and instrumentation based systems like Dy-

namoRIO [3] use simple counters associated with branches

and trace exits to quickly add new code to the code cache

and thus reduce profiling overhead. They do not perform

any computation associated with determination of stable

phase. In essence, their strategy is similar to our region for-

mation strategy that aims to maximize code coverage. Dy-

namic optimization in virtual machines is vital to its perfor-

mance but there are few systems that look for stability prior

to dynamic compilation and optimization. Kistler in [9]

describes a continuous optimization framework that looks

for stable phases in un-optimized code or phase changes

in previously optimized code before optimizing code. Sta-

ble phases are detected by computing a similarity value be-

tween two intervals for profile data. Profile data can be

instrumentation or sampling based and includes sampled

instructions, procedure and basic block execution counts.

Profile data is global and is not attributed to regions for sim-

ilarity computation. Adl-Tabatabai et al. in [10] describe

a hardware monitoring scheme for dynamic data prefetch-

ing in ORP Java virtual machine [11]. The system uses

metrics like change in delinquent loads, increase in rate

of high latency cache misses, etc to detect phase changes.

[17] presents an evaluation of various global phase detec-

tion schemes by tweaking parameters affecting those algo-

rithms. It also introduces the concept of adaptive profile

window resizing and shows that it is more accurate than

constant windows.

Detecting program phases is important in reducing sim-

ulation time too. This is achieved by simulating only those

areas of program execution that correlates with overall pro-

gram behavior. Sherwood et Al. in [4] and [5] present a

scheme that uses the execution frequencies of basic blocks

in an interval to generate a signature for that interval. Com-

paring this signature with the signature obtained from whole

program execution, it is possible to find parts of program

execution that have high correlation to the whole program.

Stable phases and phase changes can be detected by com-

paring signatures from consecutive intervals. Their scheme

is well suited to offline classification of phases as it uses

expensive clustering algorithms. Although they aggregate

samples within a basic block it is different from local phase

detection as a single phase stable/unstable value is com-

puted using basic block execution frequencies from all ba-

sic blocks executed in that interval. Dhodapkar et al. [1]

[8] use working set analysis to trigger phase changes. If

the current working set of instructions, branches, or proce-

dures changes it is indicative of a phase change. The main

difference between Dhodapkar’s approach and Sherwood’s

scheme is that the latter also takes into account the fre-

quencies of execution whereas the earlier scheme only de-

termines if the instruction/branch/procedure was executed

in the current interval. Hardware mechanisms have been

proposed by Sherwood et al. [6] and Merten et al. [7]

for detecting phase changes to support runtime optimiza-

tion. Sherwood’s scheme is a translation of the basic block

vector scheme to hardware using an accumulator table to

count basic block execution frequencies. Merten et al. col-

lect branch profile information in a hardware table and use

execution frequencies of branches to determine candidate

branches. A phase change is detected if the percentage of

execution of non-candidate branches crosses a predefined

threshold. Again both these schemes are global schemes

and are effective at determining when new code is executed.

Kim et al. in [16] present a hardware structure for detecting

phase changes at the granularity of loops and procedures.

Their algorithm is similar to the working set approach but

the signature is computed using a set of stable pattern of ex-

ecution of loops and procedures. To the best of our knowl-

edge this is no other work that looks at phase detection at

the local level.

5. Conclusion and Future Work

Metrics for phase detection using global behavior show

sensitivity to sampling period, interval size and threshold

values that can result in frequent and unnecessary phase

changes. We showed that by restricting the scope of phase

detection to smaller, independent regions, we can minimize

such effects and have a more robust and effective phase

detection. Furthermore, current dynamic optimization sys-

tems do not take advantage of inter-region behaviors and


thus it is not very important to detect change in inter-region

behavior. Applying our technique of local phase detection,

we were able to reduce the number of phase changes even

at very low sampling periods. We also showed that this re-

sulted in a phase being stable for a longer percentage of

time, allowing greater opportunity from optimization. We

found that although the cost of local phase detection, and

region monitoring in general, is higher than a single met-

ric computation approach, it is still within acceptable lim-

its for most benchmarks evaluated. In addition, this cost

is not on the critical path of program execution since re-

gion monitoring can occur in a separate thread, in paral-

lel to the main program. With region monitoring, we have

shown that our prototype dynamic optimization system on

SPARC is less sensitive to the parameters and effects of

sampling. At certain sampling periods, it significantly out-

performs the existing global phase detection approach on

several SPEC CPU2000 benchmarks running on the Ultra-

SPARC IV+ system. By performing region monitoring we

can improve phase detection and create a framework for de-

veloping a feedback mechanism to monitor deployed opti-

mizations. This would allow us to undo ineffective opti-

mizations deployed to a region.

In future, we want to investigate cheaper means of mea-

suring similarity as the Pearson’s metric involves time con-

suming calculations. We also want to look at other ways

of reducing the cost of region monitoring by selecting the

more important regions to be monitored and enhancing our

region search algorithms. As stated earlier, region moni-

toring allows us to implement a feedback mechanism and

we are looking at metrics and algorithms to estimate perfor-

mance impact of deployed optimizations.

References

[1] Dhodapkar, A. S. and Smith, J. E. Comparing Program

Phase Detection Techniques. In International Symposium on

Microarchitecture, 2003

[2] Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: a trans-

parent dynamic optimization system. In Programming Lan-

guage Design and Implementation, 2000.

[3] Bruening, D., Garnett, T., and Amarasinghe, S. An infras-

tructure for adaptive dynamic optimization. In Code Gen-

eration and Optimization: Feedback-Directed and Runtime

Optimization, 2003.

[4] Sherwood, T., Perelman, E., and Calder, B. Basic Block Dis-

tribution Analysis to Find Periodic Behavior and Simulation

Points in Applications. In Parallel Architectures and Compi-

lation Techniques, 2001

[5] Sherwood, T., Perelman, E., Hamerly, G., and Calder, B.

Automatically characterizing large scale program behavior.

In Architectural Support For Programming Languages and

Operating Systems, 2002

[6] Sherwood, T., Sair, S., and Calder, B. Phase tracking and

prediction. In International Symposium on Computer Archi-

tecture, 2003

[7] Merten, M. C., Trick, A. R., George, C. N., Gyllenhaal, J.

C., and Hwu, W. W. A hardware-driven profiling scheme for

identifying program hot spots to support runtime optimiza-

tion. In International Symposium on Computer Architecture,

1999.

[8] Dhodapkar, A. S. and Smith, J. E. Managing multi-

configuration hardware via dynamic working set analysis.

International Symposium on Computer Architecture, 2002.

[9] Kistler, T. and Franz, M. Continuous program optimization:

A case study. ACM Trans. Program. Lang. Syst. Vol. 25,

issue 4, Jul. 2003.

[10] Adl-Tabatabai, A., Hudson, R. L., Serrano, M. J., and Subra-

money, S. Prefetch injection based on hardware monitoring

and object metadata. In Programming Language Design and

Implementation, 2004.

[11] Cierniak, M., Eng, M., Glew, N., Lewis, B., and Stichnoth,

J. The Open Runtime Platform: a flexible high-performance

managed runtime environment: Research Articles. Concur-

rency and Computation: Practice and. Experience. Vol. 17,

issue 5-6, Apr. 2005.

[12] Lu, J., Chen, H., Yew P-C., Hsu, W-C. Design and Imple-

mentation of a Lightweight Dynamic Optimization System.

Journal of Instruction-Level Parallelism, Volume 6, 2004

[13] Lu, J., Das, A., Hsu, W-C., Nguyen, K., Abraham, S. G. Dy-

namic Helper Threaded Prefetching on the Sun UltraSPARC

CMP Processor, In International Symposium on Microarchi-

tecture, 2005.

[14] Hind, M.J., Rajan, V.T., Sweeney, P.F. Phase shift detection:

A problem classification, IBM Research Report RC-22887,

2003

[15] W.K. Chen, S. Lerner, R. Chaiken, and D. Gillies. Mojo:

A dynamic optimization system. In FDDO-04, pages 81-90,

2000.

[16] Kim, J. Kodakara S., Hsu W-C., Lilja D. J., Yew, P-C. Dy-

namic Code Region (DCR) Based Program Phase Tracking

and Prediction for Dynamic Optimizations, Lecture Notes in

Computer Science, Volume 3793, Oct 2005

[17] Nagpurkar P., Hind, M. J., Krintz, C., Sweeney P.F., Rajan,

V.T. Online Phase Detection Algorithms, In Code Genera-

tion and Optimization, 2006

[18] Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C. In-

troduction to Algorithms. McGraw Hill, 2003


region monitoring for local phase detection in dynamic

Documents