region monitoring for local phase detection in dynamic
TRANSCRIPT
Region Monitoring for Local Phase Detection in Dynamic Optimization Systems∗
Abhinav Das, Jiwei Lu, Wei-Chung HsuUniversity of Minnesota
{adas,jiwei,hsu}@cs.umn.edu
Abstract
Dynamic optimization relies on phase detection for twoimportant functions (1) To detect change in code workingset and (2) To detect change in performance characteris-tics that can affect optimization strategy. Current prototyperuntime optimization systems [12][13] compare aggregatemetrics like CPI over fixed time intervals to detect a changein working set and a change in performance. While simpleand cost-effective, these metrics are sensitive to samplingrate and interval size. A phase detection scheme that com-putes performance metrics by aggregating the performanceof individually optimized regions can be misled by some re-gions impacting aggregate metrics adversely. In this pa-per, we investigate the benefits and limitations of using ag-gregate metrics for phase detection, which we call GlobalPhase Detection (GPD). We present a new model to detectchange in working set and propose that the scope of phasedetection be limited to within the candidate regions for op-timization. By associating phase detection to individual re-gions we can isolate the effects of regions that are inher-ently unstable. This approach, which we call Local PhaseDetection (LPD), shows improved performance on severalbenchmarks even when global phase detection is not able todetect stable phases.
1. Introduction
Over the last few years dynamic optimization systems
[2][3][12][13][15] have been explored by researchers to in-
crease the performance of native binaries or application exe-
cutables with intermediate representation at runtime. Their
idea is to exploit runtime profiles to select “hot-code” as
the target of optimization. Some dynamic optimization sys-
tems collect runtime profiles by exploiting the hardware
support for performance monitoring in modern micropro-
∗This work is supported by a grant from the U.S. National Science
Foundation(EIA-0220021). We would also like to thank the runtime op-
timization team at Sun Microsystems for their help and support for this
research.
cessors. Phase detection is an important component of those
dynamic optimizers. They sample hardware performance
counters to determine frequently executing instructions and
performance bottlenecks associated with those instructions.
A frequently executing region surrounding this instruction
is selected as a unit of optimization. When the working set
of the program changes, it is important to determine new
regions of execution to exploit new optimization opportu-
nities. It is also important to find out if the performance
characteristics of this region change over time as this could
affect the optimization strategy. Phase detection, as imple-
mented in current sampling-based prototype runtime opti-
mization systems [12][13], is what we call Global Phase
Detection (GPD). In GPD, global metrics like average pro-
gram counter value are used to find new code regions, and
other metrics of performance, such as CPI and DPI (Data
Cache Misses per Instruction), are used to determine if the
program performance characteristics have changed. We call
this global phase detection as program characteristics are
computed by taking into account information from all re-
gions that executed during the profiled interval. However,
most optimizations are only deployed on units of optimiza-
tion called traces, loops or regions. It seems prudent to
have a phase detection scheme that computes metrics for
these units of optimizations. However, the transition into
newer regions of execution is important and must be de-
tected. This leads to the idea that region formation and
phase detection can be decoupled. The task of region forma-
tion is to look for changes in the working set and the task of
phase detection is to analyze the performance of each code
region individually and trigger phase changes when the re-
gion’s performance characteristics change. Optimizations
are based on the characteristics observed during execution
of the region. Once these optimized traces are deployed,
it is essential to track their performance for two reasons.
Firstly, the region may change behavior affecting optimiza-
tion strategy. Secondly, the optimization deployed may not
be beneficial. This is possible due to the speculative nature
of some optimizations like data pre-fetching. Thus moni-
toring the performance of a region becomes important for
detecting change in region characteristics and to determine
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
prev
Figure 1. State transition diagram for the cen-troid based phase detector. Variables TH1 toTH4 are threshold values that can be set byparameters. BOS is the band of stability.
the impact of deployed optimizations.
In this paper, we present a scheme that decouples change
in working set detection and phase detection and we pro-
pose that phase detection is needed at the level of re-
gions of optimization and not at the whole program level.
Our scheme for region monitoring achieves the dual goal
of phase detection and monitoring of deployed optimiza-
tions, which we call self-monitoring. The region monitor-
ing framework also incorporates new code detection. The
outline of this paper is as follows: In section 2 we present
an analysis of the existing centroid approach as it is used
in current dynamic optimization systems and discuss its
advantages and limitations. Section 3 introduces the re-
gion monitoring framework, followed by strategies for local
phase detection and its cost/performance analysis. Section
4 discusses related work in phase detection for fast simula-
tion and runtime optimization and we conclude in section 5
with a discussion on future work.
2. The Centroid Approach for Global PhaseDetection
The centroid approach of phase detection has been suc-
cessfully applied in prototype dynamic optimization sys-
tems. In this section we will discuss the scheme in greater
detail and analyze its benefits and limitations.
181.mcf
0
500
1000
1500
2000
2500
3000
3500
Figure 2. Relation between regions andphase changes for 181.mcf.
2.1. Algorithm
The premise of the centroid scheme is that the average
value of program counter obtained by sampling the program
counter at periodic time intervals does not deviate much.
When it does deviate, it often indicates a phase change.
Figure 1 shows the state transition diagram of the centroid
based phase detection scheme. Program counter samples
are obtained by periodic sampling and stored in a buffer.
On every buffer overflow, the mean (centroid) of all the
program counter samples in the buffer is computed. The
phase detector stores a history of such centroids and com-
putes Band of Stability(BOS) using the expectation value
(E) and standard deviation value (SD) of these centroids.
The band extends from (E - SD) to (E + SD). The drift of
the new centroid from this band of stability is computed as
Δ. For example, if the current centroid is within the band
of stability then the value of Δ is 0; otherwise Δ will have a
positive value equal to the distance of the centroid from the
closer of the upper bound or the lower bound. The value of
Δ is used to transition between states. A timer is associated
with the less stable state before transitioning to the stable
state. It is used to ensure that the centroid maintains a low
Δ for some time before triggering a stable phase. Before
transitioning into less stable phase, a check is also made to
ensure that band of stability is not too thick by ensuring that
SD is less than 1/6 of E. Thresholds TH1, 2, 3 and 4 that are
currently used in the phase detector have been determined
empirically as 1%, 5%, 10% and 67% respectively.
2.2. Efficacy of Approach
Using the centroid is a very lightweight and effective
technique to detect phase changes, although it has some lim-
itations that are detailed in subsequent sections. To track
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
78
0 0 1 0 1
3420
0
1321 19
68139
84
1
58
18
2 3
88
0 02
0 0
3 3
9
0
4
12
6
0
8
0
3
02
1 00 0 0 0 0 02
4
02
12
0 02
0
4
02
1 01
10
100
1000
168.
wup
wise
171.
swim
172.
mgr
id
173.
applu
175.
vpr
177.
mes
a
178.
galgel
181.
mcf
183.
equa
ke
186.
craf
ty
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
197.
pars
er
200.
sixtra
ck
254.
gap
255.
vorte
x
256.
bzip2
300.
twolf
301.
apsi
# PC 45k # PC 450k # PC 900k
Figure 3. Number of phase changes for different sampling periods. Three sampling periods, 45K,450K and 900K cycles/interrupt were used. Short running benchmarks were excluded from thisanalysis.
0%
25%
50%
75%
100%
168.
wup
wise
171.
swim
172.
mgr
id
173.
applu
175.
vpr
177.
mes
a
178.
galgel
181.
mcf
183.
equa
ke
186.
craf
ty
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
197.
pars
er
200.
sixtra
ck
254.
gap
255.
vorte
x
256.
bzip2
300.
twolf
301.
apsi
#PC 45k # PC 450k # PC 900k
Figure 4. Percentage of time spent in stable phase for different sampling periods.
the efficacy of the centroid scheme we used the prototype
runtime optimization system detailed in [13] and plot the
phase changes (shown as a thick line) along with the distri-
bution of cycle samples across code regions in Figure 2 for
181.mcf. This figure plots the number of program counter
samples, collected from periodic sampling, for each code
region over the length of execution of the benchmark. Each
region is assigned a different color and y-axis is the number
of samples obtained for each region. We set the buffer size
to 2032 samples so one would expect the height of stacked
area chart to be constant at 2032. However when samples
are obtained from overlapping regions, we increment coun-
ters for all overlapping regions causing the height of the area
chart to be greater than 2032. Whenever the graph shows a
shift in regions, a phase change should be observed. The
thick line indicates unstable phase at high value and stable
phase when it is 0. If the distribution of samples across re-
gions changes, the centroid should shift to indicate the cor-
responding change in behavior. However, a phase change
may not be detected due to the location of the new region
being very close to the previous region or the centroid of
the new regions may remain the same. Phase detection for
181.mcf is able to track changes in the pattern of execution.
However, we also find that the phase remains unstable for
quite some time towards the end of execution. The 181.mcf
benchmark highlights a weakness of the centroid scheme
that frequently changing but periodic code execution pat-
tern can cause the phase detector to be in the phase unstable
state for a long time. Since the dynamic optimizer does not
attempt to optimize for unstable phases, this could miss the
optimization opportunities.
2.3. Limitations
The centroid based scheme and global phase detection,
in general, is sensitive to sampling period, interval size and
thresholds used in the phase detector. Interval size is usu-
ally determined by the sampling period, but can be inde-
pendently set. We found that changing the sampling period
can have drastic effects on the centroid based phase detec-
tion. For example, a program with a periodically chang-
ing centroid will detect a phase change on periodic shifts
in centroids. However, the program is really in one phase
but periodically shifting between regions. So a larger sam-
pling period may not show shifts in centroids. The effect
of changing the sampling period was measured for selected
SPEC CPU2000 benchmark programs and we observed that
the number of phase changes was greatly increased at low
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
187.facerec
0
500
1000
1500
2000
2500
3000
3500
4000
Figure 5. Region chart for 187.facerec. The thick line indicates phase change whenever it has a highvalue and phase stable whenever it is 0.
0%
10%
20%
30%
40%
164.
gzip
168.
wup
wise
171.
swim
172.
mgr
id
173.
applu
175.
vpr
176.
gcc
177.
mes
a
178.
galgel
181.
mcf
183.
equa
ke
186.
craf
ty
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
197.
pars
er
200.
sixtra
ck
254.
gap
255.
vorte
x
256.
bzip2
300.
twolf
301.
apsi
Median of %UCR Threshold
Figure 6. Median of percentage of samples not monitored by the region monitor. The line indicatesthe threshold of 30% used in this study.
sampling periods. Figure 3 shows the number of phase
changes for 3 different sampling periods. The binaries used
were compiled using baseline optimization options and ex-
ecuted on an UltraSPARC III+ machine. Figure 4 shows
the percentage of time spent in stable phase for the three
sampling periods used. Spending more time in stable phase
can translate into more opportunity for optimization. How-
ever, percentage of time spent in stable phase does not have
a correlation with the number of phase changes. For ex-
ample, 181.mcf spends more time in stable phase when the
sampling period is small and there are many phase changes.
This is due to fast response time at low sampling periods.
Since 181.mcf is quite stable within a phase, we do not see
phase changes at large sampling periods. At the other ex-
treme 187.facerec spends a large percentage of time in un-
stable phase due to frequent phase changes. Looking at the
region chart (Figure 5) for facerec, we see that there are
few actual phase changes. Facerec periodically executes
switches between 2 sets of regions. This causes frequent
phase changes. Variations caused due to sampling can also
result in frequent phase changes at low sampling periods.
2.4. Motivating Region Monitoring
When a set of samples (or aggregates computed from this
set) in an interval are compared to another set (or aggre-
gates), variations from sampling and periodicity of work-
ing set can influence phase detection greatly as we saw in
the earlier section. A phase is usually defined as a stable
working set of instructions, basic block or procedures with
or without execution frequency information. This working
set can include many regions that are independently opti-
mized. It is dependent upon the interval size selected and
will change with different window sizes. As seen earlier,
such dependency can have a negative impact on the effec-
tiveness of phase detection. This problem can be minimized
if the scope of phase detection is reduced from looking for
global phase changes to phase changes within a small code
region. In other words, phase detection will be looking for
working set changes within a bounded region and not the
whole program. In doing so, we have to associate a phase
detection mechanism with each selected region. There are
several reasons that favor detecting phase changes locally.
It is very possible that some loops exhibit highly stable be-
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
0%
15%
30%
45%
60%
75%
90%
Time
% s
am
ple
s in
UC
R186.crafty 254.gap
Figure 7. Percentage of samples in UCR for254.gap and 191.fma3d obtained on everybuffer overflow.
havior, but other loops may have haphazard behavior. Most
optimizers limit optimization scope to a code region1, thus
it is not necessary to determine change of whole program
characteristics but it is suffice to determine change of char-
acteristics within the monitored region. By doing this, re-
gions of stable behavior are not penalized due to unstable
regions that coexist with these stable regions. In addition
to phase change detection, monitoring region characteris-
tics allows us to verify the benefit of optimizations. In the
next section, we present the region monitoring framework
and show that it can be used for phase detection.
3. Region Monitoring
Region monitoring consists of two parts viz. (1) Region
Formation and (2) Phase Detection and Self Monitoring.
The region formation algorithm is responsible for detecting
changes in the working set of the program, building regions
for the new code and adding it to the region monitor for
phase detection and self monitoring.
3.1. Region Formation
Whenever the user buffer overflows, performance
counter samples are distributed across regions. There may
be samples that do not fall in any monitored region. We at-
tribute these samples to a single unmonitored region, which
we call the unmonitored code region (UCR). When the per-
centage of samples in the UCR is above a threshold, region
formation is triggered and it builds regions from these sam-
ples. For this work we use the region building mechanism in
[13]. In the current prototype systems, regions are primar-
ily loops that have significant samples within an interval of
1Inter-region optimizations are rarely performed due to the complexity
of analysis in a runtime optimization system. However, with the help of
compiler annotations, future dynamic optimization systems may deploy
inter-region optimizations, such as instruction cache prefetching for the
next incoming phase.
sampling. In the future, regions can also include functions
or traces. Once samples are distributed across regions, each
region can be analyzed by the local phase detector to deter-
mine locally stable phase and other performance character-
istics. It is possible that a code region cannot be built around
some frequently executing instructions. For example, a re-
gion formation algorithm that looks only for loops within
procedures may find samples in a procedure that is called in
a loop. Since procedure boundaries are crossed, no regions
are formed. Such instructions could form a large fraction
of samples and the threshold should be appropriately ad-
justed depending on the percentage of such samples. Figure
6 shows the median of the percentage of samples in the un-
monitored region. For most programs, this is below 30%.
However there are a few programs that have > 30% sam-
ples in UCR.
Figure 7 shows the percentage of samples in UCR for
two such benchmarks over time. Even after frequent region
formation triggers in 254.gap, the percentage of samples in
UCR remains high. 186.crafty tries to form regions on ev-
ery buffer overflow but the percentage of samples in UCR
does not reduce. This is due to a current limitation of the re-
gion building algorithm. A better region building algorithm
can reduce the percentage of samples in the UCR signifi-
cantly. There is no fundamental limitation to building inter-
procedural regions and if such a region building algorithm
is used it can greatly reduce the number of region forma-
tion triggers. We also plan to use compiler annotations to
improve region formation in the future.
3.2. Local Phase Detection
Hind et al. in [14] showed that there are two main pa-
rameters that need to be defined for the abstract problem
of phase shift detection, viz. Granularity and Similarity.
Granularity is defined as the partitioning of a profile into
atomic, fixed length units of comparison. Similarity is de-
fined as a boolean function that computes if two of these
units are similar. As local phase detection works on smaller
code regions (for example loops), granularity is the small-
est number of cycles required to execute a single iteration of
the code region. Since our method is based on CPU cycle
sampling, any reasonable sampling period would be greater
than the number of cycles required to go through the code
region once. Our measure of similarity is detailed next.
3.2.1. Similarity using Pearson’s Co-efficient of Correla-tion
In local phase detection, phase change analysis is carried
out for each region, independent of other regions. This is
needed to track deviation of region characteristics for re-
optimization or de-optimization. We define a local phase
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
-100
0
100
200
300
400
1 2 3 4 5 6 7 8 9 10
Original
Shift bottleneck by 1 inst
More Samples but similar frequencies
r=0.998
r=-0.056
Figure 8. r values when comparing two dis-tributions with the original distribution. Thex-axis can be thought of as instructions ina region and the height of the graph as thenumber of cycle samples for that instructionin a given interval.
13134-133d4 142c8-14318146f0-14770
0
1000
2000
3000
4000
5000
Figure 9. Regions in 181.mcf.
change as a significant change in the distribution of sam-
ples for a code region. To find this change, the algorithm
computes the Pearson’s co-efficient of correlation between
the current set of samples and the stable set of samples for
the target region. It is usually symbolized as r and can have
a value anywhere between -1 and 1. It is computed as:
∑ni=1xiyi − 1
n
∑ni=1xi
∑ni=1yi√∑
ni=1x
2i − 1
n (∑
ni=1xi)
2√∑
ni=1y
2i − 1
n (∑
ni=1yi)
2
Where n is number of instructions in region, xi is the
number of samples in stable set for instruction i, and yi is
the number of samples in current set for instruction i.
The larger the absolute value of r, the stronger is the as-
sociation between the two variables. Thus, a correlation of
1 or -1 implies the two variables are perfectly correlated,
meaning that you can predict the values of one variable from
the values of the other with perfect accuracy. However an
r value of 0 corresponds to a lack of correlation between
the sets. A positive correlation means that high values of
one variable correspond to high values of the other vari-
able, and low values of one are paired with low values of
0
0.2
0.4
0.6
0.8
1
1.2
146f0-14770 142c8-14318 13134-133d4
Figure 10. Pearson’s co-efficient of correla-tion for three regions in mcf.
the other. A negative correlation means that high values on
one are paired with low values of the other variable. For
our purpose, this anti-correlation is also a change of behav-
ior. Thus negative values and values close to zero indicate
a phase change. This metric has two important properties
that are seen from Figure 8. When the bottleneck shifts by
one instruction (for example, some other load starts missing
in the cache), the r value is close to zero indicating a phase
change. Thus this metric can detect shifts in instruction bot-
tlenecks quickly. When sampling is used to determine hot
instructions, there are inherent variations in the number of
samples obtained. However, if the behavior is still the same,
meaning the same instructions are hot but distribution of
samples across instructions has changed by a constant fac-
tor, then a phase change should not be triggered. The third
line in the graph and the corresponding r value of 0.998
shows that Pearson’s metric will not detect this as a phase
change.
The impact of local phase detection can be seen by look-
ing at the region chart for 181.mcf and the r values for these
regions. Analyzing regions in 181.mcf (Figure 9) we find
that a region 146f0-14770 (code region between address
146f0 and address 14770) takes up a large fraction of ex-
ecution time in the beginning and it diminishes towards the
end, whereas another region (142c8-14318 ) initially takes
a small fraction of execution but later executes for a larger
fraction. This application also shows a transition from non-
periodic to periodic behavior of regions. Figure 10 plots the
Pearson’s co-efficient of correlation for these regions. This
plot shows that in spite of changes in the fraction of execu-
tion time of regions, the samples show very high correlation
between intervals. Thus, local analysis suggests no phase
changes in 181.mcf, whereas globally phase changes are
seen every time the distribution of samples across regions
changes. Such analysis can detect a longer stable phase and
consequently increase the possibility of improving perfor-
mance.
Another advantage of local phase detection is that it al-
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
7ba2c-7ba78 8d25c-8d314
Figure 11. Regions in 254.gap and stabilityof regions using Pearson’s co-efficient of Re-gression.
lows us to isolate the effects of unstable regions. To illus-
trate this point, let us look at r values of some regions in
254.gap (Figure 11). 254.gap has a large number of phase
changes at low sampling periods and few phase changes as
sampling period increases. When no samples are obtained
in an interval for a region, the value of r returned is the same
as during the last interval. Initially, we see a value of 0 for
both regions, as these regions do not execute from the start.
Also the code region 7ba2c-7ba78 is more stable than the
other region. From this we can see that some regions may
be more stable than others, and isolating phase detection for
each code region can result in more stable phase detection.
A state diagram in Figure 12 explains the phase detection
mechanism employed. Initially, a phase starts in the unsta-
ble state. After two intervals, an r-value can be computed.
If this value is greater than a threshold rt, then the state
changes to less unstable. As long as the phase is unstable or
less unstable, the stable set of samples is updated to reflect
the current set of samples. Once the phase stabilizes, the
stable set of samples is frozen till the state moves to an un-
stable state. In the state diagram shown below, the stable set
of samples is denoted as the previous histogram (prev hist)
and the current set of samples is denoted by curr hist. The
dotted lines indicate the state transitions that correspond
to a phase change (moving from unstable to stable or vice
versa). For this work we have used a value of 0.8 for rt.
3.2.2. Effect of Sampling Period
In section 2.3 we observed that the centroid scheme was
sensitive to change in sampling period. Local phase detec-
tion is less sensitive to change in sampling period. This
is because periodic jumps between regions observed at low
sampling periods will not affect local behavior. It may hap-
pen that during some intervals samples are not obtained for
a region, while samples are obtained for other intervals. Lo-
cal phase detection will not try to compute region character-
If (r >= rt ) prev_hist curr_hist
If (r < rt )
If (r < rt ) prev_hist curr_hist
If (r >= rt )If (r >= rt )
If (r < rt
|| prev_hist is empty)
prev_hist curr_hist
If (r >= rt )
prev_hist curr_hist
If (r < rt )
prev_hist curr_hist
Figure 12. State diagram for phase detectionusing co-efficient of correlation.
istics when no samples are obtained for the region for that
interval. Variations in number of samples (and correspond-
ing deviation in centroid) between two intervals is caused
when the sampling period is not aligned to the periodicity
of executing code in the program. This, too, does not affect
local phase detection as the Pearson’s metric will not trigger
phase changes due to variations in the number of samples.
We repeated the experiment of changing the sampling peri-
ods for various benchmarks to see the effect on local phase
detection. Figure 13 shows the number of phase changes for
a few code regions that contribute to a significant percentage
of program execution. Since every region has a phase detec-
tor, it is not possible to list the number of phase changes for
all regions. It is possible that some regions with few sam-
ples show repeated phase changes. However these locally
unstable regions do not affect the stability of other regions.
We observe that only a few regions change phases repeat-
edly using local phase detection. One region in 254.gap has
120 phase changes. However this is a short lived region
with few samples and is included in the graph to show that
there are some regions that are unstable while there are oth-
ers that are very stable. 188.ammp is an aberration showing
large number of phase changes at low sampling periods. We
observed that the r value lies just below the threshold. Since
the region is very large, the granularity limitation breaks
down for such a large region. We are investigating the use
of a threshold based on the size of region. Figure 14 shows
that the percentage of time spent in stable phase is quite
high for most benchmarks and all sampling periods. Lo-
cal phase detection minimizes the dependency on sampling
period, and can be more robust for dynamic optimization.
3.2.3. Cost of Local Phase Detection
Region monitoring has a higher cost than the centroid based
phase detection scheme. The cost comes mainly from the
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
0 0
4
0 02
0
120
0 0 0 1
5 6
0 0 0 1 0
103
0 0
9
0 0
5
0 0
11
0 0 1 03
6
0 0 0 0 0 00 0
6
0 02
0 0
7
0 02
13
1 0 0 0 0 0 0
22
1
10
100
1000
r1 r2 r1 r2 r3 r1 r2 r3 r4 r1 r2 r1 r2 r3 r4 r1 r1 r2 r3 r1 r2
181.mcf 187.facerec 254.gap 164.gzip
(ref5)
178.galgel 189.lucas 191.fma3d 188.ammp
# PC 45k # PC 450k # PC 900k
Figure 13. Sensitivity to sampling period for a selected set of benchmark programs using local phasedetection. The graph shows selected benchmarks that have a large number of phase changes at lowsampling periods using the centroid scheme. r1, r2 etc. correspond to regions 1, 2 etc. selected bythe dynamic optimizer.
0%
20%
40%
60%
80%
100%
r1 r2 r1 r2 r3 r1 r2 r3 r4 r1 r2 r1 r2 r3 r4 r1 r1 r2 r3 r1 r2
181.mcf 187.facerec 254.gap 164.gzip
(ref5)
178.galgel 189.lucas 191.fma3d 188.ammp
# PC 45k # PC 450k # PC 900k
Figure 14. Percentage of time spent in stable phase for selected benchmarks for three samplingperiods.
distribution of samples to different regions, and the phase
detection for each region. Figure 15 compares the cost of
local phase detection versus the cost of global phase detec-
tion using the centroid approach. As expected, local phase
detection is tens to hundreds of times slower than global
phase detection. Even so, for most applications, the cost is
less than 1% of execution time. Some programs like gcc,
crafty, parser, vortex, ammp and apsi have a significant per-
centage of cost for local phase detection. This cost is due
to the large number of regions monitored by these applica-
tions. However, the cost of local phase detection does not
translate into direct slowdown as region monitoring is per-
formed on a separate thread that may run in parallel, on a
separate core, to the main thread. There is a lot of scope
for reducing the cost. The algorithms used for distributing
samples go through a list of regions to determine the region
where the sample should be attributed to. A faster way to
do the same is to use interval trees[18]. This reduces the
cost from O(n) to O(log(n) + k) where n is the number
of regions and k is the number of regions that the sample
is present in. Figure 16 shows the cost of the interval tree
scheme normalized to the cost of using lists. For bench-
marks with a small number of regions, the cost is slightly
higher from the increased cost of maintaining the tree. As
the number of regions increases (e.g. gcc, crafty, fma3d,
parser and bzip) cost is significantly reduced. There are
other ways of reducing cost like region pruning, where we
can remove infrequently executing and relatively cold re-
gions from the region monitor. These will be looked into in
the future.
3.2.4. Performance of Local Phase Detection
This section presents the potential of local phase detection
using the SPEC CPU2000 benchmark suite and the proto-
type ADORE/Sparc runtime optimization system. As re-
ported in [13] only a few applications in CPU2000 have
significant data cache misses on the UltraSPARC IV+ ma-
chine. The working set of SPEC CPU2000 benchmarks is
relatively small compared to the cache hierarchy of latest
microprocessor. Furthermore, with six years of optimiza-
tion tuning, only a small number of CPU2000 programs still
suffer from serious cache misses on latest processors. How-
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
2.65%5.80%3.20%9.70%
0.00%
0.01%
0.10%
1.00%
10.00%
100.00%
16
4.g
zip
(5)
16
8.w
up
wis
e
17
1.s
wim
172.m
grid
173.a
pplu
17
5.v
pr(
2)
17
6.g
cc(2
)
17
7.m
esa
178.g
alg
el
18
1.m
cf
18
3.e
qu
ake
18
6.c
raft
y
18
7.f
ace
rec
188.a
mm
p
18
9.lu
ca
s
191.fm
a3d
19
7.p
ars
er
20
0.s
ixtr
ack
254.g
ap
25
5.v
ort
ex(3
)
25
6.b
zip
(3
)
30
0.t
wo
lf
30
1.a
psi
0200400600800100012001400
% overhead of Global PD % overhead of Local PD Times slower than Global PD
Figure 15. Cost of region monitoring and a comparison to the centroid based global phase detec-tor. The bars represent the overhead of centroid and the region monitoring schemes and the linerepresents the factor by which region monitoring is more expensive than the centroid scheme.
0.000
0.400
0.800
1.200
1.600
164.
gzip
(5)
168.
wup
wise
171.
swim
172.
mgr
id
173.
appl
u
175.
vpr(2
)
176.
gcc(
2)
177.
mes
a
178.
galg
el
179.
art
179.
art
181.
mcf
183.
equa
ke
186.
craf
ty
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
197.
pars
er
200.
sixt
rack
254.
gap
255.
vorte
x(3)
256.
bzip
2(3
)
300.
twol
f
301.
apsi
facto
r
Figure 16. Improvement from using interval trees instead of simple lists. The bars are normalized tothe overhead obtained using the simple list scheme.
ever, we have observed much greater performance impact
of our work on the candidate programs for the next genera-
tion of benchmarks, and we expect to report results on them
when the new benchmarks are released. We will report per-
formance (Figure 17) on a subset of these benchmarks viz.
181.mcf, 191.fma3d, 254.gap and 172.mgrid in this section.
[13] reported speedups of 35% for mcf, 8% for mgrid, 9%
for gap and 16% for fma3d at a sampling period of 800K.
For the first test we increased the sampling rate to 100,000
cycles/interrupt and ran the benchmarks with the system
proposed in [13] on the UltraSPARC IV+ machine, which
we will call RTOORIG. RTO with local phase detection is
called RTOLPD. We observed that 254.gap shows about
9.5% performance improvement over RTOORIG using lo-
cal phase detection. This is because LPD was able to detect
a stable loop while the global scheme kept detecting phase
changes on slight shifts in centroid. 172.mgrid does not
show much performance difference as many phase changes
are not detected at high sampling rates in mgrid. Con-
versely, at low sampling rates (1,500,000 cycles/interrupt),
181.mcf stays in an unstable phase for a long time and
RTOLPD can achieve a 23.84% speedup over RTOORIG.
We observed that for most of the part where periodic region
changes occur, the phase remained unstable. In 254.gap, the
low sampling rate caused phase to remain unstable for some
time resulting in a 4.9% performance improvement over
RTOORIG. For mcf, the speedup obtained from LPD in-
creases as sampling period is increased because, at low sam-
pling periods, more time is spent in unstable phase for GPD.
We saw this earlier in Figure 2. For gap the reverse is true,
as we observe a decreased speedup from LPD with higher
sampling periods. GPD becomes more stable at high sam-
pling periods reducing the benefit from LPD. Nevertheless,
we see that in general LPD outperforms GPD by detect-
ing fewer phase changes independent of sampling period.
The original RTO circumvents the sampling rate problems
by empirically determining a suitable sampling rate and not
unpatching traces when phase changes. It uses phase detec-
tion to determine change in working set and always assumes
that optimizations will be beneficial. To do a fair compar-
ison, we modified the original RTO to unpatch traces on a
phase change, so that optimizations could be re-evaluated
using performance characteristics of the original code when
the phase stabilizes. Although we have demonstrated some
benefit in the CPU2000 suite, we believe its performance
potential will be greater on the next generation benchmarks
and real applications where more performance loss due to
cache misses can be expected.
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
-5%
0%
5%
10%
15%
20%
25%
30%
181.mcf 172.mgrid 254.gap 191.fma3d
100k
800k
1.5M
Figure 17. Speedup of RTOLPD overRTOORIG where the original RTO usesthe centroid scheme and unpatches traceswhen phase is unstable. Three samplingperiods have been used viz. 100K, 800K and1.5M cycles/interrupt.
4. Related Work
A lot of research has been done over the years to detect
phases in an effort to deploy runtime optimization and to
reduce simulation time. Interpretation based systems like
Dynamo [2] and instrumentation based systems like Dy-
namoRIO [3] use simple counters associated with branches
and trace exits to quickly add new code to the code cache
and thus reduce profiling overhead. They do not perform
any computation associated with determination of stable
phase. In essence, their strategy is similar to our region for-
mation strategy that aims to maximize code coverage. Dy-
namic optimization in virtual machines is vital to its perfor-
mance but there are few systems that look for stability prior
to dynamic compilation and optimization. Kistler in [9]
describes a continuous optimization framework that looks
for stable phases in un-optimized code or phase changes
in previously optimized code before optimizing code. Sta-
ble phases are detected by computing a similarity value be-
tween two intervals for profile data. Profile data can be
instrumentation or sampling based and includes sampled
instructions, procedure and basic block execution counts.
Profile data is global and is not attributed to regions for sim-
ilarity computation. Adl-Tabatabai et al. in [10] describe
a hardware monitoring scheme for dynamic data prefetch-
ing in ORP Java virtual machine [11]. The system uses
metrics like change in delinquent loads, increase in rate
of high latency cache misses, etc to detect phase changes.
[17] presents an evaluation of various global phase detec-
tion schemes by tweaking parameters affecting those algo-
rithms. It also introduces the concept of adaptive profile
window resizing and shows that it is more accurate than
constant windows.
Detecting program phases is important in reducing sim-
ulation time too. This is achieved by simulating only those
areas of program execution that correlates with overall pro-
gram behavior. Sherwood et Al. in [4] and [5] present a
scheme that uses the execution frequencies of basic blocks
in an interval to generate a signature for that interval. Com-
paring this signature with the signature obtained from whole
program execution, it is possible to find parts of program
execution that have high correlation to the whole program.
Stable phases and phase changes can be detected by com-
paring signatures from consecutive intervals. Their scheme
is well suited to offline classification of phases as it uses
expensive clustering algorithms. Although they aggregate
samples within a basic block it is different from local phase
detection as a single phase stable/unstable value is com-
puted using basic block execution frequencies from all ba-
sic blocks executed in that interval. Dhodapkar et al. [1]
[8] use working set analysis to trigger phase changes. If
the current working set of instructions, branches, or proce-
dures changes it is indicative of a phase change. The main
difference between Dhodapkar’s approach and Sherwood’s
scheme is that the latter also takes into account the fre-
quencies of execution whereas the earlier scheme only de-
termines if the instruction/branch/procedure was executed
in the current interval. Hardware mechanisms have been
proposed by Sherwood et al. [6] and Merten et al. [7]
for detecting phase changes to support runtime optimiza-
tion. Sherwood’s scheme is a translation of the basic block
vector scheme to hardware using an accumulator table to
count basic block execution frequencies. Merten et al. col-
lect branch profile information in a hardware table and use
execution frequencies of branches to determine candidate
branches. A phase change is detected if the percentage of
execution of non-candidate branches crosses a predefined
threshold. Again both these schemes are global schemes
and are effective at determining when new code is executed.
Kim et al. in [16] present a hardware structure for detecting
phase changes at the granularity of loops and procedures.
Their algorithm is similar to the working set approach but
the signature is computed using a set of stable pattern of ex-
ecution of loops and procedures. To the best of our knowl-
edge this is no other work that looks at phase detection at
the local level.
5. Conclusion and Future Work
Metrics for phase detection using global behavior show
sensitivity to sampling period, interval size and threshold
values that can result in frequent and unnecessary phase
changes. We showed that by restricting the scope of phase
detection to smaller, independent regions, we can minimize
such effects and have a more robust and effective phase
detection. Furthermore, current dynamic optimization sys-
tems do not take advantage of inter-region behaviors and
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE
thus it is not very important to detect change in inter-region
behavior. Applying our technique of local phase detection,
we were able to reduce the number of phase changes even
at very low sampling periods. We also showed that this re-
sulted in a phase being stable for a longer percentage of
time, allowing greater opportunity from optimization. We
found that although the cost of local phase detection, and
region monitoring in general, is higher than a single met-
ric computation approach, it is still within acceptable lim-
its for most benchmarks evaluated. In addition, this cost
is not on the critical path of program execution since re-
gion monitoring can occur in a separate thread, in paral-
lel to the main program. With region monitoring, we have
shown that our prototype dynamic optimization system on
SPARC is less sensitive to the parameters and effects of
sampling. At certain sampling periods, it significantly out-
performs the existing global phase detection approach on
several SPEC CPU2000 benchmarks running on the Ultra-
SPARC IV+ system. By performing region monitoring we
can improve phase detection and create a framework for de-
veloping a feedback mechanism to monitor deployed opti-
mizations. This would allow us to undo ineffective opti-
mizations deployed to a region.
In future, we want to investigate cheaper means of mea-
suring similarity as the Pearson’s metric involves time con-
suming calculations. We also want to look at other ways
of reducing the cost of region monitoring by selecting the
more important regions to be monitored and enhancing our
region search algorithms. As stated earlier, region moni-
toring allows us to implement a feedback mechanism and
we are looking at metrics and algorithms to estimate perfor-
mance impact of deployed optimizations.
References
[1] Dhodapkar, A. S. and Smith, J. E. Comparing Program
Phase Detection Techniques. In International Symposium on
Microarchitecture, 2003
[2] Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: a trans-
parent dynamic optimization system. In Programming Lan-
guage Design and Implementation, 2000.
[3] Bruening, D., Garnett, T., and Amarasinghe, S. An infras-
tructure for adaptive dynamic optimization. In Code Gen-
eration and Optimization: Feedback-Directed and Runtime
Optimization, 2003.
[4] Sherwood, T., Perelman, E., and Calder, B. Basic Block Dis-
tribution Analysis to Find Periodic Behavior and Simulation
Points in Applications. In Parallel Architectures and Compi-
lation Techniques, 2001
[5] Sherwood, T., Perelman, E., Hamerly, G., and Calder, B.
Automatically characterizing large scale program behavior.
In Architectural Support For Programming Languages and
Operating Systems, 2002
[6] Sherwood, T., Sair, S., and Calder, B. Phase tracking and
prediction. In International Symposium on Computer Archi-
tecture, 2003
[7] Merten, M. C., Trick, A. R., George, C. N., Gyllenhaal, J.
C., and Hwu, W. W. A hardware-driven profiling scheme for
identifying program hot spots to support runtime optimiza-
tion. In International Symposium on Computer Architecture,
1999.
[8] Dhodapkar, A. S. and Smith, J. E. Managing multi-
configuration hardware via dynamic working set analysis.
International Symposium on Computer Architecture, 2002.
[9] Kistler, T. and Franz, M. Continuous program optimization:
A case study. ACM Trans. Program. Lang. Syst. Vol. 25,
issue 4, Jul. 2003.
[10] Adl-Tabatabai, A., Hudson, R. L., Serrano, M. J., and Subra-
money, S. Prefetch injection based on hardware monitoring
and object metadata. In Programming Language Design and
Implementation, 2004.
[11] Cierniak, M., Eng, M., Glew, N., Lewis, B., and Stichnoth,
J. The Open Runtime Platform: a flexible high-performance
managed runtime environment: Research Articles. Concur-
rency and Computation: Practice and. Experience. Vol. 17,
issue 5-6, Apr. 2005.
[12] Lu, J., Chen, H., Yew P-C., Hsu, W-C. Design and Imple-
mentation of a Lightweight Dynamic Optimization System.
Journal of Instruction-Level Parallelism, Volume 6, 2004
[13] Lu, J., Das, A., Hsu, W-C., Nguyen, K., Abraham, S. G. Dy-
namic Helper Threaded Prefetching on the Sun UltraSPARC
CMP Processor, In International Symposium on Microarchi-
tecture, 2005.
[14] Hind, M.J., Rajan, V.T., Sweeney, P.F. Phase shift detection:
A problem classification, IBM Research Report RC-22887,
2003
[15] W.K. Chen, S. Lerner, R. Chaiken, and D. Gillies. Mojo:
A dynamic optimization system. In FDDO-04, pages 81-90,
2000.
[16] Kim, J. Kodakara S., Hsu W-C., Lilja D. J., Yew, P-C. Dy-
namic Code Region (DCR) Based Program Phase Tracking
and Prediction for Dynamic Optimizations, Lecture Notes in
Computer Science, Volume 3793, Oct 2005
[17] Nagpurkar P., Hind, M. J., Krintz, C., Sweeney P.F., Rajan,
V.T. Online Phase Detection Algorithms, In Code Genera-
tion and Optimization, 2006
[18] Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C. In-
troduction to Algorithms. McGraw Hill, 2003
Proceedings of the International Symposium on Code Generation and Optimization (CGO’06) 0-7695-2499-0/06 $20.00 © 2006 IEEE