effort prediction

Accurate and Efficient Regression Modeling forMicroarchitectural Performance and Power Prediction

Benjamin C. Lee, David M. Brooks

Division of Engineering and Applied SciencesHarvard University

Cambridge, Massachusetts

{bclee,dbrooks}@eecs.harvard.edu

AbstractWe propose regression modeling as an efficient approach for accu-rately predicting performance and power for various applicationsexecuting on any microprocessor configuration in a large microar-chitectural design space. This paper addresses fundamental chal-lenges in microarchitectural simulation cost by reducing the num-ber of required simulations and using simulated results more effec-tively via statistical modeling and inference.

Specifically, we derive and validate regression models for per-formance and power. Such models enable computationally efficientstatistical inference, requiring the simulation of only 1 in 5 millionpoints of a joint microarchitecture-application design space whileachieving median error rates as low as 4.1 percent for performanceand 4.3 percent for power. Although both models achieve simi-lar accuracy, the sources of accuracy are strikingly different. Wepresent optimizations for a baseline regression model to obtain (1)application-specific models to maximize accuracy in performanceprediction and (2) regional power models leveraging only the mostrelevant samples from the microarchitectural design space to max-imize accuracy in power prediction. Assessing sensitivity to thenumber of samples simulated for model formulation, we find fewerthan 4,000 samples from a design space of approximately 22 billionpoints are sufficient. Collectively, our results suggest significant po-tential in accurate and efficient statistical inference for microarchi-tectural design space exploration via regression models.

Categories and Subject Descriptors B.8.2 [Performance Anal-ysis and Design Aids]; I.6.5 [Model Development]: ModelingMethodologies

General Terms Design, Experimentation, Measurement, Perfor-mance

Keywords Microarchitecture, Simulation, Statistics, Inference,Regression

1. IntroductionEfficient design space exploration is constrained by the significantcomputational costs of cycle-accurate simulators. These simulators

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

ASPLOS’06 October 21–25, 2006, San Jose, California, USA.Copyright c© 2006 ACM 1-59593-451-0/06/0010. . . $5.00.

provide detailed insight into application performance for a widerange of microprocessor configurations, exposing performance andpower trends in the microarchitectural design space. Long simula-tion times require designers to constrain design studies and con-sider only small subsets of the full design space. However, suchconstraints may lead to conclusions that may not generalize to thelarger space. Addressing these fundamental challenges in microar-chitectural simulation methodology becomes increasingly urgent aschip multiprocessors introduce additional design parameters andexponentially increase design space size.

We apply regression modeling to derive simulation-free statisti-cal inference models, requiring a small number of sampled designpoints in a joint microarchitecture-application design space for ini-tial simulation. Such an approach modestly reduces detail in returnfor significant gains in speed and tractability. Although we con-sider advantages in computational cost for microarchitectural de-sign space exploration in this paper, these models may also provideincreased profiling efficiency and fast performance prediction forsystem software and algorithms, such as thread-scheduling for het-erogeneous multi-processors.

Techniques in statistical inference and machine learning are in-creasingly popular for approximating solutions to intractable prob-lems. Even for applications in which obtaining extensive measure-ment data is feasible, efficient analysis of this data often lends itselfto statistical modeling. These approaches typically require an initialset of data for model formulation or training. The model respondsto predictive queries by leveraging trends and correlations in theoriginal data set to perform statistical inference. Regression mod-eling follows this predictive paradigm in a relatively cost effectivemanner. Once domain-specific knowledge is used to specify pre-dictors of a response, formulating the model from observed datarequires numerically solving a system of linear equations and pre-dicting the response simply requires evaluating a linear equation.Model formulation and evaluation are computationally efficient dueto well optimized numerical linear algebra libraries.

We use a modest number of simulations to obtain sample ob-servations from a large design space. In Section 3, we describe asampling methodology to obtain 4,000 samples drawn uniformly atrandom from a design space with approximately 22 billion points.Each sample maps a set of architectural and application-specificpredictors to observed simulator performance and power. Thesesamples are used to formulate regression models that predict theperformance and power of previously unsampled configurationsbased on the same predictors.

In Section 4, we summarize a statistically rigorous approachfor deriving regression models that includes (1) variable clustering,(2) association testing, (3) assessing strength of response-predictorrelationships, and (4) significance testing with F-tests. These tech-

185

niques ensure statistically significant architectural and application-specific parameters are used as predictors. Given a baseline modelthat accounts for predictor interaction and non-linearity, Section 5presents model optimizations to improve prediction accuracy by (1)stabilizing residual variance, (2) deriving application-specific mod-els, and (3) deriving regional models with samples most similar inarchitectural configuration to the predictive query.

The following summarizes experimental results of Section 5from four different performance and power regression models for-mulated with 4,000 samples drawn from a joint microarchitecture-application design space with nearly 1 billion microarchitecturalconfigurations and 22 benchmarks:

1. Performance Prediction: Application-specific models predictperformance with median error as low as 4.1 percent (meanerror as low as 4.9 percent). 50 to 90 percent of predictionsachieve error rates of less than 10 percent depending on theapplication. Maximum outlier error is 20 to 33 percent.

2. Power Prediction: Regional models predict power with me-dian error as low as 4.3 percent (mean error as low as 5.6 per-cent). Nearly 90 percent of predictions achieve error rates ofless than 10 percent and 97 percent of predictions achieve er-ror rates of less than 15 percent. Maximum outlier error is 24.5percent.

3. Model Optimizations: Given a single set of predictors forboth performance and power models, the model may be re-formulated with different sampled observations and optimized.Application-specific models are optimal for performance pre-diction while regional models are optimal for power prediction.

4. Sample Size Sensitivity: Although 4,000 samples are drawnfrom the design space, accuracy maximizing applications-specific performance models do not require more than 2,000.Additional samples may improve regional power models byimproving observation density and tightening regions aroundpredictive queries, but diminishing marginal returns in accu-racy advise against many more than 3,000 samples.

Collectively, these results suggest significant potential in accu-rate, efficient statistical inference for the microarchitectural designspace via regression models. We provide an overview of the modelderivation and a detailed validation of its predictive ability. Previ-ous work details the model derivation [8, 9].

2. Regression TheoryWe apply regression modeling to efficiently obtain estimates of mi-croarchitectural design metrics, such as performance and power.We apply a general class of models in which a response is mod-eled as a weighted sum of predictor variables plus random noise.Since basic linear estimates may not adequately capture nuancesin the response-predictor relationship, we also consider more ad-vanced techniques to account for potential predictor interactionand non-linear relationships. Lastly, we present standard statisticaltechniques for assessing model effectiveness and predictive ability.

2.1 Model FormulationsFor a large universe of interest, suppose we have a subset of nobservations for which values of the response and predictor vari-ables are known. Let y = y1, . . . , yn denote the vector of ob-served responses. For a particular point i in this universe, let yi

denote its response variable and xi = xi,1, . . . , xi,p denote its ppredictors. These variables are constant for a given point in theuniverse. Let β = β0, . . . , βp denote the corresponding set of re-gression coefficients used in describing the response as a linearfunction of predictors plus a random error ei as shown in Equa-

tion (1). Mathematically, βj may be interpreted as the expectedchange in yi per unit change in the predictor variable xi,j . Theei are independent random variables with zero mean and constantvariance; E(ei) = 0 and V ar(ei) = σ2. We consider a jointmicroarchitecture-application universe with design metrics of in-terest as response variables predicted by microarchitectural config-urations and application characteristics.

f(yi) = g(xi)β + ei

= β0 +

pXj=1

βjgj(xij) + ei (1)

Fitting a regression model to observations, by determining thep + 1 coefficients in β, enables response prediction. The method ofleast squares is commonly used to identify the best-fitting modelby minimizing S(β), the sum of squared deviations of predictedresponses given by the model from actual observed responses.

S(β0, . . . , βp) =

nXi=1

yi − β0 −

pXj=1

βjxij

!2

(2)

S(β) may be minimized by solving a system of p + 1 partialderivatives of S with respect to βj , j ∈ [0, p]. The solutions tothis system are estimates of the coefficients in Equation (1). Fur-thermore, solutions to this system of linear equations may often beexpressed in closed form. Closed form expressions enable using thestatistical properties of these estimates to identify the significanceof particular response-predictor correlations (Section 2.4).

2.2 Predictor InteractionIn some cases, the effect of predictors xi,1 and xi,2 on the re-sponse cannot be separated; the effect of xi,1 on yi depends onthe value of xi,2 and vice versa. This interaction may be mod-eled by constructing a third predictor xi,3 = xi,1xi,2 to obtainyi = β0 + β1xi,1 + β2xi,2 + β3xi,1xi,2 + ei. For example,pipeline depth and L2 cache size do not impact performance in-dependently; a smaller L2 cache leads to additional memory haz-ards in the pipeline. These stalls affect, in turn, instruction through-put gains from pipelining. Thus, their joint impact on performancemust also be modeled.

Modeling predictor interactions in this manner makes it difficultto interpret β1 and β2 in isolation. After simple algebraic manipu-lation to account, we find β1 + β3xi,2 is the expected change in yi

per unit change in xi,1 for a fixed xi,2. The difficulties of these ex-plicit interpretations of β for more complex models lead us to prefermore indirect interpretations of the model via its predictions.

2.3 Non-LinearityBasic linear regression models assume the response behaves lin-early in all predictors. This assumption is often too restrictive (e.g.,power increases quadratically with pipeline depth for more aggres-sive designs) and several techniques for capturing non-linearitymay be applied. The most simple of these techniques is a polyno-mial transformation on predictors suspected of having a non-linearcorrelation with the response. However, polynomials have undesir-able peaks and valleys. Furthermore, a good fit in one region ofthe predictor’s values may unduly impact the fit in another regionof values. For these reasons, we consider splines a more effectivetechnique for modeling non-linearity.

Spline functions are piecewise polynomials used in curve fitting.The function is divided into intervals defining multiple differentcontinuous polynomials with endpoints called knots. The numberof knots can vary depending on the amount of available data for

186

fitting the function, but more knots generally leads to better fits.Relatively simple linear splines may be inadequate for complex,highly curved relationships. Splines of higher order polynomialsmay offer better fits and cubic splines have been found particularlyeffective [4]. Unlike linear splines, cubic splines may be madesmooth at the knots by forcing the first and second derivatives ofthe function to agree at the knots. However, cubic splines may havepoor behavior in the tails before the first knot and after the last knot[13]. Restricted cubic splines that constrain the function to be linearin the tails are often better behaved and have the added advantageof fewer terms relative to cubic splines.

The choice and position of knots are variable parameters whenspecifying non-linearity with splines. Stone has found the locationof knots in a restricted cubic spline to be much less significantthan the number of knots [13]. Placing knots at fixed quantilesof a predictor’s distribution is a good approach in most datasets,ensuring a sufficient number of points in each interval.

In practice, five knots or fewer are generally sufficient for re-stricted cubic splines [13]. Fewer knots may be required for smalldata sets. As the number of knots increases, flexibility improves atthe risk of over-fitting the data. In many cases, four knots offer anadequate fit of the model and is a good compromise between flexi-bility and loss of precision from over-fitting [4]. For larger data setswith more than 100 samples, five knots may also be a good choice.

2.4 Significance TestingAlthough T-tests are often used to assess the significance of individ-ual terms, it is often more useful to assess a group of terms simul-taneously. Consider a model y = β0 +β1x1 +β2x2 +β3x1x2 +e.Testing the significance of x1 requires testing the null hypothesisH0 : β1 = β3 = 0 with two degrees of freedom. More gener-ally, given two nested models, the null hypothesis states the ad-ditional predictors in the larger model have no association withthe response. For example, suppose x1 and x2 are pipeline depthand width, respectively. We must then compare a model withoutdepth(x1) and its width interaction(x1x2) against the full model toassess depth’s significance.

The F-test is a standard statistical test for comparing two nestedmodels using their multiple correlation statistic, R2. Equation (3)computes this statistic as regression error (SSE) relative to totalerror (SST ). R2 will be zero when the error from the regressionmodel is just as large as the error from simply using the mean topredict the response. Thus, R2 is the percentage of variance in theresponse variable captured by the predictor variables.

R2 = 1 − SSE

SST

= 1 −Pn

i=1(yi − yi)2Pn

i=1(yi − 1n

Pni=1 yi)2

(3)

Given R2 for the full model and R2∗ for a model constructed

by dropping terms from the full model, define the F-statistic byEquation (4) where p is the number of coefficients in the full modelexcluding the intercept β0 and k is the difference in degrees offreedom between the models.

Fk,n−p−1 =R2 − R2

∗k

× n − p − 1

1 − R2(4)

The p-value is defined as 2P (X≥|c|) for a random variable Xand a constant c. In our analyses, X follows the F distribution withparameters k, n − p − 1 and c is the F-statistic. The p-value maybe interpreted as the probability a F-statistic value greater than orequal to the value actually observed would occur by chance if thenull hypothesis were true. If this probability were extremely small,

either the null hypothesis holds and an extremely rare event has oc-curred or the null hypothesis is false. Thus, a small p-value for fora F-test of two models casts doubt on the null hypothesis and sug-gests the additional predictors in the larger model are statisticallysignificant in predicting the response.

2.5 Assessing FitThe model’s fit to the observations used to formulate the modelmeasures how well the model captures observed trends. Fit is usu-ally assessed by examining residuals and the degree to which re-gression error contributes to the total error. Residuals, defined inEquation (5), are examined to validate three assumptions: (1) theresiduals are not correlated with any predictor variable or the re-sponse predicted by the model, (2) the randomness of the residualsis the same for all predictor and predicted response values, and (3)the residuals have a normal distribution. The first two assumptionsare typically validated by plotting residuals against each of the pre-dictors and predicted responses (i.e., (ei, xij) for each j ∈ [1, p]and (ei, yi) for each i ∈ [1, n]) since such plots may reveal sys-tematic deviations from randomness. The third assumption is usu-ally validated by a quantile-quantile plot in which the quantiles ofone distribution are plotted against another. Practically, this means

ranking the residuals e(1), . . . , e(n), obtaining n ranked samples

from the normal distribution s(1), . . . , s(n), and producing a scatter

plot of (e(i), s(i)) that should appear linear if the residuals follow anormal distribution.

ei = yi − β0 −pX

j=0

βjxij (5)

Fit may also be assessed by the R2 statistic where a larger R2

suggests better fits for the observed data. However, a value tooclose to R2 = 1 may indicate over-fitting, a situation in whichthe worth of the model is exaggerated and future observations willnot agree with the model’s predicted values. Over-fitting typicallyoccurs when too many predictors are used to model relatively smalldata sets. A regression model is likely reliable when the number ofpredictors p is less than n/20, where n is the sample size [4].

2.6 PredictionOnce β is determined, evaluating Equation (6) for a given xi willgive the expectation of yi and, equivalently, an estimate yi foryi. This result follows from observing the additive property ofexpectations, the expectation of a constant is the constant, and therandom errors have mean zero.

yi = Eˆyi

˜= E

ˆβ0 +

pXj=1

βjxij

˜+ E

ˆei

˜

= β0 +

pXj=1

βjxij (6)

3. Experimental MethodologyBefore developing regression models to predict performance andpower for any configuration within the microarchitectural designspace, we must first obtain a number of observations within thisspace via simulation. These observations are inputs to a statisticalcomputing package used to perform the regression analysis andformulate the models.

187

Set Parameters Measure Range |Si|S1 Depth Depth FO4 9::3::36 10S2 Width Width insn b/w 4,8,16 3

L/S Reorder Queue entries 15::15::45Store Queue entries 14::14::42Functional Units count 1,2,4

S3 Physical General Purpose (GP) count 40::10::130 10Registers Floating-Point (FP) count 40::8::112

Special Purpoes (SP) count 42::6::96S4 Reservation Branch entries 6::1::15 10

Stations Fixed-Point/Memory entries 10::2::28Floating-Point entries 5::1::14

S5 I-L1 Cache I-L1 Cache Size log2(entries) 7::1::11 5S6 D-L1 Cache D-L1 Cache Size log2(entries) 6::1::10 5S7 L2 Cache L2 Cache Size log2(entries) 11::1::15 5

L2 Cache Latency cycles 6::2::14S8 Main Memory Main Memory Latency cycles 70::5::115 10S9 Control Latency Branch Latency cycles 1,2 2S10 Fixed-Point ALU Latency cycles 1::1::5 5

Latency FX-Multiply Latency cycles 4::1::8FX-Divide Latency cycles 35::5::55

S11 Floating-Point FPU Latency cycles 5::1::9 5Latency FP-Divide Latency cycles 25::5::45

S12 Memory Latency Load/Store Latency cycles 3::1::7 5

Table 1. Parameters within a group are varied together. A range of i::j::k denotes a set of possible values from i to k in steps of j.

3.1 Simulation FrameworkWe use Turandot, a generic and parameterized, out-of-order, super-scalar processor simulator [10]. Turandot is enhanced with Power-Timer to obtain power estimates based on circuit-level power anal-yses and resource utilization statistics [1]. The modeled baselinearchitecture is similar to the POWER4/POWER5. The simulatorhas been validated against both a POWER4 RTL model and a hard-ware implementation. We do not use any particular feature of thesimulator in our models and believe our approach may be generallyapplied to other simulatorx frameworks. We evaluate performancein billions of instructions per second (bips) and power in watts.

3.2 BenchmarksWe consider SPECjbb, a Java server benchmark, and 21 computeintensive benchmarks from SPEC2k (ammp, applu, apsi, art, bzip2,crafty, equake, facerec, gap, gcc, gzip, lucas, mcf, mesa, mgrid,perl, sixtrack, swim, twolf, vpr, wupwise). We report experimentalresults based on PowerPC traces of these benchmarks. The SPEC2ktraces used in this study were sampled from the full reference inputset to obtain 100 million instructions per benchmark program. Sys-tematic validation was performed to compare the sampled tracesagainst the full traces to ensure accurate representation [5].

3.3 Statistical AnalysisWe use R, a free software environment for statistical computing, toscript and automate the statistical analyses described in Section 2.Within this environment, we use the Hmisc and Design packagesimplemented by Harrell [4].

3.4 Configuration SamplingTable 1 identifies twelve groups of parameters varied simultane-ously. Parameters within a group are varied together in a conser-vative effort to avoid fundamental design imbalances. The rangeof values considered for each parameter group is specified by aset of values, S1, . . . , S12. The Cartesian product of these sets,S =

Q12i=1 Si, defines the entire design space. The cardinality of

this product is |S| =Q12

i=1 |Si| = 9.38E + 08, or approximatelyone billion, design points. Fully assessing the performance for each

of the 22 benchmarks on these configurations further scales thenumber of simulations to well over 20 billion.

The approach to obtaining observations from a large micropro-cessor design space is critical to efficient formulation of regressionmodels. Traditional techniques of sweeping design parameter val-ues to consider all points in the large design space, S, is impossibledespite continuing research in reducing simulation costs via tracesampling [12, 14]. Although these techniques reduce per simula-tion costs by a constant factor, they do not reduce the number ofrequired simulations. Other studies have reduced the cardinality ofthese sets to an upper and lower bound [6, 15], but this approachmasks performance and power trends between the bounds and pre-cludes any meaningful statistical inference. For these reasons, sam-pling must occur in the design space to control the exponentially in-creasing number of design points as the number of parameter setsand their cardinalities increase.

We propose sampling configurations uniformly at random(UAR) from S. This approach provides observations from the fullrange of parameter values and enables identification of trends andtrade-offs between the parameter sets. We can include an arbitrarilylarge number of values into a given Si since we decouple the num-ber of simulations from the set cardinality via random sampling.Furthermore, sampling UAR does not bias the observations towardparticular configurations. Prior design space studies have consid-ered points around a baseline configuration and may be biasedtoward the baseline. Our approach is different from Monte Carlomethods, which generate suitable random samples and observethe frequency of samples following some property or properties.Although we perform random sampling, we formulate regressionmodels instead of analyzing frequency.

We report experimental results for sample sizes of up to n =4, 000 samples. Each sampled configuration is simulated with abenchmark also chosen UAR, providing one set of observed re-sponses (performance and power) for every 5 million sets of pre-dictors (configuration-application pairs) in the design space.

188

Performancebips

L1 CacheI-L1 misses D-L1 missesI-L2 misses D-L2 misses

Branchesbranch rate branch stalls

branch mispredictionsStalls

inflight castdmissq reorderqstoreq renameresv

Table 2. Measured application characteristics on a baseline archi-tecture.

Processor Core

Decode Rate 4 non-branch insns/cyDispatch Rate 9 insns/cyReservation Stations FXU(40),FPU(10),LSU(36),BR(12)Functional Units 2 FXU, 2 FPU, 2 LSU, 2 BRPhysical Registers 80 GPR, 72 FPRBranch Predictor 16k 1-bit entry BHT

Memory Hierarchy

L1 DCache Size 32KB, 2-way, 128B blocks, 1-cy latL1 ICache Size 32KB, 1-way, 128B blocks, 1-cy latL2 Cache Size 2MB, 4-way, 128B blocks, 9-cy latMemory 77-cy lat

Pipeline Dimensions

Pipeline Depth 19 FO4 delays per stagePipeline Width 4-decode

Table 3. Baseline architecture.

4. Model DerivationTo identify potential response-predictor relationships we first ex-amine a number of descriptive statistics for parameters in the dataset. We then formulate a regression model, accounting for predic-tors’ primary effects, second- and third-order interactions, and non-linearities. Given an initial model, we perform significance testingto prune statistically insignificant predictors from the model. Thefit of the refined model is assessed by examining its residuals.

4.1 PredictorsIn addition to the 12 architectural predictors (Table 1), we also con-sider 15 application-specific predictors drawn from an application’scharacteristics (Table 2) when executing on a baseline configura-tion (Table 3). These characteristics may be significant predictorsof performance when interacting with architectural predictors andinclude baseline performance (base bips), cache access patterns,branch patterns, and sources of pipeline stalls (e.g. stalls from lim-its on the number of inflight instructions). For example, the perfor-mance effect of increasing the data L1 cache size will have a largerimpact on applications with a high data L1 miss rate. The baselineapplication performance may also impact the performance effectsof further increasing architectural resources. An application withno bottlenecks and high baseline performance would benefit lessfrom additional registers compared to an application experiencingheavy register pressure. These potential interactions suggest botharchitectural and application-specific predictors are necessary.

Figure 1. Predictor Strength

4.2 Summary StatisticsWe provide a brief overview of analyses used to identify relevantpredictors for an initial performance regression model. The associ-ated figures are omitted due to space constraints. Details are avail-able in a technical report [8].

4.2.1 Variable ClusteringA variable clustering analysis reveals key parameter interaction bycomputing squared correlation coefficients as similarity measures.A larger ρ2 suggests a greater correlation between variables. Ifoverfitting is a concern, redundant predictors may be eliminatedby selecting only one predictor from each cluster.

Our clustering analysis indicates L1 and L2 misses due to in-struction cache accesses are highly correlated. We find the absolutenumber of L2 cache misses from the instruction cache to be neg-ligible and eliminate il2miss rate. Similarly, the branch rate ishighly correlated with the number of branch induced stalls and weeliminate br stall. All other variables are correlated to a lesserdegree and need not be eliminated from consideration. We find thenumber of observations sufficiently large to avoid over-fitting.

4.2.2 Performance AssociationsPlotting each of the predictors against the response may revealparticularly strong associations or identify non-linearities. For ar-chitectural predictors, we find pipeline depth and width strong,monotonic factors. The number of physical registers may be a sig-nificant, but non-linear, predictor. Correlations between L2, butnot L1, cache size and performance also appear significant. Forapplication-specific predictors, data cache access patterns may begood predictors of performance. We also find roughly half thestall characteristics have monotonic relationships with performance(i.e., dmissq, cast, reorderq, resv). A benchmark’s observedsample and baseline performance are highly correlated.

4.2.3 Strength of Marginal RelationshipsWe consider the squared correlation coefficient between each pre-dictor variable and observed performance in Figure 1 to choose thenumber of spline knots. A lack of fit for predictors with higher ρ2

will have a greater negative impact on performance prediction. Forarchitectural predictors, a lack of fit will be more consequential (in

189

Figure 2. Residual Correlations

Figure 3. Residual Distribution

descending order of importance) for width, depth, physical regis-ters, functional unit latencies, cache sizes, and reservation stations.An application’s baseline performance is the most significant pre-dictor for its performance on other configurations.

Overall, application-specific predictors are most highly cor-related with observed performance. An application’s interactionwith the microarchitecture, not the microarchitecture itself, is theprimary determinant of performance. For example, the degree towhich an application is memory bound, captured by baseline cachemiss rates, are much more highly correlated with performance thanthe cache sizes.

4.3 Model Specification and RefinementAfter a first pass at removing insignificant predictors, we formulatean initial performance regression model. We model non-linearitiesfor architectural predictors using restricted cubic splines. As shownin Figure 1, the strength of predictors’ marginal relationships withperformance will guide our choice in the number of knots. Predic-tors with stronger relationships and greater correlations with per-formance will use 4 knots (e.g. depth, registers) and those withweaker relationships will use 3 knots (e.g. latencies, cache sizes,reservation stations). Despite their importance, width and certainlatencies do not take a sufficient number of unique values to applysplines and we consider their linear effects only. With the excep-tion of baseline performance, for which we assign 5 knots, we donot model non-linearities for application-specific predictors to con-trol model complexity.

We draw on domain-specific knowledge to specify interactions.We expect pipeline width to interact with register file and queuesizes. Pipeline depth likely interacts with cache sizes that impacthazard rates. We also expect the memory hierarchy to interactwith adjacent levels (e.g. L1 and L2 cache size interaction) andapplication-specific access rates. Interactions with baseline perfor-mance account for changing marginal returns in performance fromchanging resource sizes. Although we capture most relevant inter-actions, we do not attempt to capture all significant interactions viaan exhaustive search of predictor combinations. The results of Sec-tion 5 suggest such a high-level representation is sufficient.

Figure 2 plots residual quartiles for 40 groups, each with 100observations, against median modeled performance of each group,revealing significant correlations between residuals and fitted val-ues.1 Residuals are larger for the smallest and largest fitted values.We apply a standard variance stabilizing square root transforma-tion on performance to reduce the magnitude of the correlationsas shown in Figure 2. The resulting model predicts the square-rootof performance (i.e.

√y instead of y). Variance stabilization also

cause residuals to follow a normal distribution more closely as in-dicated by the linear trend in Figure 3.

To further improve model efficiency, we consider each pre-dictor’s contribution to its predictive ability. We assess the sig-nificance of predictor p by comparing, with F-tests, the initialmodel to a smaller model with all terms involving p removed [8].Although we found most variables significant with p-values lessthan 2.2e − 16, br lat (p-value=0.1247), stall inflight (p-value=0.7285), and stall storeq (p-value=0.4137) appear in-significant. High p-values indicate these predictors do not signif-icantly contribute to a better fit when included in a larger model.

5. Model Evaluation5.1 Model Variants and OptimizationsWe use data sets of varying size drawn from n∗ < n = 4, 000random observations to formulate regression models. We refer ton∗ as the sample size for a particular model; each model mayrequire different sample sizes to maximize accuracy. We arbitrarilychose 4,000 points for the initial data set, reflecting our lack of priorintuition regarding the number of samples required for accuratemodeling. We subsequently assess model sensitivity to sample size,demonstrating n∗ < 2, 000 samples are sufficient.

Separately, we obtain 100 additional random samples for pre-dictive queries and validation against the simulator. We comparethe predictive ability of four regression models, differing in speci-fication and data used to perform the fit:

1 Residuals are defined in Equation (5)

190

Model Min 1st Quartile Median Mean 3rd Quartile Max

B (1k) 0.571 7.369 13.059 16.359 20.870 56.881S (2k) 0.101 5.072 10.909 13.015 17.671 51.198S+R (1k) 0.360 4.081 8.940 10.586 15.183 35.000S+A (1k,ammp) 0.029 1.815 4.055 4.912 7.318 20.298S+A (1k,equake) 0.181 4.132 7.385 8.064 11.202 20.825S+A (1k,mesa) 0.170 5.736 10.775 10.810 15.025 33.129

Table 4. Summary of performance prediction error with specified error minimizing sample and region sizes.

Figure 4. Empirical CDF of prediction error with error minimizing sample and region sizes. Equake achieves median accuracy and is plottedfor S+A (L). Prediction results for S+A (R).

• Baseline (B): Model specified in Section 4 without variancestabilization and formulated with a naive subset of nB < nobservations (e.g. first 1,000 obtained).

• Variance Stabilized (S): Model specified with a square-roottransformation on the response and formulated with a naivesubset of nS < n observations.

• Regional (S+R): Model is reformulated for each query by spec-ifying a naive subset of nS+R < n observations. These ob-servations are further reduced to include the rS+R < nS+R

designs with microarchitectural configurations most similarto the predictive query. We refer to rS+R as the region size.Similarity is quantified by the normalized euclidean distancebetween two vectors of architectural parameter values, d =qPp

i=1 |1 − bi/ai|2.

• Application-Specific (S+A): We use a new set of nA = 4, 000of observations for varying microarchitectural configurations,but a fixed benchmark. An application-specific model is ob-tained by eliminating application-specific predictors from thegeneral model and reformulating the model with a naive subsetof nS+A < nA. We consider S+A models for six benchmarks:ammp, applu, equake, gcc, gzip, and mesa.

5.2 Performance PredictionTable 4 summarizes model predictive accuracy in terms of percent-age error, 100∗|yi−yi|/yi. Each model is presented with error min-imizing sample and region sizes. We present the ammp, equake, andmesa application-specific models that achieve the greatest, median,and least accuracy, respectively. Variance stabilization reduces me-dian error from 13.1 to 10.9 percent and regional or application-specific models may further reduce median error from 10.9 to 8.9 or

4.1 percent, respectively. Application-specific models predict per-formance most accurately with median error ranging from 4.1 to10.8 percent. The spread between the mean and median suggest anumber of outliers.

Figure 4L plots the empirical cumulative distribution (CDF) ofprediction errors, quantifying the number of predictions with lessthan a particular error. The legend specifies the accuracy maximiz-ing sample size for each model. The best performance models areapplication-specific. The equake-specific model is a representativeS+A model, achieving the median accuracy over the six bench-marks we consider. 70 and 90 percent of equake predictions haveless than 10 and 15 percent error, respectively. The flattening CDFslopes also indicate the number of outliers decrease with error.

Figure 4R considers absolute accuracy by plotting observedand predicted performance for each point in the validation set.Although predictions trend very well with actual observations, thefigure suggests slight systematic biases. The model tends to over-estimate performance in the range of [0.3,0.5] BIPS. This biasis likely due to imperfect normality of the residuals. The normaldistribution of residuals is an underlying assumption to regressionmodeling. We initially found a significant deviation from normalityand attempted a correction with a square-root transformation ofthe response variable. This transformation significantly reduced themagnitude of, but did not eliminate, the normality deviations [8].Other transformations on the response or predictors may furthermitigate these biases.

5.3 Performance Sensitivity AnalysesAlthough 4,000 observations are available for model formulation,the accuracy maximizing models use at most 2,000 observationsto determine regression coefficients. We performed a sensitivityanalysis for both regional and application-specific models. The

191

Figure 5. S+A Performance Model Sensitivity: Varying bench-marks.

regional model, with the region and sample size, has two freeparameters. We find regional model accuracy generally insensitiveto region size rS+R, only observing a few smaller outlier errorswith smaller region sizes. Similarly, larger sample sizes (nS+R >2, 500) reduce the magnitude of outlier errors. Larger sample sizesincrease observation density in the design space and enable tighterregions around a predictive query, thereby improving accuracy.There is little additional benefit from increasing the sample sizebeyond 3,000 [8].

For application-specific models, increasing nS+A from 1,000to 2,500 in increments of 500 provides no significant accuracy im-provements. We also consider the sensitivity of application-specificmodels to benchmark differences in Figure 5. While the ammp-specific model performs particularly well with 91 percent of its pre-dictions having less than 10 percent error, the mesa-specific modelis least accurate with 47 percent of its predictions having less than10 percent error and a single outlier having more than 30 percenterror. The differences in accuracy across benchmarks may resultfrom using the same predictors deemed significant for an averageof all benchmarks and simply refitting their coefficients to obtainmodels for each benchmark. Predictors originally dropped/retainedmay become significant/insignificant when a particular benchmark,and not all benchmarks, is considered.

5.4 Power PredictionThe power model uses the performance model specification, re-placing only the response variable. Such a model recognizes sta-tistically significant architectural parameters for performance pre-diction are likely also significant for power prediction. The powermodel also recognizes an application’s impact on dynamic poweris a function of its microarchitectural resource utilization.

Table 5 summarizes the predictive accuracy of each powermodel. Variance stabilization has a significant impact on accu-racy, reducing median error from 22.1 to 9.3 percent. Application-specific power modeling, with a median error between 10.3 and11.3 percent, does not contribute significantly to accuracy. Re-gional power modeling achieves the greatest accuracy with only4.3 percent median error.

Figure 6L plots the empirical CDF’s of power prediction er-rors, emphasizing the differences in regional and application-specific modeling. Variance stabilization provides the first signifi-cant reduction in error and regional modeling provides the second.Application-specific modeling appears to have no impact on overall

accuracy. The most accurate regional model achieves less than 10percent error for nearly 90 percent of its predictions. The maximumerror of 24.5 percent appears to be an outlier as 96 percent of pre-dictions have less than 15 percent error. Again, the flattening CDFslopes also indicate the number of outliers decrease with error.

Figure 6R demonstrates very good absolute accuracy, especiallyfor low power configurations less than 30 watts. The magnitude ofprediction errors tend to increase with power and is most notablefor high-power configurations greater than 100 watts. Configura-tions in the 100 watt region are dominated by deep pipelines forwhich power scales quadratically. This region in the design spacecontains relatively few configurations, all of which are leveragedfor prediction in the regional model. Regions formed at the bound-aries of the design space are often constrained by the bias towardconfigurations away from the boundary and, hence, produce a biastoward more conservative estimates.

5.5 Power Sensitivity AnalysesLike the performance models, we find application-specific powermodels show no sensitivity to the sample size nS+A. The application-specific power models are also insensitive to the benchmark chosen[8]. In contrast, Figure 7L suggests region size rS+R affects pre-dictive ability. As the region size decreases from 4,000 to 1,000in increments of 1,000, the model becomes increasingly localizedaround the predictive query. This locality induces shifts in the er-ror distribution toward the upper left quadrant of the plot such thata larger percentage of predictions has smaller errors. Similarly,Figure 7R indicates the sample size nS+R from which the regionis drawn influences accuracy. As the sample size increases from1,000 to 4,000, outlier error is progressively reduced from a max-imum of 38 percent to a maximum of 25 percent. As with regionsize, we see shifts in the error distribution toward the upper leftquadrant. Larger sample sizes increases observation density andenables tighter regions around a query.

5.6 Performance and Power ComparisonCompared against performance models, power models realizelarger gains from variance stabilization. Although the best perfor-mance and power models achieve comparable median error rates of4 to 5 percent, the source of accuracy gains are strikingly different.The greatest accuracy gains in performance modeling arise fromeliminating application variability (S to S+A). Given a particulararchitecture, performance varies significantly across applicationsdepending on its source of bottlenecks. Keeping the applicationconstant in the model eliminates this variance. Regional modelingis relatively ineffective since application performance is dominatedby its interaction with the microarchitecture and not the microar-chitecture itself.

In contrast, power models are best optimized by specifying re-gions around each predictive query (S to S+R), thereby reducingmicroarchitectural variability. This is especially true for high powerranges in which power tends to scale quadratically with aggres-sive deeper pipelines, but linearly for more conservative configura-tions. Compared to regional performance models, regional powermodels are much more sensitive to region and sample sizes, prob-ably because the regional optimization is much more effective forpower prediction. Application-specific models add little accuracysince architectural configurations are primary determinants in un-constrained power. Power scaling effects for application-specificresource utilization are relatively small. Agressive clock gating,power gating, or DVFS techniques will likely increase the signifi-cance of application-specific predictors in power prediction. Com-bining the approaches of application-specific and regional model-ing may capture these effects.

192

Model Min 1st-Q Median Mean 3rd-Q Max

B (1k) 0.252 8.381 22.066 39.507 55.227 244.631S (2k) 0.168 3.796 9.316 13.163 21.849 45.004S+R (1k) 0.076 2.068 4.303 5.6038 8.050 24.572S+A (2k,ammp) 0.117 5.109 10.753 12.672 17.508 39.651S+A (2k,equake) 0.112 4.806 11.332 13.190 19.141 44.429S+A (2k,mesa) 0.021 5.150 10.316 13.491 20.233 45.158

Table 5. Summary of power prediction error with specified error minimizing sample and region sizes.

Figure 6. Empirical CDF of prediction errors with error minimizing sample and region sizes (L). Prediction results for S+R (R).

Figure 7. S+R Power Model Sensitivity: Varying rS+R with fixed sample size nS+R = 4, 000 (L). Varying nS+R with fixed region sizerS+R = 1, 000 (R).

193

6. Related WorkIpek, et al., consider predicting performance of memory, core andCMP design spaces with artificial neural networks (ANN) [3].ANN’s are automated and do not require statistical analyses. Train-ing implements gradient descent to optimize network edge weightsand prediction requires evaluation of nested weighted sums of non-linear sigmoids. Leveraging statistical significance testing duringmodel formulation, regression may be more computationally effi-cient. Training and prediction require numerically solving and eval-uating linear systems. Furthermore, we perform regression for asubstantially larger design space.

6.1 Statistical Significance RankingJoseph, et al., derive performance models using stepwise regres-sion, an automatic iterative approach for adding and dropping pre-dictors from a model depending on measures of significance [6].Although commonly used, stepwise regression has several signifi-cant biases cited by Harrell [4]. In contrast, we use domain-specificknowledge of microarchitectural design to specify non-linear ef-fects and interaction between predictors. Furthermore, the authorsconsider only two values for each predictor and do not predict per-formance, using the models only for significance testing.

Yi, et al., identify statistically significant processor parametersusing Plackett-Burman design matrices [15]. Given these criticalparameters, they suggest fixing all non-critical parameters to rea-sonable constants and performing more extensive simulation bysweeping a range of values for the critical parameters. We use vari-ous statistical techniques to identify significant parameters (Section4), but instead of performing further simulation, we rely on regres-sion models based on these parameters to explore the design space.

6.2 Synthetic TracesEeckhout, et al., have studied statistical simulation in the context ofworkloads and benchmarks for architectural simulators [2]. Nuss-baum, Karkhanis, and Smith have examined similar statistical ap-proaches for simulating superscalar and symmetric multiprocessorsystems [11, 7]. Both researchers claim detailed microarchitecturesimulations for specific benchmarks are not feasible early in thedesign process. Instead, benchmarks should be profiled to obtainrelevant program characteristics, such as instruction mix and datadependencies between instructions. A smaller synthetic benchmarkis then constructed with similar characteristics.

The statistical approach we propose and the approaches pro-posed by Eeckhout and Nussbaum are fundamentally different. In-troducing statistics into simulation frameworks reduces accuracy inreturn for gains in speed and tractability. While Eeckhout and Nuss-baum suggest this trade-off for simulator inputs (i.e. workloads),we propose this trade-off for simulator outputs (i.e. performanceand power results).

7. Conclusions and Future WorkWe derive and validate performance and power regression models.Such models enable computationally efficient statistical inference,requiring the simulation of only 1 in 5 million points of a large de-sign space while achieving median error rates as low as 4.1 percentfor performance and 4.3 percent for power. Whereas application-specific models are most accurate for performance prediction, re-gional models are most accurate for power prediction.

Given their accuracy, our regression models may be appliedto specific parameter studies. The computational efficiency of ob-taining predictions also suggest more aggressive studies previouslynot possible via simulation. We use the same model specificationfor both performance and power. Similarly, applictation-specificmodels contain the same predictors regardless of benchmark. Fine-

grained model customization would require additional designer ef-fort, but may improve accuracy. Techniques in statistical inferenceare necessary to efficiently handle data from large scale simula-tion and are particularly valuable when archives of observed per-formance or power data are available.

AcknowledgmentsWe thank Patrick Wolfe at Harvard University for his valuable feed-back regarding our regression strategies. This work is supported byNSF grant CCF-0048313 (CAREER), Intel, and IBM. Any opin-ions, findings, and conclusions or recommendations expressed inthis material are those of the author(s) and do not necessarily re-flect the views of the National Science Foundation, Intel or IBM.

References[1] D. Brooks, P. Bose, V. Srinivasan, M. Gschwind, P. G. Emma,

and M. G. Rosenfield. New methodology for early-stage,microarchitecture-level power-performance analysis of microproces-sors. IBM Journal of Research and Development, 47(5/6), Oct/Nov2003.

[2] L. Eeckhout, S. Nussbaum, J. Smith, and K. DeBosschere. Statisticalsimulation: Adding efficiency to the computer designer’s toolbox.IEEE Micro, Sept/Oct 2003.

[3] E.Ipek, S.A.McKee, B. de Supinski, M. Schulz, and R. Caruana.Efficiently exploring architectural design spaces via predictivemodeling. In ASPLOS-XII: Architectural support for programminglanguages and operating systems, October 2006.

[4] F. Harrell. Regression modeling strategies. Springer, New York, NY,2001.

[5] V. Iyengar, L. Trevillyan, and P. Bose. Representative traces forprocessor models with infinite cache. In Symposium on HighPerformance Computer Architecture, February 1996.

[6] P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. Construction anduse of linear regression models for processor performance analysis.In Symposium on High Performance Computer Architecture, Austin,Texas, February 2006.

[7] T. Karkhanis and J. Smith. A first-order superscalar processor model.In International Symposium on Computer Architecture, June 2004.

[8] B. Lee and D. Brooks. Regression modeling strategies for microar-chitectural performance and power prediction. Technical ReportTR-08-06, Harvard University, March 2006.

[9] B. Lee and D. Brooks. Statistically rigorous regression modelingfor the microprocessor design space. In ISCA-33: Workshop onModeling, Benchmarking, and Simulation, June 2006.

[10] M. Moudgill, J. Wellman, and J. Moreno. Environment for powerpcmicroarchitecture exploration. IEEE Micro, 19(3):9–14, May/June1999.

[11] S. Nussbaum and J. Smith. Modeling superscalar processors viastatistical simulation. In International Conference on ParallelArchitectures and Compilation Techniques, Barcelona, Sept 2001.

[12] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automaticallycharacterizing large scale program behavior. In InternationalConference on Architectural Support for Programming Languagesand Operating Systems, October 2002.

[13] C. Stone. Comment: Generalized additive models. Statistical Science,1:312–314, 1986.

[14] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS:Accelerating microarchitecture simulation via rigorous statisticalsampling. In International Symposium on Computer Architecture,June 2003.

[15] J. Yi, D. Lilja, and D. Hawkins. Improving computer architecturesimulation methodology by adding statistical rigor. IEEE Computer,Nov 2005.

194

effort prediction

Documents