a robust design space modeling - inriapages.saclay.inria.fr/olivier.temam/files/eval/todaes.pdf ·...

21
For Peer Review A Robust Design Space Modeling Qi Guo, Carnegie Mellon University Tianshi Chen, Institute of Computing Technology, Chinese Academy of Sciences Zhi-Hua Zhou, Nanjing University Olivier Temam, INRIA Ling Li, Institute of Automation, Chinese Academy of Sciences Depei Qian, Beihang University Yunji Chen, Institute of Computing Technology, Chinese Academy of Sciences Architectural design spaces of microprocessors are often exponentially large with respect to the pending processor parameters. To avoid simulating all configurations in the design space, machine learning and sta- tistical techniques have been utilized to build regression models for characterizing the relationship between architectural configurations and responses (e.g., performance or power consumption). However, it is shown in this paper that the accuracy variability of many learning techniques over different design spaces and benchmarks can be significant enough to mislead the decision-making. This clearly indicates a high risk of applying techniques which work well on previous modeling tasks (each involves a design space, benchmark, and design objective) to a new task, due to which the powerful tools might be impractical. Inspired by ensemble learning in the machine learning domain, we propose a robust framework called ELSE to reduce the accuracy variability of design space modeling. Rather than employing a single learning technique as previous investigations, ELSE employs distinct learning techniques to build multiple base re- gression models for each modeling task. This is not a trivial combination of different techniques (e.g., always trusting the regression model with the smallest error). Instead, ELSE carefully maintains the diversity of base regression models, and constructs a meta-model from the base models, which can provide accurate pre- dictions even when the base models are far from accurate. Consequently, we are able to reduce the number of cases in which the final prediction errors are unacceptably large. Experimental results validate the robust- ness of ELSE: compared with the widely used Artificial Neural Network over 52 distinct modeling tasks, ELSE can reduce the accuracy variability by about 62%, and the number of modeling tasks under which the maximal prediction errors are larger than 40% from 15 to only 0. Moreover, ELSE reduces the average prediction error by 27% and 85% for the investigated MIPS and POWER design spaces, respectively. Additional Key Words and Phrases: architectural simulation, design space modeling, ensemble learning 1. INTRODUCTION A critical issue when designing a modern microprocessor is to explore the design space of architectural configurations, where each configuration is a combination of multi- ple architectural parameters. Design space exploration (DSE) is well known to be time-consuming, not only because evaluating each architectural configuration with the cycle-by-cycle simulation is quite slow, but also because the number of configurations in a design space is exponentially large [Yi et al. 2006]. To effectively accelerate design space exploration, machine learning techniques have been applied to build regression models for the design space. At the cost of simulating only a small fraction of archi- tectural configurations, the regression model is able to learn the relationship between configurations and the corresponding responses (e.g., performance or energy). Relying on the regression model, the responses of the rest architectural configurations can be attained without cycle-accurate simulations. Besides, in addition to efficiently conduct- ing DSE, regression model can also be used to conduct detailed performance analysis to identify performance issues [ElMoustapha Ould-Ahmed-Vall and Doshi 2007] or system optimization [Lee and Brooks 2010]. Traditionally, when recommending a technique of Empirical Design Space Modeling (EDSM), the effectiveness was often validated on a specific design space and only a few benchmarks, and there is often an assumption that a technique works well on one design space can universally work well on other design spaces. While acknowledging ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY. Page 12 of 32 Transactions on Design Automation of Electronic Systems 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Upload: trandung

Post on 09-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

For Peer Review

A

Robust Design Space Modeling

Qi Guo, Carnegie Mellon University

Tianshi Chen, Institute of Computing Technology, Chinese Academy of Sciences

Zhi-Hua Zhou, Nanjing University

Olivier Temam, INRIA

Ling Li, Institute of Automation, Chinese Academy of Sciences

Depei Qian, Beihang University

Yunji Chen, Institute of Computing Technology, Chinese Academy of Sciences

Architectural design spaces of microprocessors are often exponentially large with respect to the pending

processor parameters. To avoid simulating all configurations in the design space, machine learning and sta-

tistical techniques have been utilized to build regression models for characterizing the relationship between

architectural configurations and responses (e.g., performance or power consumption). However, it is shown

in this paper that the accuracy variability of many learning techniques over different design spaces and

benchmarks can be significant enough to mislead the decision-making. This clearly indicates a high risk of

applying techniques which work well on previous modeling tasks (each involves a design space, benchmark,

and design objective) to a new task, due to which the powerful tools might be impractical.

Inspired by ensemble learning in the machine learning domain, we propose a robust framework called

ELSE to reduce the accuracy variability of design space modeling. Rather than employing a single learning

technique as previous investigations, ELSE employs distinct learning techniques to build multiple base re-

gression models for each modeling task. This is not a trivial combination of different techniques (e.g., always

trusting the regression model with the smallest error). Instead, ELSE carefully maintains the diversity of

base regression models, and constructs a meta-model from the base models, which can provide accurate pre-

dictions even when the base models are far from accurate. Consequently, we are able to reduce the number of

cases in which the final prediction errors are unacceptably large. Experimental results validate the robust-

ness of ELSE: compared with the widely used Artificial Neural Network over 52 distinct modeling tasks,

ELSE can reduce the accuracy variability by about 62%, and the number of modeling tasks under which

the maximal prediction errors are larger than 40% from 15 to only 0. Moreover, ELSE reduces the average

prediction error by 27% and 85% for the investigated MIPS and POWER design spaces, respectively.

Additional Key Words and Phrases: architectural simulation, design space modeling, ensemble learning

1. INTRODUCTION

A critical issue when designing a modern microprocessor is to explore the design spaceof architectural configurations, where each configuration is a combination of multi-ple architectural parameters. Design space exploration (DSE) is well known to betime-consuming, not only because evaluating each architectural configuration with thecycle-by-cycle simulation is quite slow, but also because the number of configurationsin a design space is exponentially large [Yi et al. 2006]. To effectively accelerate designspace exploration, machine learning techniques have been applied to build regressionmodels for the design space. At the cost of simulating only a small fraction of archi-tectural configurations, the regression model is able to learn the relationship betweenconfigurations and the corresponding responses (e.g., performance or energy). Relyingon the regression model, the responses of the rest architectural configurations can beattained without cycle-accurate simulations. Besides, in addition to efficiently conduct-ing DSE, regression model can also be used to conduct detailed performance analysisto identify performance issues [ElMoustapha Ould-Ahmed-Vall and Doshi 2007] orsystem optimization [Lee and Brooks 2010].

Traditionally, when recommending a technique of Empirical Design Space Modeling(EDSM), the effectiveness was often validated on a specific design space and only afew benchmarks, and there is often an assumption that a technique works well on onedesign space can universally work well on other design spaces. While acknowledging

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 12 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:2 Q. Guo et al.

the effectiveness of these techniques under specific modeling tasks, we claim that thisassumption, in fact, violates the well-known No-Free-Lunch theorem [Wolpert 1997;Wolpert and Macready 1997] in the machine learning domain, which asserts that nolearning technique can universally outperform another learning technique (e.g., a to-tally random learning algorithm) under all possible modeling tasks. In the context ofEDSM, this implies that a learning technique that has been validated to perform gen-erally well under specific modeling tasks might perform badly if the design space, thedesign objective and the application have changed. As illustrated in Figure 1, the pre-diction error of Artificial Neural Network (ANN) (the most commonly used learning

technique for EDSM [Ipek et al. 2006; Khan et al. 2007; Dubach et al. 2007; Cho et al.2007]) varies significantly over different modeling tasks where the design objectivesand benchmarks with respect to a superscalar design space are different.

Fig. 1. The median, third quartile and maximal prediction error (in percentage) of ANN for a superscalardesign space, where 16 different modeling tasks are considered. Although ANNs can achieve good predictionaccuracy on some modeling tasks (e.g., modeling the performance of benchmark gzip on this design space),it does not universally perform well on all 16 modeling tasks. For instance, under the modeling task equake-power, the maximal error is 82.15%, which is significant enough to mislead the decision-making. The eval-uated design space is a POWER architecture, and it contains nearly 1 billion design options. Themain design parameters of ANN are set as follows: each ANN contains one 16-unit hidden layerwith a learning rate of 0.001, and the momentum value is 0.5. Besides, 10-fold cross-validationis utilized to obtain the above results. Details of the experimental setups will be elaborated inlater sections.

The accuracy variability of a learning technique for EDSM has been so common andsignificant that when facing different modeling tasks of EDSM, sticking to a singletechnique would be rather risky. For example, in Figure 1 the third quartile percent-age error w.r.t the ANN technique can be about 8% when estimating the power forbenchmark equake, which indicates that architects may suffer from a prediction er-ror of 8% for about 25% of all design configurations. However, when facing anothermodeling tasks for the MIPS architecture, the third quartile prediction error couldonly be 0.55%. Furthermore, the maximal prediction errors of 10 out of 16 modelingtasks in this design space even exceed 50% of the actual responses obtained by simu-lations, and the maximal one could be more than 80% for modeling task equake-power.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 13 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:3

Due to the considerable accuracy variability, traditional learning techniques may mis-lead the decision-making by recommending an inappropriate architectural configura-tion, especially when designing application-specific processors to meet specific perfor-mance/energy requirements [Hardavellas et al. 2011].

The risk of making incorrect decisions calls for robust EDSM whose prediction overdifferent modeling tasks (e.g., different architectural design spaces, different designobjectives or different benchmarks) is more stable and accurate. In this paper, inspiredby a machine learning methodology called ensemble learning [Zhou 2009; 2012; Diet-terich 1998] we propose a variability-reduction approach called ELSE (i.e., EnsembleLearning for design Space Exploration) to address the above issue. Briefly speaking,for each modeling task of EDSM, ELSE builds several distinct base regression models,and uses the base models to construct a meta-model which is responsible for provid-ing the final output of each prediction. ELSE is not a trivial combination of differenttechniques (e.g., always trusting the regression model with the smallest error). The basemodels are built via using different data samples, different learning techniques or thesame technique with distinct learning parameters. During this process, in addition tobuilding base models, the diversity of different base models is carefully maintained.Consequently, the overall meta-model constructed by ensemble learning can inheritthe advantages of each base model while compensating the defects of single base mod-els. Thus, it can effectively reduce the number of cases in which the final predictionerrors are unacceptably large, even if the base models are far from accurate.1 By per-forming more stably and accurately over different modeling tasks, ensemble learningenables robust design space modeling. According to our empirical study on 52 differentmodeling tasks involving both MIPS and POWER architecture, ELSE can reduce theaccuracy variability of the widely used Artificial Neural Network (ANN) by about 62%with respect to the max prediction error. More specifically, in comparison with ANN,ELSE reduces the number of modeling tasks under which the third quartile predic-tion errors are larger than 6% from 21 to only 1, and the number of modeling tasksunder which the maximal prediction errors are larger than 40% from 15 to only 0.Meanwhile, ELSE improves the average prediction accuracy on MIPS and POWERmodeling tasks by 27% and 85% respectively, compared with the average of the bestaccuracies obtained by three traditional learning techniques.

The contributions of this paper are briefly summarized as follows. First, it is the firsttime the accuracy variability in architectural EDSM is systematically revealed andinvestigated over a large number of different modeling tasks. Second, a variability-reduction approach called ELSE is proposed to offer robust yet effective architecturalEDSM. ELSE is a framework of robust EDSM that does not rely on any specific learn-ing techniques, thus can seamlessly assist various regression techniques to performrobustly under distinct modeling tasks of EDSM. Third, the advantages of ELSE havebeen empirically validated over distinct design spaces and benchmarks.

The rest of the paper is organized as follows: Section 3 introduces the detailed ex-perimental methodology of our empirical study. Section 4 studies the prediction ac-curacies of traditional learning techniques on both MIPS and POWER design spaces.Section 5 presents the framework of ELSE. Section 6 compares ELSE with traditionaltechniques. Section 2 introduces some related work about existing learning techniquesfor DSE. Section 7 concludes the whole paper.

1It is noteworthy that ensemble learning has sound theoretical foundation [Schapire 1990; Friedman et al.2000; Kuncheva 2002]; in particular, it is well-known in the machine learning community that, ensemblemethods are the most powerful technique, both theoretically and empirically, for the reduction of accuracyvariability [Opitz and Maclin 1999; Buja and Stuetzle 2000].

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 14 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:4 Q. Guo et al.

2. RELATED WORK

Learning/regression techniques are a type of emerging techniques for EDSM, whichhave been continually inspired by various machine learning and numerical methods.

Ipek et al. [Ipek et al. 2006] utilized ANNs to capture the sophisticated relationshipbetween design parameters and the performance. Almost in the same time, Lee andBrooks [Lee and Brooks 2006] used spline-based regression model to predict IPC andpower consumption for microarchitectural design. As stated by Lee and Brooks [Leeand Brooks 2010], spline-based regression models can achieve comparable accuracieswith ANNs. However, it requires more human interventions and domain knowledgethan ANNs, though spline-base models are much more computationally efficient thanANNs [Lee and Brooks 2010]. Another important observation described in [Lee andBrooks 2006] is that no spline function can be universally promising for various mod-eling tasks [Lee and Brooks 2006], which serves as another crucial evidence to the ac-curacy variability of regression models studied in Section 4. Joshep et al. [Joseph et al.2006] proposed to use Radius Basis Function (RBF) networks to construct a nonlin-ear model for performance prediction. To reveal the program runtime characteristicson different architectures, Cho et al. [Cho et al. 2007] utilized a novel wavelet modelto predict the workload dynamics, and the model can be utilized to capture the com-plex workload dynamics across different architectures. To reduce the simulation costsfor cross-program EDSM, Khan et al. [Khan et al. 2007] and Dubach et al. [Dubachet al. 2007] independently proposed to utilize signature, which is a small set of re-sponses that are attained via simulating representative architectures, when buildingcross-program ANN-based regression models. The experiments conducted by Dubachet al. on a unicore design space showed that the percentage errors on energy-delayrange from about 5% to 50% over different benchmarks, where the signature-based re-gression clearly exhibited the accuracy variability. Lee et al. [Lee et al. 2008] proposedcomposable performance regression (CPR) model for multiprocessors. Azizi et al. [Az-izi et al. 2010] investigated an architecture-circuit co-design space to meet energy andperformance requirements. To facilitate marginal cost analysis, they utilized polyno-mial functions as the regression models to predict the performance. Empirical compar-isons were conducted to show that their models are comparable with cubic spline func-tion regarding the prediction accuracy. In addition to such learning/regressiontechniques used for DSE, there are also studies which use analytic modelsto cope with DSE. Palermo et al. proposed to use Response Surface Mod-eling (RSM) techniques to refine the design space exploration and identifyfeasible configurations under design constraints (such as area, energy, andthroughput) [Palermo et al. 2009]. Mariani et al. also leveraged RSM in thedesign-time architectural exploration [Mariani et al. 2013]. Paone et al. uti-lized an ANN-based ensemble learning technique to improve the simulationspeed and accuracy [Paone et al. 2013]. In contrast, our approach mainly fo-cuses on reducing the accuracy variability of EDSM to help traditional tech-niques to gain better robustness on processor design.

Although a number of learning/regression techniques have been proposed for specificmodeling tasks of EDSM, no study has been dedicated to comprehensively evaluate theaccuracy variability of the proposed technique over a large number of distinct modelingtasks, while the accuracy variability is indeed quite common and significant accordingto the observations in this paper. When facing a new modeling task with distinct char-acteristics, the accuracy variability may diminish the applicability of an establishedtechnique. Rather than proposing any specific EDSM technique, our study proposes aframework for reducing the accuracy variability of EDSM techniques and can help tra-ditional techniques to gain robustness for processor design, especially for specialized

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 15 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:5

Table I. MIPS Architectural Design Space

Abbr. Variable Parameters Values Number

WIDTH Fetch/Issue/Commit Width 4,8 2ROB ROB Size 64-160 with a step of 8 13IQ Instruction Queue Size 16-80 with a step of 8 9

LSQ Load/Store Queue Size 16-80 with a step of 8 9RF Register File Size 64-160 with a step of 8 13

L1IC L1 ICache 16,32,64,128KB 4L1DC L1 DCache 16,32,64,128KB 4L2UC L2 UCache 256KB-4MB 5BRAN Branches Allowed 8,16,24,32 4PRED Branch Predictor 1,2,4,8,16,32K 6BTB Branch Target Buffer 1,2,4K 3RAS RAS Size 16,32,48,64 4ALU ALUs 2,4 2FPU FPUs 2,4 2

Total Number 14 2.52bn

or customized heterogeneous design. In this sense, our study is substantially differentfrom traditional ones.

3. EXPERIMENTAL METHODOLOGY

Before presenting the details of our study, we briefly introduce our platform and ex-perimental methodology.

3.1. Simulated Architecture and Benchmarks

Regression-based EDSM requires cycle-by-cycle simulations of a small fraction ofarchitectural configurations. Our study focuses on the design spaces of modern su-perscalar architectures. We first employ a heavily modified version of the simulatorSESC [Renau et al. 2005] to model a MIPS superscalar architecture as shown in Ta-ble I. This design space contains 14 different design parameters, and the number ofdifferent configurations for the design parameters is more than 2 billions. For this ar-chitectures, estimations of timing and power consumption of these architectures arebased on Wattch [Brooks et al. 2000] and CACTI 4.0 [Tarjan et al. 2006], respectively.Moreover, we assume that a 65-nm manufacturing technology is adopted for studiedprocessors. Besides, we also consider a POWER superscalar architecture that is inves-tigated by Lee and Brooks in [Lee and Brooks 2006]. This architecture is simulatedby Turandot [Moudgill et al. 1999], and the modeled baseline architecture is similarto the POWER4/POWER5. The studied POWER design space contains nearly 1 bil-lion design options. More details of the simulator and the design space could be foundin [Lee and Brooks 2006].

In our empirical study, we employ several benchmarks to evaluate the performanceof learning techniques. For the investigated MIPS architecture, we employ 18 bench-marks from SPEC CPU2000 (ammp, applu, apsi, art, bzip2, crafty, equake, gap, gcc,gzip, mesa, mgrid, parser, swim, twolf, vortex, vpr, and wupwise), and for the investi-gated POWER architecture, we evaluate 8 benchmarks that are from SPEC CPU2000(ammp, applu, equake, gcc, gzip, mesa, twolf ) and SPECjbb (jbb). Actually, the simu-lated results of the investigated POWER architecture can be found at Lee’s homepageas http://people.duke.edu/ bcl15/software.

3.2. Architectural Modeling Tasks

The performance of EDSM depends on characteristics of the design space, the designobjective, and the concrete application. For a given design space, there can be differentdesign objectives, including maximizing the performance (e.g., Instruction Per Cycle,

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 16 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:6 Q. Guo et al.

IPC), minimizing the power consumption and so on. Since the optimal configurationunder a given design space and a given design objective may still vary with the specificapplication, we consider a tuple of design space, benchmark and design objective, asa modeling task here. For example, a concrete modeling task would be “modeling therelationship between the parameters of MIPS design space and the IPC for the bench-mark mcf ”. In Section 4 , the accuracy variability of three learning techniques will beevaluated over a number of different modeling tasks.

3.3. Base Regression Models

Our empirical study employs three classic learning techniques and evaluate their ac-curacy variability over different modeling tasks of EDSM. The first technique is Ar-

tificial Neural Networks (ANNs), which has been frequently utilized in EDSM [Ipeket al. 2006; Khan et al. 2007; Dubach et al. 2007]. The second technique is SupportVector Machine (SVMs), a technique based on rigorous statistical learning theory [Vap-nik 1995] and validated to perform well on a complicated compiler option/architectureco-design space [Dubach et al. 2008]. In addition to ANNs and SVMs, M5P regres-sion tree, a variant of decision tree for regression problem, divides the in-put space into a tree structure and fits linear regression models at theleaves [Wang and Witten 1997], is also involved in our evaluation. The mostnotable merit of M5P is that it can provide an interpretable model to helpdesigners focusing on most critical performance bottlenecks. Actually, M5Phas already been employed for performance modeling and analysis of com-puter systems [Ould-Ahmed-Vall et al. 2007; Liao et al. 2009; Guo et al. 2011].In short, the reason why we use ANNs, SVMs, and M5P as the base models tobuild ELSE is that these models have been widely used in performance anal-ysis and modeling of computer systems, and their abilities of predicting theperformance and power consumption have also been well demonstrated.

In our study, ANNs, SVMs and M5P will be utilized to build three regression modelsfor each modeling task of EDSM, and the purpose is to demonstrate that the accuracyvariability of learning techniques can be rather significant. Note that ANNs, SVMsand M5P regression trees are the three most popular practical machine learning tech-niques [Witten and Frank 2005], and the conclusion drawn from the study of them canbe generalized to the inclusion of other learning techniques. Motivated by the consider-able accuracy variability, we then propose ELSE for EDSM, which is a robust learningframework with reduced accuracy variability over different modeling tasks of EDSM.

3.4. Evaluation Methodology

3.4.1. Construction and Evaluation of Regression Models. To construct a regression model, alearning/regression technique requires a training set consisting of several design con-

figurations which must be simulated, whose size could be either fixed or dynamic [Ipeket al. 2006]. In our study, for the MIPS architecture, we sample 3000 design configura-tions uniformly at random (UAR) from the design space to construct the training set,as done by Lee and Brooks [2006]. For the POWER architecture, the simulated resultsare open source, and can be directly obtained from the website of Lee.

For each learning technique there are several parameters that are crucial to the per-formance of the technique. In our study, ANNs directly adopt the parameter setting

utilized by Ipek et al. [Ipek et al. 2006], that is, each ANN contains one 16-unit hiddenlayer with a learning rate of 0.001, and a momentum value of 0.5. For SVMs whichare known to be more parameter-sensitive, the three parameters, penalty parameter,the gamma value, and the epsilon for the loss function, are very important once radius

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 17 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:7

based function (RBF) is adopted as the kernel function 2. To obtain a promising com-bination of these parameters, we utilize grid search to traverse the parameter space.This implies that the parameter settings of SVMs may vary over different modelingtasks. The third technique, M5P, involves only one important parameter that deter-mines the minimal number of examples in each leaf of constructed model tree, whichis set to be 4 in our study.

To evaluate the accuracy of each regression model, we employ 10-fold cross valida-tion to assess how the regression model generalizes to an independent data set thatcontains unseen design configurations. In the 10-fold cross validation, the data set with3000 simulated design configurations is randomly partitioned into 10 subsets (folds).In each iteration, one single subset is retained as the validation data set for checkingthe prediction error of the regression model, while the other remaining 9 subsets arecombined to construct the training set for constructing the regression model. The en-tire cross validation process is repeated for 10 iterations such that each subset acts asthe validation data for exactly one time. Finally, the results of 10 iterations will be av-eraged to produce an evaluation of the learning technique under the specific modelingtask.

3.4.2. Performance Metrics. Performance metrics play important roles in the evalua-tions of the learning techniques. In this paper, we employ the Mean Squared Error(MSE) to demonstrate how well a model predict the responses for the given designspace respectively. Formally, given the validation set consisting of N unseen designconfigurations, the MSE is defined by

MSE :=1

N

N∑

k=1

(f(xk)− yk)2, (1)

where f represents the regression model, xk represents the k-th configuration (k =1, . . . , N ) in the validation set, and yk represents the architectural response of the con-figuration xk, which is obtained by cycle-by-cycle simulation. A model with a smallerMSE is more favorable. However, from the MSE only we are not able to directly de-duce the accuracy of the model, unless we take the real responses y1, . . . , yk as thereferences.

To address the above issue, we also consider the distribution of Percentage Error (PE)in our empirical study, where the PE for the k-th configuration xk in the validation setis defined by

PE(k) :=|f(xk)− yk|

yk× 100%. (2)

The PE explicitly demonstrates the accuracy of a regression model on a specific designconfiguration. For a regression model built for a given modeling task, by the corre-sponding PE distribution we are able to summarize the general accuracy of the re-gression model. Concretely, our empirical study employs a set of metrics about the PEdistribution including the first quartile error (1PE), median percentage error (MPE)and third quartile percentage error (3PE), where, for example, the MPE is formallydefined by MPE := median{PE(k); k = 1, . . . , N}. Thus, “the MPE of a model is 5%”means that the percentile of the cases in which the error rate is larger than 5% is 50%.Besides, the maximal percentage error, defined by max{PE(k); k = 1, . . . , N}, will alsobe taken into account.

2kernel function is utilized to map the original data into a high dimensional feature space where a hyper-plane can be used to perform linear separation.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 18 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:8 Q. Guo et al.

Given the PE-related metrics, we can define the term “accuracy variability” by thestandard deviation of some PE-related metrics over different modeling tasks. A largestandard deviation of a PE-related metric indicates a larger accuracy variability, whichwe do not like.

4. ACCURACY VARIABILITY OF REGRESSION MODELS

In this section, we provide an in-depth study to reveal the accuracy variability of dif-ferent learning techniques over different modeling tasks of EDSM.

For the MIPS design space, our study considers 36 different modeling tasks involv-ing two design objectives (IPC and energy) and 18 benchmarks. For the POWER designspace, our study considers 16 different modeling tasks involving two design objectives(IPC and power) and 8 benchmarks. For each modeling task mentioned above, we con-struct three regression models by ANN, SVM and M5P, respectively.

4.1. MIPS Modeling Tasks

For the MIPS design space, Figure 2 presents the MSEs of different regression modelsacross different benchmarks, where the MSEs are normalized to the correspondingMSE of ANN. We observe that the prediction accuracy of SVM is worse than othertwo techniques on all 36 MIPS modeling tasks. Moreover, although ANN can obtainbetter performance for the modeling tasks “MIPS − applu− IPC” than M5P, it cannotoutperform M5P under the modeling task “MIPS − apsi − IPC”, where the MSE ofM5P is only 6.16% of that of ANN. In summary, ANN outperforms M5P on 8 out of36 MIPS modeling tasks, while M5P beats ANN under the rest modeling tasks. It isclear that the accuracy variability may influence the relative standings of learningtechniques under different modeling tasks.

��� ��� ��� ��

Fig. 2. MSEs of investigated learning techniques on 36 different modeling tasks for the MIPS design space,which involves 18 benchmarks and 2 design objectives (IPC and energy). The results are normalized to theMSE of ANN.

To further illustrate the accuracy variability of the techniques, Figure 3 shows theboxplots of the prediction errors of ANN, SVM and M5P (from top to bottom) over36 MIPS modeling tasks, respectively. Boxplots are graphical displays measuring thelocation and dispersion, which can indicate the symmetry of error distribution. Thecentral box shows the data between 1PE (PEs of 75% data are larger than this value)and 3PE (PEs of 25% data are larger than this value), which indicates that about 50%

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 19 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:9

data belongs to this box. The average PE is marked as “block” and MPE is markedas “furcation”. The whiskers (the vertical line across the box) extend to the extremeprediction value, i.e., maximal and minimal PEs.

The accuracy variability can be measured by the standard deviation of PE-relatedmetrics over different modeling tasks. For example, the MPEs of SVM models rangefrom 0.98% (Task “MIPS − parser − IPC” to 16.475% (Task “MIPS − apsi − IPC”),and the standard deviation is 3.73%. For the ANN models, the MPEs range from 0.41%(Task “MIPS − gzip − IPC”) to 5.48% (Task “MIPS −mgrid − IPC”), and the corre-sponding standard deviation is 2.65%. For M5P models, the standard deviation of theirMPEs is 1.44%, which is the smallest among three techniques.

In fact, by the maximal PE metric we can also observe the accuracy variability evenmore easily. As illustrated in Figure 3, the maximal PEs of SVM models range from4.79% (Task “MIPS − ammp− IPC”) to 60.58% (Task “MIPS − art− IPC”), and thecorresponding standard deviation is 19.48%. Similarly, the standard deviations of themaximal PEs w.r.t. ANN models and M5P models are 7.99% and 7.16%, respectively.Although M5P has the most stable prediction accuracy among these techniques, themaximal PEs still vary significantly across different modeling tasks, i.e., from 2.02%(Task “MIPS − parser − IPC”) to 33.18% (Task “MIPS −mgrid− IPC)”).

4.2. POWER Modeling Tasks

For the POWER architecture, Figure 4 presents the MSE of three techniques over 16modeling tasks, where the prediction results are normalized to that of ANN. It can beobserved that SVM performs worse than other two techniques under most modelingtasks. However, it can still achieve the best performance among the three techniqueson 4 modeling tasks such as “POWER − applu − IPC”, which well demonstrates theaccuracy variability. Other two techniques also exhibit similar behaviors. For example,M5P achieves the best performance on 11 modeling tasks, while ANN achieves the bestperformance on only one modeling task.

Figure 5 shows the boxplots of the error distributions w.r.t. ANN, SVM and M5P(from top to bottom), respectively. The MPEs of ANN range from 1.33% (Task“POWER − jbb − IPC”) to 4.07% (Task “POWER − gcc − power”), which leads to astandard deviation of 0.76%. The MPEs of M5P range from 1.53% (Task “POWER −gzip − IPC”) to 3.84% (Task “POWER − gzip − power”), with the standard deviationof 0.71%. The MPEs of SVM range from 0.80% (Task “POWER− jbb− IPC”) to 2.90%(Task “POWER −mesa − IPC”), with the standard deviation of 0.65%, showing thatSVM is relatively more robust that other two techniques over the 16 POWER modelingtasks.

To gain more insights into the accuracy variability w.r.t. POWER modeling tasks, weestimate the standard deviations of 3PEs (PEs of 25% data are larger than this value)and maximal PEs for three techniques. To be specific, the standard deviations of 3PEsare 1.52%, 1.16%, and 1.46% for ANN, SVM and M5P, respectively. When adoptingthe maximal PE metric, the accuracy variability are more significant, as evidencedby the large standard deviations 15.17%, 6.67% and 5.75% for ANN, SVM and M5P,respectively. Compared with the MIPS modeling tasks, the accuracy variability overPOWER modeling tasks seems to be less significant. The reason might be that thesize of the studied POWER design space (i.e., < 1 billion) is smaller than that of theinvestigated MIPS design space (i.e., > 2 billion).

5. PROPOSED METHODOLOGY

So far we have shown that the accuracy variability of the learning techniques is sosignificant that when facing a new modeling task whether a single regression modelstill performs well is in doubt. The reasons are two-fold [Dietterich 1998]. First, the

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 20 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:10 Q. Guo et al.

��

��

��

��

��

��

����

��

���

��

��

��

��

��

���

��

���

��

��

��

��

��

��

���

��

���

��� ��� ����� ���

Fig. 3. PE distribution of different learners on 36 different modeling tasks for the MIPS design space,where the central box shows the data between 1PE (PEs of 75% data are larger than this value) and 3PE(PEs of 25% data are larger than this value), and the average PE is marked as block shape and MPE ismarked as furcation shape (e.g., the 1PE, MPE and 3PE of mgrid-ipc w.r.t. ANN are 2.73%, 5.48%, and8.89%, respectively). The design space contains more than 2 billion design options, and the size ofthe sample set is 3000. Regarding the ANN, each ANN contains one 16-unit hidden layer with alearning rate of 0.001, and the momentum value is 0.5. Here the modeled metrics are IPC andenergy.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 21 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:11

Fig. 4. MSEs of investigated learning techniques on 16 different modeling tasks for the POWER architec-ture, which involves 8 benchmarks and 2 design objectives (IPC and power). The results are normalized tothe MSE of ANN.

training data might not always provide sufficient information for determining a singlebest model. Second, the learning process of a learning/regression technique is oftenimperfect, which can easily result in sub-optimal models.

In this paper, we propose to employ the notion of ensemble learning to addressthe above concern and make regression-based EDSM more robust and practical. Un-like conventional machine learning techniques that construct one model from trainingdata, ensemble learning builds multiple base regression models by using different datasets, different learning techniques (heterogeneous models) or the same technique withdistinct learning parameters (homogeneous models). During this process, the diversityof different models is carefully maintained such that the base models can mutuallycompensate their weakness by interacting with each other. Consequently, the overallmeta-model constructed by ensemble learning can inherit the advantages of each basemodels while compensating the defects, which effectively reduce the number of casesin which the final prediction errors are unacceptably large even if the base models farfrom accurate.

5.1. ELSE

In this subsection we present the algorithm framework of the proposed ELSE, which isderived from the stacking algorithm in the machine learning domain [Wolpert 1992].The stacking algorithm trains several low-level base models from the training data,and let the low-level models construct the data set for constructing the high-levelmeta-model. Here we use stacking as an illustrative example to demonstratethe advantage of ensemble learning, that is, ensemble learning can signifi-cantly reduce the accuracy variability to provide robust design space model-ing techniques.

Algorithm 1 illustrates the model construction process of ELSE, where we utilizethree base techniques, ANN, SVM and M5P, as an illustration, and the frameworkcan easily be generalized to the case in which more base techniques are involved. Thetraining process is described as follows. First, three regression models h1, h2, and h3

are built by three first-level base techniques, L1 (ANN), L2 (SVM), and L3 (M5P), fromthe training data set D. After that, for each configuration involved in D, the three mod-els present their own results, and such results are combined to form a new example.Such examples are collected in the data set D′ that is reserved for the meta-model h′.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 22 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:12 Q. Guo et al.

��

��

��

��

��

��

����

��

��

���

��

��

��

��

��

��

��

��� ��

���

��

��

��

��

��

��

��

���

��

���

��� ��� ����� ���

Fig. 5. PE distribution of different regression models on 16 different modeling tasks w.r.t. the POWERdesign space, where for each modeling task the grey box show the data between 1PE (PEs of 75% data arelarger than this value) and 3PE (PEs of 25% data are larger than this value), and the average PE is markedas block shape and MPE is marked as furcation shape. The design space contains nearly 1 billiondesign options, and the size of the sample set ranges from 2000 to 4000. Regarding the ANN, eachANN contains one 16-unit hidden layer with a learning rate of 0.001, and the momentum valueis 0.5. Here the modeled metrics are IPC and power.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 23 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:13

ALGORITHM 1: Training Algorithm of ELSE

Input: Simulated Data Set D = (x1, y1), (x2, y2), . . . , (xm, ym);Base Regression Technique L1, L2, and L3;Meta Regression Technique L.

1 L1=ANN, L2=SVM, L3=M5P2 L=M5P3 for t = 1, 2, 3 do4 ht = Lt(D) /* Training a base model ht via the technique Lt on the data set D */5 end

6 D′ = ∅

7 for each xi ∈ D do8 for t = 1, 2, 3 do9 zit = ht(xi) /* Using ht to obtain the response of configuration xi */

10 end

11 D′ = D′ ∪ ((zi1, zi2, zi3), yi)12 end

13 h′ = L(D′) /* Training the meta-model h′ via the technique L on the data set D′. */

Output: H(x) = h′(h1(x), h2(x), h3(x))

Finally, based on the data set D′, ELSE constructs the meta-model h′ from a specificlearning technique (say, M5P), and the meta-model is responsible for presenting the fi-nal prediction result for each given input design configuration. The main reason whywe use M5P as the meta-model is that model trees have better interpretabil-ity compared with other learning techniques, such as ANNs and SVMs. Withthe help of the interpretability offered by the tree structure, we can clearlysee the qualitative and quantitive impacts of different base models on theprediction accuracy. In addition, compared with ANNs and SVMs, M5P is alightweight learning algorithm that can be built in much shorter time.

5.2. ELSE for Constrained Design Space Exploration

Constrained design space exploration (e.g., determining the configurationwith best performance given specific power constraint), is very common inindustrial design practices. ELSE can greatly facilitate such practices dueto its high accuracy and robustness. Figure 6 illustrates the overall frame-work to leverage ELSE for constrained design space exploration. The to-tal process consists of two phases, the modeling phase and the explorationphase. During the modeling phase, multiple design configurations are sam-pled and simulated to constitute the training set. Then, the training set isused to build several different response models with ELSE, such as perfor-mance and power models. The accuracy of such models should be test to de-termine whether they meet the pre-defined accuracy criteria. If they do, themodeling process can be terminated. Otherwise, more design configurationsshould be sampled, simulated and trained to further enhance the accuracy.During the exploration phase, processor responses of design configurationscan be predicted by ELSE models without consuming any additional proces-sor simulations. In the example as shown in Figure 6, a design configurationshould first be fed to the ELSE power model. If the predicted power of thisconfiguration violates the constraint (e.g., 50 watt), the configuration will bediscarded. Otherwise, the configuration will further be fed to the ELSE per-formance model to predict its performance. After iterating the above process

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 24 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:14 Q. Guo et al.

����������

������������������ �

������ ��������� ������

������ ������ ����

������ ������ ����

������������ �

����� ���������

������ ��������� ������

��������� ������

� ��������� �

!��������� ��������� �

������

������

�������� ����� ���

"

#

#

""

#

Fig. 6. The overall framework to leverage ELSE for constrained design space exploration.

for all candidate configurations, the configuration having the best perfor-mance among those who meet the power constraint can be found..

6. EXPERIMENTAL EVALUATION

This section evaluates the proposed ELSE by experimental evaluations, in which wefocus on two critical problems.

1 Does ELSE outperform the three base learning techniques regarding the averageprediction accuracy?

2 Is ELSE more robust (with smaller accuracy variability) than the three base learn-ing techniques, ANN, SVM and M5P?

6.1. Average Accuracy

To compare the average prediction accuracy of ELSE with those of the three base tech-niques, we present the prediction results of both ELSE and the “PickBest” approach,where the PickBest approach always selects for each modeling task the base regressionmodel with the smallest prediction error among the three models built independentlyby ANN, SVM, and M5P. Figure 7 illustrates the comparison results w.r.t the MIPSand POWER design spaces. It can be observed that ELSE outperforms PickBest overall MIPS modeling tasks, and the MSE reduction over PickBest, which ranges from6.76% (Task “MIPS−gzip−energy”) to 117% (Task “MIPS−swim− IPC”), is 27.19%in average over 36 MIPS modeling tasks. On POWER modeling tasks, the MSE re-duction is even much more significant. The average MSE reduction is 84.74% over 16different POWER modeling tasks, and the MSE reduction ranges from 19.20% (Task“POWER− gcc− IPC”) to 207.54% (Task “POWER− twolf − power”).

Furthermore, Figure 8 illustrates the PE distribution of ELSE for MIPS andPOWER modeling tasks. For the MIPS modeling tasks, the MPE of ELSE varies from0 (Task “MIPS − art − IPC”) to 3.77% (Task “MIPS − wupwise − energy”), and is1.77% in average over 36 modeling tasks. The average MPE is only 72.36%, 25.87%,and 93.45% of those of ANN, SVM and M5P techniques, respectively. For the POWER

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 25 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:15

Fig. 7. MSE Comparison of ELSE and the most accurate model selected from the base models (task-wise).Top: MIPS design space; Bottom: POWER design space; All results are normalized to the correspondingMSE of ELSE.

modeling tasks, the MPE of ELSE varies from 0.76% (Task “POWER− jbb− IPC”) to3.11% (Task “POWER − gcc − power”), and is 1.70% in average. The average MPE ofELSE is only 57.90%, 87.72% and 58.63% of those of ANN, SVM and M5P, respectively.

Therefore, considering the MSE and PE, we can conclude that ELSE can signifi-cantly reduce the prediction error for different modeling tasks, which can offer a moreconfident design decision for architects.

In addition, we also conduct experiments to investigate the effect of thesize of training set on the average accuracy. Due to the page limit, here weonly present the experimental results of program applu on POWER archi-tecture as shown in Figure 9. We can clearly see that the prediction accuracygenerally improves as the size of the sample set increases. We observe similartrends on other design scenarios.

6.2. Reduction of Accuracy Variability

Since we focus on the extreme cases as “outliers”, Table II presents the 3PE and max-imal PE for base learning techniques and ELSE over all 52 modeling tasks, where

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 26 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:16 Q. Guo et al.

��

��

��

��

��

������ ��

���

��

��

��

��

��

��

��

���� ��

���

��� ��� ����� ���

Fig. 8. PE distribution of ELSE for the MIPS (top) and the POWER (bottom) design spaces.

the corresponding standard deviations are utilized to measure the accuracy variabil-ity among these approaches. If taking the 3PE (PEs of 25% data are larger than thisvalue) as the error metric, ELSE can achieve the best performance on 48 out of all52 modeling tasks, and the corresponding standard deviation is 1.94%, which is only76.79%, 30.98%, and 83.66% of those of ANN, SVM and M5P techniques. If taking themaximal PE as the error measure, ELSE can achieve the best performance on 45 outof all 52 modeling tasks, and the corresponding standard deviation is 8.36%, which isonly 37.76%, 49.89%, and 58.52% of those of ANN, SVM and M5P techniques. In par-ticular, compared with ANN, ELSE reduces the number of modeling tasks under whichthe 3PEs are larger than 6% from 21 to only 1, and the number of modeling tasks underwhich the maximal prediction errors are larger than 40% from 15 to only 0. Accordingto the above detailed comparisons, we can conclude that ELSE is significantly morerobust than traditional learning/regression techniques of EDSM.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 27 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:17

Table II. Comparison of 3PE and maximal PE of ELSE and its base learning techniques. The correspondingstandard deviations over 52 modeling tasks are also presented. The terms in bold are minimal values amongthe techniques under comparison.

3PE (%) Max PE (%)Tasks ANN SVM M5P ELSE ANN SVM M5P ELSE

MIPS-ammp-ipc 0.735 1.740 0.429 0.286 2.609 4.787 2.609 2.174MIPS-ammp-energy 4.930 10.680 4.930 4.760 16.180 41.669 15.320 10.256

MIPS-applu-ipc 1.382 11.000 1.950 1.120 18.120 36.280 22.390 15.800MIPS-applu-energy 6.667 16.730 5.410 5.150 18.750 57.810 16.340 12.660

MIPS-apsi-ipc 5.195 20.290 0.854 0.853 14.127 36.004 7.679 5.169MIPS-apsi-energy 5.435 10.998 6.340 4.808 13.880 38.700 16.980 12.420

MIPS-art-ipc 3.330 23.610 1.000 0.001 18.000 60.580 19.500 14.500MIPS-art-energy 4.032 6.727 3.540 3.540 9.385 16.690 7.985 13.380MIPS-bzip2-ipc 0.860 3.474 0.816 0.792 4.526 6.787 4.210 4.900

MIPS-bzip2-energy 6.930 14.429 5.488 5.143 17.037 53.870 18.699 12.308MIPS-crafty-ipc 3.818 6.054 2.281 1.927 13.979 14.229 10.900 8.738

MIPS-crafty-energy 8.330 13.360 5.263 5.263 25.397 54.070 13.820 13.820MIPS-equake-ipc 1.102 5.690 1.167 0.895 5.846 13.870 5.470 4.508

MIPS-equake-energy 6.667 13.897 5.495 5.000 21.488 52.480 18.182 11.970MIPS-gap-ipc 1.839 7.320 1.935 1.880 11.446 10.710 10.480 9.620

MIPS-gap-energy 7.107 14.286 5.560 5.330 18.788 50.100 18.550 12.970MIPS-gcc-ipc 0.700 2.798 0.918 0.645 4.022 8.066 3.505 2.828

MIPS-gcc-energy 6.522 15.276 5.072 4.850 17.544 56.450 16.000 10.853MIPS-gzip-ipc 0.695 3.138 0.676 0.667 3.836 6.980 3.784 2.703

MIPS-gzip-energy 6.667 13.740 5.556 5.384 18.000 54.240 16.670 17.020MIPS-mesa-ipc 1.069 5.405 0.821 0.756 5.873 12.360 6.129 5.079

MIPS-mesa-energy 6.716 13.290 5.600 5.063 19.000 50.696 17.757 11.570MIPS-mgrid-ipc 8.889 29.030 5.574 5.740 40.440 42.990 33.180 19.045

MIPS-mgrid-energy 6.195 12.900 4.938 4.730 17.178 41.390 15.823 11.875MIPS-parser-ipc 0.550 1.747 0.460 0.449 2.283 5.580 2.020 1.304

MIPS-parser-energy 6.481 12.870 5.260 5.195 17.540 47.230 16.040 12.389MIPS-swim-ipc 0.748 10.330 0.242 0.167 2.520 18.730 2.195 1.220

MIPS-swim-energy 6.820 12.370 5.880 5.400 17.650 46.460 18.180 12.610MIPS-twolf-ipc 1.105 4.939 0.972 0.833 5.616 10.910 6.027 4.459

MIPS-twolf-energy 6.349 14.430 4.950 4.510 16.788 51.270 15.873 11.628MIPS-vortex-ipc 1.259 7.254 1.457 1.154 9.124 18.540 8.850 6.312

MIPS-vortex-energy 6.818 17.570 4.760 4.545 18.340 59.096 18.182 11.270MIPS-vpr-ipc 0.822 3.515 0.909 0.750 4.320 9.090 5.310 3.735

MIPS-vpr-energy 6.349 12.523 5.389 5.000 17.949 47.650 16.960 10.656MIPS-wupwise-ipc 1.940 16.660 0.142 0.093 6.013 38.830 6.667 6.144

MIPS-wupwise-energy 7.207 11.660 6.450 6.154 18.970 47.480 21.110 18.430POWER-ammp-ipc 4.096 3.034 4.204 2.811 48.045 36.000 31.319 20.865

POWER-ammp-power 5.303 3.058 6.058 2.466 56.006 20.116 50.244 19.317POWER-applu-ipc 4.909 3.041 5.139 2.716 54.839 39.610 35.714 18.452

POWER-applu-power 6.802 4.873 5.994 3.663 67.231 29.452 35.481 21.025POWER-equake-ipc 5.120 3.261 5.316 2.901 71.654 40.157 44.848 33.981

POWER-equake-power 7.930 4.446 6.187 3.493 82.145 25.825 38.789 18.400POWER-gcc-ipc 6.716 5.344 4.372 3.902 67.460 27.807 35.897 26.790

POWER-gcc-power 7.680 4.602 6.813 5.612 59.375 26.797 45.680 32.383POWER-gzip-ipc 3.463 2.949 3.010 2.388 25.316 28.077 30.460 20.076

POWER-gzip-power 5.745 4.210 7.503 3.249 47.493 26.305 38.012 22.374POWER-jbb-ipc 2.445 1.729 3.707 1.600 30.435 20.956 34.107 14.154

POWER-jbb-power 5.414 2.517 7.826 2.083 66.222 20.596 42.393 12.093POWER-mesa-ipc 5.846 5.402 4.497 4.241 40.647 35.612 45.083 35.971

POWER-mesa-power 7.074 5.070 7.347 5.164 62.406 27.707 43.399 31.001POWER-twolf-ipc 4.128 2.545 4.743 2.335 45.736 23.991 40.566 22.464

POWER-twolf-power 5.864 2.748 7.025 2.177 62.289 19.654 45.461 16.766AVG 4.631 8.857 4.004 3.070 26.536 32.141 21.093 13.701STD 2.520 6.245 2.313 1.935 22.147 16.764 14.290 8.363

Normalized AVG 1.51X 2.89X 1.30X 1 1.94X 2.35X 1.54X 1Normalized STD 1.30X 3.23X 1.20X 1 2.65X 2.00X 1.71X 1

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 28 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Page 29 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:19

In the experiments, we first identify the Pareto sets of the testing sets3 usingdifferent learning techniques, that is, ANN, M5P, SVM, and ELSE. Then, thequality of such Pareto set is evaluated with average distance from referenceset (ADRS) [Palermo et al. 2009]. Technically, ADRS measures the distancebetween the exact Pareto set and the approximate Pareto set, and it is alsoused by Palermo et al. to measure the quality of Pareto sets in [Palermo et al.2009].

We compare the ADRS of the Pareto set of PickBest (i.e., the Pareto set withthe best ADRS among that of ANN, M5P and SVM) and ELSE on the bench-marks from both MIPS and POWER scenarios. Experimental results showthat ELSE outperforms PickBest on most benchmarks, and the average im-provement of ADRS is 9.64%. Therefore, ELSE can also be very helpful forfinding the optimal configurations for practical multi-objective DSE prob-lems.

7. CONCLUSION

Architectural design space exploration is a very challenging task during the designof microprocessors, and learning/regression techniques are very promising to combattime-consuming simulations. However, according to detailed evaluations of existingtechniques on 52 different modeling tasks, we found that the accuracy variability ofexisting regression models is common and significant, and previous techniques are notrobust enough to offer reliable predictions for many different modeling tasks. To ad-dress the issue which may potentially diminish the general applicability of learningtechniques in EDSM, we suggest a EDSM framework called ELSE that employs mul-tiple base regression models to gain robustness. Like the stacking algorithm [Wolpert1992] that inspires our investigation, ELSE is an open framework that does not relyon any specific learning techniques, thus can be utilized to enhance the robustness oftraditional EDSM techniques. Experimental results show that ELSE can reduce about50% accuracy variability in comparison with the widely used ANN technique, enablingrobust EDSM that can adapt to different modeling tasks. Furthermore, ELSE can sig-nificantly reduce the prediction error for different modeling tasks.

In this paper, we mainly focus on performance modeling of a single bench-mark, while multi-programmed applications are very common on existingmulti-core architectures. In the future, we plan to apply ELSE to the perfor-mance modeling of multi-programmed applications on multi-core architec-tures.

REFERENCES

AZIZI, O., MAHESRI, A., LEE, B. C., PATEL, S. J., AND HOROWITZ, M. 2010. Energy-performance tradeoffsin processor architecture and circuit design: a marginal cost analysis. In Proceedings of the 37th annualinternational symposium on Computer architecture. ISCA ’10. 26–36.

BROOKS, D., TIWARI, V., AND MARTONOSI, M. 2000. Wattch: a framework for architectural-level poweranalysis and optimizations. In Proceedings of the 27th annual international symposium on Computerarchitecture. ISCA ’00. 83–94.

BUJA, A. AND STUETZLE, W. 2000. The effect of bagging on variance, bias, and mean squared error. Tech-nical Report: AT&T Labs-Research.

CHO, C.-B., ZHANG, W., AND LI, T. 2007. Informed microarchitecture design space exploration using work-load dynamics. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchi-tecture. MICRO 40. 274–285.

3Since it is impossible to simulate the entire design space with billions of design options to find its Paretoset, here we evaluate the Pareto set of each testing set. We also use 10-fold cross-validation to enhance thereliability of the experiments.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 30 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

A:20 Q. Guo et al.

DIETTERICH, T. G. 1998. Machine-learning research: Four current directions. The AI Magazine 18, 97–136.

DUBACH, C., JONES, T., AND O’BOYLE, M. 2007. Microarchitectural design space exploration using anarchitecture-centric approach. In Proceedings of the 40th Annual IEEE/ACM International Symposiumon Microarchitecture. MICRO 40. 262–271.

DUBACH, C., JONES, T. M., AND O’BOYLE, M. F. 2008. Exploring and predicting the architecture/optimisingcompiler co-design space. In Proceedings of the 2008 international conference on Compilers, architecturesand synthesis for embedded systems. CASES ’08. 31–40.

ELMOUSTAPHA OULD-AHMED-VALL, JAMES WOODLEE, C. Y. AND DOSHI, K. A. 2007. On the comparisonof regression algorithms for computer architecture performance analysis of software applications. InProceedings of the First Workshop on Statistical and Machine learning approaches applied to ARchitec-tures and compilaTion. SMART.

FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2000. Additive logistic regression: a statistical view ofboosting. Annals of Statistics 2, 337–407.

GUO, Q., CHEN, T., CHEN, Y., ZHOU, Z.-H., HU, W., AND XU, Z. 2011. Effective and efficient microprocessordesign space exploration using unlabeled design configurations. In Proceedings of the Twenty-SecondInternational Joint Conference on Artificial Intelligence - Volume Volume Two. IJCAI’11.

HARDAVELLAS, N., FERDMAN, M., FALSAFI, B., AND AILAMAKI, A. 2011. Toward dark silicon in servers.IEEE Micro 31, 4, 6–15.

IPEK, E., MCKEE, S. A., CARUANA, R., DE SUPINSKI, B. R., AND SCHULZ, M. 2006. Efficiently exploringarchitectural design spaces via predictive modeling. In Proceedings of the 12th international conferenceon Architectural support for programming languages and operating systems. ASPLOS XII. 195–206.

JOSEPH, P. J., VASWANI, K., AND THAZHUTHAVEETIL, M. J. 2006. A predictive performance model forsuperscalar processors. In Proceedings of the 39th Annual IEEE/ACM International Symposium onMicroarchitecture. MICRO 39. 161–170.

KHAN, S., XEKALAKIS, P., CAVAZOS, J., AND CINTRA, M. 2007. Using predictivemodeling for cross-programdesign space exploration in multicore systems. In Proceedings of the 16th International Conference onParallel Architecture and Compilation Techniques. PACT ’07. 327–338.

KUNCHEVA, L. 2002. A theoretical study on six classifier fusion strategies. IEEE Transactions on PatternAnalysis and Machine Intelligence 24, 2, 281 –286.

LEE, B. C. AND BROOKS, D. 2010. Applied inference: Case studies in microarchitectural design. ACM Trans.Archit. Code Optim. 7.

LEE, B. C. AND BROOKS, D. M. 2006. Accurate and efficient regression modeling for microarchitecturalperformance and power prediction. In Proceedings of the 12th international conference on Architecturalsupport for programming languages and operating systems. ASPLOS XII. 185–194.

LEE, B. C., COLLINS, J., WANG, H., AND BROOKS, D. 2008. Cpr: Composable performance regression forscalable multiprocessor models. In Proceedings of the 41st annual IEEE/ACM International Symposiumon Microarchitecture. MICRO 41. 270–281.

LIAO, S.-W., HUNG, T.-H., NGUYEN, D., CHOU, C., TU, C., AND ZHOU, H. 2009. Machine learning-basedprefetch optimization for data center applications. In Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis. SC ’09.

MARIANI, G., PALERMO, G., ZACCARIA, V., AND SILVANO, C. 2013. Design-space exploration and runtimeresource management for multicores. ACM Trans. Embed. Comput. Syst. 13, 2, 20:1–20:27.

MOUDGILL, M., WELLMAN, J.-D., AND MORENO, J. H. 1999. Environment for powerpc microarchitectureexploration. IEEE Micro 19, 3, 15–25.

OPITZ, D. AND MACLIN, R. 1999. Popular ensemble methods: an empirical study. Journal of Artificial Intel-ligence Research 11, 169–198.

OULD-AHMED-VALL, E., WOODLEE, J., YOUNT, C., DOSHI, K. A., AND ABRAHAM, S. 2007. Using modeltrees for computer architecture performance analysis of software applications. In Proceedings of Inter-national Symposium on Performance Analysis of Systems and Software. ISPASS. 116–125.

PALERMO, G., SILVANO, C., AND ZACCARIA, V. 2009. Respir: A response surface-based pareto itera-tive refinement for application-specific design space exploration. Trans. Comp.-Aided Des. Integ. Cir.Sys. 28, 12.

PAONE, E., VAHABI, N., ZACCARIA, V., SILVANO, C., MELPIGNANO, D., HAUGOU, G., AND LEPLEY, T. 2013.Improving simulation speed and accuracy for many-core embedded platforms with ensemble models. InProceedings of the Conference on Design, Automation and Test in Europe. DATE ’13.

RENAU, J., FRAGUELA, B., TUCK, J., LIU, W., PRVULOVIC, M., CEZE, L., SARANGI, S., SACK, P., STRAUSS,K., AND MONTESINOS, P. 2005. SESC simulator. http://sesc.sourceforge.net.

SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning 5, 197–227.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 31 of 32 Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

For Peer Review

Robust Design Space Modeling A:21

TARJAN, D., THOZIYOOR, S., AND JOUPPI, N. P. 2006. Cacti 4.0. Technical Report: HPL-2006-86.

VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc.

WANG, Y. AND WITTEN, I. H. 1997. Induction of model trees for predicting continuous classes. In Proceed-ings of European Conference on Machine Learning (ECML). ECML’97.

WITTEN, I. H. AND FRANK, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques withJava Implementations 2nd Edition Ed. Morgan Kaufmann.

WOLPERT, D. 1997. The lack of a priori distinctions between learning algorithms. Neural Computation 8, 7,1341–1390.

WOLPERT, D. AND MACREADY, W. G. 1997. No free lunch theorems for optimization. IEEE Transactions onEvolutionary Computation 1, 1, 67–82.

WOLPERT, D. H. 1992. Stacked generalization. Neural Networks 5, 241–259.

YI, J. J., EECKHOUT, L., LILJA, D. J., CALDER, B., JOHN, L. K., AND SMITH, J. E. 2006. The future ofsimulation: A field of dreams. Computer 39, 22–29.

ZHOU, Z.-H. 2009. Ensemble learning. In Encyclopedia of Biometrics. 270–273.

ZHOU, Z.-H. 2012. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Chapman & Hall/CRC.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 32 of 32Transactions on Design Automation of Electronic Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960