eiger: a framework for the automated synthesis of...

Eiger: A Framework for the Automated Synthesis ofStatistical Performance Models

Andrew Kerr∗, Eric Anger∗, Gilbert Hendry† and Sudhakar Yalamanchili∗∗School of Electrical and Computer Engineering

Georgia Institute of Technology{akerr, eanger, sudha}@gatech.edu†Sandia National Laboratories

[email protected]

Abstract—As processor architectures continue to evolve to increasinglyheterogeneous and asymmetric designs, the construction of accurateperformance models of execution time and energy consumption hasbecome increasingly more challenging. Models that are constructed,are quickly invalidated by new features in the next generation ofprocessors while many interactions between application and architectureparameters are often simply not obvious or even apparent. Consequently,we foresee a need for an automated methodology for the systematicconstruction of performance models of heterogeneous processors. Themethodology should be founded on rigorous mathematical techniquesyet leave room for the exploration and adaptation of a space ofanalytic models. Our current effort toward creating such an extensible,targeted methodology is Eiger. This paper describes the methodologyimplemented in Eiger, the specifics of Eiger’s extensible implementationand the results of one scenario in which Eiger has been applied - thesynthesis of performance models for use in the simulation-based designspace exploration of Exascale architectures.

I. INTRODUCTION

The transition to many-core computing has been accompanied bythe emergence of heterogeneous architectures combining mainstreammutltithreaded cores with accelerators such as vector units (IntelAVX) or graphics processor units (GPUs), e.g., from AMD andNVIDIA. Critically important are performance models which arenecessary for the efficient management of these architectures, e.g., intask scheduling, as well being central to the design space explorationof future heterogeneous processors. However, the increased com-plexity of these architectures challenges the construction of accurateperformance models. In addition, the design-space exploration ofthe interaction of future hardware and software that is necessary toadvance computing into the Exascale regime is highly dependent onsimulation and analytical modeling. Coarse-grained simulators suchas SST/macro [1] can be used to predict the performance of manythousands of tasks running on millions of heterogeneous cores andthe communication between them using models of varying degreesof fidelity, often trading off simulation performance for accuracy.The periodic construction of accurate analytic models is central tosuch analyses.

In this paper, we describe an automated statistical approach formodeling program behaviors on diverse architectures and present anevaluation of the application of the infrastructure to the problemof generating performance models for graphics processing unit(GPU) accelerators. Our objective is to design and implement amethodology for discovering and synthesizing analytic performancemodels of the runtimes and energy consumption of applicationsexecuting on target heterogeneous processors.

Our approach discovers analytic relationships between the staticand dynamic parameters of an application and performance metrics.For example, we may wish to capture the impact of the sparsityof input data structures, dynamic execution count, and number offunction calls on the execution time. Or we may wish to discover arelationship between the number of double precision load operations,number of DMA calls, and occurrence of unconditional branches

on energy consumption. These are usually complex relationshipsthat elude manual discovery or effective application of off the shelfmodels and mathematical techniques. This complexity is magnifiedin modern and emerging heterogeneous processors. Broadly, ourmethodology is comprised of 1.) experimental data acquisition anddatabase construction, 2.) a series of data analysis passes over thedatabase (possibly creating new higher order data), and 3.) modelselection and construction. The last phase automates the constructionof the software implementations of the models while the analysispasses can utilize a rich source of existing data analytics techniques.

Our current implementation is Eiger, which achieves our goalsvia statistical methods for dimensionality reduction and automatedmodel selection. Eiger constructs coarse-grain predictive modelswhen trained with results of similar applications on similar ma-chines. The regression models may be as simple or complex asdesired using metrics ranging from fine grained counts of instructiondistributions to coarse grain estimates of computation working setsize. The infrastructure is easily extensible to new sources ofmeasurement data and supports the incremental addition of newexperimental data. Its modular structure supports the easy additionof new analysis and model construction passes. Consequently, wehope the infrastructure and framework will benefit other explorationsin the community by lowering the barriers to entry in performancemodel generation or synthesis.

II. MODELING TECHNIQUE

This section describes elements of the proposed performancemodeling infrastructure as well as a formal definition of analysesand the model selection algorithm.

A. Eiger Framework

A detailed illustration of the Eiger framework is provided inFigure 1. The following components constitute an automated processin which application profiling data is collected via a standardizedinterface and ultimately used to construct a model of runtimes,energy, or any other dependent result metric. The resulting statisticalmodel may be then composed with other tools and applications suchas simulation environments, heterogeneity-aware process schedulers,and reporting tools. This paper describes the construction of modelsof execution time. An application is executed on the target processor,multiple parameters recorded, and execution time measured. This isdefined as one trial. Multiple trials for an application typically coverdifferent parameter values.

Measurements. Due to the mutually independent nature of eachtrial, the individual runs required to accumulate this information canintrinsically be run independently. We use a relational database tomanage the storage of all data accumulated during profiling runsas it allows asynchronous insertions while allowing for rigorousrelational specifications. Additionally, we provide support for the

AnalysisPasses

ModelConstruction

Reporting

Model estimation,Regression analysis: linear least squares nonlinear least squares

Measurements

RDBM

CompilerApplication

Instrumentation(C, C++, assembly)

CPU

Performance counters,Dynamic instrumentation

Static programanalysis

PCA, Varimax, andCluster analysis

Runtime and Energy Prediction

Recorded metricsCluster membership, projected metrics Performance models

Intermediate results Intermediate results

RDBM RDBM RDBM

Fig. 1: Implementation details of Eiger Statistical Model Creation framework.

scenario where execution runs of multiple tools are required toconstruct a single data point (trial).

Analysis. The design supports the addition of analysis passes.We will start with Principal Component Analysis (PCA) whichis a well-known dimensionality reduction technique. The majorcomputations are construction of a correlation matrix and computingthe eigenvectors of this correlation matrix via Singular Value De-composition (SVD) [2]. Implementations of SVD are available inLAPACK implementations, notably SciPy [3]. Benefits of reducingthe dimensionality of the input dataset are manifold; it speeds upthe model generation process, improves the clarity of the resultingmodel, and allows for intuition into the correlations between inputmetrics.

It is important to note that PCA is an unsupervised learningtechnique in that it does not take the performance metric into accountwhen choosing dimensions to eliminate. It is entirely possible that adimension with low variance may have a larger affect on applicationperformance than one with high variance. For example, number ofcores may not have as large a variance as memory bandwidth fora range of machines but a greater impact on runtime for compute-bounded applications. PCA does not explicitly specify how manydimensions to retain; rather it relies upon the user to make the finaldecision.

Model Construction. Designing processors to accelerate generalsets of workloads would be significantly simpler if all workloadsexhibited similar performance characteristics. In contrast, real ap-plications demonstrate varied performance behavior. Dense linearalgebra workloads with regular control properties and compute-intensive inner loops differ significantly from irregular workloadswith data-dependent branch behavior and load imbalance acrossthreads. Goswami et al. [4] analyze the diversity of CUDA work-loads and present a prioritized tree of benchmark applications sortedby how much each increases total variance of a given set ofmetrics. Applications exhibiting high correlation among principalcomponents are clustered together. A single model trained fromprofiling data from all applications in a comprehensive benchmarkis unlikely to yield high accuracy. Rather, the best model is likely tobe obtained from training data gathered by a set of applications thatare “similar” to the experimental application. Kerr et al. [5] provideempirical evidence that clustering and partitioning improves modelaccuracy.Model Pool The model pool defines a set of possible basis functionswhich may be mapped to principal components and whose linear

Functionsx−2i x−1

i x−1/2i log2(xi) x

1/2i x1

i x2i xi ∗ xj

TABLE I: Model pool.

combination yields the resulting performance model. The model poolmust be selected by the experimenter and should offer sufficientvariety for maximizing goodness of fit of the resulting model. Themodel pool should include functions that closely model the spaceand time complexity of dominant algorithms within the applicationsof interest as well as non-linear combinations of several metrics.For example, compute-bound applications may demonstrate a verystrong correlation between the product of clock frequency anddynamic instruction counts. Table I describes the basis functionsused for this work.Model Selection and Training. Model construction is a pass overthe data to produce a model. Our first model construction pass willbe automated regression analysis [6] which constructs an analyticmodel determined from a set of training samples. In this case, thesamples are taken from data projected onto the principle componentbasis vectors. Regression modeling yields an analytic formula forcomputing runtime and energy from additional signals. PCA andvarimax yield orthogonal and uncorrelated principal components,crucial assumptions enabling classic regression analysis techniques.We propose the use of parametric regression models.

Each step in the stepwise procedure, shown in Algorithm 1,considers the function from the model pool that, if added to thecurrent model, would increase the fit the most. If the R̄2 of thethis new model, including the candidate function, surpasses the R̄2

of the model without the new function by more than the providedthreshold, the function is added to the model, removed from themodel pool, and the algorithm starts again. When there are no morefunctions remaining in the model pool that would pass this threshold,the algorithm completes and returns the final model. This final modelmay not have maximally reduced squared error for the training data,but the formulation of R̄2 provides for a model that is more likelyto predict values not present in the training set.

Reporting Completed models consist of a set of transformationmatrices from dimensionality reduction and cluster analysis as wellas a vector of functions and their associated weights. Reportingpasses over this data will format the information in a method easilyconsumed by the user, including plotting and statistical results.

This phase also allows for the serialization and memoization of thefinished models for later consumption. Finally, this phase producesmodel descriptions in a format that can be imported by systemsimulation tools and software modules such as run-time schedulers.

B. Formal Specification

Let m ∈ Z refer to the number of trials executed for a multiplicityof applications, datasets, and machine configurations. Let n ∈ Zrefer to the total number of metrics (measurements) acquired pertrial. These may include static application metrics, dynamic metricsacquired during the execution of the trial, and machine configurationparameters.

We define X ∈ Rm×n as an input dataset and R ∈ Rm×1 asa result set. Each row in X corresponds to a trial instance withresult (e.g., execution time) in the corresponding row of the columnvector R. Together, (X,R) captures sufficient data describing theapplication, machine, and performance characteristics to construct amodel for R.

X =

...

m trials.... . . n metrics . . .

R =

...

m results...

Principle component analysis (PCA) yields a projection P fromX onto U ∈ Rm×p, where p ∈ Z, p ≤ n is the number of principlecomponents, such that all axes are orthogonal and are sorted indecreasing amount of variance.

Clustering analysis enables a down-selection of trials used inmodel computation. This analysis yields a subset of rows such thatU ′ = SU where U ′ ∈ Rm′×p,m′ ≤ m and S is a selection matrix.This work applies k-means clustering which partitions a set of pointsinto k clusters such that each point in a cluster ki is closest to themean of ki than any other mean. The distance metric used in thiswork is squared Euclidean distance, which gives increasingly greaterweight to the distance between two elements.

Model selection yields a function f : R1×p → R that mapsindividual trials onto a predicted result value. f is the performancemodel, and this work yields one model per cluster. Model selectionleverages linear regression, a commonly-used and well-behaved formof regression that evaluates the dependent variable y as the weightedlinear combination of independent variables x plus an error term ε,representing any deviation of the expected value of the model fromthe real value.

y = β0 +

n∑i=1

βixi + ε (1)

The method of model estimation is least squares, in which the setof coefficients β is chosen to minimize the residual sum of squares

RSS(β) =

n∑i=1

(yi − β0 −p∑

j=1

xijβj)2 (2)

Model selection itself composes a performance model as thelinear combination of a set of non-linear functions on the projectedprofiling data. The set of possible functions is known as a modelpool, which can include basis expansions, mathematical transforma-tions, and varaible interactions, among others. To allow for greatergeneralizability, this work can iterate over a set of model pools andselect the one that minimizes squared error.

In order to manage model complexity and optimize trainingerror rate, a forward-stepwise procedure is used to aggregate ba-sis functions based upon adjusted coefficient of determination, amodification of the coefficient of determination that adjusts for thenumber of terms in the model [7].

AdjustedR2 = R̄2 = 1− (1−R2)n− 1

n− p− 1(3)

It is important to note that R̄2 does not have the same interpre-tation as R2; while R2 is in the range 0 → 1, R̄2 is in the range−∞ → 1. The intuition is that any value of R̄2 less than zeroimplies a fit worse than could be expected by chance.

profile, performance, modelPool = ... // initialize training datathreshold = ... // specification from userfinalModel = [] // empty setcurrentRsquaredAdj = −∞begin

while not done domaximum = −∞foreach each function left in modelPool do

add function to finalModelU = apply finalModel to profilebeta = leastSquares(U, performance)RsquaredAdj = ... // calculate adjusted rsquaredif RsquaredAdj > maximum then

maximum = RsquaredAdjnewFunction = functionnewBeta = beta

endremove current function from finalModel

endif maximum - currentRsquaredAdj > threshold then

add newFunction to finalModelremove newFunction from modelPoolcurrentRsquaredAdj = maximum

endelse

done = True // there are no more useful functionsend

endendreturn finalModel, beta, currentRsquaredAdj

Algorithm 1: Selects a model that minimizes error over a cluster

III. EVALUATION

This section describes the profiling infrastructure developed forEiger including a normalized schema for storing profiling data in arelational database.Application Interfaces. GPU Ocelot defines a functional simulator[8] for NVIDIA’s PTX [9], a low-level RISC-like assembly languagefor GPU computing. We extended this functional simulator tocapture detailed traces of dynamic instructions, SIMD utilization,and effective memory bandwidth. We leverage the compiler analysisfunctionality of GPU Ocelot to measure the metrics such as numberof live registers, frequency of synchronization, and registers perthread.

Statistical performance modeling requires significant profiling toproperly characterize workloads and yield enough data points atvarious configurations to adequately capture relationships amongmetrics and results. Consequently, this work focuses on providinga durable and centralized data store supporting concurrent accessfrom multiple data sources. We selected MySQL, a commercially

available open-source relational database management system, toserve as the centralized data store. Eiger’s profiling database schemasupports flexible, user-defined metrics to capture application behav-ior and machine configuration. At the top level, this work defines atrial as one execution of a GPU compute kernel. The trial referencesa particular application, machine configuration, and dataset and isannotated with additional metadata capturing details about the trial’sexecution environment and purpose. The machine configuration isa vector of real numbers describing characteristics of the targetprocessor.

A. Metrics

The set of metrics under consideration are partitioned into twohalves: application metrics and machine metrics. Application metricsare independent of the device upon which they are run where asmachine metrics describe the execution hardware itself. Many ofthese application metrics are consistent with metrics identified inother workload characterization studies such as [10] which describesa set of microarchitecture-independent metrics.

Application metrics characterize the runtime behavior of the trialand are partitioned into deterministic and non-deterministic metrics.Deterministic metrics are invariant across machines and are basedon profiling results obtained from executing the application on adeterministic functional simulator. This work assumes that applica-tions are race free, or that data races may be safely ignored withoutimpacting results. Non-deterministic metrics contain the results ofhardware performance counters, runtimes, and other metrics whichmay vary non-deterministically across trials, even while all inputsand machine configuration remain constant. This partitioning of met-rics enables profiling runs to capture strictly necessary informationfor each trial, as the deterministic metrics, by definition, do not varyacross executions. A summary of deterministic application metricsare listed in Table II.

Metric Units Collection methodMemory Efficiency percentage instrumentationMemory Intensity instructions instrumentationMemory Sharing percentage instrumentationActivity Factor percentage instrumentationMIMD speedup instrumentationSIMD speedup instrumentationDMA Size bytes static analysisStatic Integer instructions static analysisStatic Memory instructions static analysisStatic Control instructions static analysisStatic Parallelism instructions static analysisStatic Float instructions static analysisStatic Special instructions static analysisDynamic Integer instructions emulationDynamic Memory instructions emulationDynamic Control instructions emulationDynamic Parallelism instructions emulationDynamic Float instructions emulationDynamic Special instructions emulation

TABLE II: Application metrics.

This work considers machine metrics, summarized in Table IIIwhich capture the performance characteristics of processors underconsideration and define the dimensions of possible design spaceexplorations.

B. Validation Procedure

While there are many options for validation of regression models,there is no universally agreed upon method. R2, known as thecoefficient of determination, is often used in regression analysis,

GTX 480 GTX 560Ti C2070Issue width 2 3 2Shader frequency (MHz) 1401 1645 1150Memory frequency (MHz) 3696 4008 3696Cores per Streaming Multiprocessor 32 48 32Streaming Multiprocessors 8 8 8Bandwidth (GB/s) 177.4 128 144L2 cache (kB) 640 512 640

TABLE III: GPU processor metrics.

Suite Application Description

Parboilfft Fast Fourier transformation

mm Dense matrix-matrix multiply

mri-q Magnetic resonance image recon-struction in non-Cartesian space

Rodiniahotspot Microprocessor thermal modeling

gaussian Linear system solver using Gaussianelimination

bfs Breadth-first graph traversal

CUDA SDK

scalarprod Scalar product of vector pairs

eigenvalues Bisection algorithm for calculatingeigenvalues of tridiagonal matrices

binomialoptions Fair call price evaluator for Euro-pean options using binomial model

scan Parallel prefix sum over large vec-tors

boxfilter Box convolution filter for imageprocessing

mersennetwister Random number generator

TABLE IV: Applications.

but it can only indicate how close the model fits the trained data,known as training error. Instead we would ideally like to knowhow well the model performs for test data independent from thetraining data, known as test error; however, this is complicatedby statistically unstable methods for test error estimation. For thispaper we will use K-folds cross-validation [11], a popular techniquefor estimating prediction error. In K-fold cross-validation, the inputdata set is randomly partitioned into K equally sized parts, denotedK1,K2, . . . ,KK . For all i ∈ 1 . . .K, parts Kj 6=i ∀j ∈ 1 . . .Kare used to train the model, and part Ki is used for testing. Theperformance of the model is then the average of all K runs. Thespecial case where K = N is known as leave-one-out cross-validation.

To evaluate Eiger’s capacity for performance prediction, 12 bench-mark applications, listed in Table IV, were selected and profiled indetail using GPU Ocelot’s PTX functional simulator. This gathersdeterministic application-dependent profiling information using themetric collection methodology laid out in Section II-B. Subse-quently, the same set of applications were executed on NVIDIAGPUs via GPU Ocelot’s NVIDIA backend. Light-weight profilingtools attached to GPU Ocelot capture kernel execution times. Timingmeasurements were made for each of the devices described in TableIII. The host machine is an Intel Core-i7 at 3.2 GHz.

C. Experiments

Varying Dimensionality. This experiment reduces the number ofdimensions retained after principal component analysis (PCA). Re-ducing dimensionality is a lossy compression technique, which hasthe potential to alias important metrics which may have a greatimpact on the performance of the model. However, increasing thedimensionality of the input data has the potential to complicate themodel, resulting in poor generalizability due to over fitting. As seenin Figure 2, there is indeed a minimum where the benefit of reduced

Fig. 2: Error as dimensions are varied.

Fig. 3: Error as number of clusters are varied.

dimensions on model complexity are balanced out by the loss ofdata.Varying Cluster Count. This experiment increases the numberof clusters to demonstrate the performance benefits of generatingmodels from only like data points. Each data point in the experimentset is predicted by the model whose cluster’s centroid is the closest.The results, shown in Figure 3, demonstrate how the segmentationinto separate clusters for modeling can increase the quality of themodels, although at a certain point the quality of the individualmodels decays due to the low number of data points in the cluster.Varying New Function Threshold. This experiment reduces thethreshold for adding new functions to the final model. As thethreshold decreases, more functions that only marginally increase thequality of the regression are included. This increases the dependenceon the training data and therefore the variance in the final model.The effect of varying this threshold is demonstrated in Figure 4.Varying Applications. In this experiment each application is pre-dicted using models created from all of the other applications. Thisexperiment demonstrates how generalized the application metricsare; instead of relying upon previous runs of the same executionof the application to train the model, which may obscure some ofthe application characteristics (e.g. algorithmic complexity, commu-nication patterns, etc.), other applications with potentially widelydifferent algorithmic implementations are used. Results are shownin Figure 5. Each application kernel is listed separately, indicatedby the number concatenated to the end of the application name.Varying Machine Parameters. In this experiment the GTX 480 ispredicted by models trained on only the GTX 560Ti and the TeslaC2070. This experiment demonstrates how well hardware metrics

Fig. 4: Error as convergence threshold varies.

Fig. 5: Predicted versus actual runtimes for each application formodels trained from all other applications, annotated by meansquared error.

can describe the performance of a given application. Results fromthis experiment, shown in Figure 6, indicate accurate predictions arepossible simply by varying machine parameters.

D. Discussion

Let us consider the structure of one of these models, wherea model is built from every trial except those from the secondeigenvalues kernel. All but nine principal components areremoved and all trials belong to a single cluster. The threshold usedin the model building algorithm is 0.01. The final model built takesthe form runtime = β0 ∗ log2(PC0) + β1 ∗ log2(PC8) + β2 ∗log2(PC1) + β3 ∗ (PC2 ∗ PC3) where β0, β2 are positive andβ1, β3 are negative. Here PC0 represents problem size, includingcontributions from dynamic instruction counts, function call depth,and branches; PC8 represents SIMD utilization, i.e. parallelizationand control flow divergence; PC1 represents program size, con-tributed to most by static instruction counts; PC2 ∗PC3 representsthroughput as the product of threading and efficiency. Intuitively,this set of principal components makes sense: as problem sizeincreases, so does execution time, and as SIMD utilization andthroughput increase, execution time decreases. This resulted in a29.28% average mean absolute percent error when predicting thesecond eigenvalues kernel trials.

It is surprising that Eiger selects base-2 logarithms to construct theperformance model, however. This is a consequence of automatedmodel selection that attempts to find the best fit possible given the

Fig. 6: Predicted versus actual runtimes on the GTX 480 when modelis trained from the GTX 560Ti and the Tesla C2070, annotated bymean squared error.

available model pool and training data. For the given problem sizes,this performance model yields the best-fit model of runtimes. Carefulanalysis of the algorithm, implementation, and target hardwarewould almost certainly not express runtime as proportional to thelogarithm of dynamic instruction counts. And yet, doing so yieldsthe closest fitting performance model for the problem sizes in thetraining data. Thus, Eiger is able to obtain the best performancemodel given training data and may out perform analytic models.

IV. RELATED WORK

Jia et. al. [12] present a design space exploration technique whichsimulates a random subset of GPU designs from a very large designspace then applies a stepwise regression modeling algorithm toconstruct a performance estimator. Our work presents a flexibleframework for the composition and use of such existing and futureregression techniques to facilitate the construction and applicationof performance models.

Genbrugge et al. [13] describe a method for constructing asynthetic trace of a program execution bearing the same statisticalproperties as a complete execution but of much shorter overalllength. Both approaches are focused on modeling specific phenom-ena rather than enabling the construction of general models. Honget. al. [14] propose a predictive analytical performance model forGPUs. Our approach, on the other hand, does not assume particularprocessor architecture or machine model and instead attempts todetermine them based on measurable statistics that may changesubstantially as microarchitectures evolve.

PCA-based approach to modeling GPU workloads executing oneither GPUs or CPUs. That work inspired the design of Eiger whichformalizes and automates the steps of application profiling, dimen-sionality reduction, cluster analysis, and model selection. Whilemuch of the cluster analysis and model selection performed in thatwork was performed in a human-guided ad hoc manner, the Eigerframework provides a automated implementation for i) imoportinginstrumentation data, ii) excercising the analysis passes, and iii)searching the (extensible) space of performance models. The modelsare exported in a format that can be used by simulators and softwaresystems. Goswami et al. [4] perform detailed characterization studiesof GPU applications using a detailed cycle-accurate simulator andidentify redundancies in characterization metrics among kernelsfrom different CUDA benchmark applications. The authors decom-pose kernel executions to provide recommendations for prioritizing

the execution of benchmarks to exercise the most complete set ofprogram behaviors in as few trial executions as possible.

V. CONCLUSION

This study demonstrates the Eiger modeling framework for au-tomating the generation of statistical performance models. Eigerprovides an automated method in which designers may profileand characterize workloads, automatically construct performancemodels, and evaluate performance sensitivity to processor config-urations. Our results show models based on as few as 5-7 principalcomponents achieve fairly low mean squared error, but that addingprincipal components can increase squared error. We also verifycluster analysis has a strong impact on model accuracy due tosignificant differences in workload characteristics among benchmarksuites. Finally, this empirical evaluation shows an automated statis-tical performance modeling framework such as Eiger can providean effective approach to design space exploration of candidatemicroarchitectures. The next step for us is the application to themodeling of energy and power.

VI. ACKNOWLEDGMENTS

This research was supported in part by the National ScienceFoundation under grant CCF-0905459 and Sandia National Labo-ratories. Sandia National Laboratories is a multiprogram laboratoryoperated by Sandia Corporation, a Lockheed Martin Company, forthe United States Department of Energy’s National Nuclear SecurityAdministration under contract DE-AC04- 94AL85000.

REFERENCES

[1] C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar,D. A. Evensky, and J. Mayo, “A simulator for large-scale parallelcomputer architectures.” IJDST, vol. 1, no. 2, pp. 57–73, 2010.

[2] G. Golub and C. V. Loan, Matrix Computations, 3rd ed. Baltimore,MD.: Johns Hopkins University Press, 1996.

[3] “Scipy: Scientific tools for python,” July 2012, http://www.scipy.org/.[4] N. Goswami, R. Shankar, M. Joshi, and T. Li, “Exploring gpgpu work-

loads: Characterization methodology, analysis and microarchitectureevaluation implications,” in Workload Characterization (IISWC), 2010IEEE International Symposium on, dec. 2010, pp. 1 –10.

[5] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling gpu-cpu work-loads and systems,” in Third Workshop on General-Purpose Computa-tion on Graphics Procesing Units, Pittsburg, PA, USA, March 2010.

[6] H. Samet, Foundations of Multidimensional and Metric Data Structures,1st ed. Morgan Kaufmann, 2006.

[7] M. H. Kutner, C. J. Nachtsheim, and J. Neter, Applied Linear Regres-sion Models, fourth international ed. McGraw-Hill/Irwin, Sep. 2004.

[8] A. Kerr, G. Diamos, and S. Yalamanchili, “A characterization andanalysis of ptx kernels,” in IISWC09: IEEE International Symposiumon Workload Characterization, Austin, TX, USA, October 2009.

[9] NVIDIA, NVIDIA Compute PTX: Parallel Thread Execution, 1st ed.,NVIDIA Corporation, Santa Clara, California, October 2008.

[10] K. Hoste and L. Eeckhout, “Microarchitecture-independent workloadcharacterization,” Micro, IEEE, vol. 27, no. 3, pp. 63 –72, may-june2007.

[11] R. Kohavi, “A study of cross-validation and bootstrap for accuracyestimation and model selection.” Morgan Kaufmann, 1995, pp. 1137–1143.

[12] W. Jia, K. Shaw, and M. Martonosi, “Stargazer: Automated regression-based gpu design space exploration,” in Performance Analysis ofSystems and Software (ISPASS), 2012 IEEE International Symposiumon, april 2012, pp. 2 –13.

[13] D. Genbrugge and L. Eeckhout, “Chip multiprocessor design space ex-ploration through statistical simulation,” Computers, IEEE Transactionson, vol. 58, no. 12, pp. 1668 –1681, dec. 2009.

[14] S. Hong and H. Kim, “An integrated gpu power and performancemodel,” in Proceedings of the 37th annual international symposium onComputer architecture, ser. ISCA ’10. New York, NY, USA: ACM,2010, pp. 280–289.

eiger: a framework for the automated synthesis of...

Documents