statistical inference for efficient microarchitectural and...

Motivation & BackgroundMicroarchitectural Analysis

Application AnalysisConclusion

Statistical Inference for EfficientMicroarchitectural and Application Analysis

Benjamin C. Lee

www.deas.harvard.edu/∼bcleeDivision of Engineering and Applied Sciences

Harvard University

15 November 2006

Benjamin C. Lee 1 :: Supercomputing 2006

Acknowledgements

David Brooks, Harvard UniversityBronis de Supinski, Lawrence Livermore National LaboratoryMartin Schulz, Lawrence Livermore National Laboratory

OutlineMotivation & Background

Parameter Space ExplorationExploration ParadigmStatistical Inference

Microarchitectural AnalysisMethodologyEvaluationCase Study

Application AnalysisMethodologyEvaluationCase Study

Conclusion

Parameter Space Exploration

Cycle-Accurate SimulationTracks instructions’ progress through microprocessorEstimates performance, power, temperature, . . .

Execution-Based ProfilingSelectively execute application with varying inputsEstimates performance

Exploration CostsNon-trivial simulation, profiling timesParameter space scales exponentially (mp)

p :: parameter countm :: parameter resolution

Exploration Paradigm

Comprehensively understand parameter spaceSpecify large, high-resolution parameter spaceConsider all parameters simultaneously

Selectively measure modest number of pointsSample points randomly from space for measurementDecouple resolution of space and measurements

Efficiently leverage measured data with inferenceReveal trends, trade-offs from sparse samplingEnable predictions for metrics of interest

Model Formulation

Notationn observations . {measured samples}Response :: ~y = y1, . . . , yn . {e.g., performance}Predictor :: ~xi = xi,1, . . . , xi,p . {e.g., L2 cache}

Regression Coefficients :: ~β = β0, . . . , βp

Random Error :: ~e = e1, . . . , en where ei ∼ N(0, σ2)Transformations :: f ,~g = g1, . . . , gp

f (y) = β0 +p∑

βjgj(xj) + e

Predictor Interaction

Modeling InteractionSuppose effects of predictors x1, x2 cannot be separatedConstruct predictor x3 = x1x2

y = β0 + β1x1 + β2x2 + β3x1x2 + ei

ExampleLet x1 be pipeline depth, x2 be L2 cache sizePerformance impact of pipelining affected by cache size

Speedup =Depth

1 + Stalls/Inst

Predictor Non-Linearity

Restricted Cubic SplinesDivide predictor domain into intervals separated by knotsPiecewise cubic polynomials joined at knotsHigher order polynomials provide better fits

2Stone [SS’86]Benjamin C. Lee 9 :: Supercomputing 2006

Prediction

Expected Responseβ are known from least squaresxi,1, . . . , xi,p are known for a given query iExpected response is weighted sum of predictor values

= E[β0 +

p∑j=1

= β0 +p∑

MethodologyEvaluationCase Study

Conclusion

Tools and Benchmarks

Simulation FrameworkTurandot :: a cycle-accurate trace driven simulatorPowerTimer :: power models derived from circuit analysesBaseline simulator models POWER4/POWER5 architecture

BenchmarksSPEC2kCPU :: compute-intensive benchmarksSPECjbb :: Java server benchmark

Statistical FrameworkR :: software environment for statistical computingHmisc and Design packages

Predictors :: MicroarchitectureSet Parameters Measure Range |Si|

S1 Depth depth FO4 9::3::36 10S2 Width width insn b/w 4,8,16 3

L/S reorder queue entries 15::15::45store queue entries 14::14::42functional units count 1,2,4

S3 Physical general purpose (GP) count 40::10::130 10Registers floating-point (FP) count 40::8::112

special purpose (SP) count 42::6::96S4 Reservation branch entries 6::1::15 10

Stations fixed-point/memory entries 10::2::28floating-point entries 5::1::14

S5 I-L1 Cache i-L1 cache size log2(entries) 7::1::11 5S6 D-L1 Cache d-L1 cache size log2(entries) 6::1::10 5S7 L2 Cache L2 cache size log2(entries) 11::1::15 5

L2 cache latency cycles 6::2::14

Parameter space of 375,000 design points

Validation Approach

FrameworkFormulate models with 1,000 samplesObtain 100 additional random samples for validationQuantify percentage error, 100 ∗ |yi − yi|/yi

ComparisonSimulator-reported performance, powerRegression-predicted performance, power

Prediction Accuracy

Multiprocessor Heterogeneity I

MotivationEvaluate trends in chip multiprocessor designMitigate penalties from single design compromise

ObjectiveIdentify efficient heterogeneous design compromises

ApproachSimulate 1K samples from design spaceFormulate regression models for performance, powerIdentify per benchmark optima (bips3/w) via regressionIdentify compromises via K-means clustering

Multiprocessor Heterogeneity II

Conclusion

Platforms and Workloads

PlatformsBlue Gene/LIntel Xeon Clusters (ALC,MCR)

Numerical MethodsSemicoarsening Multigrid 2000 (SMG2k)High-Performance Linpack (HPL)

Statistical FrameworkR :: software environment for statistical computingHmisc and Design packages

Predictors :: Numerical MethodsSemicoarsening Multigrid 2000

Set Parameters Measure Range |Si|S1 Nx x-dim working set grid points 10:20:510 26S2 Ny y-dim 10:20:510 26S3 Nz z-dim 10:20:510 26S4 Px x-dim processors processors 1,8,64,512 4S5 Py y-dim 1,8,64,512 4S6 Pz z-dim 1,8,64,512 4

Parameter space of ∼ 280,000 combinations

High-Performance LinpackSet Parameters Measure Range |Si|

S1 Matrix Size N sq. matrix dim 1000 1S2 Block Size NB sq. block dim 10:10:80 8S3 Processor rows (P) log2(procs) 0:1:9 10

Distribution columns (Q) log2(procs) 9-PS4 Panel Factor PFACT algorithm L,R,C 3S5 Recursive Factor RFACT algorithm L,R,C 3S6 Recursive Base NBMIN block size 1:1:8 8S7 Recursive Sub-Panels NDIV sub-panels 2:1:4 3S8 Broadcast BCAST algorithm 1rg, 1rM, 2rg, 6

2rM, Lng, LnM

Parameter space of ∼ 100,000 combinations

Validation Approach

FrameworkFormulate models with 600 samplesObtain 100 additional random samples for validationQuantify percentage error, 100 ∗ |yi − yi|/yi

ComparisonProfiled execution timeRegression-predicted execution time

Prediction Accuracy

Performance Gradients I

MotivationModel performance empiricallyCircumvent analytical complexity

ObjectiveUnderstand performance topology, bottlenecks

ApproachMeasure 600 samples from parameter spaceFormulate regression models for performancePredict execution time for every pointCompute numerical gradients with local differences

Performance Gradients II

Conclusion

Exploration ParadigmComprehensively understand parameter spaceSelectively measure modest number of pointsEfficiently leverage measured data with inference

Model Evaluation7.2%, 5.4% median errors for µ-arch performance, power5.1%, 3.1% median errors for SMG2K, HPL performance

Future DirectionsChip multiprocessors and on-chip interconnectAdditional applications and compiler parametersCombine microarchitecture, application models

AppendixFurther ReadingReferencesExtra Slides

Further Readingwww.deas.harvard.edu/∼bclee

B.C. Lee and D.M. Brooks and B.R. de Supinski and M. Schulz and K. Singh and S.A. McKee.Methods of inference & learning for performance modeling of parallel applications.PPoPP’07: Symposium on Principles & Practice of Parallel Programming, March 2007.

B.C. Lee and D.M. Brooks.Illustrative design space studies with microarchitectural regression models.HPCA-13: International Symposium on High Performance Computer Architecture, Feb 2007.

B.C. Lee and D.M. Brooks.Accurate, efficient regression modeling for microarchitectural performance, power prediction.ASPLOS-XII: International Conference on Architectural Support for Programming Languages & OperatingSystems, Oct 2006.

B.C. Lee and M. Schulz and B. de Supinski.Regression strategies for parameter space exploration: A case study in semicoarsening multigrid & R.Technical Report UCRL-TR-224851, Lawrence Livermore National Laboratory, Sept 2006.

B.C. Lee and D.M. Brooks.Statistically rigorous regression modeling for the microprocessor design space.MoBS-2: Workshop on Modeling, Benchmarking, & Simulation, June 2006.

B.C. Lee and D.M. Brooks.Regression modeling strategies for microarchitectural performance & power prediction.Harvard University Technical Report TR-08-06, March 2006.

References IY. Li, B.C. Lee, D. Brooks, Z. Hu, K. Skadron.CMP design space exploration subject to physical constraints.HPCA-12: International Symposium on High-Performance Computer Architecture, Feb 2006.

L. Eeckhout, S. Nussbaum, J. Smith, and K. DeBosschere.Statistical simulation: Adding efficiency to the computer designer’s toolbox.IEEE Micro, Sept/Oct 2003.

R. Liu and K. Asanovic.Accelerating architectural exploration using canonical instruction segments.In International Symposium on Performance Analysis of Systems and Software, Austin, Texas,March 2006.

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder.Automatically characterizing large scale program behavior.ASPLOX-X: Architectural Support for Programming Languages and Operating Systems,October 2002.

B.C. Lee and D.M. Brooks.Effects of pipeline complexity on SMT/CMP power-performance efficiency.ISCA-32: Workshop on Complexity Effective Design, June 2005.

References IIC. Stone.Comment: Generalized additive models.Statistical Science, 1986.

F. Harrell.Regression modeling strategies.Springer, New York, NY, 2001.

J. Yi, D. Lilja, and D. Hawkins.Improving computer architecture simulation methodology by adding statistical rigor.IEEE Computer, Nov 2005.

P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil.Construction and use of linear regression models for processor performance analysis.In Proceedings of the 12th Symposium on High Performance Computer Architecture, Austin,Texas, February 2006.

S. Nussbaum and J. Smith.Modeling superscalar processors via statistical simulation.In PACT2001: International Conference on Parallel Architectures and Compilation Techniques,Barcelona, Sept 2001.

Treatment of Missing Data

Missing Completely at Random (MCAR)Treat unobserved design points as missing dataSampling UAR ensures observations are MCARData is missing for reasons unrelated to characteristics orresponses of the configuration

Informative MissingData is more likely missing if their responses aresystematically higher or lower“Missingness” is non-ignorable and must also be modeledSampling UAR avoids such modeling complications

Predictor Non-Linearity I

Polynomial TransformationsUndesirable peaks and valleysDiffering trends across regions

Linear SplinesPiecewise linear regions separated by knotsInadequate for complex, highly curved relationships

Restricted Cubic SplinesHigher order polynomials provide better fitsContinuous at knotsLinear constraint on tails

Predictor Non-Linearity II

Location of KnotsLocation of knots less important than number of knotsPlace knots at fixed predictor quantiles

Number of KnotsFlexibility, risk of over-fitting increases with knot count5 knots or fewer are often sufficient 1

4 knots is a good compromise between flexibility, over-fittingFewer knots required for small data sets

1Stone [SS’86]Benjamin C. Lee 32 :: Supercomputing 2006

Derivation OverviewSpatial Sampling

Hierarchical ClusteringAssociation Analysis

qualitative scatterplots, quantitative ρ2

Model Specificationpredictor interaction, non-linearity

Assessing FitR2 statistic

Residual Analysisnormality (quantile-quantile), randomness (scatterplots)

Significance Testinghypothesis testing, F-statistic, p-values

4Lee[ASPLOS’06], Lee[PPoPP’07]Benjamin C. Lee 33 :: Supercomputing 2006

Variable Clustering

rate stal

Performance Associations

0.6 0.7 0.8 0.9

12131170 815 802

131813201362

11971179 802 822

11251247 818 810

[ 9,18) [18,27) [27,33) [33,36]

[ 40, 70) [ 70,100) [100,120) [120,130]

[ 6, 9) [ 9,12)

[12,14) [14,15]

phys_reg

Overall

Pipeline

N=4000bips

0.74 0.78 0.82

799 828 807 805 761

774 794 773 837 822

825 776 823 780 796

11 12 13 14 15

6 7 8 9

l2cache_size

icache_size

dcache_size

Overall

Memory Hierarchy

N=4000bips

0.5 0.8

1110 9191071 900

1103 9111103 883

1856 349 928 867

[5.08e−06,3.91e−05) [3.91e−05,7.15e−04) [7.15e−04,3.85e−03) [3.85e−03,3.71e−02]

[0.02,0.09) [0.09,0.12) [0.12,0.18) [0.18,0.55]

[0.00,0.02)

[0.04,0.11) [0.11,0.24]

il1miss_rate

dl1miss_rate

dl2miss_rate

Overall

N=4000

Performance Associations I

0.6 0.7 0.8 0.9

12131170 815 802

131813201362

11971179 802 822

11251247 818 810

[ 9,18) [18,27) [27,33) [33,36]

[ 40, 70) [ 70,100) [100,120) [120,130]

[ 6, 9) [ 9,12)

[12,14) [14,15]

phys_reg

Overall

Pipeline

N=4000bips

0.70 0.80 0.90

10101017 977 996

1505 728 953 814

2399 800 432 180 189

1036 973

1197 794

1298 707

1046 949

1 2 3 4 5

[ 34, 56) [ 56, 77)

[ 77,124) [124,307]

[1, 4)

[5, 8) [8,19]

[3, 5) [5,13]

[ 2, 5)

[ 6,10) [10,24]

mem_lat

ls_lat

ctl_lat

fix_lat

fpu_lat

Overall

Latency

N=4000bips

0.74 0.78 0.82

799 828 807 805 761

774 794 773 837 822

825 776 823 780 796

11 12 13 14 15

6 7 8 9

l2cache_size

icache_size

dcache_size

Overall

Memory Hierarchy

N=4000

Performance Associations II

0.6 0.8

[0.01,0.05)

[0.05,0.10)

[0.10,0.16)

[0.16,0.29]

[0.005,0.034)

[0.034,0.072)

[0.072,0.093)

[0.093,0.165]

br_rate

br_mis_rate

Overall

Branch

N=4000bips

0.6 0.8

10751104 910 911

1114 933

1053 900

1135 876

1129 860

[ 6225, 108733) [108733, 242619) [242619, 597291) [597291,2821539]

[ 193084,2.60e+06) [ 2599382,7.22e+06) [ 7222608,4.69e+07) [46879987,1.04e+08]

[ 949, 6037) [ 6037, 18667)

[ 18667, 254857) [254857,14883942]

stall_inflight

stall_dmissq

stall_cast

Overall

Stalls

N=4000bips

0.5 0.7 0.9

1093 933

1045 929

1096 907

1077 920

1127 922

1070 881

10991075 941 885

[ 0, 9959) [ 9959, 97381)

[ 97381, 299112) [299112,1373099]

[ 0, 350) [ 350, 24448)

[ 24448,144305) [144305,923543]

[ 1814764,5.98e+06) [ 5984258,1.01e+07)

[10134921,2.42e+07) [24224471,1.13e+08]

[ 16188, 2013914) [ 2013914, 4131316)

[ 4131316,22315283) [22315283,45296951]

stall_storeq

stall_reorderq

stall_resv

stall_rename

Overall

Stalls

N=4000

Performance Associations III

0.5 0.8

1110 9191071 900

1103 9111103 883

1856 349 928 867

[5.08e−06,3.91e−05) [3.91e−05,7.15e−04) [7.15e−04,3.85e−03) [3.85e−03,3.71e−02]

[0.02,0.09) [0.09,0.12) [0.12,0.18) [0.18,0.55]

[0.00,0.02)

[0.04,0.11) [0.11,0.24]

il1miss_rate

dl1miss_rate

dl2miss_rate

Overall

N=4000bips

0.5 0.7 0.9

[0.309,0.92)

[0.920,1.21)

[1.208,1.35)

[1.347,1.58]

base_bips

Overall

Baseline Performance

N=4000

Assessing FitMultiple Correlation Statistic

R2 is fraction of response variance captured by predictorsLarge R2 suggests better fit to observed dataR2 → 1 suggests over-fitting (less likely if p < n/20)

R2 = 1−∑n

i=1(yi − yi)2∑ni=1(yi − 1

∑ni=1 yi)2

Residual Distribution AssumptionsResiduals are normally distributed, ei ∼ N(0, σ2)No correlation between residuals and response, predictorsValidate by scatterplots and quantile-quantile plots

ei = yi − β0 −p∑

βjxij

Significance Testing I

ApproachGiven two nested models, hypothesis H0 states additionalpredictors in larger model have no response associationTest H0 with F-statistics and p-values

ExamplePredictor interaction requires comparing nested modelsConsider a model y = β0 + β1x1 + β2x2 + β3x1x2.Test significance of x1 with null hypothesis H0 : β1 = β3 = 0

Significance Testing IIF-Statistic

Compare two nested models using their R2 and F-statisticR2 is fraction of response variance captured by predictors

R2 = 1−∑n

i=1(yi − yi)2∑ni=1(yi − 1

∑ni=1 yi)2

F-statistic of two nested models follows F distribution

Fk,n−p−1 =R2 − R2

× n− p− 11− R2

P-ValuesProbability F-statistic greater than or equal to observedvalue would occur under H0

Small p-values cast doubt on H0

Significance Testing IV

Microarchitectural PredictorsMajority of F-tests imply significance (p-values < 2.2E − 16)Several predictors were less significant

Control latency (p-value = 0.1247)Reservation station size (p-value = 0.1239)L1 instruction cache size (p-value = 0.02941)

Application-Specific PredictorsMajority of F-tests imply significance (p-values < 2.2E − 16)Pipeline stalls classified by structure are less significant

Completion and reorder queue stalls (p-values > 0.4)

Related Work

Statistical Significance RankingYi :: Plackett-Burman, effect rankingsJoseph :: Stepwise regression, coefficient rankingsBound parameter values to improve tractabilityRequire simulation for estimation

Synthetic WorkloadsEeckhout :: Profile workloads to obtain synthetic tracesNussbaum :: Superscalar and SMP simulationObtain distribution of instructions and data dependenciesRequire simulation with smaller traces for estimation

statistical inference for efficient microarchitectural and...

Documents

profile-guided microarchitectural floorplanning for deep...

methods of inference and learning for performance modeling...

computing with time: microarchitectural weird machines

bone microarchitectural analysis using ultra-high

may the fourth be with you: a microarchitectural side...

research article temporal changes of microarchitectural...

a survey of microarchitectural timing attacks and ... · pdf...

using interaction costs for microarchitectural bottleneck...

microarchitectural mechanisms to exploit value structure

prevention of microarchitectural covert channels on an...

microarchitectural implications of event-driven server...

stratus: clouds with microarchitectural resource...

accurate & efficient regression modeling for...

microarchitectural characterization of production jvms and...

microarchitectural side-channel attacks - jo van bulck ·...

microarchitectural floorplanning under performance and...

microarchitectural transformations using elasticity

software and hardware based microarchitectural attacks...

osiris: automated discovery of microarchitectural side

nemesis: studying microarchitectural timing leaks in...