methodologies for performance simulation of super-scalar ooo processors srinivas neginhal...

39
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Upload: oswald-barton

Post on 14-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Methodologies for Performance Simulation of

Super-scalar OOO processors

Srinivas NeginhalAnantharaman Kalyanaraman

CprE 585: Survey Project

Page 2: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Introduction

Modeling Simulation

Performance Study

ProcessorDesign

Page 3: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Architectural Simulators Explore Design Space Evaluate existing hardware, or

Predict performance of proposed hardware

Designer has control

Functional Simulators:

Model architecture (programmers’ focus)Eg., sim-fast, sim-safe

Performance Simulators:

Model microarchitecture (designer’s focus)Eg., cycle-by-cycle (sim-outoforder)

Page 4: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Simulation Issues Real-applications take too long for a cycle-by-cycle

simulation

Vast design space: Design Parameters:

code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc.

Architectural metrics: IPC/ILP, cache miss rate, branch prediction accuracy, etc.

Find design flaws + Provide design improvements

Need a “robust” simulation methodology !!

Page 5: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Two Methodologies HLS

Hybrid: Statistical + Symbolic REF:

HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA. 71-82. 2000.

BBDA Basic block distribution analysis REF:

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.

Page 6: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: An Overview A hybrid processor simulator

HLS

Statistical Model

Symbolic Execution

Performance Contours spanned by design space parameters

What can be achieved?

Explore design changes in architectures and compilers that would be impractical to simulate using conventional simulators

Page 7: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: Main Idea

Application code

Statistical Profiling

Instruction stream, data stream

Machine independent characteristics:

-basic block size

-Dynamic instruction distance

-Instruction mix

Machine independent characteristics:

-basic block size

-Dynamic instruction distance

-Instruction mix

Structural Simulation of FU, issue pipeline units

Architecture metrics:

-Cache behavior

-Branch prediction accuracy

Architecture metrics:

-Cache behavior

-Branch prediction accuracy

Synthetically generated code

Synthetically generated code

Page 8: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Statistical Code Generation Each “synthetic instruction”

contains the following parameters based on the statistical profile:

Functional unit requirements Dynamic instruction distances Cache behavior

Page 9: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Validation of HLS against SimpleScalar

For varying combinations of design parameters:

Run original benchmark code on SimpleScalar (use sim-outoforder)

Run statistically generated code on HLS

Compare SimpleScalar IPC vs. HLS IPC

Page 10: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Validation: Single- and Multi-value correlations

IPC vs. L1-cache hit rate

For SPECint95:

HLS Errors are within 5-7% of the cycle-by-cycle results !!

Page 11: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Validation: L1 Instruction Cache Miss Penalty vs. Hit

Rate

Correlation suggests that:

Cache hit rate should be at least 88% to dominate

Page 12: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: Code PropertiesBasic Block Size vs. L1-Cache Hit

Rate

Correlation suggests that:

Increasing block size helps only when L1 cache hit rate is >96% or <82%

Page 13: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: Code PropertiesDynamic Instruction Distance vs. Basic Block Size

Correlation suggests that:

Moderate DID values suffice for IPC, and high values of basic block size (>8) does not help without an increase in DID

Page 14: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: Value Prediction

DID vs. Value predictability

GOAL: Break True Dependency

Stall Penalty for mispredict vs. Value Prediction Knowledge

Page 15: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: More Multi-value Correlations

L1-cache hit rate vs. Value Predictability DID vs. Superscalar issue width

Page 16: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

HLS: Discussion Low error rate only on SPECint95 benchmark

suite. High error rates on SPECfp95 and STREAM benchmarks

Findings: by R. H. Bell et. Al, 2004

Reason: Instruction-level granularity for workload

Recommended Improvement: Basic block-level granularity

Page 17: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Goals The end of the initialization The period of the program Ideal place to simulate given a

specific number of instructions one has to simulate

Accurate confidence estimation of the simulation point.

<Note> Revamp this slide.

Page 18: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Program Behavior Program behavior has ramification

on architectural techniques. Program behavior is different in

different parts of execution. Initialization Cyclic behavior (Periodic)

Page 19: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Distribution Analysis

Each basic block gets executed a certain number of times.

Number of times each basic block executes gives a fingerprint.

Use the fingerprints to find representative areas to simulate.

<Note> How does fingerprinting help?

Page 20: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Cyclic Behavior of Programs Cyclic Behavior is not

representative of all programs. Common case for compute bound

applications. SPEC95 wave program executes 7

billion instructions before it reaches the code that amounts to the bulk of execution.

Page 21: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Vectors Fast profiling to determine the number of

times a basic block executes. Behavior of the program is directly related to the

code that it is executing. Profiling gives a basic block fingerprint for that

particular interval of time. Full execution of the program and the interval we

choose spends proportionally the same amount of time in the same code.

Collected in intervals of 100 million instructions.

Page 22: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Vector - BBV BBV is a single dimensional array.

There is an element for each basic block in the program.

Each element is the count of how many times a given basic block was entered during an interval.

Varying size intervals A BBV collected over an interval of N times

100 million instructions is a BBV of duration N.

Page 23: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Vectors BBV is normalized

Each element divided by the sum of all elements.

Target BBV BBV for the entire execution of the

program. Objective

Find a BBV of small duration similar to Target BBV.

Page 24: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Vector Difference Difference between BBVs

Element wise subtraction, sum of absolute values.

A number between 0 and 2. Manhattan and Euclidean Distance.

Page 25: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Difference Graph Plot of how well each individual sample

in the program compares to the target BBV.

For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV.

Used to Find the initialization phase Find the period for the program.

Page 26: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Difference Graph Diagram and explain

Page 27: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Initialization Initialization is not trivial. Important to simulate representative sections of

code. Detection of the end of the initialization phase is

important. Initialization Difference Graph

Initial Representative Signal - First quarter of BB Difference graph.

Slide it across BB difference graph. Difference calculated at each point for first half of

BBDG. When IRS reaches the end of the initialization stage on

the BB difference graph, the difference is maximized.

Page 28: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Initialization Diagram and explain

Page 29: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Period Period Difference Graph

Period Representative Signal Part of BBDG, starting from the end of

initialization to ¼th the length of program execution.

Slide across half the BBDG. Distance between the minimum Y-axis points

is the period. Using larger durations of a BBV creates a

BBDG that emphasizes the larger periods.

Page 30: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Period Diagram and explain

Page 31: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Method SimpleScalar modified.

Output and clear statistics counters every 100 million instructions committed.

Graphed data: IPC, % RUU Occupancy, Cache Miss Rate etc.

To get the most representative sample of a program at least one full period must be simulated.

Page 32: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Results

Page 33: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Similarity Matrix A phase of a program behavior can be

defined as all similar sections of execution regardless of temporal adjacency.

Similarity Matrix Upper Triangle N X N Matrix, where N is the

number of intervals in the program execution. An entry at (x, y) in the matrix represents

Manhattan distance between the BBV at x and BBV at y.

Page 34: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Basic Block Similarity Matrix IMAGE and explain the image.

Page 35: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Finding Basic Block Similarity Many intervals of execution are

similar to each other. It makes sense to group them

together. Analogous to clustering.

Page 36: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Clustering Goal is to divide a set of points into groups

such that points within each group are similar to one another by some metric.

This problem arises in other fields such as computer vision, genomics etc.

Two types of clustering algorithms exist Partitioning

Choose an initial solution then iteratively update to find better solution

Linear Time Complexity Hierarchical

Divisive or Agglomerative Quadratic Time Complexity

Page 37: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Phase Finding Algorithm Generate BBVs with a duration of 1. Reduce the dimension of the BBVs

to 15. Apply clustering algorithm on the

BBVs. Score the clustering and choose the

most suitable.

Page 38: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Random Projection Curse of Dimensionality BBV dimensions

Number of executed Basic Blocks. Could grow to millions.

Dimension Selection Dimension Reduction

Random Linear Projection.

Page 39: Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Clustering Algorithm K-means algorithm

Iterative optimizing algorithm. Two repetitive phases that converge.

WORK IN PROGRESS