workload characteristics and representative workloads

Workload Characteristicsand Representative

WorkloadsDavid Kaeli

Department of Electrical and Computer Engineering

Northeastern UniversityBoston, MA

[email protected]

Overview

• When we want to collect profiles to be used in the design of a next-generation computing system, we need to be very careful that we capture a representative sample in our profile

• Workload characteristics allow us to better understand the content of the samples which we collect

• We need to select programs to study which represent the class of applications we will eventually run on the target system

Examples of Workload Characteristics• Instruction mix

• Static vs. dynamic instruction count• Working sets• Control flow behavior• Working set behavior• Inter-reference gap model (temporal

locality)• Database size• Address and value predictability• Application/library/OS breakdown• Heap vs. stack allocation

Benchmarks• Real or synthetic programs/applications used

to exercise a hardware system, and generally representing a range of behaviors found in a particular class of applications

• Benchmarks classes include:– Toy benchmarks: tower of hanoi, qsort, fibo– Synthetic benchmarks: dhrystone, whetstone– Embedded: EEMBC, UTDSP – CPU benchmarks: SPECint/fp– Internet benchmark: SPECjAppServ, SPECjvm– Commerical benchmark: TPC, SAP, SPECjAppServ– Supercomputing: Perfect Club, Splash, Livermore

Loops

SPEC Benchmarks Presentation

Is there a way to reduce the runtime of SPEC while maintaining representativeness?• MinneSPEC (U. of Minn) – Using gprof

statistics about the runtime of SPEC and various Simplescalar simulation results (I-mix, cache misses, etc), we can capture statistically similar, though significantly shorter, runs of the programs

• Provides three input sets will run in:– A few minutes– A few hours– A few days

• IEEE Computer Arch Letters paper

How many programs do we need?

• Does the set of application capture enough variation to be “representative” of the entire class of workload?

• Should we consider using multiple benchmark suites to factor out similarities in programming styles?

• Can we utilize workload characteristics to identify the particular programs of interest?

Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995

• C++ is a programming language growing in popularity

• We need to design tomorrow’s computer architectures based on tomorrow’s software paradigms

• How do workload characteristics changes as we move to new programming paradigms?

Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995

• First problem – Find a set of representative programs from both FO and OO domains– Difficult for OO in 1995

• Differences between programming models– OO relies heavily on messages and methods– Data locality will change due to the bundling together

of data structures in Objects– Size of functions will be reduced in OO Polymorphism

allows for indirect function invocation and runtime method selection

– OO programs will manage a larger number of dynamically allocated objects

Address and Value Profiling

• Lipasti observed that profiled instructions tend to repeat their behavior

• Many addresses are nearly constant• Many values do not change between

instruction execution• Can we use profiles to better

understand some of these behaviors, and the until this knowledge to optimize execution?

Address Profiling

• If an address remains unchanged, can we issue loads and store early (similar to prefetching)?

• Do we even have to issue the load or store if we have not modified memory?

• What are the implications if indirect addressing is used?

• Can we detect patterns (i.e., strides) in the address values?

• Can we do anything smart when we detect pointer chasing??

Data Value Profiling

• When we see that particular data values do not change, how can we take advantage of this?

• Lipasti noticed that a large percentage of store instructions overwrite memory with the value already stored there

• Can we avoid computing new results if we notice that our input operand have not changed?

• What can we do if we a particular operand only takes on a small set of values?

Parameter Value Profiling

• Profile the parameter values passed to functions

• If these parameters are predictable, we can exploit this fact during compilation

• We can study this on an individual function basis or a call site basis

• Compiler optimizations such as code specialization and function cloning can be used


• We have profiled a set of MS Windows NT 4.0 desktop applications– Word97– Foxpro 6.0– SQLserver 7.0– VC++ 6.0 – Excel97– Powerpoint97– Access97

• We measured the value predictability of parameter values for all non-pointer based parameters


• We look for the predictability of parameters using:– Invariance 1 – probability that the most frequent value is

passed– Invariance 4 – probability that one of the 4 most frequent

values is passed

• Parameter values are more predictable on a call site basis than on a function basis (e.g., for Word97, 8% of the functions pass highly predictable parms, where as when computed on individual call sites, over 16% of the call sites pass highly predictable parms)

• Highly predictable means that on over 90% of the calls the same value is observed

• We will discuss how to clone and specialize procedures when we discuss profile guided data transformations

How can we reduce the runtime of a single benchmark and still maintain accuracy?• Simpoint – attempt to collect a set of

trace samples that best represents the whole execution of the program– Identifies phase behavior in programs– Considers a metric that captures the

differences between two samples– Computes the difference between these two

intervals– Selects the interval that is closest to all

other intervals

Simpoint (Calder ASPLOS 2002)• Utilize basic block frequencies to build

basic block vectors (bbf0, bbf1….bbfn-1)• Each frequency is weighted by its length • Entire vector normalized by dividing by

total number of basic blocks executed• Take fixed-length samples (100M instructions)• Compare BBVs using:

– Euclidean Distance: • ED(a, b) = sqrt(sum(i->1,n) (ai-bi)2)

– Manhattan Distance:• MD(a, b) = sum(i->1,n)(|ai-bi|)

Simpoint (Calder ASPLOS 2002)

• Manhattan Distance is used to build a similarity matrix

• N x N matrix, where N is the number of sampling intervals in the program

• Element SM(x, y) is the Manhattan Distance between two 100M element BBV at sample offsets x and y

• Plot the Similarity Matrix as an upper triangle

Simpoint (Calder ASPLOS 2002)

• Basic Block Vectors can adequately capture the necessary representative characteristics of a program

• Distance metrics can help to identify the most representative samples

• Cluster analysis (k-means) can improve representativeness by selecting multiple samples

Simpoint (Calder ISPASS 2004)

• Newer work on Simpoint considers using register def-use behavior on an interval basis

• Also, tracking of loop behavior and procedure calls frequencies provides similar accuracy as using basis block vectors

Simpoint (Calder ASPLOS 2002)• Algorithm overview

– Profile program by dividing into fixed sized intervals (e.g., 1M, 10M, 100M insts)

– Collect frequency vectors (e.g., BBVs, def-use, etc.) – compute normalized frequencies

– Run k-means clustering algorithm to divide the set of intervals into k partitions/sets, for values of k from 1 to K

– Compute a “goodness-of-fit” of the data for each value of k

– Select the clustering the reduces small k and provides a reasonable “goodness-of-fit” result

– The result is a selection of representative simulation points that best “fit” the entire application execution

Simpoint Paper

Improvements to Simpoints (KDD05, ISPASS06)

• Utilize a Mixture of Multinomials instead of K-means

• Assumes data is generated by a mixture of K-component density functions

• We utilize Expectation-Maximization (EM) to find a local maximum likelihood for the parameters of the density function – iterate on E and M steps until convergence

• The number of clusters is selected using the Bayesian Information Criteria (BIC) approach to judge “goodness of fit”

“A multinomial clustering model for fast simulation of computer architecture designs”, K. Sanghai et al., Proc. of KDD 2005, Chicago, IL., pp. 808-813.

Summary• Before you begin studying a new

architecture, have a clear understanding of the target workloads for this system

• Perform a significant amount of workload characterization before you begin profiling work

• Benchmarks are very useful tools, but must be used properly to obtain meaningful results

• Value profiling is a rich area for future research

• Simpoints can be used to reduce the runtime of simulation and still maintain simulation fidelity

workload characteristics and representative workloads

Documents