the return of synthetic benchmarks
DESCRIPTION
The Return of Synthetic Benchmarks. January 28, 2008. Ajay M. Joshi (UT Austin) Lieven Eeckhout (Ghent University) Lizy K. John (UT Austin) Laboratory of Computer Architecture Department of Electrical & Computer Engineering The University of Texas at Austin. Outline. - PowerPoint PPT PresentationTRANSCRIPT
The Return of Synthetic Benchmarks
Ajay M. Joshi (UT Austin)Lieven Eeckhout (Ghent University)
Lizy K. John (UT Austin)
Laboratory of Computer ArchitectureDepartment of Electrical & Computer
EngineeringThe University of Texas at Austin
January 28, 2008
2
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
3
Benchmark Spectrum
Toy Benchmarkse.g. Heap sort
Microbenchmarkse.g. STREAM
Kernel Codese.g. Livermore Loops
Application Suitese.g. SPEC CPU
Complete Application Code
Less Development Effort
More Scalable
More Maintainable
Less Representative
More Development Effort
Less Scalable
Less Maintainable
More Representative
Synthetic Benchmarkse.g. Dhrystone, Whetstone
4
Benchmark Subsetting [Eeckhout et al., PACT’02]
[Vandierendonck et al., CAECW’04]
[Phansalkar et al., ISPASS’05]
[Eeckhout et al. IISWC’05]
• Statistical Sampling [Conte et al., ICCD’96 ] [Wunderlich et al., ISCA’03]
• Representative Sampling [Sherwood et al., ASPLOS’02]
• Reduced Input Set [ KleinOsowski, CAN’04]
• Statistical Simulation & Synthetic Workloads [Oskin et al., ISCA’00] [ Eeckhout et al., ISPASS’00] [Nussbaum et al., PACT’01] [Bell et al., ICS’05]
• Analytical Modeling [Noonburg et al., MICRO’94] [Karkhanis et al., ISCA’04]
• Speedup Simulation [Schnarr et al., ASPLOS’98] [Loh et al., SIGMETRICS’01]
Ben
chm
ark
Exp
losi
onBenchmark Run Length
Microprocessor
Complexity
Focus on Simulation Time Reduction
5
Using Real-World Applications as Benchmarks Proprietary Nature of Real-World Applications
Single-Point Performance Characterization Application Benchmarks are Rigid
Applications Evolve Faster than Benchmarks Benchmark Suites are Costly to Develop, Maintain, and Upgrade
Studying Commercial Workload Performance Early Design Stage Power/Performance Studies
Motivation : Benchmarking Challenges
Usefulness of Synthetic Benchmarks Beyond Simulation Time Reduction
6
Resurgence of Synthetic Benchmarks…..
IEEE Computer, August 2003
7
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
8
Workload Synthesis: Central Idea
Workload Synthesizer
Inst
ruct
ion
Leve
l Pa
ralle
lism
Prog
ram
Loc
ality
Inst
ruct
ion
Mix
Cont
rol F
low
Beha
vior
ADD R1, R2,R3LD R4, R1, R6MUL R3, R6, R7 ADD R3, R2, R5DIV R10, R2, R1SUB R3, R5, R6
STORE R3, R10, R20ADD R1, R2,R3LD R4, R1, R6MUL R3, R6, R7 ADD R3, R2, R5DIV R10, R2, R1SUB R3, R5, R1
BEQ R3, R6, LOOPSUB R3, R5, R6
STORE R3, R10, R20DIV R10, R2, R1
………….
Application Behavior Space
‘Knobs’ for Changing Program
Characteristcs
Workload Synthesis Algorithm
Synthetic Benchmark
Execution Driven Simulator
Real Hardware or RTL
Compile and Execute
Just 40 workload characteristics
9
Modeling Real-World Applications
Real Hardware
ExecutionDriven
Simulator
Real World Proprietary Workload
Synthetic Benchmark
Clone
Workload ProfilerBinary Instrumentation OR
Simulation
WorkloadSynthesizer
Workload Profile =
Workload Attributes
+DistributionOf Attribute
Values
Modeling Workload Attributes into Synthetic Workload
Experiment Environment
Microarchitecture-Independent Workload Profiling
10
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
11
Workload Characteristics as ‘Knobs’Category Num. Characteristic
instruction mix 10 percentage of integer short latencypercentage of integer long latencypercentage of floating-point short latencypercentage of floating-point long latencypercentage of integer loadpercentage of integer storepercentage of floating-point loadpercentage of floating-point storepercentage of branches
Instruction-level parallelism
8 register-dependency-distance – 8 distributions for register dependencies. Register dependency distance equal to 1 instruction, and the percentage of dependency dependencies that have a distance of up to 2, 4, 6, 8, 16, 32, and greater than 32 instructions.
data locality 110
data footprintdistribution of local stride values
instruction locality 1 instruction footprint
branch predictability 10 distribution of branch transition rate
12
Attributes to capture inherent workload behavior
– Data Locality: Dominant strides of static Load/Store – Control Flow Predictability: Branch transition rate
Modeling Locality & Control Flow Predictability
– Data Locality of Integer, Scientific, and Embedded Workloads effectively modeled using circular streams – Replicating transition-rate of static branches
Capturing The Essence of Workloads
13
Modeling Data Access Pattern• Identify streams of data references
• A Stream? – Sequence of memory addresses in an arithmetic progression – Elements of arrays A, B, and C form 3 streams for( ii = 0; ii < N; ii ++)
A [ii] = B [ii] + C [ii]
200, 204, 208 .. 320, 324, 328 .. 404, 408, 412 ... Issuing Sequence : 320, 404, 200, 324, 408, 204 ….
• Streams are interleaved and may contain noise 4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28 …
14
Reference pattern of static Load / Store Instructions– PC-correlated spatial locality - Dependence on address referenced by nearby Ld / St
- Programs with pointer chasing codes
– PC-correlated temporal locality - Dependence on previous address generated by same Ld / St
- Programs with multidimensional arrays
Could static Load / Store instructions be natural sources of streams ?
Profile every static Load / Store instruction – Number of different strides with which it accesses data
Extracting Streams
15
Dependency Distance
ADD R1, R3,R4
MUL R5,R3,R2 ADD R5,R3,R6
LD R4, (R1) SUB R8,R2,R1
Measure Distribution of Dependency Distances
Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32
Read After Write Dependency Distance = 3
Modeling Instruction Level Parallelism
16
Capture behavior of easy and difficult to predict branches
Inherent program feature that captures branch behavior
Transition Rate [ Haungs et al. HPCA’00 ] # of Taken-Not Taken transitions / # of times executed
Branches with low transition-rate (easier to predict)TTTTTTTTTN, NNNNNNNNNT
Branches with high transition-rate (easier to predict)TNTNTNTNTN
Branches with moderate transition-rate (tougher to predict)
Modeling Control Flow Predictability
17
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
18
Workload Profile
Instruction MixRegister Dependency DistanceStride Pattern of Load/StoreBranch Transition RateBranch Transition Probabilities
C
A
B
D
BR
BRBR
BR
0.8 0.2
1.0 1.0
0.90.1
Synthetic Clone Generation
1 Big Loop
Workload Synthesis (1)
A
B
D
A
B
D
A
C
D
A
B
D
19
Workload Profile
Instruction MixRegister Dependency DistanceStride Pattern of Load/StoreBranch Transition RateBranch Transition Probabilities
C
A
B
D
BR
BRBR
BR
0.8 0.2
1.0 1.0
0.90.1
Synthetic Clone Generation
1 Big Loop
Workload Synthesis (2)
A
B
D
A
B
D
A
C
D
A
B
D
Memory Access Model (Strides)
20
Workload Profile
Instruction MixRegister Dependency DistanceStride Pattern of Load/StoreBranch Transition RateBranch Transition Probabilities
C
A
B
D
BR
BRBR
BR
0.8 0.2
1.0 1.0
0.90.1
Synthetic Clone Generation
1 Big Loop
Workload Synthesis (3)
A
B
D
A
B
D
A
C
D
A
B
D
Memory Access Model (Strides)
Branching Model – Based on Transition Rate
21
Workload Profile
Instruction MixRegister Dependency DistanceStride Pattern of Load/StoreBranch Transition RateBranch Transition Probabilities
C
A
B
D
BR
BRBR
BR
0.8 0.2
1.0 1.0
0.90.1
Synthetic Clone Generation
1 Big Loop
Workload Synthesis (4)
A
B
D
A
B
D
A
C
D
A
B
D
Memory Access Model (Strides)
Branching Model – Based on Transition Rate
Register Assignment C code with asm & volatile constructs
22
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
23
Evaluation of BenchMaker SPEC CPU2000, SPECjbb2005, and DBT2 workloads Validated Sim-Alpha Performance Model of Alpha 21264
Benchmark Input SimPoint(s)
SPEC CPU2000 Integer
bzip2 graphic 553
crafty ref 774
eon rushmeier 403
gcc 166.i 389
gzip graphic 389
mcf ref 553
perlbmk perfect-ref 5
twolf ref 1066
vortex lendian1 271
vpr route 476
gcc expr 8, 24, 47, 51, 56, 73, 87, 99
SPEC CPU95 Integer
gcc expr 0, 3,5,6,7,8,9,10,12
24
Performance Correlation
00.20.40.60.8
11.21.41.61.8
bzip
2
craf
ty
gcc
gzip
mcf
perlb
mk
twol
f
vorte
x
vpr
dbt2
dbm
s
SP
EC
jbb2
005
Inst
ruct
ions
-Per
-Cyc
le
Original Benchmark Synthetic Benchmark
Trade Accuracy for Flexibility – Average Error of 11%
25
Energy/Power Correlation
0
5
10
15
20
25
30
35bz
ip2
craf
ty
gcc
gzip
mcf
perlb
mk
twol
f
vorte
x
vpr
dbt2
dbm
s
SP
EC
jbb2
005
Ene
rgy-
Per
-Inst
ruct
ion
Original Benchmark Synthetic Benchmark
Average Error of 13%
26
The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary
Outline
27
Altering Individual Program Characteristics
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 10 20 30 40 50 60 66 70 80 90 100
Percentage of References with Stride Value 0
Inst
ruct
ions
-Per
-Cyc
le
28
Interaction of Program Characteristics
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 66 70 80 90 100Percentage of references with Stride Value 0
L1 D
-cac
he M
iss-
Rat
eData Footprint - 600K Data Footprint - 300KData Footprint - 900K
29
Modeling Impact of Benchmark Drift
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8
Factor by which code size is increased
Inst
ruct
ions
-Per
-Cyc
le
Increase in Data Footprint from SPEC CPU95 to SPEC CPU2000 for gcc (Model with 7% accuracy)
Increase in Code Footprint (hypothetical)
30
Summary Synthetic Benchmarks to Address Benchmarking Challenges
Constructing Synthetic Benchmarks from Hardware-Independent Characteristics
Applications of Synthetic Benchmarks - Altering Program Characteristics - Studying Interaction of Program Characteristics - Modeling Benchmark Drift