the return of synthetic benchmarks

The Return of Synthetic Benchmarks

Ajay M. Joshi (UT Austin)Lieven Eeckhout (Ghent University)

Lizy K. John (UT Austin)

Laboratory of Computer ArchitectureDepartment of Electrical & Computer

EngineeringThe University of Texas at Austin

January 28, 2008

2

The Need for Synthetic Benchmarks BenchMaker Framework for Benchmark Synthesis Workload Characteristics Used in Synthesis Synthetic Benchmark Construction Evaluation of BenchMaker Applications Summary

Outline

3

Benchmark Spectrum

Toy Benchmarkse.g. Heap sort

Microbenchmarkse.g. STREAM

Kernel Codese.g. Livermore Loops

Application Suitese.g. SPEC CPU

Complete Application Code

Less Development Effort

More Scalable

More Maintainable

Less Representative

More Development Effort

Less Scalable

Less Maintainable

More Representative

Synthetic Benchmarkse.g. Dhrystone, Whetstone

4

Benchmark Subsetting [Eeckhout et al., PACT’02]

[Vandierendonck et al., CAECW’04]

[Phansalkar et al., ISPASS’05]

[Eeckhout et al. IISWC’05]

• Statistical Sampling [Conte et al., ICCD’96 ] [Wunderlich et al., ISCA’03]

• Representative Sampling [Sherwood et al., ASPLOS’02]

• Reduced Input Set [ KleinOsowski, CAN’04]

• Statistical Simulation & Synthetic Workloads [Oskin et al., ISCA’00] [ Eeckhout et al., ISPASS’00] [Nussbaum et al., PACT’01] [Bell et al., ICS’05]

• Analytical Modeling [Noonburg et al., MICRO’94] [Karkhanis et al., ISCA’04]

• Speedup Simulation [Schnarr et al., ASPLOS’98] [Loh et al., SIGMETRICS’01]

Ben

chm

ark

Exp

losi

onBenchmark Run Length

Microprocessor

Complexity

Focus on Simulation Time Reduction

5

Using Real-World Applications as Benchmarks Proprietary Nature of Real-World Applications

Single-Point Performance Characterization Application Benchmarks are Rigid

Applications Evolve Faster than Benchmarks Benchmark Suites are Costly to Develop, Maintain, and Upgrade

Studying Commercial Workload Performance Early Design Stage Power/Performance Studies

Motivation : Benchmarking Challenges

Usefulness of Synthetic Benchmarks Beyond Simulation Time Reduction

6

Resurgence of Synthetic Benchmarks…..

IEEE Computer, August 2003

7


Outline

8

Workload Synthesis: Central Idea

Workload Synthesizer

Inst

ruct

ion

Leve

l Pa

ralle

lism

Prog

ram

Loc

ality

Inst

ruct

ion

Mix

Cont

rol F

low

Beha

vior

ADD R1, R2,R3LD R4, R1, R6MUL R3, R6, R7 ADD R3, R2, R5DIV R10, R2, R1SUB R3, R5, R6

STORE R3, R10, R20ADD R1, R2,R3LD R4, R1, R6MUL R3, R6, R7 ADD R3, R2, R5DIV R10, R2, R1SUB R3, R5, R1

BEQ R3, R6, LOOPSUB R3, R5, R6

STORE R3, R10, R20DIV R10, R2, R1

………….

Application Behavior Space

‘Knobs’ for Changing Program

Characteristcs

Workload Synthesis Algorithm

Synthetic Benchmark

Execution Driven Simulator

Real Hardware or RTL

Compile and Execute

Just 40 workload characteristics

9

Modeling Real-World Applications

Real Hardware

ExecutionDriven

Simulator

Real World Proprietary Workload

Synthetic Benchmark

Clone

Workload ProfilerBinary Instrumentation OR

Simulation

WorkloadSynthesizer

Workload Profile =

Workload Attributes

+DistributionOf Attribute

Values

Modeling Workload Attributes into Synthetic Workload

Experiment Environment

Microarchitecture-Independent Workload Profiling

10


Outline

11

Workload Characteristics as ‘Knobs’Category Num. Characteristic

instruction mix 10 percentage of integer short latencypercentage of integer long latencypercentage of floating-point short latencypercentage of floating-point long latencypercentage of integer loadpercentage of integer storepercentage of floating-point loadpercentage of floating-point storepercentage of branches

Instruction-level parallelism

8 register-dependency-distance – 8 distributions for register dependencies. Register dependency distance equal to 1 instruction, and the percentage of dependency dependencies that have a distance of up to 2, 4, 6, 8, 16, 32, and greater than 32 instructions.

data locality 110

data footprintdistribution of local stride values

instruction locality 1 instruction footprint

branch predictability 10 distribution of branch transition rate

12

Attributes to capture inherent workload behavior

– Data Locality: Dominant strides of static Load/Store – Control Flow Predictability: Branch transition rate

Modeling Locality & Control Flow Predictability

– Data Locality of Integer, Scientific, and Embedded Workloads effectively modeled using circular streams – Replicating transition-rate of static branches

Capturing The Essence of Workloads

13

Modeling Data Access Pattern• Identify streams of data references

• A Stream? – Sequence of memory addresses in an arithmetic progression – Elements of arrays A, B, and C form 3 streams for( ii = 0; ii < N; ii ++)

A [ii] = B [ii] + C [ii]

200, 204, 208 .. 320, 324, 328 .. 404, 408, 412 ... Issuing Sequence : 320, 404, 200, 324, 408, 204 ….

• Streams are interleaved and may contain noise 4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28 …

14

Reference pattern of static Load / Store Instructions– PC-correlated spatial locality - Dependence on address referenced by nearby Ld / St

- Programs with pointer chasing codes

– PC-correlated temporal locality - Dependence on previous address generated by same Ld / St

- Programs with multidimensional arrays

Could static Load / Store instructions be natural sources of streams ?

Profile every static Load / Store instruction – Number of different strides with which it accesses data

Extracting Streams

15

Dependency Distance

ADD R1, R3,R4

MUL R5,R3,R2 ADD R5,R3,R6

LD R4, (R1) SUB R8,R2,R1

Measure Distribution of Dependency Distances

Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32

Read After Write Dependency Distance = 3

Modeling Instruction Level Parallelism

16

Capture behavior of easy and difficult to predict branches

Inherent program feature that captures branch behavior

Transition Rate [ Haungs et al. HPCA’00 ] # of Taken-Not Taken transitions / # of times executed

Branches with low transition-rate (easier to predict)TTTTTTTTTN, NNNNNNNNNT

Branches with high transition-rate (easier to predict)TNTNTNTNTN

Branches with moderate transition-rate (tougher to predict)

Modeling Control Flow Predictability

17


Outline

18

Workload Profile

Instruction MixRegister Dependency DistanceStride Pattern of Load/StoreBranch Transition RateBranch Transition Probabilities

C

A

B

D

BR

BRBR

BR

0.8 0.2

1.0 1.0

0.90.1

Synthetic Clone Generation

1 Big Loop

Workload Synthesis (1)

A

B

D

A

B

D

A

C

D

A

B

D

19

Workload Profile


C

A

B

D

BR

BRBR

BR

0.8 0.2

1.0 1.0

0.90.1


1 Big Loop


A

B

D

A

B

D

A

C

D

A

B

D

Memory Access Model (Strides)

20

Workload Profile


C

A

B

D

BR

BRBR

BR

0.8 0.2

1.0 1.0

0.90.1


1 Big Loop


A

B

D

A

B

D

A

C

D

A

B

D


Branching Model – Based on Transition Rate

21

Workload Profile


C

A

B

D

BR

BRBR

BR

0.8 0.2

1.0 1.0

0.90.1


1 Big Loop


A

B

D

A

B

D

A

C

D

A

B

D


Branching Model – Based on Transition Rate

Register Assignment C code with asm & volatile constructs

22


Outline

23

Evaluation of BenchMaker SPEC CPU2000, SPECjbb2005, and DBT2 workloads Validated Sim-Alpha Performance Model of Alpha 21264

Benchmark Input SimPoint(s)

SPEC CPU2000 Integer

bzip2 graphic 553

crafty ref 774

eon rushmeier 403

gcc 166.i 389

gzip graphic 389

mcf ref 553

perlbmk perfect-ref 5

twolf ref 1066

vortex lendian1 271

vpr route 476

gcc expr 8, 24, 47, 51, 56, 73, 87, 99

SPEC CPU95 Integer

gcc expr 0, 3,5,6,7,8,9,10,12

24

Performance Correlation

00.20.40.60.8

11.21.41.61.8

bzip

2

craf

ty

gcc

gzip

mcf

perlb

mk

twol

f

vorte

x

vpr

dbt2

dbm

s

SP

EC

jbb2

005

Inst

ruct

ions

-Per

-Cyc

le

Original Benchmark Synthetic Benchmark

Trade Accuracy for Flexibility – Average Error of 11%

25

Energy/Power Correlation

0

5

10

15

20

25

30

35bz

ip2

craf

ty

gcc

gzip

mcf

perlb

mk

twol

f

vorte

x

vpr

dbt2

dbm

s

SP

EC

jbb2

005

Ene

rgy-

Per

-Inst

ruct

ion

Original Benchmark Synthetic Benchmark

Average Error of 13%

26


Outline

27

Altering Individual Program Characteristics

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 66 70 80 90 100

Percentage of References with Stride Value 0

Inst

ruct

ions

-Per

-Cyc

le

28

Interaction of Program Characteristics

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 66 70 80 90 100Percentage of references with Stride Value 0

L1 D

-cac

he M

iss-

Rat

eData Footprint - 600K Data Footprint - 300KData Footprint - 900K

29

Modeling Impact of Benchmark Drift

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8

Factor by which code size is increased

Inst

ruct

ions

-Per

-Cyc

le

Increase in Data Footprint from SPEC CPU95 to SPEC CPU2000 for gcc (Model with 7% accuracy)

Increase in Code Footprint (hypothetical)

30

Summary Synthetic Benchmarks to Address Benchmarking Challenges

Constructing Synthetic Benchmarks from Hardware-Independent Characteristics

Applications of Synthetic Benchmarks - Altering Program Characteristics - Studying Interaction of Program Characteristics - Modeling Benchmark Drift

31

Questions?

Ajay’s email: [email protected]

the return of synthetic benchmarks

Documents