workload characterization: can it save computer architecture and performance evaluation? ·...
Post on 09-Mar-2020
7 Views
Preview:
TRANSCRIPT
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Workload Characterization: Can it save Computer Architecture and Performance
Evaluation?Lizy Kurian John
The Laboratory for Computer Architecture (LCA)ECE Department, UT Austin
ljohn@ece.utexas.edu(512)-232-1455
http://www.ece.texas.edu/projects/ece/lca/http://www.ece.utexas.edu/~ljohn
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
This presentation includes work by several of my students,
especially• Deepu Talla• Ramesh Radhakrishnan• Tao Li• Yue Luo• Rob Bell Jr• Aashish Phansalkar
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Basic Belief
• There are bottlenecks that exist in modern computer systems, which if precisely unveiled, will lead to appropriate architectures and architectural enhancements.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Media Workload Characterization Example –Study on effectiveness of MMX- using Discrete
Cosine Transform (DCT)
Pentium II without MMX Pentium II with MMX
Clocks Eff. Comp. Clocks Eff. Comp.
Maximum compiler optimizations 3500 0.15 2375 0.24 (6%)
Ideal case ≈512 1 ≈128 4 (100%)
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Media Workload Characterization Example –Study on effectiveness of MMX- using Discrete
Cosine Transform (DCT)
Pentium II without MMX Pentium II with MMX
Clocks IPC Eff. Comp. Clocks IPC Eff. Comp.
Maximum compileroptimizations 3500 1.47 0.15 2375 1.04 0.24 (6%)
Perfect memorySystem (prefetching) 2737 1.88 0.20 1578 1.56 0.36 (9%)
Ideal case ≈512 - 1 ≈128 - 4 (100%)
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations.
Pentium III – SIMD code for Discrete Cosine Transform (DCT)
lea ebx, DWORD PTR [ebp+128] load/address overheadmov DWORD PTR [esp+28], ebx load/address overhead$B1$2:xor eax, eax address overheadmove dx, ecx address overheadlea edi, DWORD PTR [ecx+16] load/address overheadmov DWORD PTR [esp+24], ecx load/address overhead$B1$3:movq mm1, MMWORD PTR [ebp] load overheadpxor mm0, mm0 initialization overheadpmaddwd mm1, MMWORD PTR [eax+esi] True Computationmovq mm2, MMWORD PTR [ebp+8] load overheadpmaddwd mm2, MMWORD PTR [eax+esi+8] True Computationadd eax, 16address overheadpaddw mm1, mm0 True Computationpaddw mm2, mm1 True Computationmovq mm0, mm2 load related overheadpsrlq mm2, 32 SIMD reduction overheadpovd ecx, mm0 SIMD load overheadmovd ebx, mm2 SIMD load overheadadd ecx, ebx SIMD conversion Overheadmov WORD PTR [edx], cx store overheadadd edx, 2 address overheadcmp edi, edx branch related overheadjg $B1$3 loop branch overhead$B1$4:move cx, DWORD PTR [esp+24] load/address overheadadd ebp, 16 address overheadadd ecx, 16 address overheadmove ax, DWORD PTR [esp+28] load/address overheadcmp eax, ebp branch related overheadjg $B1$2 loop branch overhead
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Breakdown of dynamic instructions
0%
20%
40%
60%
80%
100%
cfa-PIII cfa-SS dct-PIII dct-SS scale-PIII scale-SS motest-PIII
motest-SS
aud-PIII aud-SS g711-PIII g711-SS
memory branch integer SIMD-overhead SIMD-computation
• Approximately 75%-85% of dynamic instructions are supporting
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
The MediaBreeze architecture focuses on the parallelism in the supporting instructions rather than the actual media computations.
L1 D-cacheSIMD
computationunit
Addressgeneration
units
Addressgeneration
units
Load/Storeunits
Breeze InstructionBuffer
Hardwarelooping
Hardwarelooping
Instructionstream
InstructionDecoder
Non-SIMD pipeline
BreezeInstructionInterpreter
BreezeInstructionInterpreter
Starting address ofBreeze instruction
Normalsuperscalarexecution
L2 cache
Main memory
SIMD pipeline
IS-1
IS-2
IS-3
OS
IS - input stream
OS - output stream
Overhead
Useful computations
new hardware
existing hardware useddifferently
Data reorg. /Address trans.
Data Station
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Performance of MediaBreeze
• The MediaBreeze architecture gives up to 27X performance on DSP kernels and up to 2X on media applications over the best SIMD performance.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Area, power, and timing implications of MediaBreeze
• Area – 0.31 mm2 (overall chip area increase is 0.3%)
• Power – 430 mW at 1 GHz (less than 1% of the overall processor power
• Timing – Overall pipeline depth is not increased.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Simple Solutions
Elegant Solutions
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Power Aware Adaptive Architectures
Detects phases during program execution and adapts hardware characteristics to suit features of the phase
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
OS Energy DissipationOS Energy Dissipation
0%10%20%30%40%50%60%
pmake
SPECInt.gcc
SPECInt.vorte
xse
ndmail
fileman
Java
.dbJa
va.je
ssJa
va.ja
vac
Java
.jack
Java
.mtrt
Java
.compres
sDBMS.se
lect
DBMS.update
DBMS.join
% of OS Cycles% of OS Energy
92% 89%
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
OS Power & Performance TradeoffOS Power & Performance Tradeoff
0
0.2
0.4
0.6
0.8
pmake
gccvo
rtex
sendm
ailfile
man dbjes
sjav
ac jack
postgre
s.sele
ct
postgre
s.upd
ateosb
ootAVGN
orm
aliz
ed E
nerg
y.D
elay
Sampling based Adaptation (Window Size: 2048-cycle)Sampling based Adaptation (Window Size: 128-cycle)Routine based Adaptation
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Java Acceleration for General Purpose Processors
• What’s the biggest bottleneck to alleviate?• Object-oriented nature?• Translation• Hardware Translator for Java Bytecodes
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
The Java Hardware Interpreter
Java class file
Native executable
Fetchbytecode translator Decode Execute
bytecodes
Native machine instructions
• No changes to processor core– light-weight Java run time environment
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
HardInt Performance4-way performance
44.8
109.
3 149.
7
934.
1
911.
7
60.4
135.
9
85.2 12
7.7
492.
2
71.0
133.
7
221.
5
989.
4
867.
8
59.8
108.
8 146.
2
146.
1
321.
9
16.0
27.7
28.8
250.
2
120.
0
0
50
100
150
200
250
300
350
400
db javac jess mpeg mtrt
ex
ecu
tio
n c
ycle
s (
millio
ns)
JDK 1.1.6 Interpreter JDK 1.1.6 JIT JDK 1.2 Interpreter JDK 1.2 JIT Hard-Int
• Hard-Int performs consistently better than the interpreter
• In JIT mode, significant performance boost in 4 of 5 applications.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Simple Solutions
Elegant Solutions
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Architect’s Treasure Chest
• Perhaps our treasure chest contains simple solutions to most problems we will encounter
• We need to be able to identify which is the right solution for the right problem
• Diagnosing the problem is the issue• Workload characterization is the key to this
diagnosis
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Are computer architects becoming like doctors
prescribing medicines without diagnosing the disease?
Are we getting too excited with all the wonderful things a new medicine
can do?
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Performance Evaluation
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
If cars were benchmarked like computers
• Mileage chart might have looked like
28.3 mpgI-7527.5 mpgI-9526.6 mpgI-3524.2 mpgI-9025.3 mpgI-8028.1 mpgI-2027.2 mpgI-10
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
International Routes
28 mpgJapan’s Hwy 14230 mpgAutoBahn24 mpgMadrid’s M-3024 mpgI-9025 mpgI-8028 mpgI-2027 mpgI-10
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
And we would have asked questions like
• Did you drive on I-80 in the summer or winter?
• Was it night or day?• When you drove through Austin on I-35,
was a Longhorn football game going on?
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
And in our car conferences, we would have accepted papers that
• Benchmarking results on most number of highways
• Benchmarked from end to end• Benchmarked on highways in multiple
weather conditions
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Imagine running cars on
1790 milesI-75 (FL to MI)1920 milesI-95 (Maine-FL)1570 milesI-35 (TX to MN)3020 milesI-90 (WA to MA)2900 milesI-80 (CA to NJ)1540 milesI-20 (TX to SC)
2460 milesI-10 (CA to FL)
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Abstracting these long roads
• CITY• HIGHWAY
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
If we look at our computer performance evaluation trends,
aren’t we simply adding roads to the list?
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Reducing Redundancy in Benchmarking
• SimPoint [Sherwood et. al]• SMARTS [Wunderlich et. al]• Benchmark clustering using PCA Analysis
[Eeckhout et. al]
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SimPoint
• Sherwood, et al. ASPLOS 2002• Sample selection:
– Clustering analysis of Basic Block Vectors to identify representative chunks of instructions
• Sampling unit size: 100 million instructions• Sample size: 3-10• Warm up: No explicit warm-up
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SMARTS
• Wunderlich, et al. ISCA 2003• Sample selection:
– Selecting chunks evenly distributed in the instruction stream (systematic sampling)
• Sampling unit size: 1,000 instructions• Sample size:
– Depends on confidence interval requirement– Thousands to tens of thousands
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SIMPOINT
• Analogous to realizing that no need to go all over I-10 from California to Florida, if 10 miles around Phoenix, and 10 miles from San Antonio and 10 miles from El Paso are 10 miles from the New Mexico desert are taken, that’s sufficient.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SMARTS
• Randomly picking some miles from anywhere will do.
• No need for representative sampling
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Cluster analysis
• Linkage clustering
• K-means clustering• Iterative algorithm• Based on distance
between program-input pairs = linkage distance
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Rescaled PCA space: PC1 vs PC2 [Eeckhout]
compress
ijpeg
go
gcc
m88ksim
vortex
perlbmkxlisp
TPC-D
PC
2:
hig
h I
LP a
nd
low
bra
nch
pre
dic
tion a
ccura
cy
PC1: fewer branches and low I-cache miss rates
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Dendrogram to select representatives [Eeckhout]
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Cluster Analysis and PCA [Eeckhout]
• Analogous to realizing that I-10 and I-20 are very similar kinds of roads. Similarly I-80 and I-90 are very similar.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Plot for the scores of L1 cache access behavior of SPECint2000 and SPECjvm98 benchmark suites
-6
-4
-2
0
2
4
6
8
-20 -15 -10 -5 0 5 10 15
PC1
PC2
SPECint2000SPECjvm98jvm.compress
bzip2
gzip
jvm98
gcc
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
The Return of Synthetic Benchmarks?
A framework to generate synthetic benchmarks that are:
•Representative of applications or user specifications
•Automatically generated
•Generated and executed using user parameters
•Source code and executables
•Portable to multiple hardware & simulation systems
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Statistical Simulation
•Statistical Simulation using Synthetic Traces•Carl and Smith
•Nussbaum and Smith
•Oskin et al.: HLS
•Eeckhout et al.
•Executable code built from the workload characterization of well-correlated statistical simulation systems
•Automatic Benchmark Synthesis
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Synthetic Traces in Statistical Simulation
1. Collect global statistics• Basic block size
• Instruction Mix
• Instruction Dependencies
• Branch predictability
• L1/L2 cache statistics
2. Generate basic blocks
3. Connect them together into a graph (HLS) or generate a trace
4. Execute in order, simulating cache misses and branch mispredicts
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Improving Correlation
Track information on a per basic block basis•Basic block size
•Instruction sequences
•Merged dependency information
•Cache hit/miss information
•Branch predictability
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Summary by Benchmark Class
05
10152025303540
overa
ll
spec
int
spec
fp
tech
overa
ll_ds
p32sp
ecint_ds
p32sp
ecfp_
dsp32
tech_d
sp32
overa
ll_bp
1sp
ecint_bp
1sp
ecfp_
bp1
tech_b
p1
HLS errorBasic Blk error
Tracking characteristics on a basic block granularity reduces error
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Adding Structure: BB Maps
0
5
10
15
20
25
30
35
40
45
sdot_ssum1 sdot_sfill sscale_ssum2 ssum2_sfill scopy_sdot ssum2_sdot avg
HLS errorBasic Blk errorBB Map
•Multi-phase programs can benefit from simulation using a graph of basic block frequencies
•Programs are back-to-back two-loop combinations of the technical loops
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
IPC from Synthetic Trace
0
0.5
1
1.5
2
2.5
gcc perl m88ksim ijpeg vortex compress go li
SS IPCSynthetic IPC
Converted workload characteristics to C-code/ASM statements and run through SimpleScalar
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Use of statistical theory?
We architects – are we unwilling to use statistical theory?
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Normalized Standard Deviation (crafty)
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50Normalized number of sampled instructions (j)
larger chunkoriginal chunk size
0.32
0.89
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SimPoint
• Sherwood, et al. ASPLOS 2002• Sample selection:
– Clustering analysis of Basic Block Vectors to identify representative chunks of instructions
• Sampling unit size: 100 million instructions• Sample size: 3-10• Warm up: No explicit warm-up
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
SMARTS
• Wunderlich, et al. ISCA 2003• Sample selection:
– Selecting chunks evenly distributed in the instruction stream (systematic sampling)
• Sampling unit size: 1,000 instructions• Sample size:
– Depends on confidence interval requirement– Thousands to tens of thousands
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Normalized Standard Deviation (crafty)
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50Normalized number of sampled instructions (j)
larger chunkoriginal chunk size
0.32
0.89
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Estimating speedupRequired sample size
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
art equake lucas bzip2-source
gcc-166 vpr-route gzip-random
vortex-1
Sam
ple
Size
8-way cpi16-way cpispeedup
Sample size required to achieve 2% relative error at 95% confidence level
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
(Comparing reduced data set:Test of population median
• If the populations are the same, the population median/mean should be the same
• Wilcoxon signed rank testMetrics Reduced data set p-value
Test 0.06445 Train 0.02734
CPI on 8-way machine
MinneSPEC 0.04883 Test 0.03711 Train 0.01953
CPI on 16-way machine
MinneSPEC 0.03711 Test 1Train 0.375
Speedup (16-way vs. 8-way)
MinneSPEC 0.6953
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Comparing reduced data set:Test result
• None of the reduced data sets have the same population median CPI with reference data set
• All reduced data sets show same median speedup as the reference data set
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Future Workloads
Life, Death and Games
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Future Workloads – Life, Death and Games
• Life Science - Pharmaceuticals, Drug Discovery
• Death Science – Military Applications, Weapon Simulation, Crash Analysis, Scientific Computing
• Games – Multiplayer, Natural Language Recognition/Semantic Analysis
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Risk of missing truly innovative architectures?
Inventions in Search of Applications eg: Laser
If workload characterization can lead to synthetic workloads of the future, endless future workloads can be synthesized that may lead to truly innovative architectures
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Workload Characterization: Can it save Computer Architecture and Performance
Evaluation?
Possibly Yes.
Nothing else possibly can.
Feb 15, 2004 Lizy Kurian John, LCA, UT Austin, CAECW-7
Task is on us, the workload characterization community
We need to abstract program behavior into essential attributes
which can help new architectures and performance evaluation
top related