exploring the potential of performance monitoring hardware to support run-time optimization alex...
Post on 19-Dec-2015
219 views
TRANSCRIPT
Exploring the Potential of Performance Monitoring Hardware to
Support Run-time Optimization
Alex Shye
M.S. Thesis Defense
Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani
University of Colorado at Boulder
Department of Electrical and Computer Engineering
DRACO Architecture Research Group
Thesis Statement
• Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.
Introduction
• Profile information is critical to success of profile-based optimizations
– Point Profile - BB count, edge profile, etc.– Path Profile - correlated points
• Off-line Path Profiling Methods:– Use static/dynamic instrumentation to gather
full path profile
• On-line Path Profiling Method:– Interpretation and MRET
• Both incur high overhead!!– Slowdown of 2-3x with Pin for BB counting
A
B C
D
E F
G
80 20
7030
Edge Profile: ABDFG 70-50
Path Profile: ABDFG 60 ACDFG 10 …
Performance Monitoring• HPM through on-chip Performance
Monitoring Units (PMUs)– Itanium, Pentium 4, PowerPC– Coarse-grained, fine-grained features
• Obstacles to PMU profiling– Non-deterministic (sampling)– Sample aliasing– Less information
• Compiler analysis can extend PMU information!!!
Features Description
Event Counters Counts of course grained events. ex. cpu cycles, flushes,etc.
Branch Trace Buffer (BTB)
Record branch vector of last 4 branches executed.
Filters: T/NT, predicted correct/mispredicted,etc.
Instruction Event Address Registers (IEAR)
Sample Icache/ITLB missed. Addresses and latency
Data Event Address Registers (DEAR)
Sample Dcache, DTLB, ALAT misses. Addresses and latency
Itanium-2 PMU Features
Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.
Contributions
I. Characterize the information provided by PMU sampling of branch vectors
II. Characterize the effect compiler analysis on PMU information
III. Demonstrate the construction of a PMU-based path profiler
PMU Profiling Framework
PMU
BranchVectors…
Partial Paths
OfflineCompiler Analysis
Profile Information
IntermediateFile
Kernel Buffer
Branch VectorHash Table
Online
perfmoninterface
Interrupt onkernel buffer overflow
TerminologyBranch Vector: Series of addresses from BTB
Partial Path: Path of ops in compiler IR
Dominator Analysis
Path Profile Generation
Partial Path ExtensionsAddress Map
Annotated Binary
PMU Configuration
• Itanium-2 PMU BTB masks– Taken Mask (All, T, NT, None)– Predicted Target Address Mask (All, Correct, Incorrect, None)– Predicted Predicate Mask (All, Correct, Incorrect, None)– Branch Type Mask (All, Indirect, Return, IP-relative)
• Configuration depends on goal– Branch prediction performance? Building call graph?
• PMU configured to sample only taken branches for path information– Not taken branches can be inferred in control flow graph
Partial Path Extensions
• Compiler view of CFG can be used to extend paths
• Extend until point of uncertainty– Up until Join Point– Down until Branch Point
Join Point
Branch Point
Partial Path from Branch Vector
Extended Partial Path
BTB Branch Vector
1-2-3-4
1
2
3
4
Dominator Analysis
• Dominator Analysis– Finds all blocks guaranteed to
execute
• Partial Path Extensions– Subset of dominator analysis– Constrained to a path
Join Point
Branch Point
Partial Path from Branch Vector
Basic Blocks added with Dom. Analysis
BTB Branch Vector
1-2-3-4
1
2
3
4
TerminologyDominator: u dominates v if all paths from Entry to v include u
Post Dominate: u post-dominates v if all paths from v to Exit include u
Path Profile Generation• Combine compiler analysis and PMU branch
vectors to generate a path profile comparable to software path profiling techniques
• Issues:– Path of a branch vector inherently different
• Random start and end of path - path ambiguity• Spans boundaries compiler-based paths do not
– Number of paths increases exponentially
• Must map PMU paths to compiler paths– Region Formation– Split partial paths– Path Matching– Path Crediting
Hot Path
BTB Trace
Region 3
Region 1
Region 2
Region Formation• Use region-based paths
– Makes total # paths more manageable
• Functions can be large• Create loop-based regions
– Programs spend most of time in loops
• Rules for Region R:– R must be single entry– R may not cross function boundaries– R may not cross loop boundaries
A
CB
D
L
NM
O
E
GF
HQP
R
TS
U
WV
X
JI
K
Y
Path Matching and Crediting• Path Matching
– Find list of all paths that contain partial path
• Path Crediting– Distribute partial path weight equally among
matched paths
• Ex. ABDLMOP, ABDEFHIK, OPRSUVX
Partial Path Count Matches Inc Total
ABDLMOP 100 ABDLMOPRSUVX
ABDLMOPRSUWX
ABDLMOPRSUVX
ABDLMOPRSUWX
+25
+25
+25
+25
25
25
25
25
ABD 160 ABDLMOPRSUVX
…(14 more)
ABDLNOQRTUWX
+10
…
+10
35
10
EFHIK 160 EFHIK +160 160
OPRSUVX 280 ABDLMOPRSUVX
ABDLNOPRSUVX
ACDLMOPRSUVX
ACDLNOPRSUVX
+70
+70
+70
+70
105
80
70
70Region 3
Region 1
Region 2
A
CB
D
L
NM
O
E
GF
HQP
R
TS
U
WV
X
JI
K
Y
Methodology
• Experiments run on Itanium-2 with 2.6.10 kernel• Developed tool using perfmon kernel interface
and libpfm-3.1 to interface with PMU
• Benchmarks– Set of SPEC2000 benchmarks– Compiled with the OpenIMPACT Research Compiler
• Compared to full path profile gathered with a Pin path profiling tool
Percent Overhead vs. Sampling Period
0
5
10
15
20
25
30
35
40
45
50
50K 100K 500K 1M 5M 10M
Sampling Period (cpu cycles)
Percent Overhead
Effect of Sampling Period
• Sampling Overhead due to:– Periodic interrupt, copying between buffers, hash table insertion
PMU vs Actual Instruction Distribution
• Kullback-Leibler Divergence (Entropy)
– d = k=0 pk log2(pk/qk)
• Relative measure of distance between two distributions
Code Coverage
• Explore how PMU branch vectors translate to code coverage information
• Code Coverage Types– Single BB: Simulates PC-sampling
– Branch Vectors
– Branch Vectors w/ Dom. Analysis
• Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors
Benchmark #Ops # Covered Ops
164.gzip 6,466 3,063 (47%)
175.vpr 23,573 12,229 (52%)
177.mesa 89,006 7,390 (8%)
179.art 2,201 1,515 (69%)
181.mcf 1,973 1,401 (71%)
183.equake 3,033 2,265 (75%)
188.ammp 19,562 5,835 (30%)
197.parser 17,541 11,271 (64%)
256.bzip2 5,095 3,138 (62%)
300.twolf 40,490 15,705 (39%)
Number of Instructions and Actual Code Covered
Code Coverage
Hot Instruction Thresholds
• For top 10-30% of instructions, code coverage does well (80-100%)
• Drops off at around 40-50% of hot instructions
Coverage for Hot Instruction Thresholds
0
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
Percentage of Instructions (sorted by execution count)
Percent Coverage
164.gzip175.vpr177.mesa181.mcf197.parser300.twolf
Stability
• Across 20 runs, PMU code coverage varies ~5-10%
Multiple Runs
• Regular Sampling: 1) gzip, parser, twolf improve greatly• Randomized Sampling may discover code regular sampling cannot
Partial Path Characteristics
• Partial Path extensions increase length ~20%• However, splitting drastically decreases lengths
– ~30% on function boundaries, ~20% more on loop back edges
Partial Path Lengths
0
10
20
30
40
50
60
70
80
gzip vprmesa
art mcfequakeammpparser bzip2 twolf
Benchmark
Length (number of IR ops)
Initial Partial PathsExtended Partial PathsSplit on Func. and Loop Boundaries
Accuracy Results• Accuracy measured similar to Wall’s weight matching scheme[Wall91]
– Threshold = .125%
Accuracy Vs. Sampling Period
0
10
20
30
40
50
60
70
80
90
100
50K 100K500K1M 5M 10M 50M100M500M
Sampling Period
Accuracy (%)
164.gzip175.vpr177.mesa179.art181.mcf183.equake188.ammp197.parser256.bzip2300.twolf
Conclusion
• Motivates and presents initial results and rational for PMU-based profiling
• Characterizes branch vector sampling– Improves code coverage > 50% over PC-sampling
– Branch vector paths are inter-procedural
• Characterizes effect of compiler analysis– Partial path extensions increase length by ~20%
– Dominator analysis on branch vectors improve code coverage > 50%
• Demonstrates construction of a PMU-based path profiler– ~85% accurate at 1% overhead (at sampling period of 5M)
Questions?