memory systems performance workshop 2004© david ryan koes 20041 msp 2004 programmer specified...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 1
MSP 2004
Programmer SpecifiedPointer Independence
David KoesMihai Budiu
Girish VenkataramaniSeth Copen Goldstein
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 2
Outline
• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 3
Problem
Potentially aliasing pointers inhibit compiler optimization.
Fully determining pointer aliasing may be infeasible or expensive.
How to get the benefit without paying the cost?
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 4
Memory Dependencies
Memory dependencies inhibit optimization• Introduce edges into dependence graph• Limits parallelization• Inhibits code motion
– instruction scheduling– loop invariant code motion– partial redundancy elimination– register promotion
Breaking memory dependencies difficult• compile-time analysis infeasible or expensive• run-time analysis limited to local window
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 5
Examples
while(len--){ *p++ = *q++;}
There is a real data dependence between the load and store within a single iteration.
Unroll loop to exploit parallelism
.L26: mov r24 = r33 mov r17 = r32 adds r22 = 8, r33 adds r19 = 8, r32 adds r20 = 12, r33 adds r21 = 12, r32 ;; ld4 r14 = [r24], 4 adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r17] = r14, 4 ld4 r23 = [r24] ;; st4 [r17] = r23 ld4 r18 = [r22] ;; st4 [r19] = r18 ld4 r16 = [r20] ;; st4 [r21] = r16 br.cloop .L26 ;;
Itanium assembly from gcc
.L26: mov r18 = r33 mov r23 = r32 adds r25 = 8, r33 adds r24 = 12, r33 adds r22 = 8, r32 adds r21 = 12, r32 ;; ld4 r14 = [r18], 4 ld4 r19 = [r25] adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r23] = r14, 4 ld4 r16 = [r18] ld4 r20 = [r24] ;; .mmb st4 [r23] = r16 st4 [r22] = r19 st4 [r21] = r20 br.cloop .L26 ;;
without memory dependence
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 6
Examples
for(i = 0; i < len; i++){ ... ... = *q; ... *p = ...}
t0 = *q;for(i = 0; i < len; i++){ ... ... = t0; ... t1 = ...}*p = t1; if loop was executed
t0 = *q; if loop will be executed
for(i = 0; i < len; i++){ ... ... = t0; ... *p = ...}
loop invariantcode motion
register
promotion
Hardware can’t do this
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 7
Pointer Analysis
Memory Disambiguation is important• hardware can’t do everything• so have compiler figure it out...
int p[10];foo(){ int q[10]; ...}
foo(){ int *p, *q; int a,b; if(...) { p = &a; q = &b; } else { p = &b; q = &a; } ...}
foo(int *p, int *q){ ...}
easy!
harder.. need precisedataflow analysis
requiresinter-procedural
information
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 8
Inter-procedural Pointer Analysis
• Just apply same techniques as used for intraprocedural• may not be possible
– gcc -c foo.c• may not be feasible
– n2 analysis on source code of Microsoft Office?
• Use less precise analysis• still might not be possible (separate compilation, libraries)• still takes time (every time you compile, or at least link)• less precise » less optimization
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 9
Alternative: Have Programmer Do It
Programmer annotates source code • informs compiler of pointer relationships
Previous Work • ANSI C99 restrict keyword
– difficult for compiler and programmer to reason about– non-local semantics
• MIPSpro #pragma ivdep– break loop carried dependence in inner loop
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 10
Outline
• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 11
#pragma independent
Syntax#pragma independent ptr1 ptr2
Example
int x[100]int y;
void foo(int *a, int *b){ #pragma independent a b int arr[50]; …}
x
y malloc_site_1
arr
malloc_site_2
pointers guaranteed to always point to different objects
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 12
Examples
void f(int len, int * p, int * q){ #pragma independent p q while (len--) *p++ = *q++;}
void example(int *a, int *b, int *c){ #pragma independent a b #pragma independent a c (*b)++; *a = *b; *a = *a + *c;}
pragmas allow compiler to eliminate a store to *a
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 13
#pragma independent
Advantages• more flexible and powerful than restrict• relationships between pointers explicit• easy to reason about
– effects only listed pointers• easy to implement in compiler
– fewer than 100 lines of code
Possible Disadvantage• could take programmer a lot of time to annotate
existing source
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 14
Outline
• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 15
Automated Annotation Toolflow
*.c *.h compiler execution
script
pragma aware
compiler
programmer
executable withruntime checks
invalid pointer pairsexecution frequencies
candidate pointer pairsstatic scores
pragma annotations ranked by score
source code withverified pragmas
faster executable
Compiler finds interesting pointer pairs
• pairs which inhibit optimization
• pairs whose aliasing is unknown
Inserts profiling code and checks
inputs
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 16
Automated Annotation Toolflow
*.c *.h compiler execution
script
pragma aware
compiler
programmer
executable withruntime checks
invalid pointer pairsexecution frequencies
candidate pointer pairsstatic scores
pragma annotations ranked by score
source code withverified pragmas
faster executable
Instrumented executable run on input
• records pointers which conflict
• counts number of pointer uses
inputs
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 17
Automated Annotation Toolflow
*.c *.h compiler execution
script
pragma aware
compiler
programmer
executable withruntime checks
invalid pointer pairsexecution frequencies
candidate pointer pairsstatic scores
pragma annotations ranked by score
source code withverified pragmas
faster executable
Script combines static and dynamic info
• eliminates conflicting pairs
• assigns score to each pair
inputs
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 18
Automated Annotation Toolflow
*.c *.h compiler execution
script
pragma aware
compiler
programmer
executable withruntime checks
invalid pointer pairsexecution frequencies
candidate pointer pairsstatic scores
pragma annotations ranked by score
source code withverified pragmas
faster executable
Programmer verifies pointer pairs
• can verify high scoring pairs only
inputs
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 19
Example Output
void summer(int *p, int *q, int n, int *result){#pragma independent p q /* score: 1100 */#pragma independent p result /* score: 15 */#pragma independent q result /* score: 12 */
int i, sum = 0;for(i = 0; i < n; i++){
*p += *q;sum += *q;
}*result = sum;
}
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 20
Sample Score Distribution
0
50
100
150
200
250
300
350
400
Number of pairs
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Percentile of maximum score
Dynamic ScoreStatic Score
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 21
Outline
• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 22
Targets & Benchmarks
Targets• Itanium
• EPIC/VLIW architecture• instruction scheduling important for good performance
• ASH (Application Specific Hardware)• can take full advantage of parallelism
Benchmarks• Mediabench
• small, multimedia applications• can’t time accurately on Itanium
• Spec95, Spec2000• general purpose integer• longer running
– sometimes days for ASH simulation
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 23
Compilers
Compilers• gcc
• not very sophisticated optimizations• -funroll-loops -O2
• CASH• more sophisticated optimizations• memory dependencies are first class objects
– token edge– pragma independent removes edge
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 24
Questions
Do we find a reasonable number of potential annotations?• Yes!
Do the annotations result in faster code?• Yes!
Does our scoring mechanism find the pointer pairs with the biggest impact on performance?• Yes!
How much time does the programmer have to spend verifying pragmas?• Not a lot!
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 25
Annotations Found
119
3
56
490
188
132
12 12
4132
0 0
36 36
453
94
72
34 34
3470
159
40
451
3 07 2
950
30
252
463744 418 979
0
50
100
150
200
250
300
124.m88ksim129.compress
130.li132.ijpeg134.perl175.vpr181.mcfadpcm_dadpcm_e
epic_depic_eg721_dg721_egsm_dgsm_ejpeg_djpeg_emesampeg2_dmpeg2_epegwit_dpegwit_e176.gcc
197.parser256.bzip2300.twolf168.wupwise
171.swim172.mgrid173.applu177.mesa183.equake188.ammp
301.apsi
Benchmark
Number of pointer pairs
uncheckedconflictno conflictuseful
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 26
Do the annotations result in faster code?
Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup
Itanium Speedup
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 27
Do the annotations result in faster code?
CASH Speedup
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 28
Does our scoring mechanism work?
all (68)
Number of highest scoring pragmas
mpeg2_e
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 29
How much time does the programmer have to spend?
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 31
Conclusions
• We’ve performed a limit study of pointer analysis• gcc doesn’t fully exploit the results of pointer analysis• CASH and ASH can fully exploit parallelism
• Programmer specified annotations are effective• faster and more flexible than inter-procedural analysis
• Annotations can be automatically generated• automatic score successfully focuses programmer’s attention• manual verification does not take long
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 33
ANSI C99 restrict keyword
An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in 6.7.3.1 below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer.) The intended use of the restrict qualifier (like the register storage class) is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior).
ISO/IEC 9899Second edition
1999-12-01 6.7.3-7
Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 34
restrict Example
void f(int len, int * restrict p, int * restrict q){ while (len--) *p++ = *q++;}
restrict tells the compiler that p and q refer to different objects, enabling optimizations