software group © 2005 ibm corporation compilation technology controlling parallelization in the ibm...
Post on 19-Dec-2015
215 views
TRANSCRIPT
Software Group
© 2005 IBM Corporation
Compilation Technology
Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers
Priya UnnikrishnanIBM Toronto [email protected] 2005
Software Group
© 2005 IBM CorporationOctober 2005
Overview
Parallelization in IBM XL compilers
Outlining
Automatic parallelization
Cost analysis
Controlled parallelization
Future work
Software Group
© 2005 IBM CorporationOctober 2005
Parallelization
IBM XL compilers support Fortran 77/90/95, C and C++
Implements both OpenMP and Auto-parallelization.
Both target SMP (shared memory parallel) machines
Non-threadsafe code generated by default
– Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code
Software Group
© 2005 IBM CorporationOctober 2005
Parallelization options
-qsmp=noopt Parallelizes code with minimal optimization to allow for better debugging of OpenMP applications.
-qsmp=omp Parallelizes code containing OpenMP directives
-qsmp=auto Automatically parallelizes loops
-qsmp=noauto No auto-parallelization. Processes IBM and OpenMP parallel directives.
Software Group
© 2005 IBM CorporationOctober 2005
Outlining
Parallelization transformation
Software Group
© 2005 IBM CorporationOctober 2005
Outlininglong main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208,
&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)
endif return main;}
int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… }}
Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}
+
Runtime call
Outlined routine
Software Group
© 2005 IBM CorporationOctober 2005
SMP parallel runtime
_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..)
main@OL@1(30,39)main@OL@1(0,9) main@OL@1(10,19) main@OL@1(20,29)
The outlined function is parameterized – can be invoked for different ranges in the iteration space
Software Group
© 2005 IBM CorporationOctober 2005
Auto-parallelization
Integrated framework for OpenMP and auto-parallelization
Auto-parallelization is restricted to loops.
Auto-parallelization is done in the link step when possible.
This allows us to perform various interprocedural analysis and optimizations before automatic parallelization
Software Group
© 2005 IBM CorporationOctober 2005
Auto-parallelization transformation
int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}
+
int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… }}
Outlining
Software Group
© 2005 IBM CorporationOctober 2005
We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!!
int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}
+Outlining
int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}
Software Group
© 2005 IBM CorporationOctober 2005
Pre-parallelization phase
Loop Normalization (normalize countable loops)
Scalar privatization
Array privatization
Reduction variable analysis
Loop interchange (that helps parallelization)
Software Group
© 2005 IBM CorporationOctober 2005
Cost Analysis
Automatic parallelization tests
– Dependence analysis : Is it safe to parallelize ??
– Cost analysis : Is it worthwhile to parallelize ??
Cost analysis: Estimates the total workload of the loop
LoopCost = ( IterationCount * ExecTimeOfLoopBody )
Cost known at compile time – trivial
Runtime cost analysis is more complex
Software Group
© 2005 IBM CorporationOctober 2005
Conditional Parallelization
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208,
&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)
} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}
int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}
Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}
+
Runtime check
Software Group
© 2005 IBM CorporationOctober 2005
Runtime cost analysis challenges
Runtime checks should be
– Light weight : should not introduce large overhead in applications that are mostly serial
– Overflow problems : leads to incorrect decision – costly!!
loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* …
– Restricted to integer operations
– Should be accurate
Balance all the above factors
Software Group
© 2005 IBM CorporationOctober 2005
Runtime dependence test
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208,
&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)
} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}
int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}
Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}
+
Runtime dependence
Work by Peng Zhao
Software Group
© 2005 IBM CorporationOctober 2005
-20
-10
0
10
20
30
40
50
%Im
pro
ve
me
nt
(-O
5 -
qs
mp
)
swim wupwise mgrid applu lucas mesa art equake ammp apsi facerec fma3d sixtrack
SPEC2000FP auto-par performance1 Proc : -0.5%
2 Proc : 8%
Software Group
© 2005 IBM CorporationOctober 2005
Controlled parallelization
Cost analysis selects big loops
Controlled parallelization
– Selection is not enough
– Parallel performance dependent on
( amount of work + number of processors used)
– Using large number of processors for a small loop huge degradations !!
Software Group
© 2005 IBM CorporationOctober 2005
0
50
100
150
200
250
Ex
ec
uti
on
tim
e (
se
c)
8 16 32 48 64Processors
galgel (SPECOMPM 2001) performanceMeasured on a 64-way Power5 processor
Small is good !!!
Software Group
© 2005 IBM CorporationOctober 2005
Controlled parallelization
Introduce another runtime parameter IPT (minimum iterations per thread)
The IPT is passed to the SMP runtime
SMP runtime limits the number of threads working on the parallel loop based on IPT
IPT = function( loop_cost, mem access info .. )
Software Group
© 2005 IBM CorporationOctober 2005
Controlled Parallelization
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208,
&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,IPT)
endif } else main@OL@1(0,0,(unsigned)n,0) } return main;}
int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }} Subroutine void main@OL@1(
…… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}
+
Runtime parameter
Software Group
© 2005 IBM CorporationOctober 2005
SMP parallel runtime
_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT)
{
threadsUsed = IterCount/IPT
if (threadsUsed > threadsAvailable)
threadsUsed = threadsAvailable
…..
…..
}
Software Group
© 2005 IBM CorporationOctober 2005
Controlled parallelization for OpenMP
Improves performance and scalability
Allows fine grained control at loop level granularity
Can be applied to OpenMP loops as well
Adjust number of threads when ENV variable OMP_DYNAMIC is turned on.
Issues with threadprivate data
Encouraging results in galgel
Software Group
© 2005 IBM CorporationOctober 2005
0
50
100
150
200
250
Ex
ec
uti
on
tim
e (
se
c)
8 16 32 48 64Processors
galgel (SPECOMPM 2001) performance
no controlled par controlled par
Measured on a 64-way Power5 processor
Software Group
© 2005 IBM CorporationOctober 2005
Future work
Improve cost analysis algorithm and fine tune heuristics
Implement interprocedural cost analysis.
Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability
Implement interprocedural dependence analysis