high performance on the j90 systems
DESCRIPTION
High Performance on the J90 Systems. David Turner & Tom DeBoni NERSC User Services Group April 1999. Philosophical Ramblings. Design for optimization? Where to start? When to stop?. J90 Potential. STREAM benchmark results Sustainable memory bandwidth (http://www.cs.virginia.edu/stream) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/1.jpg)
High Performance on the J90 Systems
David Turner & Tom DeBoni
NERSC User Services Group
April 1999
![Page 2: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/2.jpg)
13 April, 1999 High Performance on the J90 Systems 2
Philosophical Ramblings
Design for optimization?
Where to start?
When to stop?
![Page 3: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/3.jpg)
13 April, 1999 High Performance on the J90 Systems 3
J90 Potential
STREAM benchmark resultsSustainable memory bandwidth
(http://www.cs.virginia.edu/stream)
John McCalpin, SGI
bytes/iter FLOPS/iterCOPY
a(i)=b(i) 16 0
TRIAD
a(i)=b(i)+q*c(i) 24 2
![Page 4: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/4.jpg)
13 April, 1999 High Performance on the J90 Systems 4
STREAM Results
Machine ncpus COPY TRIAD MFLOPSCray_C90 16 105497.0 103812.0 8651.0Cray_C90 8 55071.9 63229.6 5269.1Cray_C90 1 6965.4 9500.7 791.7
Cray_J932 16 16298.2 14995.9 1249.7Cray_J932 8 9995.2 8941.3 745.1Cray_J932 1 1433.6 1270.0 105.8
Cray_T3E-900 16 7497.0 8828.0 735.7Cray_T3E-900 8 3747.0 4471.0 372.6Cray_T3E-900 1 484.0 568.0 47.3
SGI_Origin_2K 16 5560.0 5240.0 436.7SGI_Origin_2K 8 2570.0 2740.0 228.3SGI_Origin_2K 1 332.0 358.0 29.8
Sun_UE_10000 16 2371.0 2905.0 242.1Sun_UE_10000 8 1271.0 1546.0 128.8Sun_UE_10000 1 164.0 202.0 16.8
![Page 5: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/5.jpg)
13 April, 1999 High Performance on the J90 Systems 5
STREAM Results (cont.)
Machine COPY TRIAD MFLOPS
Cray_C90 6965.4 9500.7 791.7
Cray_J932 1433.6 1270.0 105.8
Compaq_AlphaServer_DS20 1077.0 1323.0 110.2
IBM_RS6000-397 778.8 882.4 73.5
Cray_T3E-900 484.0 568.0 47.3
SGI_Origin_2K 332.0 358.0 29.8
Generic_440BX_400 304.0 315.4 26.3
Sun_Ultra2-2200 228.5 189.9 25.9
Sun_UE_10000 164.0 202.0 16.8
Apple_Mac_G3_266 137.1 137.1 11.4
![Page 6: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/6.jpg)
13 April, 1999 High Performance on the J90 Systems 6
Tools
F90 (with lots of options)
ja./nameja -cst -n name
hpm
prof
flowview
atexpert
![Page 7: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/7.jpg)
13 April, 1999 High Performance on the J90 Systems 7
Program “SLOW”PROGRAM SLOW
IMPLICIT NONE INTEGER, PARAMETER :: DIMSIZE=8000000 REAL, DIMENSION(DIMSIZE) :: X, Y, Z INTEGER:: I, J
X = RANF() Y = RANF() DO J = 1, 10 DO I = 1, DIMSIZE Z(I)=LOG(SIN(X(I))**2+COS(Y(I))**4) END DO PRINT *, Z(DIMSIZE-1) ENDDO STOP
END PROGRAM SLOW
![Page 8: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/8.jpg)
13 April, 1999 High Performance on the J90 Systems 8
No Optimization
f90 -O0 -r6 -O,msgs,negmsgs -o slow slow.f90
x = RANF()
cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=8
A loop starting at line 8 was vectorized.
y = RANF()
cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=9
A loop starting at line 9 was vectorized.
![Page 9: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/9.jpg)
13 April, 1999 High Performance on the J90 Systems 9
Moderate Optimization
f90 -O1 -r6 -O,msgs,negmsgs -o slow slow.f90
do j = 1, 10
cf90-6286 f90:VECTOR SLOW,File = slow.f90,Line=10
A loop starting at line 10 was not vectorized because it contains input/output operations at line 14.
DO i = 1, DIMSIZE
cf90-6204 f90:VECTOR SLOW,File = slow.f90,Line=11
A loop starting at line 11 was vectorized.
z(i) = LOG(SIN(x(i))**2 + COS(y(i))**4)
cf90-6001 f90:SCALAR SLOW,File=slow.f90,Line=12
An exponentiation was replaced by optimization. This may cause numerical differences.
![Page 10: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/10.jpg)
13 April, 1999 High Performance on the J90 Systems 10
High Optimization
f90 -O3 -r6 -O,msgs,negmsgs -o slow slow.f90
cf90-6502 f90:TASKING SLOW,File=slow.f90,Line=10
A loop starting at line 10 was not tasked because it contains input/output operations at line 14.
cf90-6403 f90:TASKING SLOW,File=slow.f90,Line=11
A loop starting at line 11 was tasked.
![Page 11: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/11.jpg)
13 April, 1999 High Performance on the J90 Systems 11
Optimization Results
Opt NCPUS Elapsed User Sys
0 768.7530 583.6793 7.1886
1 89.0162 82.1009 1.1936
2 104.7003 81.5687 1.0003
3 1 107.0177 81.6185 1.2994
3 2 44.6562 81.7050 1.4069
3 3 41.3401 81.5320 1.3099
3 4 24.8146 81.8099 1.2968
![Page 12: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/12.jpg)
13 April, 1999 High Performance on the J90 Systems 12
2 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 5.4300 = 5.4300
2 * 38.1300 = 76.2600
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- -----------
1.88 * 43.5600 = 81.6900
![Page 13: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/13.jpg)
13 April, 1999 High Performance on the J90 Systems 13
3 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 9.2200 = 9.2200
2 * 13.5500 = 27.1000
3 * 15.0700 = 45.2100
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- -----------
2.15 * 37.8400 = 81.5300
![Page 14: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/14.jpg)
13 April, 1999 High Performance on the J90 Systems 14
4 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 2.0400 = 2.0400
2 * 1.7700 = 3.5400
3 * 5.3200 = 15.9600
4 * 15.0700 = 60.2800
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- ----------
3.38 * 24.2000 = 81.8200
![Page 15: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/15.jpg)
13 April, 1999 High Performance on the J90 Systems 15
Useful F90 Options
-e (0 or i) - initializes storage or flags use of unitialized vars-e n - flags nonstandard fortran usage-e v - make all variables static-g - same as -G0-G (0 or 1) - sets debugging level to statement or block-m (0 - 4) - message verbosity (0 gives most output)-N (72, 80, or 132) - source line length-O - Optimization levels
0,1,2,3, aggress, fastint, msgs, negmsgs, inline(0-3), scalar(0-3), task(0-3), vector (0-3)
-r (0-6, …) - listing levels (6 is EVERYthing)-R (a, b, c)- runtime checking: args, array bounds, indexing
![Page 16: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/16.jpg)
13 April, 1999 High Performance on the J90 Systems 16
Using flowtrace/flowview
f90 -O1 -ef -o slow slow.f90./slowflowview -Luch > slow.flow
Routine Tot Time Percentage Accum%
------------ -------- ---------- -------
SUB2 5.66E+01 69.02 69.02
SUB1 2.43E+01 29.63 98.65
SLOW 1.11E+00 1.35 100.00
![Page 17: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/17.jpg)
13 April, 1999 High Performance on the J90 Systems 17
Using prof
f90 -O1 -l prof -o slow slow.f90
./slow
prof -x ./slow > slow.prof
profview slow.prof
![Page 18: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/18.jpg)
13 April, 1999 High Performance on the J90 Systems 18
profview Output
![Page 19: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/19.jpg)
13 April, 1999 High Performance on the J90 Systems 19
Optimization Strategies
• First, let the compiler do it• Vectorize and scalar optimize, then parallelize
• Vectorization can give you a factor of 10 speedup• Scalar optimization can improve performance by
10-50%• Parallelism will give you a linear speedup, max• Memory contention inhibits gains from parallelism
• Let the compiler advise you
• Add directives where appropriate• Be sure you tell the truth• Check your answers
![Page 20: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/20.jpg)
13 April, 1999 High Performance on the J90 Systems 20
Scalar Optimization
Subroutine or function inlining
Fast (32-bit) integers
-Oallfastint
-Ofastint
Use INTERFACE specifications if passing array sections
![Page 21: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/21.jpg)
13 April, 1999 High Performance on the J90 Systems 21
Vectorization
![Page 22: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/22.jpg)
13 April, 1999 High Performance on the J90 Systems 22
Inhibitors to Vectorization
Function or subroutine references
Inline
Push loop
Split loop
Backwards data dependencies
Reorder loop, use temporary vector
I/O statements
Character or bit manipulations
Branches into loop or backward out of loop
![Page 23: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/23.jpg)
13 April, 1999 High Performance on the J90 Systems 23
Nonvectorizable Code
DO I = 1, N
CALL CALC(X(I), Y(I), Z(I))
ENDDO
...
SUBROUTINE CALC(X, Y, Z)
Z = ALOG(SQRT((SIN(X) * COS(Y)) ** X))
RETURN
END
![Page 24: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/24.jpg)
13 April, 1999 High Performance on the J90 Systems 24
Inlining
DO I = 1, N
Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))
ENDDO
![Page 25: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/25.jpg)
13 April, 1999 High Performance on the J90 Systems 25
Pushing
CALL CALC(X(I), Y(I), Z(I), N)
...
SUBROUTINE CALC(X, Y, Z, N)
DIMENSION X(N), Y(N), Z(N)
DO I = 1, N
Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))
ENDDO
RETURN
END
![Page 26: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/26.jpg)
13 April, 1999 High Performance on the J90 Systems 26
Splitting
DO I = 1, N
A(I) = ABS(CALC(C(I)))
B(I) = A(I) ** T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
ENDDO
![Page 27: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/27.jpg)
13 April, 1999 High Performance on the J90 Systems 27
Splitting (cont.)
EXTERNAL CALC
DO I = 1, N
A(I) = ABS(CALC(C(I)))
ENDDO
DO I = 1, N
B(I) = A(I) ** T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
ENDDO
![Page 28: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/28.jpg)
13 April, 1999 High Performance on the J90 Systems 28
Scalar Recurrence
DIMENSION A(1000), C(1000)
DO J = 1, M
S = BB
DO I = 1, N
S = S * C(I)
A(I) = A(I) + S
ENDDO
ENDDO
<cf90-8135,Scalar,Line=7> Loop starting at line 7 was unrolled 16 times.
![Page 29: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/29.jpg)
13 April, 1999 High Performance on the J90 Systems 29
Scalar Recurrence (cont.)
DIMENSION A(1000), C(1000), S(1000)DO I = 1, M S(I) = BBENDDODO I = 1, N DO J = 1, M S(J) = S(J) * C(I) A(I) = A(I) + S(J) ENDDOENDDO
Loop starting at line 5 was unrolled 2 times.
A loop starting at line 5 was vectorized.
A loop starting at line 9 was vectorized.
![Page 30: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/30.jpg)
13 April, 1999 High Performance on the J90 Systems 30
Compiler Vector Directives
CDIR$ directive
!DIR$ directive
VECTOR, NOVECTOR
Turn vectorization on or off until end of program unit.
IVDEP
Ignore vector dependencies in next loop.
![Page 31: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/31.jpg)
13 April, 1999 High Performance on the J90 Systems 31
Parallel Computing
Multitasking, microtasking, autotasking, parallel processing, multiprocessing, etc.
This is “fine-grained” parallelism
parallelism mostly comes from loop slicing
One possible goal: parallelize outer loop(s),
vectorize inner loop(s)
F90 is capable of autotasking, but it can always
benefit from help
![Page 32: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/32.jpg)
13 April, 1999 High Performance on the J90 Systems 32
Parallelism
![Page 33: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/33.jpg)
13 April, 1999 High Performance on the J90 Systems 33
Parallelism, cont.
![Page 34: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/34.jpg)
13 April, 1999 High Performance on the J90 Systems 34
Data “Scoping”
DIMENSION A(N)
SUM = 0.0
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I)
SUM = SUM + TEMP * A(I)
ENDDO
A, N Shared, read-only everywhere
I, TEMP Private, read-write everywhere
SUM Shared, read-write everywhere
![Page 35: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/35.jpg)
13 April, 1999 High Performance on the J90 Systems 35
Compiler Tasking Directives
DIMENSION A(N)
SUM = 0.0
!MIC$ DOALL SHARED(A,N),PRIVATE(I,TEMP)
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I) * A(I)
!MIC$ GUARD
SUM = SUM + TEMP
!MIC$ ENDGUARD
ENDDO
![Page 36: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/36.jpg)
13 April, 1999 High Performance on the J90 Systems 36
Threshold Test
DIMENSION A(N)
SUM = 0.0
!MIC$ DOALL VECTOR
!MIC$ IF(N.GT.1000)
!MIC$ SHARED(A,N),PRIVATE(I,TEMP)
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I)
!MIC$ GUARD
SUM = SUM + TEMP * A(I)
!MIC$ ENDGUARD
ENDDO
![Page 37: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/37.jpg)
13 April, 1999 High Performance on the J90 Systems 37
Helping F90 with Parallelism
DIMENSION A(N), SUM(NumTasks)
!MIC$ DOALL SHARED(A,N),PRIVATE(J,I,TEMP)DO J = 1, NumTasks
SUM(J) = 0.0
!MIC$ CNCALL DO I = 1, N
SUM(J) = SUM(J) = DEEP_THOUGHT(A,I,J) * A(I)
ENDDO
ENDDO
DO J = 1, NumTasks
TSUM = TSUM + SUM(J)
ENDDO
![Page 38: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/38.jpg)
13 April, 1999 High Performance on the J90 Systems 38
Helping F90 with Directives
• Useful compiler directives for tasking• CASE, ENDCASE• CNCALL• DOALL• DOPARALLEL, ENDDO• GUARD, ENDGUARD• MAXCPUS• NUMCPUS• PERMUTATION• PARALLEL, ENDPARALLEL
• These all begin with !MIC$• NOTE: There are also OpenMP directives...
![Page 39: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/39.jpg)
13 April, 1999 High Performance on the J90 Systems 39
Helping F90 with Directives, cont.
Directive Parameters
AUTOSCOPE
IF
MAXCPUS
PRIVATE
SAVELAST
SHARED
Directive Work Distribution
CHUNKSIZE
GUIDED
NCPUS_CHUNKS
NUMCHUNKS
SINGLE
VECTOR
These all augment !MIC$ directives
NOTE: There are also OpenMP directive parameters...
![Page 40: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/40.jpg)
13 April, 1999 High Performance on the J90 Systems 40
atexpert
f90 -eX -O3 -r6 -o slow slow.f90
setenv NCPUS 1
./slow
atexpert
![Page 41: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/41.jpg)
13 April, 1999 High Performance on the J90 Systems 41
atexpert Output
![Page 42: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/42.jpg)
13 April, 1999 High Performance on the J90 Systems 42
atexpert Output, cont.
![Page 43: High Performance on the J90 Systems](https://reader035.vdocuments.site/reader035/viewer/2022062321/56813ff8550346895dab236e/html5/thumbnails/43.jpg)
13 April, 1999 High Performance on the J90 Systems 43
atexpert Output, cont.