ecmwf 1 com hpcf 2004: profiling and optimisation serial optimisation and profiling computer user...

60
ECMWF COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit is derived from presentations by John Hague and Daniel Boulet

Upload: jarrett-boykin

Post on 30-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 1 COM HPCF 2004: Profiling and optimisation

Serial optimisation and

profilingComputer User Training Course 2004

Carsten Maaß

User Support

This unit is derived from presentations by John Hague and Daniel Boulet

Page 2: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 2 COM HPCF 2004: Profiling and optimisation

Topics

•Tuning methodology

•Timing

•Profiling

•Serial optimisation

Page 3: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 3 COM HPCF 2004: Profiling and optimisation

Why optimisation?

Resources are limited/shared. With optimised code

you could:

• Make more cycles available for all users

• Run more experiments

• Run larger experiments

• Get results earlier

Page 4: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 4 COM HPCF 2004: Profiling and optimisation

Performance measure

• M1: ‘Science’ / time-unit using computer

• M2: Computations / time-unit

• M1/M2 = ?

Page 5: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 5 COM HPCF 2004: Profiling and optimisation

Tuning options

In order of ease and preference:

1. Use existing package tuned for pSeries

(IFS etc., etc., etc.)

2. Use ESSL (and/or MASS)

3. Use another tuned library such as NAG

4. Hand tune

Page 6: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 6 COM HPCF 2004: Profiling and optimisation

• minimize I/O including paging

• compile overall with "-O4 -qarch=pwr4" and with "-O3 -qarch=pwr4“ (and –qstrict)

– measure performance & go with the better one

• do a Hot-Spot Analysis (profiling)

• for key routines:

– consider replacing with ESSL equivalent

– check which of -O3 and -O4 is best again

– hand tune

Tuning techniques

Page 7: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 7 COM HPCF 2004: Profiling and optimisation

Don't abandon an otherwise successful tuning approach just because the program starts to generate wrong answers as it may be possible to fix the answers while still getting faster results.

measure / profile

tune bottlenecks

fast enoug

h

un-optimised correct code

correct results

optimised code

check code

Yes

Yes

No

No

Tuning methodology

Define a performance target!

Page 8: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 8 COM HPCF 2004: Profiling and optimisation

• Neither hand tuning or the compiler's optimizer are likely to triple or even double the performance of most programs.

The only realistic exceptions are

– very careful tuning of certain matrix intensive applications (check out the SMP aware ESSL library)

– parallelization techniques

• Never underestimate the potential improvements to be gained by switching to a better algorithm

– after which, hand tuning and the compiler's optimizer can make things even faster

Some reminders . . .

Page 9: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 9 COM HPCF 2004: Profiling and optimisation

Remember why bank robbers rob banks:

because that's where the money is!

Apply the same logic to performance tuning:

1. identify which parts of the program are the most expensive

2. concentrate tuning efforts on the expensive parts

REMEMBER: different input data may result in very different activity patterns

Profiling or "Hot Spot Analysis" (1/2)

Page 10: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 10 COM HPCF 2004: Profiling and optimisation

Rule of thumb:Trying to double or triple the performance of a program by optimizing parts that each use less than 20% of the time is like trying to get rich by robbing lots of small grocery stores.

i.e. you have to be incredibly lucky to succeed!

Profiling or "Hot Spot Analysis" (2/2)

Page 11: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 11 COM HPCF 2004: Profiling and optimisation

…Node actual : 16Adapter Req. : (csss,MPI,not_shared,US)Resources : ConsumableCpus(2) ConsumableMemory(1.758 gb)#*#* Next 3 times NOT up-to-date (TOTAL CPU TIME given later IS accurate)Step User Time : 5+19:10:52.450000Step System Time : 00:25:27.960000Step Total Time : 5+19:36:20.410000 (502580.41 secs)#*#* Last 3 times NOT up-to-date (TOTAL CPU TIME given later IS accurate)Context switches : involuntary = 180609151, voluntary = 2017625 per second = 36383 406Page faults : with I/O = 41887, without I/O = 12774072 per second = 8 2573 <--------- CPU --------> <------------- MEM ------------>Node ? #T #t secs/CPU (Eff%) (Now%) max/TSK mb (Eff%) (Now% - mb ) Task list-------- - -- -- ---------- ------ ------ ---------- ------ -------------- ---------hpcb0302 M 4 2 4076.71 ( 82%) ( 83%) 644.91 ( 35%) ( 36% - 7680) 0:1:2:3: … … … … … … hpcb1903 . 4 2 4105.18 ( 82%) ( 95%) 627.06 ( 34%) ( 35% - 7680) 60:61:62:63:-------- - -- -- ---------- ------ ------ ---------- ------ -------------- --------- Min = 4033.09 602.13 = Min Max = 4118.67 644.91 = Max-------- - -- -- ---------- ------ ------ ---------- ------ -------------- --------- Elapsed = 4964 secs 1800 mb = ConsumableMemory CPU Tot = 523004.47 ( 6+01:16:44) Average: 32688 s/node, 8172 s/taskSystem Billing Units used by this jobstep = 563.363

eoj — end-of-job information

Can be used at any time: eoj job-ID

Page 12: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 12 COM HPCF 2004: Profiling and optimisation

• Xprofiler (a really useful and user-friendly tool)

– compile with -g -pg (and usual optimization options) and then execute program against chosen test case(s)

– provides graphical indication of call tree

– visual indication of most active routines

– click on routine to get FORTRAN statement level profiling

– part of IBM Parallel Environment

– also available on ecgate

• prof, gprof (standard Unix tools)

– Use them if Xprofiler is unavailable

Profiling tools

Page 13: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 13 COM HPCF 2004: Profiling and optimisation

• compile and link the program with the -g and -pg

options:

$ xlf -c -g -pg -O4 main.f

$ xlf -c -g -pg -O4 qq.f

$ xlf -g -pg main.o qq.o -o prog

• run the program (creates a gmon.out file)

$ ./prog data1

• invoke Xprofiler on the binary and the

gmon.out file

$ xprofiler prog gmon.out

Using Xprofiler

Page 14: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 14 COM HPCF 2004: Profiling and optimisation

start

main

recurspar2 par1

log transfm tan

trans3 trans2

sin cos

mcount

Xprofiler - overall view

Page 15: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 15 COM HPCF 2004: Profiling and optimisation

Xprofiler - zoomed view

Hints:

To obtain a clear overview of the call tree for your executable only, use the option Filter -> Hide All Library Calls followed by Filter -> Uncluster

View -> Zoom In to see labels

right-click function box for Function menu

Page 16: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 16 COM HPCF 2004: Profiling and optimisation

The width of the box indicates the relative amount of time spent by the routine and the routine's descendents

The height of the box

indicates the relative amount of time spent in

the routine

The program spent 2.631 seconds of CPU time in the routine and the routine's descendents

the routine itself consumed 1.230 seconds of CPU

time

the name of the routine is recurs

[4] is the index of the routine in the "Function index" report

recurs called log 1000000 times

Interpreting Xprofiler results

Page 17: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 17 COM HPCF 2004: Profiling and optimisation

A "right click" on the recurs function's box brings up the routine's source code view:

The "ticks" column is the number of times the line was "active" when the profiling clock "ticked"

Xprofiler source code view

Page 18: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 18 COM HPCF 2004: Profiling and optimisation

• each run of the program will create a new gmon.out file (overwriting any existing gmon.out file)

– rename the old gmon.out file first if you want to keep it

• recompiling the program invalidates all older gmon.out files

– saving the old binary before recompiling can be used to keep older gmon.out files valid

• modifying the source file invalidates the information shown in xprofiler's "Source Code" window

gmon.out file issues

Page 19: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 19 COM HPCF 2004: Profiling and optimisation

• In program:

– mclock() returns CPU time

• INTEGER FUNCTION

• Returns 1/100ths of seconds

– rtc() returns elapsed (wall clock) time

• REAL*8 FUNCTION

• Returns seconds with microsecond resolution

• AIX time(x) command gives:

– 'Real' time (elapsed)

– 'User' time (CPU)

– 'System' time (CPU)

– Total CPU time = User + System

implicit (none)real*8 r0,rtc,cpu_secs,real_secsinteger m0,mclock . r0=rtc() m0=mclock() . >code you want to time< .cpu_secs=(mclock()-m0)*0.01real_secs=rtc()-r0

Timing a program

Page 20: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 20 COM HPCF 2004: Profiling and optimisation

• timed code must take significantly longer than 1/100th sec

– an AIX restriction - not FORTRAN

• wrap the loop in "time multiplier" loop to improve timing results:

T0=MCLOCK()C===MMM LOOP IS TO INCREASE TOTAL CPU TIME DO MMM=1,10000 DO I=1,2000 A(I)=A(I)+S*B(I) ENDDO ENDDO TLOOP=(MCLOCK()-T0)/100./MMM

• BE CAREFUL: this may wrongly hide cache miss effects

The solution is to use rtc() and run CPU-bound on an otherwise quiet system.

MCLOCK granularity

Page 21: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 21 COM HPCF 2004: Profiling and optimisation

m0=rtc() . . .ml = rtc()delta = ml – m0

• the optimizer might move calls to rtc() and mclock()

• insert print statements to force serialization

call dummy(m0)

call dummy(ml)

• use –qstrict

• flush

Beware of the optimizer …

Page 22: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 22 COM HPCF 2004: Profiling and optimisation

•counters provided by processor

•can be used via libraries in /usr/local/lib/trace

– libmpihpm.a (and –lpmapi)

•see /usr/local/lib/trace/README

Hardware performance monitor (1/2)

No. of floating point operations:

FPU_FMA + FPU0_FIN + FPU1_FIN - FPU_STF

Performance monitoring

was developed for

hardware engineers!

Page 23: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 23 COM HPCF 2004: Profiling and optimisation

Other tools:

• hpmcount

~trx/hpm/hpmcount executable

• libhpm

-L/home/ectrain/trx/hpm –lhpm –lpmapi

export HPM_GROUP=[0-60], see

/usr/local/lib/trace/power4.ref

call hpm_begt(n) start counting block n call hpm_endt(n) stop counting block n call hpm_prnt() print counter values and labels

• see ~trx/hpm

Hardware performance monitor (2/2)

Page 24: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 24 COM HPCF 2004: Profiling and optimisation

• instrument the code (i.e. mclock() and rtc())

• try different optimization flags

• use optimised libraries (ESSL, MASS, (NAG))

• use stride 1

• use cache effectively

• keep pipelines and FPUs busy

• maximize: Floating Point ops / (Load+Store ops)

• replace DIVIDEs

• remove IF statements

• help the compiler

• replace ** with EXP of LOG

• recode getting fractional part of a number

The serial tuning top 10 list

Page 25: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 25 COM HPCF 2004: Profiling and optimisation

-O2

– optimises, but retains order of computation

– small amount of unrolling

– better than -O3 -qhot for some routines

-O3

– optimises with reordering of computation

– more aggressive unrolling

– use -qstrict to retain order of computation

-qhot

– blocks and transforms simple loops

– good for F90 array notation instructions

– use selectively

FORTRAN compiler flags

Page 26: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 26 COM HPCF 2004: Profiling and optimisation

-qarch=auto [pwr4, pwr3]

– Controls which instructions the compiler can generate. Changing the default can improve performance but might produce code that can only be run on specific machines

-O4

– shorthand for: -O3 -qhot -qipa –qarch=auto -qtune=auto -qcache=auto

– -qipa: inter procedural analysis - increases compilation time

– -qcache=auto, -qtune=auto: tune for processor doing compilation

FORTRAN compiler flags

Page 27: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 27 COM HPCF 2004: Profiling and optimisation

-qessl

– will substitute Fortran intrinsic functions from ESSL library when it is safe to do so (-lessl must be specified at link time). Controls which instructions the compiler can generate. Changing the default can improve performance but might produce code that can only be run on specific machines

Try various combinations as many optimisations interfere with each other

FORTRAN compiler flags

Page 28: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 28 COM HPCF 2004: Profiling and optimisation

• MASS library

– Mathematical Acceleration SubSystem

• ESSL

– Engineering and Scientific Subroutine Library

• NAG

– Numerical Algorithms Groups (not particularly optimised for POWER4!)

link with $NAGLIB

Performance libraries

Page 29: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 29 COM HPCF 2004: Profiling and optimisation

• automatically provides high-performance alternative to maths intrinsics

• re-link only

• vector versions require source code change

• some are very slightly less accurate (normally only one ULP, i.e. one bit)

• at high optimization levels (-O4), xlf may automatically use routines from MASS

•-L/usr/local/lib/mass –lmass -lmassv

MASS library

Page 30: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 30 COM HPCF 2004: Profiling and optimisation

• scalar library

– no change to code (i.e. compiler uses them)

– exp, log, **, sin, cos, tan, dnint speed up by a factor of about 2

• vector library

– code change may be required

– exp, log, sin, cos, tan, dnint, dint speed up by a factor of about 6

– if exp with IF statement, create reduced vector (which may not be long enough)

http://www.rs6000.ibm.com/resource/technology/MASS

MASS library

Page 31: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 31 COM HPCF 2004: Profiling and optimisation

• Examples for performance gains on POWER3 architecture

– log: 1.57 vlog: 10.4

– sin: 2.42 vsin: 10.0

– (reciprocal) vrec: 2.6

• Compiler flags:

-qarch=pwr4 enables hardware SQRT (very important)

-qnounroll to be avoided for small loops

-qhot compiler uses vector MASS SQRT and 1/x

-qstrict disables vector MASS functions

• Hand coded use of MASS generally best

MASS library

Page 32: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 32 COM HPCF 2004: Profiling and optimisation

• BLAS (Basic Linear Algebra Subprograms)

• linear algebra

• eigensystem analysis

• Fourier transform

• etc.

ESSL functionality

Page 33: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 33 COM HPCF 2004: Profiling and optimisation

• three "levels"

– Level 1: vector-vector: e.g. dot product

– Level 2: vector-matrix: e.g. DAXPY

– Level 3: matrix-matrix: e.g. matrix multiply, DGEMM

• standardised

– portable across systems

– hardware vendors are encouraged to supply high-performance BLAS

– IBM's high performance BLAS is ESSL

BLAS

Page 34: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 34 COM HPCF 2004: Profiling and optimisation

• both ESSL and PESSL have an SMP-parallel capability

-lessl

-lesslsmp

• the "Parallel" in PESSL refers to the use of MPI message passing, usually over the SP switch

(P)ESSL

Page 35: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 35 COM HPCF 2004: Profiling and optimisation

particularly good for:– FFT's– matrix manipulation– linear equation solvers– sort

FORTRAN often better for level 2 BLAS – get benefits of inlining etc:

CALL DAXPY(N,A,P,1,R,1)S=DDOT(N,R,1,P,1)

DO I=1,N R(I)=R(I)+A*P(I) S=S+P(I)*R(I)ENDDO

ESSL library

Page 36: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 36 COM HPCF 2004: Profiling and optimisation

DO I=1,N DO J=1,N C(I,J)=C(I,J)+A(I,J) ENDDOENDDO

Note: the compiler may do this with -qhot (or -O4)

DO J=1,N DO I=1,N C(I,J)=C(I,J)+A(I,J) ENDDOENDDO

Use stride 1

(leftmost Fortan index in array)

Page 37: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 37 COM HPCF 2004: Profiling and optimisation

• stores may flush caches

– stores may have to load caches first

• don't zero array as a precaution

– mix store zeros with storing data or

– use CACHE_ZERO directive

• zeroes whole cache line without loading from memory

• need to put in subroutine to handle partial line zeroing

• stores only go at half speed on POWER4 (unless cache lines interleaved)

Avoid stores

Page 38: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 38 COM HPCF 2004: Profiling and optimisation

do i=1,n y(i)=c*x(i)enddo . .do=1,n z(i)=1.0+y(i)enddo do i=1,n

z(i)=1.0+c*x(i)enddo

Avoid stores

Page 39: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 39 COM HPCF 2004: Profiling and optimisation

do i=1,N a(i)=x(i)/z(i) b(i)=y(i)/z(i)enddo do i=1,N

t=1.d0/z(i) a(i)=x(i)*t b(i)=y(i)*tenddo

Note: compiler usually uses reciprocal with -O3 (without -qstrict)

Remove DIVIDEs

Page 40: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 40 COM HPCF 2004: Profiling and optimisation

do i=1,N z(i)=a/x(i)+b/y(i)enddo

do i=1,N z(i)=(a*y(i)+b*x(i))/(x(i)*y(i))enddo

Remember: DIVIDEs take about 14 cycles but extra multiplies only take from 1 cycle (pipelined) to 6 cycles (totally unpipelined)

Remove DIVIDEs

Page 41: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 41 COM HPCF 2004: Profiling and optimisation

• create reciprocal array if DIVIDE uses same denominator more than once

• try to use MASS library VDIV (vector divide)

– use call vdiv(out,nom,div,len)

– may need to split loop

– compiler will try to do this if -qhot, and not conditional

Remove DIVIDEs

Page 42: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 42 COM HPCF 2004: Profiling and optimisation

• much harder to remove as there is usually no high speed alternative

• but sometimes there is:

if ( sqrt(x) < y ) if ( x < y * y )

Remove SQRTs

Page 43: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 43 COM HPCF 2004: Profiling and optimisation

do j=1,N if(j.eq.1) then a(j)=1.0 else a(j)=b(j) endifenddo

a(1)=1.0 do j=2,N a(j)=b(j) enddo

Remove IFs

Page 44: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 44 COM HPCF 2004: Profiling and optimisation

do j=1,N if(k(i).eq.0)x(j)=0.0 a(j)=x(j)+c*b(j) enddo

if(k(i).eq.0)then do j=1,N x(j)=0.0 a(j)=c*b(j) enddoelse do j=1,N a(j)=x(j)+c*b(j) enddoendif

Remove IFs

Page 45: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 45 COM HPCF 2004: Profiling and optimisation

do j=1,N if(a(j).lt.0) then b(j)=0.0 else b(j)=a(j) endifenddo

do j=1,N b(j)= max(0.0,a(j))enddo

Use MAX or MIN instead of IF:

Remove IFs

Page 46: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 46 COM HPCF 2004: Profiling and optimisation

DO I=1,N A(I)=B(I)*C(J)*D(J) X(I)=Y(I)*C(J)*D(J)ENDDO

DO I=1,N A(I)=B(I)*(C(J)*D(J)) X(I)=Y(I)*(C(J)*D(J))ENDDO

If –qstrict is used, put parentheses around common expressions:

Help the compiler

Page 47: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 47 COM HPCF 2004: Profiling and optimisation

DO J=1,M,6 S1=S1+C*X(J) S2=S2+C*X(J+1) S3=S3+C*X(J+2) S4=S4+C*X(J+3) S5=S5+C*X(J+4) S6=S6+C*X(J+5)ENDDOS=S1+S2+S3+S4+S5+S6

DO J=1,M S=S+C*X(J)ENDDO

Notes: need at least 6 independent FMAs for max MFLOPS compiler may unroll with -O3 (without -qstrict) don't forget to handle the case where M isn't a multiple of 6!

Enable overlapping (pipelining):

Help the compiler

Page 48: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 48 COM HPCF 2004: Profiling and optimisation

do j=1,N do i=2,M y(i,j)=y(i,j)-c*y(i-1,j) enddoenddo

do j=1,N,4 do i=2,M y(i,j )=y(i,j )-c*y(i-1,j ) y(i,j+1)=y(i,j+1)-c*y(i-1,j+1) y(i,j+2)=y(i,j+2)-c*y(i-1,j+2) y(i,j+3)=y(i,j+3)-c*y(i-1,j+3) enddoenddo

Unroll outer loop:

Help the compiler

Page 49: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 49 COM HPCF 2004: Profiling and optimisation

• poor use of cache can reduce performance by a factor of 10 or more

• thorough understanding of cache enables efficient program design

• ....but remember Feynman:

– if you think you understand how the cache works …then you don't understand how the cache works

Use the cache effectively

Page 50: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 50 COM HPCF 2004: Profiling and optimisation

do ii=1,N,NB do j=1,N do i=ii,ii+NB-1 y(i,j)=x(j,i) enddo enddoenddo

4 Kwords

• Block inner strided loop

• e.g. matrix transpose:

Use the cache effectively

Page 51: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 51 COM HPCF 2004: Profiling and optimisation

Prefetch data:

• untuned copy

X(J)=Y(J)

• tuned copy

X(J)=Y(J)+ZERO*X(J)

• load of X(J) activates prefetch streaming

• much faster if data is not in cache

• extra load takes longer if data in L1 cache (but L1 cache is tiny so it probably won't be)

Use the cache effectively

Page 52: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 52 COM HPCF 2004: Profiling and optimisation

Compiler directives to prefetch data

PREFETCH_FOR_STORE or PREFETCH_FOR_LOAD

• issues command to load cache line with specified address

• use for semi sequential access where hardware will not activate streaming

• prefetch address a few loop counts ahead

• but remember POWER4 hardware also looks ahead and can have 8 outstanding cache misses

PREFETCH_BY_LOAD

• issues load instruction

• can be used for "fast start" streaming

• but must start in first 3/4 of cache line for forward streaming

See XL Fortran User's Guide for details

Use the cache effectively

Page 53: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 53 COM HPCF 2004: Profiling and optimisation

Create outer loop to limit inner loop count

DO J1=1,N,NCHUNK DO J=J1,MIN(J1+NCHUNK-1,N) .... ENDDO DO J=J1,MIN(J1+NCHUNK-1,N) .... ENDDO ... ENDDO

Reduce cache misses

Page 54: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 54 COM HPCF 2004: Profiling and optimisation

do i=1,n do j=1,n

do k=1,n d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo

Reduce cache misses - Blocking

! 3 blocking loopsdo ii=1,n,nb do jj=1,n,nb do kk=1,n,nb ! In-cache loops do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo ! enddo enddo enddo

Matrix multiply

Best value of nb needs to be determined experimentally!

See e.g. IBM Redbook “The POWER4 Processor Introduction and Tuning Guide”

Page 55: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 55 COM HPCF 2004: Profiling and optimisation

Usual: f = x - float(int(x))

Tuned:data RND/z'4338000000000000'/ ! D.P.

x = x - sign(0.5d0,x) f = x - ((RND+x)-RND)

data RND/z'4b400000'/ ! S.P. pwr3 data RND/z'59C00000'/ ! S.P. pwr2

Fractional part of a number

Page 56: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 56 COM HPCF 2004: Profiling and optimisation

To get y(i)=mod(x(i),c)

parameter(rnd=2d0**52+2d0**51) ... do i=1,N t=x(i)*(1.d0/c) x=t-sign(0.5d0,t) t=t-((rnd+x)-rnd) y(i)=c*t enddo

Mod function

Page 57: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 57 COM HPCF 2004: Profiling and optimisation

• argument of LOG must be greater than zero

• can also use VEXP and VLOG

X(I)=Y(I)**Z

X(I)=EXP(Z*LOG(Y(I))

Replace **:

Replace ** with EXP of LOG

Page 58: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 58 COM HPCF 2004: Profiling and optimisation

What else?

• listings

-qreport=hotlist

• inlining

-Q , -Q+names

•@process directives

– NOOPT, NOSTRICT

#ifdef RS6K@process NOOPT#endif

Page 59: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 59 COM HPCF 2004: Profiling and optimisation

• operate efficiently within the L1 and L2 caches

• no DIVIDEs (or SQRTs or function calls or . . .) in loops

• take advantage of Fused Multiply Add (FMA)

– FMAs must be independent and at least 12 in number to keep both FPU's pipelines busy

– use loop unrolling and decoupling to achieve that

• i.e. sum into six variables and then add up the six later

• loops should be FMA bound

– more FMAs than loads and stores

Getting close to peak performance

Page 60: ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit

ECMWF 60 COM HPCF 2004: Profiling and optimisation

Unit summary

• Tuning methodology

• Timing

• Profiling

• Serial optimisation