optimal coding practices for ibm power4 processors

Optimal coding practices for IBM POWER4 processors

Steve Behling

IBM Corporation

[email protected]

Getting the most out of AIX, xlf, and xlcor

Outline

• Some hardware details

• Some software discussions

• My favorite hints

• Questions

Memory Hierarchy

CPU

Register

Cache

Main Memory

Disk

Massive Tape Storage

1 cycle

Cache miss: 8-200 cycles TLB miss: tens to hundreds of cycles

~ 100,000 cycles

Don't want to know

Speed Size

POWER4 processor chip layout

Memory

I/O Bus

Processor local busL3 Cache

>1GHz CPU

L3 Controller L3 Directory

Shared L2 Cache

FABRIC CONTROLLERDISTRIBUTED SWITCH

DD

II>1GHz CPU

• Contains two 64-bit processors (PowerPC architecture)• POWER4 has 1.4 MB (1440 KB) L2 cache; POWER4+ has 1.5 MB L2 cache• L3 cache directory on chip• All chip frequencies scale with processor frequency

POWER4 Processor Features

• High-frequency, speculative execution, superscalar processor with out-of-order instruction execution capabilities• Eight independent execution units (capable of executing instructions in parallel) = superscalar

−Two identical floating-point execution units; each with 2 floating-point operations per cycle

−Two load/store execution units−Two fixed-point execution units−One branch execution unit−One conditional register unit to perform logical

operations on the condition register−Only one of the FPUs does divides

POWER4 Instruction Issue Block Diagram

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

Decode,Crack &Group

Formation

Instr Q

I-cache

GCT

BRScan

BRPredict

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

Decode,Crack &Group

Formation

Instr Buffer

IFARI-cacheI-cache

GCT

BRScan

BRPredict

FPIssue Q

FPIssue Q

FPIssue Q

L3

MemCtrl L3

L3

GX Bus

GX Bus GX Bus

GX Bus

Multi-chip Module Boundary

>1 Ghz Core

>1 Ghz Core

Chip-chip communication

Shared L2

Shared L2

Shared L2

Shared L2

L3 Dir

L3MemCtrl

MemCtrl

MemCtrl

MEMORY

MEMORY

Multi-Chip Module (MCM)

p690 Multi-Chip Module (MCM)

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

MemSlot

GX Slot

L3 L3 L3 L3L3 L3L3 L3

L3 L3

L3 L3

L3 L3L3 L3 L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

MCM 1

MCM 3MCM 2

MCM 0

GX Slot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

GX Slot

GX Slot

IBM 32 processor pSeries 690

Cache Organization and Size

Cache Organization Capacity

L1 instruction cache Direct map, 128-byte line 64 KB per processor

L1 data cache Two-way set associative, 128-byte cache line

32 KB per processor

Shared L2 cache POWER4 mostly eight-way, some four-way; POWER4+ all eight-way

1.4 MB per chip POWER4; 1.5 MB per chip POWER4+

L3 cache Eight-way. Two boot modes: 1 cache line per transfer or 4 cache lines per transfer

128 MB per MCM

Virtual Memory Manager

Virtual storage is the addressable memoryspace used by the AIX operating system

This linear contiguous address space is mapped, bya combination of hardware and software, onto thehardware memory of the computer and onto disk paging space(s)

Pages are 4096 bytes on POWER3 and earlier hardware. Pages on POWER4 can be 4096 bytes, 16 MB, and 256 MB (requires AIX 5.1.0.25)

Translation Lookaside Buffer (TLB)

TLB misses are likely when using indirect addressing.

TLB holds the information to translate between virtual and physical memory addresses. If page is in TLB; no cost translation.

The cost of TLB misses varies between ~25 cycles to possibly hundreds of cycles in unfavorable cases

L=left_neighbor[i];R=right_neighbor[i];a[i] += b[i]*a[L] + c[i]*a[R];

Hardware data prefetch

• IBM POWER4 has 8 hardware prefetch streams.• 2 sequential cache line accesses (forward or

backward) establish a prefetch stream• Prefetch streams stop when they reach a page

boundary.• Prefetching can be encouraged using compiler

directives or code changes• Prefetch streams only get established for loads

– Can use PREFETCH_BY_LOAD() directive for store

do 10 i=1,NCELL!IBM$ PREFETCH_BY_LOAD(i+33) a(i)=0.0 10 continue

Coding for prefetch performance

double s;double *a, *b;....s=0.0;for(i=0;i<N;i++) s = s + a[i]*b[i];

Example: Dot product. 2 prefetch streams

Example: Interleaved dot product. 6 prefetch streamsdouble s,s1,s2;double *a, *b;int onethird,twothird;....s = s1 = s2 = 0.0;onethird = N/3; twothird = 2*onethird;for(i=0;i<onethird;i++) { s = s + a[i]*b[i]; s1 = s1 + a[i+onethird]*b[i+onethird]; s2 = s2 + a[i+twothird]*b[i+twothird]; }for(i=3*onethird;i<N;i++) s = s + a[i]*b[i];s = s + s1 + s2;

AIX Large pages

• 16 MB large pages help HPC application performance by:– Eliminating TLB misses– Enhancing prefetch since prefetch streams get

reset at page boundary• Typically 5 to 15 % improvement• Some start up overhead since each task

gets full 256 MB segment (16 pages).– Deadly for scripts; may be bad for fork(), execlp()

• If large pages are exhausted, jobs silently fall over to use small pages– Watch with “vmstat –l”

AIX Large Page Administration

• AIX can set aside memory to be backed by large pages (typically 50%)– vmtune -g nnn –L mmm– bosboot –a; reboot

• Application can be large page enabled– ldedit –b lpdata a.out

• User must be enabled:– chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE userid

• Or set default in /etc/security/users

TLB coverage

POWER3:TLB contained 256 entries.TLB coverage is 1 MB (smaller than L2 cache)

POWER4:TLB contains 1024 entriesTLB coverage is 4 MB for small pagesTLB coverage is 16 GB for large pages

TLB example (xlf -WF,-DHPM …)

program stand#ifdef HPM#include "f_hpm.h"#endif parameter (NCELL=400) common /mystuff/ a1,a2,a3 real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) real(8) time1,time2,rtc,etime,sc a1 = 1.0d0 a2 = 2.0d0#ifdef HPM call f_hpminit(0,"Job") call f_hpmstart(1,"Total_routine")#else time1=rtc()#endif call sub1(a1,a2,a3,NCELL)#ifdef HPM call f_hpmstop(1) call f_hpmterminate(0)#else time2=rtc() etime=time2-time1 print *,'Subroutine took ', etime,' seconds'#endif end

TLB subroutines and performance

subroutine sub_nest(a1,a2,a3,n) parameter (NCELL=400) real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) integer(4) n integer(4) i,j,k real(8) s! s=1.1d0 do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

TLB: performance, 375 MHz Power3, 4 MB L2 cache

do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

do 10 i=1,NCELL do 10 j=1,NCELL do 10 k=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

do 10 k=1,NCELL do 10 i=1,NCELL do 10 j=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

Time = 21.5 s.329.7 LD/TLB miss



Favorite hints

• Put “export AIXTHREAD_SCOPE=S” in your .profile• -g does not decrease optimization • First compile: -O2 –qarch=pwr4 –qtune=pwr4 –

qmaxmem=-1– C: use –qlibansi– Fortran: use xlf90 -qfixed

• Most likely to get within 5% of optimal performance using –O3– May need to use –qstrict

• Use –lmass if you use any intrinsics (sqrt, exp, **, etc.)• Try –O4; -qhot; -qalias=allptrs (C) etc. on individual

routines.• OpenMP: use guided scheduling. –qsmp=omp,noauto

Favorite hints (cont.)

• MPI codes run very well on SMP systems– MP_SHARED_MEMORY=yes– MP_WAIT_MODE=poll

• (MPICH ch_shmem is pretty good, too, if you build it with –O3 –qarch=pwr4 –qtune=pwr4 -- at least through 8 processors)

• If you do lots of 64-bit integer arithmetic use –q64 so you can exploit the PowerPC 64-bit integer hardware.

• Use “nmon” for low overhead, curses based system monitoring program.

• dbx a.out core is OK, but Totalview is awesome.• Don’t use –bmaxdata with –q64• Use –bmaxdata:0x80000000/dsa with –q32

L3 Cache (POWER4 only)

Four POWER4 chips are combined into a multi-chip module (MCM) each of which has a 128 MB level 3 cache

L3 cache is 8-way set associative

L3 cache may be bypassed if busy Consequence: Data may not be where you think it is.

On p690, L3 cache is shared system wide.

Tuning Recomendation

POWER4: For optimal performance it is recommended to block data for L2 cache and to structure the data access for the L1 data cache

Use FMA for best performance

A multiply/add counts as two floating point operations, so that, for example, a program doing only additions might run at half the MFlops rate of one doing alternate multiplies and adds

/* bad code */for(i=0; i<N; i++) a[i] = s*a[i];printf("I did the multiply loop.\n");for(i=0; i<N; i++) a[i] = b[i]+a[i];

/* good code */for(i=0; i<N; i++) a[i] = b[i] + s*a[i];

Note: C++ operator overloading could result in “bad code” – requires careful examination

How to get the most MFlops

Operate within L1 and L2 cache via blocking Avoid TLB misses (Stride 1 as much as possible)Multiplies must be paired with adds or subtracts so that each FMA is two flopsFMAs must be independent (and at least eight in number to keep two pipes of depth four going)

Peak Mflops example!Matrix multiply kerneldo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddoenddo

! Same code but scalar explicitly stated! Good, but load/store bounddo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s =d(i,j) do k=kk,min(n,kk+nb-1) s =s +a(j,k)*b(k,i) enddo d(i,j)=s enddoenddo

Peak Mflops (cont.)do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 =d(i+0,j+0) s10 =d(i+1,j+0) s20 =d(i+2,j+0) s30 =d(i+3,j+0) s40 =d(i+4,j+0) s01 =d(i+0,j+1) s11 =d(i+1,j+1) s21 =d(i+2,j+1) s31 =d(i+3,j+1) s41 =d(i+4,j+1) s02 =d(i+0,j+2) s12 =d(i+1,j+2) s22 =d(i+2,j+2) s32 =d(i+3,j+2) s42 =d(i+4,j+2) s03 =d(i+0,j+3) s13 =d(i+1,j+3) s23 =d(i+2,j+3) s33 =d(i+3,j+3) s43 =d(i+4,j+3) do k=kk,min(n,kk+nb-1) s00 =s00 +a(j+0,k)*b(k,i+0) s10 =s10 +a(j+0,k)*b(k,i+1) s20 =s20 +a(j+0,k)*b(k,i+2) s30 =s30 +a(j+0,k)*b(k,i+3) s40 =s40 +a(j+0,k)*b(k,i+4) s01 =s01 +a(j+1,k)*b(k,i+0) s11 =s11 +a(j+1,k)*b(k,i+1) s21 =s21 +a(j+1,k)*b(k,i+2) s31 =s31 +a(j+1,k)*b(k,i+3)

s41 =s41 +a(j+1,k)*b(k,i+4) s02 =s02 +a(j+2,k)*b(k,i+0) s12 =s12 +a(j+2,k)*b(k,i+1) s22 =s22 +a(j+2,k)*b(k,i+2) s32 =s32 +a(j+2,k)*b(k,i+3) s42 =s42 +a(j+2,k)*b(k,i+4) s03 =s03 +a(j+3,k)*b(k,i+0) s13 =s13 +a(j+3,k)*b(k,i+1) s23 =s23 +a(j+3,k)*b(k,i+2) s33 =s33 +a(j+3,k)*b(k,i+3) s43 =s43 +a(j+3,k)*b(k,i+4) enddo d(i+0,j+0)=s00 d(i+1,j+0)=s10 d(i+2,j+0)=s20 d(i+3,j+0)=s30 d(i+4,j+0)=s40 d(i+0,j+1)=s01 d(i+1,j+1)=s11 d(i+2,j+1)=s21 d(i+3,j+1)=s31 d(i+4,j+1)=s41 d(i+0,j+2)=s02 d(i+1,j+2)=s12 d(i+2,j+2)=s22 d(i+3,j+2)=s32 d(i+4,j+2)=s42 d(i+0,j+3)=s03 d(i+1,j+3)=s13 d(i+2,j+3)=s23 d(i+3,j+3)=s33 d(i+4,j+3)=s43 enddoenddo

5x4 hand unrolling to maximize FMA and register usage

Avoid divides – only one FPU on Power4 does divides!

Untuned Tuned------- -----DO I=1,N DO I=1,N A(I)=B(I)/C(I) OC=1.0/C(I) P(I)=Q(I)/C(I) A(I)=B(I)*OCENDDO P(I)=Q(I)*OC ENDDO

Untuned Tuned------- -----DO I=1,N DO I=1,NA(I)=B(I)/C(I) OCD=1.0/(C(I)*D(I))P(I)=Q(I)/D(I) A(I)=B(I)*D(I)*OCDENDDO P(I)=Q(I)*C(I)*OCD ENDDO

For simple cases, compiler does this for you.

Clever method to replace 2 divides by 1 divide and 5 multiplies and use both FPUs

Minimize expensive intrinsic calls

Untuned Tuned------- -----DO I=1,N DIMENSION SINX(N) DO J=1,N ... A(J,I)=B(J,I)*SIN(X(J)) DO J=1,N ENDDO SINX(J)=SIN(X(J))ENDDO ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO

optimal coding practices for ibm power4 processors

Documents