optimal coding practices for ibm power4 processors

30
Optimal coding practices for IBM POWER4 processors Steve Behling IBM Corporation [email protected] etting the most out of AIX, xlf, and xlc or

Upload: keegan-rice

Post on 30-Dec-2015

45 views

Category:

Documents


1 download

DESCRIPTION

Optimal coding practices for IBM POWER4 processors. Getting the most out of AIX, xlf, and xlc. or. Steve Behling IBM Corporation [email protected]. Outline. Some hardware details Some software discussions My favorite hints Questions. CPU. Register. Cache. Main Memory. Disk. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimal coding practices for IBM POWER4 processors

Optimal coding practices for IBM POWER4 processors

Steve Behling

IBM Corporation

[email protected]

Getting the most out of AIX, xlf, and xlcor

Page 2: Optimal coding practices for IBM POWER4 processors

Outline

• Some hardware details

• Some software discussions

• My favorite hints

• Questions

Page 3: Optimal coding practices for IBM POWER4 processors

Memory Hierarchy

CPU

Register

Cache

Main Memory

Disk

Massive Tape Storage

1 cycle

Cache miss: 8-200 cycles TLB miss: tens to hundreds of cycles

~ 100,000 cycles

Don't want to know

Speed Size

Page 4: Optimal coding practices for IBM POWER4 processors

POWER4 processor chip layout

Memory

I/O Bus

Processor local busL3 Cache

>1GHz CPU

L3 Controller L3 Directory

Shared L2 Cache

FABRIC CONTROLLERDISTRIBUTED SWITCH

DD

II>1GHz CPU

• Contains two 64-bit processors (PowerPC architecture)• POWER4 has 1.4 MB (1440 KB) L2 cache; POWER4+ has 1.5 MB L2 cache• L3 cache directory on chip• All chip frequencies scale with processor frequency

Page 5: Optimal coding practices for IBM POWER4 processors

POWER4 Processor Features

• High-frequency, speculative execution, superscalar processor with out-of-order instruction execution capabilities• Eight independent execution units (capable of executing instructions in parallel) = superscalar

−Two identical floating-point execution units; each with 2 floating-point operations per cycle

−Two load/store execution units−Two fixed-point execution units−One branch execution unit−One conditional register unit to perform logical

operations on the condition register−Only one of the FPUs does divides

Page 6: Optimal coding practices for IBM POWER4 processors

POWER4 Instruction Issue Block Diagram

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

Decode,Crack &Group

Formation

Instr Q

I-cache

GCT

BRScan

BRPredict

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

FX1ExecUnit

FX2ExecUnit

FP1ExecUnit

FP2ExecUnit

CRExecUnit

BRExecUnit

BR/CRIssue Q

FX/LD 1Issue Q

FX/LD 2Issue Q

D-cache

StQ

D-cache

StQ

LD2ExecUnit

LD1ExecUnit

Decode,Crack &Group

Formation

Instr Buffer

IFARI-cacheI-cache

GCT

BRScan

BRPredict

FPIssue Q

FPIssue Q

FPIssue Q

Page 7: Optimal coding practices for IBM POWER4 processors

L3

MemCtrl L3

L3

GX Bus

GX Bus GX Bus

GX Bus

Multi-chip Module Boundary

>1 Ghz Core

>1 Ghz Core

Chip-chip communication

Shared L2

Shared L2

Shared L2

Shared L2

L3 Dir

L3MemCtrl

MemCtrl

MemCtrl

MEMORY

MEMORY

Multi-Chip Module (MCM)

p690 Multi-Chip Module (MCM)

Page 8: Optimal coding practices for IBM POWER4 processors

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

MemSlot

GX Slot

L3 L3 L3 L3L3 L3L3 L3

L3 L3

L3 L3

L3 L3L3 L3 L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

MCM 1

MCM 3MCM 2

MCM 0

GX Slot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

GX Slot

GX Slot

IBM 32 processor pSeries 690

Page 9: Optimal coding practices for IBM POWER4 processors

Cache Organization and Size

Cache Organization Capacity

L1 instruction cache Direct map, 128-byte line 64 KB per processor

L1 data cache Two-way set associative, 128-byte cache line

32 KB per processor

Shared L2 cache POWER4 mostly eight-way, some four-way; POWER4+ all eight-way

1.4 MB per chip POWER4; 1.5 MB per chip POWER4+

L3 cache Eight-way. Two boot modes: 1 cache line per transfer or 4 cache lines per transfer

128 MB per MCM

Page 10: Optimal coding practices for IBM POWER4 processors

Virtual Memory Manager

Virtual storage is the addressable memoryspace used by the AIX operating system

This linear contiguous address space is mapped, bya combination of hardware and software, onto thehardware memory of the computer and onto disk paging space(s)

Pages are 4096 bytes on POWER3 and earlier hardware. Pages on POWER4 can be 4096 bytes, 16 MB, and 256 MB (requires AIX 5.1.0.25)

Page 11: Optimal coding practices for IBM POWER4 processors

Translation Lookaside Buffer (TLB)

TLB misses are likely when using indirect addressing.

TLB holds the information to translate between virtual and physical memory addresses. If page is in TLB; no cost translation.

The cost of TLB misses varies between ~25 cycles to possibly hundreds of cycles in unfavorable cases

L=left_neighbor[i];R=right_neighbor[i];a[i] += b[i]*a[L] + c[i]*a[R];

Page 12: Optimal coding practices for IBM POWER4 processors

Hardware data prefetch

• IBM POWER4 has 8 hardware prefetch streams.• 2 sequential cache line accesses (forward or

backward) establish a prefetch stream• Prefetch streams stop when they reach a page

boundary.• Prefetching can be encouraged using compiler

directives or code changes• Prefetch streams only get established for loads

– Can use PREFETCH_BY_LOAD() directive for store

do 10 i=1,NCELL!IBM$ PREFETCH_BY_LOAD(i+33) a(i)=0.0 10 continue

Page 13: Optimal coding practices for IBM POWER4 processors

Coding for prefetch performance

double s;double *a, *b;....s=0.0;for(i=0;i<N;i++) s = s + a[i]*b[i];

Example: Dot product. 2 prefetch streams

Example: Interleaved dot product. 6 prefetch streamsdouble s,s1,s2;double *a, *b;int onethird,twothird;....s = s1 = s2 = 0.0;onethird = N/3; twothird = 2*onethird;for(i=0;i<onethird;i++) { s = s + a[i]*b[i]; s1 = s1 + a[i+onethird]*b[i+onethird]; s2 = s2 + a[i+twothird]*b[i+twothird]; }for(i=3*onethird;i<N;i++) s = s + a[i]*b[i];s = s + s1 + s2;

Page 14: Optimal coding practices for IBM POWER4 processors

AIX Large pages

• 16 MB large pages help HPC application performance by:– Eliminating TLB misses– Enhancing prefetch since prefetch streams get

reset at page boundary• Typically 5 to 15 % improvement• Some start up overhead since each task

gets full 256 MB segment (16 pages).– Deadly for scripts; may be bad for fork(), execlp()

• If large pages are exhausted, jobs silently fall over to use small pages– Watch with “vmstat –l”

Page 15: Optimal coding practices for IBM POWER4 processors

AIX Large Page Administration

• AIX can set aside memory to be backed by large pages (typically 50%)– vmtune -g nnn –L mmm– bosboot –a; reboot

• Application can be large page enabled– ldedit –b lpdata a.out

• User must be enabled:– chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE userid

• Or set default in /etc/security/users

Page 16: Optimal coding practices for IBM POWER4 processors

TLB coverage

POWER3:TLB contained 256 entries.TLB coverage is 1 MB (smaller than L2 cache)

POWER4:TLB contains 1024 entriesTLB coverage is 4 MB for small pagesTLB coverage is 16 GB for large pages

Page 17: Optimal coding practices for IBM POWER4 processors

TLB example (xlf -WF,-DHPM …)

program stand#ifdef HPM#include "f_hpm.h"#endif parameter (NCELL=400) common /mystuff/ a1,a2,a3 real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) real(8) time1,time2,rtc,etime,sc a1 = 1.0d0 a2 = 2.0d0#ifdef HPM call f_hpminit(0,"Job") call f_hpmstart(1,"Total_routine")#else time1=rtc()#endif call sub1(a1,a2,a3,NCELL)#ifdef HPM call f_hpmstop(1) call f_hpmterminate(0)#else time2=rtc() etime=time2-time1 print *,'Subroutine took ', etime,' seconds'#endif end

Page 18: Optimal coding practices for IBM POWER4 processors

TLB subroutines and performance

subroutine sub_nest(a1,a2,a3,n) parameter (NCELL=400) real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) integer(4) n integer(4) i,j,k real(8) s! s=1.1d0 do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

Page 19: Optimal coding practices for IBM POWER4 processors

TLB: performance, 375 MHz Power3, 4 MB L2 cache

do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

do 10 i=1,NCELL do 10 j=1,NCELL do 10 k=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

do 10 k=1,NCELL do 10 i=1,NCELL do 10 j=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

Time = 21.5 s.329.7 LD/TLB miss

Time = 980.6 s.0.667 LD/TLB miss

Time = 178.0 s.0.853 LD/TLB miss

Page 20: Optimal coding practices for IBM POWER4 processors

Favorite hints

• Put “export AIXTHREAD_SCOPE=S” in your .profile• -g does not decrease optimization • First compile: -O2 –qarch=pwr4 –qtune=pwr4 –

qmaxmem=-1– C: use –qlibansi– Fortran: use xlf90 -qfixed

• Most likely to get within 5% of optimal performance using –O3– May need to use –qstrict

• Use –lmass if you use any intrinsics (sqrt, exp, **, etc.)• Try –O4; -qhot; -qalias=allptrs (C) etc. on individual

routines.• OpenMP: use guided scheduling. –qsmp=omp,noauto

Page 21: Optimal coding practices for IBM POWER4 processors

Favorite hints (cont.)

• MPI codes run very well on SMP systems– MP_SHARED_MEMORY=yes– MP_WAIT_MODE=poll

• (MPICH ch_shmem is pretty good, too, if you build it with –O3 –qarch=pwr4 –qtune=pwr4 -- at least through 8 processors)

• If you do lots of 64-bit integer arithmetic use –q64 so you can exploit the PowerPC 64-bit integer hardware.

• Use “nmon” for low overhead, curses based system monitoring program.

• dbx a.out core is OK, but Totalview is awesome.• Don’t use –bmaxdata with –q64• Use –bmaxdata:0x80000000/dsa with –q32

Page 22: Optimal coding practices for IBM POWER4 processors

End

Page 23: Optimal coding practices for IBM POWER4 processors

L3 Cache (POWER4 only)

Four POWER4 chips are combined into a multi-chip module (MCM) each of which has a 128 MB level 3 cache

L3 cache is 8-way set associative

L3 cache may be bypassed if busy Consequence: Data may not be where you think it is.

On p690, L3 cache is shared system wide.

Page 24: Optimal coding practices for IBM POWER4 processors

Tuning Recomendation

POWER4: For optimal performance it is recommended to block data for L2 cache and to structure the data access for the L1 data cache

Page 25: Optimal coding practices for IBM POWER4 processors

Use FMA for best performance

A multiply/add counts as two floating point operations, so that, for example, a program doing only additions might run at half the MFlops rate of one doing alternate multiplies and adds

/* bad code */for(i=0; i<N; i++) a[i] = s*a[i];printf("I did the multiply loop.\n");for(i=0; i<N; i++) a[i] = b[i]+a[i];

/* good code */for(i=0; i<N; i++) a[i] = b[i] + s*a[i];

Note: C++ operator overloading could result in “bad code” – requires careful examination

Page 26: Optimal coding practices for IBM POWER4 processors

How to get the most MFlops

Operate within L1 and L2 cache via blocking Avoid TLB misses (Stride 1 as much as possible)Multiplies must be paired with adds or subtracts so that each FMA is two flopsFMAs must be independent (and at least eight in number to keep two pipes of depth four going)

Page 27: Optimal coding practices for IBM POWER4 processors

Peak Mflops example!Matrix multiply kerneldo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddoenddo

! Same code but scalar explicitly stated! Good, but load/store bounddo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s =d(i,j) do k=kk,min(n,kk+nb-1) s =s +a(j,k)*b(k,i) enddo d(i,j)=s enddoenddo

Page 28: Optimal coding practices for IBM POWER4 processors

Peak Mflops (cont.)do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 =d(i+0,j+0) s10 =d(i+1,j+0) s20 =d(i+2,j+0) s30 =d(i+3,j+0) s40 =d(i+4,j+0) s01 =d(i+0,j+1) s11 =d(i+1,j+1) s21 =d(i+2,j+1) s31 =d(i+3,j+1) s41 =d(i+4,j+1) s02 =d(i+0,j+2) s12 =d(i+1,j+2) s22 =d(i+2,j+2) s32 =d(i+3,j+2) s42 =d(i+4,j+2) s03 =d(i+0,j+3) s13 =d(i+1,j+3) s23 =d(i+2,j+3) s33 =d(i+3,j+3) s43 =d(i+4,j+3) do k=kk,min(n,kk+nb-1) s00 =s00 +a(j+0,k)*b(k,i+0) s10 =s10 +a(j+0,k)*b(k,i+1) s20 =s20 +a(j+0,k)*b(k,i+2) s30 =s30 +a(j+0,k)*b(k,i+3) s40 =s40 +a(j+0,k)*b(k,i+4) s01 =s01 +a(j+1,k)*b(k,i+0) s11 =s11 +a(j+1,k)*b(k,i+1) s21 =s21 +a(j+1,k)*b(k,i+2) s31 =s31 +a(j+1,k)*b(k,i+3)

s41 =s41 +a(j+1,k)*b(k,i+4) s02 =s02 +a(j+2,k)*b(k,i+0) s12 =s12 +a(j+2,k)*b(k,i+1) s22 =s22 +a(j+2,k)*b(k,i+2) s32 =s32 +a(j+2,k)*b(k,i+3) s42 =s42 +a(j+2,k)*b(k,i+4) s03 =s03 +a(j+3,k)*b(k,i+0) s13 =s13 +a(j+3,k)*b(k,i+1) s23 =s23 +a(j+3,k)*b(k,i+2) s33 =s33 +a(j+3,k)*b(k,i+3) s43 =s43 +a(j+3,k)*b(k,i+4) enddo d(i+0,j+0)=s00 d(i+1,j+0)=s10 d(i+2,j+0)=s20 d(i+3,j+0)=s30 d(i+4,j+0)=s40 d(i+0,j+1)=s01 d(i+1,j+1)=s11 d(i+2,j+1)=s21 d(i+3,j+1)=s31 d(i+4,j+1)=s41 d(i+0,j+2)=s02 d(i+1,j+2)=s12 d(i+2,j+2)=s22 d(i+3,j+2)=s32 d(i+4,j+2)=s42 d(i+0,j+3)=s03 d(i+1,j+3)=s13 d(i+2,j+3)=s23 d(i+3,j+3)=s33 d(i+4,j+3)=s43 enddoenddo

5x4 hand unrolling to maximize FMA and register usage

Page 29: Optimal coding practices for IBM POWER4 processors

Avoid divides – only one FPU on Power4 does divides!

Untuned Tuned------- -----DO I=1,N DO I=1,N A(I)=B(I)/C(I) OC=1.0/C(I) P(I)=Q(I)/C(I) A(I)=B(I)*OCENDDO P(I)=Q(I)*OC ENDDO

Untuned Tuned------- -----DO I=1,N DO I=1,NA(I)=B(I)/C(I) OCD=1.0/(C(I)*D(I))P(I)=Q(I)/D(I) A(I)=B(I)*D(I)*OCDENDDO P(I)=Q(I)*C(I)*OCD ENDDO

For simple cases, compiler does this for you.

Clever method to replace 2 divides by 1 divide and 5 multiplies and use both FPUs

Page 30: Optimal coding practices for IBM POWER4 processors

Minimize expensive intrinsic calls

Untuned Tuned------- -----DO I=1,N DIMENSION SINX(N) DO J=1,N ... A(J,I)=B(J,I)*SIN(X(J)) DO J=1,N ENDDO SINX(J)=SIN(X(J))ENDDO ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO