performance issues application programmers view john cownie hpc benchmark engineer

Performance Issues Application Programmers View

John CownieHPC Benchmark Engineer

2

Agenda

• Running 64bit 32bit codes under AMD64 (Suse Linux)

• FPU and Memory performance issues

• OS and memory layout (1P, 2P, 4P)

• Spectrum of application needs memory cpu

• Benchmark examples STREAM, HPL

• Some real applications

• Conclusions

3

64-bit OS & Application Interaction

32-bit Compatibility Mode

• 64-bit OS runs existing 32-bit APPs with leading edge performance

• No recompile required, 32-bit code directly executed by CPU

• 64-bit OS provides 32-bit libraries and “thunking” translation layer for 32-bit system calls.

64-bit Mode

• Migrate only where warranted, and at the user’s pace to fully exploit AMD64

• 64-bit OS requires all kernel-level programs & drivers to be ported.

• Any program that is linked or plugged in to a 64-bit program (ABI-level) must be ported to 64-bits.

USER

KERNEL

64-bit Operating System

64-bit Device Drivers

Translation

32-bit thread

32-bitApplication4GB expanded address space

64-bit Application

64-bit thread

512GB (or 8TB) address space

4

Increased Memory for 32-bit Applications

32-bit server, 4 GB DRAM

64-bit server, 12 GB DRAM

0 GB

2 GB

4 GB

0 GB

2 GB

4 GB

Shared32-bit

OS

32-bit

App

32-bit

App

32-bitOS

VirtualMemory

4GBDRAM

VirtualMemory

256 TB

32-bit App

0 GB

4 GB

0 GB

4 GB

32-bit App

256 TB

Notshared

Notshared

Notshared

64-bit

OS64-bit

OS

VirtualMemory

VirtualMemory

12GBDRAM

• OS & App share small 32-bit VM space• 32-bit OS & applications all share 4GB DRAM• Leads to small dataset sizes & lots of paging

• App has exclusive use of 32-bit VM space• 64-bit OS can allocate each application large dedicated portions of 12GB DRAM• OS uses VM space way above 32-bits• Leads to larger dataset sizes & reduced paging

5

AMD64 Developer Tools

•GNU compilers – GCC 3.2.2 - 32-bit and 64-bit – GCC 3.3 - optimized 64-bit

Compiler OS Base Peak

Intel C/C++ 7.0 Windows Server 2003 1095 1170

Intel C/C++ 7.0 Linux/x86-64 1081 1108

Intel C/C++ 7.0 Linux (32-bit) 1062 1100

GCC 3.3 (64-bit) Linux/x86-64 1045

GCC 3.3 (32-bit) Linux/x86-64 980

GCC 3.3 (32-bit) Linux (32-bit) 960

1.8 MHz AMD Opteron Processor– SPECint2000

Optimized Compilers Are Reaching

Production Quality

• PGI Workstation 5.0 beta – Windows and 64-bit Linux compilers

http://www.aceshardware.com/

Optimized Fortran 77/90, C,C++Good flags –O2-fastsse

http://www.spec.org/osg/cpu2000/results/res2003q2/cpu2000-20030421-02108.html









6

Running 32bit and 64bit codes

•64bit addresses memory can now the big (>4Gbytes…)boris@quartet4:~> more /proc/meminfo

total: used: free: shared: buffers: cached:

Mem: 7209127936 285351936 6923776000 0 45547520 140636160

Swap: 1077469184 0 1077469184

MemTotal: 7040164 kB

•OS has both 32bit and 64bit libraries… boris@quartet4:/usr> ls

X11 bin games lib local sbin src x86_64-suse-linux X11R6

include lib64 pgi share tmp

•For gcc 64bit addressing is the default use –m32 for 32bit…

•(Don’t confuse 64bit floating point data operations with addressing and pointer lengths..)

7

Running 32bit program

boris@quartet4:~/c> gcc –m32 -o test_32bit sizeof_test.c

boris@quartet4:~/c> ./test_32bit

char is 1

short is 2

int is 4

long is 4

long long is 8

unsigned long long is 8

float is 4

double is 8

int * is 4

8

Running 64bit program

•Pointers are now 64bits longboris@quartet4:~/c> gcc -o test_64bit sizeof_test.c

boris@quartet4:~/c> ./test_64bit

char is 1

short is 2

int is 4

long is 8

long long is 8

unsigned long long is 8

float is 4

double is 8

int * is 8

9

Compilers and flags

•Intel icc/ifc 32bit code compiled on 64bit OS use

–W,-m elf_i386 to tell linker to use 32bit libraries.

•Intel icc/ifc avoid –xaW (tests cpu id) use –xW to enable SSE

•PGI pgcc/pgf90 Vector generates prefetch and –Mnontemporal streaming store instructions

•Absoft f90 looks promising

•GNU g77 front end limited but gnu backend is good

•GNU gcc 3.3 is best good (rpm gcc33 perf has faster libraries ?)

•GNU g++ good common code generator

•GNU gcc 3.2 good

The more compilers the better !

10

PGI Compiler - Additional Features

•Plans to bundle useful libraries

•See www.spec.org for SPEC numbers…

•MPI-CH - Pre-configured libraries and utilities for ethernet- based x86 and AMD64/Linux clusters

•PBS – Portable Batch System batch-queuing from NASA Ames and MRJ Technologies

•ScaLAPACK - Pre-compiled distributed-memory parallel Math Library

•ACML – The AMD Core Math Library is planned to be included

•Training – Tutorials (OSC), exercises, examples and benchmarks for MPI, OpenMP and HPF programming

The Portland Group Compiler Technology

11

Open Source Tools

64-bit Tools64-bit Tools TypeType AvailableAvailable CommentsComments

ATLAS 3.5.0 Developer Release Library http://math-atlas.sourceforge.net/ Optimized BLAS (Basic Linear Algebra Subroutines) library

Blackdown Java Platform 2 Version 1.4.2

Linux JAVA http://www.blackdown.com/java-linux/java2-status/jdk1.4-status.html SUN Java products ported to Linux by Blackdown group

GNU binutils Utilities http://www.gnu.org/software/binutils/ GNU collection of binary tools including GNU linker, GNU assembler

GNU C++ (g++) 3.2GNU C (gcc) 3.2GNU C (gcc) 3.3 (optimized)

Compilers http://gcc.gnu.org/ GNU Collection of Compilers (gcc) is a full-featured ANSI C compiler

GNU Debugger (GDB) Debugger Analysis tool for debugging programs - included with SuSE SLES 8

GNU glibc 2.2.5GNU glibc 2.3.2 (optimized)

C Library http://www.gnu.org/software/libc/libc.htmlGNU C Library

Other GNU Tools Various bash, csb, ksb, strace, libtool - included with SuSE SLES 8

MPICH Library Open Source message passing interface for Linux clusters

PERL, Python, Ruby, Tcl/Tk Language Scripting languages - included with SuSE SLES 8

GNU means GNU's Not UNIXTM" and is the primary project of by the Free Software Foundation (FSF), a non-profit organization committed to the creation of a large body of useful, free, source-code-available software.

TOP

http://www.gnu.org/software/binutils/

12

32-bit vs 64-bit App Performance

% Speed-Up for 32-bit App Ported to 64-bit

-60.00%

-40.00%

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

Ro

lled

-Sin

gle

Un

roll

ed

-Sin

gle

Ro

lled

-Do

ub

le

Un

roll

ed

-Do

ub

le

Co

py

Sc

ale

Ad

d

Tri

ad

NU

ME

RIC

_S

OR

T

ST

RIN

G_S

OR

T

BIT

FIE

LD

FP

_E

MU

L

AS

SIG

N

IDE

A

HU

FF

MA

N

LU

_D

EC

OM

P

Linpack

Stream

BYTEmark (tm) ver. 2

32-bit App compiled using GCC 3.2 for x8664-bit App compiled using GCC 3.3 for AMD64All run-times measured on SLES8 For AMD64 RC7

Measured on AMD Opteron Model 144 (1.8GHz, 1MB L2, 128-bit Memory Controller, DDR-333, CL 2.5, 512MB)

13

BLAS libraries

• 3 different BLAS libraries support 32bit and 64bit code

1. ACML (includes FFTs)

2. ATLAS

3. Goto

• Currently Goto has fastest DGEMM. ~88% of peak on 1P HPL

• Compare with BLASBENCH and pick best one for your application.

• For FFTs also consider FFTW

14

Optimized Numerical Libraries: ACML

•AMD and The Numerical Algorithms Group (NAG) joint development the AMD Core Math Library (ACML)

•ACML includes• Basic Linear Algebra Subroutines (BLAS) levels 1, 2 and 3• A wide variety of Fast Fourier Transforms (FFTs)• Linear Algebra Package (LAPACK)

•ACML has:• Fortran and C Interfaces• Highly optimized routines for the AMD64 Instruction Set• Ability to address single-, double-, single-complex and double-complex data types• Will be available for commercially available OSs

•ACML is freely downloadable from www.developwithamd.com/acml

http://www.developwithamd.com/acml




15

DGEMM relative performance

K Goto DGEMM 88% of peak FPU performance

Floating Point and Memory Performance

17

Register Differences: x86-64 / x86-32

• x86-64– 64-bit integer registers

– 48-bit Virtual Address

– 40-bit Physical Address

• REX - Register Extensions– Sixteen 64-bit integer registers

– Sixteen 128-bit SSE registers

• SSE2 Instruction Set– Double precision scalar

and vector operations

– 16x8, 8x16 way vectorMMX operations

– SSE1 already added with AMD Athlon MP

RAX

63

GGPPRR

xx8877

079

31

AHEAX AL

0715In x86

MMX0SSSSEE

127 0

MMX7

EAX

EIP

Added by x86-64

EDI

XMM8MMX8

MMX15

R8

R15

TOP

18

Floating point hardware

•4 cycle deep pipeline

•Separate multiply and add paths

•64bit (double precision) 2flops/cycle (1mul + 1add)

•32bit (single precision) 4flops/cycle (2muls + 2 adds)

•2.0Ghz clock 1 cycle = 0.5ns gives...

•Theoretical Peak 4Gflops double precision•SSE registers 128 bit wide…but packed instructions only

help single precision

•Pipeline depth and separate mul add mean that even

register to register are helped by loop unrolling.

19

AMD Opteron™ Processor Architecture

3.2 GB/s per direction@ 800 MHz Dual Data Rate

6.4 GB/s @ 1600 MT/s Data Rate

3.2 GB/s per direction@ 1600 MT/s Data Rate

DRAM

XBAR

MCTCPU

SR

Q

5.3 GB/s128-bitDDR333

Coherent HyperTransportTM

Coherent Hyper-

TransportTM

Coherent Hyper-

TransportTM

TOP

20

Main Memory hardware

•Dual on-chip memory controllers

•Remote memory systems accesses via hyper-transport

•Bandwidth scales with more processors

•Latency is very good 1P,2P,4P (cache probes on 2P, 4P)

•Local memory latency is less than remote memory

•2P machine 1 hop (worst case)

•4P machine 2 hops (worst case)

•Memory DIMMS 333Mhz (2700) or 266Mhz (2100)

•Can interleave memory banks (BIOS)

•Can interleave processor memory (BIOS)

21

Integrated Memory Controller Performance

– Peak Bandwidth and Latency

– Performance improves by almost 20% compared to AMD Athlon™ topology

Peak Memory Bandwidth

64-Bit DCT 128-Bit DCT

DDR200 PC1600 1.6GB/s 3.2GB/s

DDR266 PC2100 2.1GB/s 4.2GB/s

DDR333 PC2700 2.7GB/s 5.33GB/s

Idle Latencies to First Data

•1P System: <80ns

•0-Hop in DP System: <80ns

•0-Hop in 4P System: ~100ns

•1-Hop in MP System: <115ns



TOP

22

P0

P3

P1

P2

0-hop

P0

P3

P1

P2

1-hop

•0 Hop: Local memory access

•1 Hop: Remote 1 memory access

•2 Hop: Remote 2 memory access

•Probe: Coherency request between nodes

•Diameter: Maximum hop count between any pair of nodes

•Average distance: Average hop count between nodes

P0

P3

P1

P2

2-hops

Integrated Memory ControllerLocal versus Remote Memory Access

TOP

23

MP Memory Bandwidth Scalability

Memory Bandwidth Scalability

0

2

4

6

8

10

12

14

16

18

1P 2P 4P

Number of Processors in System

GB

/s Local B/W

Xfire B/W

TOP

24

How should the OS allocate memory ?

• To maximize local accesses ?

• To get best bandwidth across all processors ?

• Different needs from different applications

• Scientific MPI codes already have model of networked 1P machines.

• Enterprise codes, databases, web servers lots of threads

maybe throughput and scrambled memory is best ?

• SMP kernel plus processor interleaving

• NUMA kernel plus memory bank interleaving

• Suse NUMACTL utility allows policy choice per process.

25

MPI libraries (shmem buffers where ?)

• Argonne MPICH compile with gcc –O2 –funroll-all-loops

• Myrinet, Quadrics, Dolphin, also have MPI libraries based on MPICH

•On MP machine uses shared memory in the box.

•Where do MPI message buffers go ?

•Currently just malloc one chunk of space for all processors

•OK for 2P … not so good for 4P machine (2 hops worst.. contention)

•Scope to improve MPI performance on 4P NUMA machine with better buffer placement…

SyntheticBenchmarks

27

Spectrum of application needs

•Some codes are memory limited (BIG data)

•Others are CPU bound (kernels and SMALL data)

•ALL memory codes like low latency to memory !

•Example in BLAS libraries

•BLAS1 vector-vector … Memory Intensive …STREAM

•BLAS2 matrix-vector

•BLAS3 matrix-matrix … CPU Intensive … DGEMM, HPL

•Faster CPU will not help memory bound problems…

•PREFETCH can sometimes hide memory latency (on predictable

memory access patterns.

28

LM Bench (memory latency)

– LM bench has published comparative numbers– Fastest latency is in 1P Opteron machine – no HT cache probes– Note that HP sometimes reports a “within cache line stride” latency number of

110.6. Opteron’s “within cache line stride” latencies are 25ns for 16 bytes and 53ns for 32 bytes.

Memory Latencies in ns (smaller is better)

HP zx6000 PC800 SUN BladeOperon 1P Operon 2P Opteron 2P Itanium2 1.0Ghz Xeon /i850/ UltraSparc III

1.8Ghz 1.6Ghz 2.0Ghz DDR SDRAM RDRAM 900MhzL1 cache 1.7 1.8 1.5L2 cache 9.0 10.0 7.7

Main Memory 63 99 89 156 209 172

Http://www.hp.com/products1/itanium/performance/architecture/lmbench.html

TOP

29

Measured Memory latency

•Physical limit.. Speed of light almost crosses board in 1ns

•2.0Ghz 0.5ns clock ticks…

•L1 CACHE 1.5ns

•L2 CACHE 7.7ns

•Main Memory 89ns

•Try to hide the big hit of main memory access.

•Codes with predictable access patterns use PREFETCH (in various flavours) to hide latency.

30

High Performance Linpack

• A CPU bound problem (memory system not stressed)

• Solves A x = b using LU factorization

• Results Peak Gflops rate achieved for matrix size NxN

• Almost all time is spent in DGEMM

• Use MPI message passing

• Larger N the better – fill memory per node

• Used in www.top500.org ranking of supercomputers

• N (half) size what Gflops is half Nmax a measure of overhead

• Current number one machine is Earth Simulator (40Tflops)

Cray Red Storm (10,000 Opterons) has comparable peak

http://www.top500.org/

31

Quartet: 4u4P AMD OpteronTM MP Processor

PCI-X64bits @ 66Mhz

64bits @100Mhz

6.4GB/s HyperTransport 1000 BaseT

U320

Dual GbE

Dual SCSI

AMD-8131™ HyperTransport™

PCI-X Tunnel


PCI-X Tunnel

AMD Opteron™uP 2

AMD Opteron™uP 3

AMD Opteron™uP 1

AMD Opteron™Boot uP

PCI-XHot Plug

200

-33

3M

Hz

144

-Bit

Reg

DD

R

1.6GB/s HyperTransport

AMD-8111TM

HyperTransportTM

I/O Hub

AMD-8111TM

HyperTransportTM

I/O Hub

LegacyPCI

AC’97

10/100Ethernet

USB 2.0

SM Bus

EIDELPC

SIO

FLASH(BIOS)


PCI-X Tunnel


PCI-X TunnelPCI-X

64bits @ 133Mhz

64bits @133Mhz

PCI-XHot Plug

800MB/s HyperTransport Optional

ATI Rage

BMCKeyboard, mouse, hardware monitor, fan monitor

LAN 10/100

16x16Coherent

HyperTransport

16x16Coherent HyperTransport

16x16Coherent HyperTransport

16x16Coherent

HyperTransport

Platform Agenda TOP

32

MPI vs threaded BLAS ?

•BLAS libraries can do thread level parallelism to exploit MP node

•MPI can treat MP node processors as separate machines talking via shmem.

•Which is best ?

•NUMA kernel allocates memory locally for each process

•But in the box MPI on 4P has memory issues.

•MPI and single threaded BLAS performs best with NUMA kernel.

•Mixing OpenMP and MPI is possible and maybe sensible.

•Generally static upfront decomposition feels better ?

33

High Performance Linpack

GOTO Library Results

AMD Opteron™ system# P

Rmax(GFlops)

Nmax(order)

N1/2(order

)Rpeak

(GFlops)GFLOP/

Proc

Rmax / Rpeak

4P AMD Opteron 1.8GHz 2GB/proc PC2700 8GB Total 4 12.06 28000 1008 14.4 3.02 83.8%

2P AMD Opteron 1.8GHz 2GB/proc PC2700 4GB Total 2 6.22 20617 672 7.2 3.11 86.4%

1P AMD Opteron 1.8GHz 2GB PC2700 1 3.14 15400 336 3.6 3.14 87.1%

High-Performance BLAS by Kazushige Goto

• Optimized http://www.cs.utexas.edu/users/flame/goto

GOTO results were with 64-bit SuSE 8.1 Linux Professional Edition with NUMA kernel and Myrinet MPIch-gm-1.2.5..10 message passing library.

34

HPL on 16P (4x4P) Opteron Cluster

•Machine 4x4P 1.8Ghz 2Gb/processor with single myrinet and gigabit ethernet per box.

•Goto DGEMM (single threaded) and MPICH-GM

•Myrinet 8 processors N=40000 N(half)=2252 81.3% peak

•Myrinet 16 processors N=41288 N(half)=3584 80.5% peak

•Ethernet 16 processors N=54616 N(half)=5768 78.1% peak

•A big run to show >4Gbytes/processor working

•4P 1.8Ghz 8Gb/processor (32Gbytes in all …266Mhz memory)

•4 processors N=60144 N(half)=1123 80.56%peak

35

STREAM

• Measures sustainable memory bandwidth (in MB/s)

• Simple kernels on large vectors (~50Mbytes)

• Vectors must be much bigger than cache

• Machine balance defined as:

peak floating ops/cycle / sustained memory ops/cycle

name kernel bytes/iter FLOPS/iterCOPY a(i) = b(i) 16 0SCALE a(i) = q* b(i) 16 1SUM a(i) = b(i) + c(I) 24 1TRIAD a(i) = b(i) + q*c(I) 24 2

36

Compiling STREAM

•PGI compiler recognises Open MP threads directives and generates prefetch instructions and streaming stores

pgf77 -Mnontemporal -O3 -fastsse -Minfo=all -Bstatic -mp -c -o stream_d_f.o stream_d.f

180, Parallel loop activated; static block iteration allocation

Generating sse code for inner loop

Generated prefetch instructions for 2 loads

37

STREAM results

•4P Opteron 2.0Ghz with 333Mhz memory

• Compiled pgf77 –O3 –mp –fastsse -Mnontemporal

• 64bit word flops

• Rates in Mbytes/sec

• Triad ~310Mflops /processor (About the same as a CrayYMP ?)

Kernel 1P 2P 4PCOPY 3640 7061 13142SCALE 3542 7078 13344SUM 3583 6916 12695TRIAD 3610 6962 12759

ApplicationBenchmarks

39

Greedy – Travelling Salesman Problem

•Solves the traveling salesman problem and is sensitive to memory latency and bandwidth.

•An example of increasing importance of memory performance as problems grows from small to large.

Greedy - Traveling Salesman Problem

Solve times in secs (smaller is better)

Pentium III Itanium 2 OpteronProblem 800Mhz 1.0Ghz 1.6Ghz1000x1000 12 6 3316x3162 12 7 4100x10000 18 9 532x31623 28 11 810x100000 33 16 123x31622 37 26 141x1000000 50 45 201x3162278 180 164 981x10000000 665 554 434

Results www.research.att.com/~dsj/chtsp/speeds.html

TOP

40

OPA -Parallel Ocean Model

– A large scientific code from France…

– The uses LAM-MPI.

– Compiled in France with ifc 7.1 binary run in USA using the same 32bit version of LAM-MPI, ifc runtime and LD_LIBRARY_PATH settings.

Melody 6 Melody 8 McKinley ML 570Opteron 1.4Ghz Opteron 1.6Ghz Itanium2 Intel XeonLinux SMP Linux NUMA 1.0Ghz 2.0Ghz

2494 1772 3602 3071

41

PARAPYR

• Direct Numerical Simulation of Turbulent Combustion

• Single processor performance on 1P system

• Suse AMD64 Linux NUMA kernel

• Problem size (small) 340x340

• Mflops are double precision (64bits)

• Run identical statically linked binary on Intel P4 2.8Ghz (dell)

• Compiled with ifc 7.1 –r8 -ip

• Opteron 1.8Ghz 420 Mflops

• P4 2.8Ghz 351 MflopsBest Opteron results (vs beta Absoft and PGI compilers)

435Mflops ifc –r8 –xW –static

437Mflops ifc -r8 -O2 -xW -ipo -static -FDO

(-FDO pass1 = -prof_gen pass2=-prof_use

42

PARAPYR

Best 1.8Ghz Opteron ifc results

435Mflops ifc –r8 –xW –static

437Mflops ifc -r8 -O2 -xW -ipo -static -FDO

(-FDO pass1 = -prof_gen pass2=-prof_use)

43

Conclusions

– Use AMD64 64bit OS with NUMA support– 32bit compiled applications run wellKnow you applications memory or cpu needs (so you know what to expect)64bit compilers need work (as ever)

– Competitive processors Itanium 1.5Ghz and Xeon 3.0Ghz have higher peak FLOPs but relatively poor memory scalingHighly tuned benchmarks which make heavy use of floating point and which fit in cache or make

low use of the memory system perform better on Itanium Xeon memory system does not scale well

• Opteron has excellent memory system performance and scalability – both bandwidth and latency

Codes that depend on memory latency or bandwidth perform better on OpteronCodes with a mix of integer and floating point will perform better on OpteronCode that is not highly tuned will likely perform better on Opteron

• More Upside from MPI 4P memory layout (MPICH2 ?) and 64bit compilers

TOP

44

AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow! and combinations thereof, AMD-8100, AMD-8111 and AMD-8131 AMD-8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Pentium and MMX are registered trademarks of Intel Corporation in the U.S. and/or other jurisdictions. SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation in the U.S. and/or other jurisdictions. Other product and company names used in this presentation are for identification purposes only and may be trademarks of their respective companies.

performance issues application programmers view john cownie hpc benchmark engineer

Documents

unsigned long long

c gcc o test

gb dram os

os applications

os use w

os application interaction32

c gcc m32 o test

goodgnu gcc