performance issues application programmers view john cownie hpc benchmark engineer
DESCRIPTION
Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer. Agenda. Running 64bit 32bit codes under AMD64 (Suse Linux) FPU and Memory performance issues OS and memory layout (1P, 2P, 4P) Spectrum of application needs memory cpu Benchmark examples STREAM, HPL - PowerPoint PPT PresentationTRANSCRIPT
Performance Issues Application Programmers View
John CownieHPC Benchmark Engineer
2
Agenda
• Running 64bit 32bit codes under AMD64 (Suse Linux)
• FPU and Memory performance issues
• OS and memory layout (1P, 2P, 4P)
• Spectrum of application needs memory cpu
• Benchmark examples STREAM, HPL
• Some real applications
• Conclusions
3
64-bit OS & Application Interaction
32-bit Compatibility Mode
• 64-bit OS runs existing 32-bit APPs with leading edge performance
• No recompile required, 32-bit code directly executed by CPU
• 64-bit OS provides 32-bit libraries and “thunking” translation layer for 32-bit system calls.
64-bit Mode
• Migrate only where warranted, and at the user’s pace to fully exploit AMD64
• 64-bit OS requires all kernel-level programs & drivers to be ported.
• Any program that is linked or plugged in to a 64-bit program (ABI-level) must be ported to 64-bits.
USER
KERNEL
64-bit Operating System
64-bit Device Drivers
Translation
32-bit thread
32-bitApplication4GB expanded address space
64-bit Application
64-bit thread
512GB (or 8TB) address space
4
Increased Memory for 32-bit Applications
32-bit server, 4 GB DRAM
64-bit server, 12 GB DRAM
0 GB
2 GB
4 GB
0 GB
2 GB
4 GB
Shared32-bit
OS
32-bit
App
32-bit
App
32-bitOS
VirtualMemory
4GBDRAM
VirtualMemory
256 TB
32-bit App
0 GB
4 GB
0 GB
4 GB
32-bit App
256 TB
Notshared
Notshared
Notshared
64-bit
OS64-bit
OS
VirtualMemory
VirtualMemory
12GBDRAM
• OS & App share small 32-bit VM space• 32-bit OS & applications all share 4GB DRAM• Leads to small dataset sizes & lots of paging
• App has exclusive use of 32-bit VM space• 64-bit OS can allocate each application large dedicated portions of 12GB DRAM• OS uses VM space way above 32-bits• Leads to larger dataset sizes & reduced paging
5
AMD64 Developer Tools
•GNU compilers – GCC 3.2.2 - 32-bit and 64-bit – GCC 3.3 - optimized 64-bit
Compiler OS Base Peak
Intel C/C++ 7.0 Windows Server 2003 1095 1170
Intel C/C++ 7.0 Linux/x86-64 1081 1108
Intel C/C++ 7.0 Linux (32-bit) 1062 1100
GCC 3.3 (64-bit) Linux/x86-64 1045
GCC 3.3 (32-bit) Linux/x86-64 980
GCC 3.3 (32-bit) Linux (32-bit) 960
1.8 MHz AMD Opteron Processor– SPECint2000
Optimized Compilers Are Reaching
Production Quality
• PGI Workstation 5.0 beta – Windows and 64-bit Linux compilers
http://www.aceshardware.com/
Optimized Fortran 77/90, C,C++Good flags –O2-fastsse
6
Running 32bit and 64bit codes
•64bit addresses memory can now the big (>4Gbytes…)boris@quartet4:~> more /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 7209127936 285351936 6923776000 0 45547520 140636160
Swap: 1077469184 0 1077469184
MemTotal: 7040164 kB
•OS has both 32bit and 64bit libraries… boris@quartet4:/usr> ls
X11 bin games lib local sbin src x86_64-suse-linux X11R6
include lib64 pgi share tmp
•For gcc 64bit addressing is the default use –m32 for 32bit…
•(Don’t confuse 64bit floating point data operations with addressing and pointer lengths..)
7
Running 32bit program
boris@quartet4:~/c> gcc –m32 -o test_32bit sizeof_test.c
boris@quartet4:~/c> ./test_32bit
char is 1
short is 2
int is 4
long is 4
long long is 8
unsigned long long is 8
float is 4
double is 8
int * is 4
8
Running 64bit program
•Pointers are now 64bits longboris@quartet4:~/c> gcc -o test_64bit sizeof_test.c
boris@quartet4:~/c> ./test_64bit
char is 1
short is 2
int is 4
long is 8
long long is 8
unsigned long long is 8
float is 4
double is 8
int * is 8
9
Compilers and flags
•Intel icc/ifc 32bit code compiled on 64bit OS use
–W,-m elf_i386 to tell linker to use 32bit libraries.
•Intel icc/ifc avoid –xaW (tests cpu id) use –xW to enable SSE
•PGI pgcc/pgf90 Vector generates prefetch and –Mnontemporal streaming store instructions
•Absoft f90 looks promising
•GNU g77 front end limited but gnu backend is good
•GNU gcc 3.3 is best good (rpm gcc33 perf has faster libraries ?)
•GNU g++ good common code generator
•GNU gcc 3.2 good
The more compilers the better !
10
PGI Compiler - Additional Features
•Plans to bundle useful libraries
•See www.spec.org for SPEC numbers…
•MPI-CH - Pre-configured libraries and utilities for ethernet- based x86 and AMD64/Linux clusters
•PBS – Portable Batch System batch-queuing from NASA Ames and MRJ Technologies
•ScaLAPACK - Pre-compiled distributed-memory parallel Math Library
•ACML – The AMD Core Math Library is planned to be included
•Training – Tutorials (OSC), exercises, examples and benchmarks for MPI, OpenMP and HPF programming
The Portland Group Compiler Technology
11
Open Source Tools
64-bit Tools64-bit Tools TypeType AvailableAvailable CommentsComments
ATLAS 3.5.0 Developer Release Library http://math-atlas.sourceforge.net/ Optimized BLAS (Basic Linear Algebra Subroutines) library
Blackdown Java Platform 2 Version 1.4.2
Linux JAVA http://www.blackdown.com/java-linux/java2-status/jdk1.4-status.html SUN Java products ported to Linux by Blackdown group
GNU binutils Utilities http://www.gnu.org/software/binutils/ GNU collection of binary tools including GNU linker, GNU assembler
GNU C++ (g++) 3.2GNU C (gcc) 3.2GNU C (gcc) 3.3 (optimized)
Compilers http://gcc.gnu.org/ GNU Collection of Compilers (gcc) is a full-featured ANSI C compiler
GNU Debugger (GDB) Debugger Analysis tool for debugging programs - included with SuSE SLES 8
GNU glibc 2.2.5GNU glibc 2.3.2 (optimized)
C Library http://www.gnu.org/software/libc/libc.htmlGNU C Library
Other GNU Tools Various bash, csb, ksb, strace, libtool - included with SuSE SLES 8
MPICH Library Open Source message passing interface for Linux clusters
PERL, Python, Ruby, Tcl/Tk Language Scripting languages - included with SuSE SLES 8
GNU means GNU's Not UNIXTM" and is the primary project of by the Free Software Foundation (FSF), a non-profit organization committed to the creation of a large body of useful, free, source-code-available software.
TOP
12
32-bit vs 64-bit App Performance
% Speed-Up for 32-bit App Ported to 64-bit
-60.00%
-40.00%
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
Ro
lled
-Sin
gle
Un
roll
ed
-Sin
gle
Ro
lled
-Do
ub
le
Un
roll
ed
-Do
ub
le
Co
py
Sc
ale
Ad
d
Tri
ad
NU
ME
RIC
_S
OR
T
ST
RIN
G_S
OR
T
BIT
FIE
LD
FP
_E
MU
L
AS
SIG
N
IDE
A
HU
FF
MA
N
LU
_D
EC
OM
P
Linpack
Stream
BYTEmark (tm) ver. 2
32-bit App compiled using GCC 3.2 for x8664-bit App compiled using GCC 3.3 for AMD64All run-times measured on SLES8 For AMD64 RC7
Measured on AMD Opteron Model 144 (1.8GHz, 1MB L2, 128-bit Memory Controller, DDR-333, CL 2.5, 512MB)
13
BLAS libraries
• 3 different BLAS libraries support 32bit and 64bit code
1. ACML (includes FFTs)
2. ATLAS
3. Goto
• Currently Goto has fastest DGEMM. ~88% of peak on 1P HPL
• Compare with BLASBENCH and pick best one for your application.
• For FFTs also consider FFTW
14
Optimized Numerical Libraries: ACML
•AMD and The Numerical Algorithms Group (NAG) joint development the AMD Core Math Library (ACML)
•ACML includes• Basic Linear Algebra Subroutines (BLAS) levels 1, 2 and 3• A wide variety of Fast Fourier Transforms (FFTs)• Linear Algebra Package (LAPACK)
•ACML has:• Fortran and C Interfaces• Highly optimized routines for the AMD64 Instruction Set• Ability to address single-, double-, single-complex and double-complex data types• Will be available for commercially available OSs
•ACML is freely downloadable from www.developwithamd.com/acml
15
DGEMM relative performance
K Goto DGEMM 88% of peak FPU performance
Floating Point and Memory Performance
17
Register Differences: x86-64 / x86-32
• x86-64– 64-bit integer registers
– 48-bit Virtual Address
– 40-bit Physical Address
• REX - Register Extensions– Sixteen 64-bit integer registers
– Sixteen 128-bit SSE registers
• SSE2 Instruction Set– Double precision scalar
and vector operations
– 16x8, 8x16 way vectorMMX operations
– SSE1 already added with AMD Athlon MP
RAX
63
GGPPRR
xx8877
079
31
AHEAX AL
0715In x86
MMX0SSSSEE
127 0
MMX7
EAX
EIP
Added by x86-64
EDI
XMM8MMX8
MMX15
R8
R15
TOP
18
Floating point hardware
•4 cycle deep pipeline
•Separate multiply and add paths
•64bit (double precision) 2flops/cycle (1mul + 1add)
•32bit (single precision) 4flops/cycle (2muls + 2 adds)
•2.0Ghz clock 1 cycle = 0.5ns gives...
•Theoretical Peak 4Gflops double precision•SSE registers 128 bit wide…but packed instructions only
help single precision
•Pipeline depth and separate mul add mean that even
register to register are helped by loop unrolling.
19
AMD Opteron™ Processor Architecture
3.2 GB/s per direction@ 800 MHz Dual Data Rate
6.4 GB/s @ 1600 MT/s Data Rate
3.2 GB/s per direction@ 1600 MT/s Data Rate
DRAM
XBAR
MCTCPU
SR
Q
5.3 GB/s128-bitDDR333
Coherent HyperTransportTM
Coherent Hyper-
TransportTM
Coherent Hyper-
TransportTM
TOP
20
Main Memory hardware
•Dual on-chip memory controllers
•Remote memory systems accesses via hyper-transport
•Bandwidth scales with more processors
•Latency is very good 1P,2P,4P (cache probes on 2P, 4P)
•Local memory latency is less than remote memory
•2P machine 1 hop (worst case)
•4P machine 2 hops (worst case)
•Memory DIMMS 333Mhz (2700) or 266Mhz (2100)
•Can interleave memory banks (BIOS)
•Can interleave processor memory (BIOS)
21
Integrated Memory Controller Performance
– Peak Bandwidth and Latency
– Performance improves by almost 20% compared to AMD Athlon™ topology
Peak Memory Bandwidth
64-Bit DCT 128-Bit DCT
DDR200 PC1600 1.6GB/s 3.2GB/s
DDR266 PC2100 2.1GB/s 4.2GB/s
DDR333 PC2700 2.7GB/s 5.33GB/s
Idle Latencies to First Data
•1P System: <80ns
•0-Hop in DP System: <80ns
•0-Hop in 4P System: ~100ns
•1-Hop in MP System: <115ns
•2-Hop in MP System: <150ns
•3-Hop in MP System: <190ns
TOP
22
P0
P3
P1
P2
0-hop
P0
P3
P1
P2
1-hop
•0 Hop: Local memory access
•1 Hop: Remote 1 memory access
•2 Hop: Remote 2 memory access
•Probe: Coherency request between nodes
•Diameter: Maximum hop count between any pair of nodes
•Average distance: Average hop count between nodes
P0
P3
P1
P2
2-hops
Integrated Memory ControllerLocal versus Remote Memory Access
TOP
23
MP Memory Bandwidth Scalability
Memory Bandwidth Scalability
0
2
4
6
8
10
12
14
16
18
1P 2P 4P
Number of Processors in System
GB
/s Local B/W
Xfire B/W
TOP
24
How should the OS allocate memory ?
• To maximize local accesses ?
• To get best bandwidth across all processors ?
• Different needs from different applications
• Scientific MPI codes already have model of networked 1P machines.
• Enterprise codes, databases, web servers lots of threads
maybe throughput and scrambled memory is best ?
• SMP kernel plus processor interleaving
• NUMA kernel plus memory bank interleaving
• Suse NUMACTL utility allows policy choice per process.
25
MPI libraries (shmem buffers where ?)
• Argonne MPICH compile with gcc –O2 –funroll-all-loops
• Myrinet, Quadrics, Dolphin, also have MPI libraries based on MPICH
•On MP machine uses shared memory in the box.
•Where do MPI message buffers go ?
•Currently just malloc one chunk of space for all processors
•OK for 2P … not so good for 4P machine (2 hops worst.. contention)
•Scope to improve MPI performance on 4P NUMA machine with better buffer placement…
SyntheticBenchmarks
27
Spectrum of application needs
•Some codes are memory limited (BIG data)
•Others are CPU bound (kernels and SMALL data)
•ALL memory codes like low latency to memory !
•Example in BLAS libraries
•BLAS1 vector-vector … Memory Intensive …STREAM
•BLAS2 matrix-vector
•BLAS3 matrix-matrix … CPU Intensive … DGEMM, HPL
•Faster CPU will not help memory bound problems…
•PREFETCH can sometimes hide memory latency (on predictable
memory access patterns.
28
LM Bench (memory latency)
– LM bench has published comparative numbers– Fastest latency is in 1P Opteron machine – no HT cache probes– Note that HP sometimes reports a “within cache line stride” latency number of
110.6. Opteron’s “within cache line stride” latencies are 25ns for 16 bytes and 53ns for 32 bytes.
Memory Latencies in ns (smaller is better)
HP zx6000 PC800 SUN BladeOperon 1P Operon 2P Opteron 2P Itanium2 1.0Ghz Xeon /i850/ UltraSparc III
1.8Ghz 1.6Ghz 2.0Ghz DDR SDRAM RDRAM 900MhzL1 cache 1.7 1.8 1.5L2 cache 9.0 10.0 7.7
Main Memory 63 99 89 156 209 172
Http://www.hp.com/products1/itanium/performance/architecture/lmbench.html
TOP
29
Measured Memory latency
•Physical limit.. Speed of light almost crosses board in 1ns
•2.0Ghz 0.5ns clock ticks…
•L1 CACHE 1.5ns
•L2 CACHE 7.7ns
•Main Memory 89ns
•Try to hide the big hit of main memory access.
•Codes with predictable access patterns use PREFETCH (in various flavours) to hide latency.
30
High Performance Linpack
• A CPU bound problem (memory system not stressed)
• Solves A x = b using LU factorization
• Results Peak Gflops rate achieved for matrix size NxN
• Almost all time is spent in DGEMM
• Use MPI message passing
• Larger N the better – fill memory per node
• Used in www.top500.org ranking of supercomputers
• N (half) size what Gflops is half Nmax a measure of overhead
• Current number one machine is Earth Simulator (40Tflops)
Cray Red Storm (10,000 Opterons) has comparable peak
31
Quartet: 4u4P AMD OpteronTM MP Processor
PCI-X64bits @ 66Mhz
64bits @100Mhz
6.4GB/s HyperTransport 1000 BaseT
U320
Dual GbE
Dual SCSI
AMD-8131™ HyperTransport™
PCI-X Tunnel
AMD-8131™ HyperTransport™
PCI-X Tunnel
AMD Opteron™uP 2
AMD Opteron™uP 3
AMD Opteron™uP 1
AMD Opteron™Boot uP
PCI-XHot Plug
200
-33
3M
Hz
144
-Bit
Reg
DD
R
1.6GB/s HyperTransport
AMD-8111TM
HyperTransportTM
I/O Hub
AMD-8111TM
HyperTransportTM
I/O Hub
LegacyPCI
AC’97
10/100Ethernet
USB 2.0
SM Bus
EIDELPC
SIO
FLASH(BIOS)
AMD-8131™ HyperTransport™
PCI-X Tunnel
AMD-8131™ HyperTransport™
PCI-X TunnelPCI-X
64bits @ 133Mhz
64bits @133Mhz
PCI-XHot Plug
800MB/s HyperTransport Optional
ATI Rage
BMCKeyboard, mouse, hardware monitor, fan monitor
LAN 10/100
16x16Coherent
HyperTransport
16x16Coherent HyperTransport
16x16Coherent HyperTransport
16x16Coherent
HyperTransport
Platform Agenda TOP
32
MPI vs threaded BLAS ?
•BLAS libraries can do thread level parallelism to exploit MP node
•MPI can treat MP node processors as separate machines talking via shmem.
•Which is best ?
•NUMA kernel allocates memory locally for each process
•But in the box MPI on 4P has memory issues.
•MPI and single threaded BLAS performs best with NUMA kernel.
•Mixing OpenMP and MPI is possible and maybe sensible.
•Generally static upfront decomposition feels better ?
33
High Performance Linpack
GOTO Library Results
AMD Opteron™ system# P
Rmax(GFlops)
Nmax(order)
N1/2(order
)Rpeak
(GFlops)GFLOP/
Proc
Rmax / Rpeak
4P AMD Opteron 1.8GHz 2GB/proc PC2700 8GB Total 4 12.06 28000 1008 14.4 3.02 83.8%
2P AMD Opteron 1.8GHz 2GB/proc PC2700 4GB Total 2 6.22 20617 672 7.2 3.11 86.4%
1P AMD Opteron 1.8GHz 2GB PC2700 1 3.14 15400 336 3.6 3.14 87.1%
High-Performance BLAS by Kazushige Goto
• Optimized http://www.cs.utexas.edu/users/flame/goto
GOTO results were with 64-bit SuSE 8.1 Linux Professional Edition with NUMA kernel and Myrinet MPIch-gm-1.2.5..10 message passing library.
34
HPL on 16P (4x4P) Opteron Cluster
•Machine 4x4P 1.8Ghz 2Gb/processor with single myrinet and gigabit ethernet per box.
•Goto DGEMM (single threaded) and MPICH-GM
•Myrinet 8 processors N=40000 N(half)=2252 81.3% peak
•Myrinet 16 processors N=41288 N(half)=3584 80.5% peak
•Ethernet 16 processors N=54616 N(half)=5768 78.1% peak
•A big run to show >4Gbytes/processor working
•4P 1.8Ghz 8Gb/processor (32Gbytes in all …266Mhz memory)
•4 processors N=60144 N(half)=1123 80.56%peak
35
STREAM
• Measures sustainable memory bandwidth (in MB/s)
• Simple kernels on large vectors (~50Mbytes)
• Vectors must be much bigger than cache
• Machine balance defined as:
peak floating ops/cycle / sustained memory ops/cycle
name kernel bytes/iter FLOPS/iterCOPY a(i) = b(i) 16 0SCALE a(i) = q* b(i) 16 1SUM a(i) = b(i) + c(I) 24 1TRIAD a(i) = b(i) + q*c(I) 24 2
36
Compiling STREAM
•PGI compiler recognises Open MP threads directives and generates prefetch instructions and streaming stores
pgf77 -Mnontemporal -O3 -fastsse -Minfo=all -Bstatic -mp -c -o stream_d_f.o stream_d.f
180, Parallel loop activated; static block iteration allocation
Generating sse code for inner loop
Generated prefetch instructions for 2 loads
37
STREAM results
•4P Opteron 2.0Ghz with 333Mhz memory
• Compiled pgf77 –O3 –mp –fastsse -Mnontemporal
• 64bit word flops
• Rates in Mbytes/sec
• Triad ~310Mflops /processor (About the same as a CrayYMP ?)
Kernel 1P 2P 4PCOPY 3640 7061 13142SCALE 3542 7078 13344SUM 3583 6916 12695TRIAD 3610 6962 12759
ApplicationBenchmarks
39
Greedy – Travelling Salesman Problem
•Solves the traveling salesman problem and is sensitive to memory latency and bandwidth.
•An example of increasing importance of memory performance as problems grows from small to large.
Greedy - Traveling Salesman Problem
Solve times in secs (smaller is better)
Pentium III Itanium 2 OpteronProblem 800Mhz 1.0Ghz 1.6Ghz1000x1000 12 6 3316x3162 12 7 4100x10000 18 9 532x31623 28 11 810x100000 33 16 123x31622 37 26 141x1000000 50 45 201x3162278 180 164 981x10000000 665 554 434
Results www.research.att.com/~dsj/chtsp/speeds.html
TOP
40
OPA -Parallel Ocean Model
– A large scientific code from France…
– The uses LAM-MPI.
– Compiled in France with ifc 7.1 binary run in USA using the same 32bit version of LAM-MPI, ifc runtime and LD_LIBRARY_PATH settings.
Melody 6 Melody 8 McKinley ML 570Opteron 1.4Ghz Opteron 1.6Ghz Itanium2 Intel XeonLinux SMP Linux NUMA 1.0Ghz 2.0Ghz
2494 1772 3602 3071
41
PARAPYR
• Direct Numerical Simulation of Turbulent Combustion
• Single processor performance on 1P system
• Suse AMD64 Linux NUMA kernel
• Problem size (small) 340x340
• Mflops are double precision (64bits)
• Run identical statically linked binary on Intel P4 2.8Ghz (dell)
• Compiled with ifc 7.1 –r8 -ip
• Opteron 1.8Ghz 420 Mflops
• P4 2.8Ghz 351 MflopsBest Opteron results (vs beta Absoft and PGI compilers)
435Mflops ifc –r8 –xW –static
437Mflops ifc -r8 -O2 -xW -ipo -static -FDO
(-FDO pass1 = -prof_gen pass2=-prof_use
42
PARAPYR
Best 1.8Ghz Opteron ifc results
435Mflops ifc –r8 –xW –static
437Mflops ifc -r8 -O2 -xW -ipo -static -FDO
(-FDO pass1 = -prof_gen pass2=-prof_use)
43
Conclusions
– Use AMD64 64bit OS with NUMA support– 32bit compiled applications run wellKnow you applications memory or cpu needs (so you know what to expect)64bit compilers need work (as ever)
– Competitive processors Itanium 1.5Ghz and Xeon 3.0Ghz have higher peak FLOPs but relatively poor memory scalingHighly tuned benchmarks which make heavy use of floating point and which fit in cache or make
low use of the memory system perform better on Itanium Xeon memory system does not scale well
• Opteron has excellent memory system performance and scalability – both bandwidth and latency
Codes that depend on memory latency or bandwidth perform better on OpteronCodes with a mix of integer and floating point will perform better on OpteronCode that is not highly tuned will likely perform better on Opteron
• More Upside from MPI 4P memory layout (MPICH2 ?) and 64bit compilers
TOP
44
AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow! and combinations thereof, AMD-8100, AMD-8111 and AMD-8131 AMD-8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Pentium and MMX are registered trademarks of Intel Corporation in the U.S. and/or other jurisdictions. SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation in the U.S. and/or other jurisdictions. Other product and company names used in this presentation are for identification purposes only and may be trademarks of their respective companies.