Intel Technologies for High Performance ComputingLeo Borges
Intel Software Conference 2014 BrazilMay 2014
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal DisclaimersSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its
customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks
are accurate and reflect performance of systems available for purchase.
Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC
manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors
support HT Technology, visit http://www.intel.com/info/hyperthreading.
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology
performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system
delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different
processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life
sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject
to change without notice
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s
current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice
Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only
and are subject to change without notice.
*Other names and brands may be claimed as the property of others.
2
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Building Blocks
Many Product Families – Today’s talk: HPC Focus
3
E5-2600 v3
(E5-2400 v3 for Comms &Storage only)
E3-1200 v3
E7-4800 v3
E5-4600 v3
E7-2800 v3
E7-8800 v3
Haswell
E7
E5Efficient Performance
E3
E5-1600 v3
Boards/PDKs
Software
SSDsLAN
RAID
Note: For discussion purposes pnly(Not intended to be interpreted as portfolio recommendations or guidance)
Cloud
Storagev3
Segments
Channel
Enterprise
HPC
MissionCritical
Big Data
Public Cloud
Co-processors
Product families and building blocks targeting an array of Segments
Storage
Networking
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Recall of a few basics for HPC
What to expect from your code
What to expect from the hardware
Review Vectorization
Xeon + Xeon Phi Example
Objectives of this session
4
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Review of a few HPC basicsfor non-ninja programmers
5
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How it works and where are the bottlenecks
CPUCPUCPUCPU
L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3
memorymemorymemorymemory
CPUCPUCPUCPU
L 2L 2L 2L 2 L 3L 3L 3L 3memorymemorymemorymemory
I/OI/OI/OI/O
Interconnect.Interconnect.Interconnect.Interconnect.
L 1L 1L 1L 1
Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?
Cache Size, BW & Cache Size, BW & Cache Size, BW & Cache Size, BW & latency latency latency latency
CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ?
Intra / Inter socket Intra / Inter socket Intra / Inter socket Intra / Inter socket communicationscommunicationscommunicationscommunications
Inter Inter Inter Inter nodesnodesnodesnodescommunication? communication? communication? communication?
6
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Unfortunately, you need to be aware
CPU
L 1 L 2 L 3memory
Bandwidth
Latency
Capacity
From the core ………………….. ------> ………………………… to the i/o subsystem
L1 L2 L3 L4 L5 …. Ln
caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR
7
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
FLOPS and memory Bandwidth impact the efficiency & scalability
� Performing Flops is not an issue
� Data movement is the issue (BW, Latency, Power)
Efficiency (= Peak flops / Achieved flops)
won’t be high enough if store / load are not fast enough (GB/s)
First approximation: Only a matter of Frequency and Bandwidth
for (i=0;i<=MAX;i++)
c[i]= a[i] + b[i]* d[i];
store load load load
add mul
8
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance expectation: upper bounds
CPU bound.“HPL”Real world applications
Memory bound.“Stream”
Flops/s demanding
applications
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
BW demanding
applications
• Our performance metrics are Gflop/s and% of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak
performance?)
9
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance expectation: upper bounds
CPU bound.“HPL”Real world applications
Memory bound.“Stream”
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
• Our performance metrics are Gflop/s and% of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak
performance?)
10
Memory Bound?
Compute Bound?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Glossary, “High performance computing”� Peak =nb of floating points operations per cycle * frequency
� “Flops /sec”
“Efficiency = % of the peak performance”
Same for Bandwidth (but in Gbytes / sec)
sec/sec)/(*)/( FlopscyclecycleFlopsPeak ==
By the way : What is the peak perf of your laptop ?
11
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Anatomy of a Computer Platform
12
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
CPU: Core/Uncore - Designed For Modularity
DRAMDRAMDRAMDRAM
QPIQPIQPIQPI
Core
Uncore
IMC QPIPower &Clock
#QPILinks
# memchannels
Size ofcache# cores
PowerManage-ment
Type ofMemory
Integratedgraphics
Differentiation in the “Uncore”:
…
QPI…
…
…
L3 Cache
QPI: Intel®
QuickPathInterconnect
CCCCOOOORRRREEEE
CCCCOOOORRRREEEE
CCCCOOOORRRREEEE
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Romley EP/EN PlatformsIntel® Xeon® Processor E5-2600 v2/2400 v2 Product Families
14
Intel® Xeon® processorE5-2400/2600 prod fam
Intel® Xeon® ProcessorE5-2400/2600 prod fam
Intel® C600 series chipset
QPI
QPI
DDR3
DDR3
DDR3
DDR3
3Gb/sSAS,SATA
Memory
DDR3 & DDR3L
RDIMMs & UDIMMs, LR DIMMs
Socket R: 4 channels per socket, up to 3 DPC; speeds up to DDR3 1866
Socket B2: 3 channels per socket, up to 2 DPC; speeds up to DDR3 1600
PCI Express* 3.0
Socket R: 40 lanes per socket
Socket B2: 24 lanes per socket
Extra Gen 2 x4 on 2nd CPU
DDR3
DDR3
DDR3
DDR3
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
Intel® C600 series chipset (Patsburg PCH)
Optimized Server & WS PCH
Integrated Storage:
Up to 8 ports 3Gb/s SAS
RAID 5 optional
Ivy Bridge CPUs
Socket R: Up to 12 cores / socket
Socket B2: Up to 10 cores / socket
DMI2
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 3.0 x8
PCIe* 2.0 x4
QPI
Socket R: 2 QPI links
Socket B2: 1 QPI link
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
IvyBridge (IVB) E5-2600 v2 family
The total benefit (at node level) is given by a combination of factors
DD
R3
DD
R3
DD
R3
DD
R3
LLCCache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
15
C
C
C
C
C
C
C
C
C
C
Feature Xeon E5-2600 v2
Process Technology
22 nm
Cores/ThreadsUp to 12 Cores/24
Threads
Last-level Cache Up to 30 MB
Max Memory Speed (MHz)
Up to 1866
Max DIMM Capacity
12 Slots/Processor
PCIe* Lanes / Controllers/Speed
40 / 10 (PCIe* 3.0 at 8 GT/s)
TDP (W)150 (Workstation only), 130, 115, 95, 80, 70, 60
wstream.exe
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Advanced
Standard
Workstation Only SKU
Segment Optimized
� 8.0 GT/s QPI
� DDR3-1866
� Intel® HT
� Intel® Turbo Boost
Low Power
Basic
Socket compatible withSNB-EP top to bottom on the SKU stack
All SKUs, frequencies and features and can change without notice
6C 80W2.1GHz 15M E5-2620 v2
4C 80W2.5GHz 10M E5-2609 v2
10C 115W2.5GHz 25M E5-2670 v2
8C 95W2.0GHz 20M E5-2640 v2
4C 80W1.8GHz 10M E5-2603 v2
6C 80W2.6GHz 15M E5-2630 v2
10C 130W3.0GHz 25M E5-2690 v2
10C 115W2.8GHz 25M E5-2680 v2
8C 95W2.6GHz 20M E5-2650 v2
10C 95W2.2GHz 25M E5-2660 v2
12C 130W2.7GHz 30M E5-2697 v2
12C 115W2.4GHz 30M E5-2695 v2
8C 130W3.3GHz 25M
6C 130W3.5GHz 25M E5-2643 v2
4C 130W3.5GHz 15M E5-2637 v2
10C 70W1.7GHz 25M E5-2650L v2
6C 60W2.4GHz 15M E5-2630L v2
� 10C 8.0 GT/s QPI
� 6C 7.2 GT/s QPI
� DDR3-1600
� Intel® HT
Intel® Turbo Boost
� 7.2 GT/s QPI
� DDR3 1600
� Intel® HT
� Intel® Turbo
Boost
� 8.0 GT/s QPI
� DDR3-1866 (skt R)
� DDR3-1600 (skt B2)
� Intel® HT
� Intel® Turbo Boost
� 6.4 GT/s QPI
� DDR3 1333
� No Intel® HT
No Intel® Turbo
8C 150W3.4GHz 20M E5-2687W v2
E5-2667 v2
E5-2600 v2 Product Family
16
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)
Cores
Vectors
Memory, caches
Data layout and alignment
OpenMP TBB Cilk plus
Vectorloops
Vectorfunctions
Blocking algorithms
Manual layout,ugly code
AoS � SoAlibrary
4 considerations when writing an efficient, unconstrained parallel program
Array notations
Threads, locks
Intrinsics
Directives for alignment
Performance Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
“SIMDization”, so called VectorizationSingle Instruction Multiple Data (SIMD):
� Processing vector with a single operation
� Provides data level parallelism (DLP)
Vector:
� Consists of more than one element
� Elements are of same scalar data types (e.g. floats, integers, …)
Scalar Processing
Vector Processing
AA BB
CC
++
A B
C
+
CiCi
++
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
VLVL
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
18
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vectorization of Code
• Transform sequential code to exploit vector processing capabilities (SIMD)
– Manually by explicit syntax
– Automatically by tools like a compiler
for(i = 0; i <= MAX;i++)c[i] = a[i] + b[i];
a
b
c
+
a
b
c
++
a[i]
b[i]
c[i]
+
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
19
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Reminder about the peak flops
Scheduler (Port names as used by Intel® Architecture Code Analyzer ***)
Load
Port 0 Port 1 Port 5 Port 2 Port 3
Load
Store Address
Store DataALUALU ALU/JMP
AVX FP Shuf
AVX FP Bool
VI* ADDVI*MUL
SSE MUL
DIV**
SSE ADD
AVX FP ADD
AVX FP MUL
0 63 127 255
SSE Shuf
AVX FP Blend
Port 4
AVX FP Blend
VI* ADD Store Address
6 instructions / cycle: • 3 memory ops • 3 computational operations
Nehalem /Westmere: Two 128 bits SIMD per cycle4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycleSandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
Intel® SandyBridge/Ivy Bridge micro-architecture
20
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Processor: Intel Core i5-3427U
ark.intel.com:
21
In the Laptop We’ll be Using for Demo…
Processor Number i5-3427U
# of Cores 2
# of Threads 4
Clock Speed 1.8 GHz
Max Turbo Frequency 2.8 GHz
Instruction Set Extensions AVX
SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision)2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Haswell-EP vs IvyBridge-EPThe total benefit (at node level) is given by a combinaison of factors
• Benefit frommicro-u optimization (IPC)
25 % IPC improvements
• Benefit from the nb of cores
up to 1.16x (at cst Frequency)
• Benefit from AVX2
up to 2x (FMA)
• Benefit fromMemory bandwidth
up to 1.14x (1866MHz to 2133MHz)
DD
R4
DD
R4
DD
R4
DD
R4
LLCCache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
22
C
C
C
C
C
C
C
C
C
C
C C
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Flops/s, AVX, AVX2 and AVX-512
2013 2014 2015 2016H1 H2 H1 H2 H1 H2 H1 H2
Haswell-EP future futureIvy Bridge-EP
23
----512512512512
----512512512512
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
FMAFP Multiply
Unified Reservation Station
Port 1
Port 2
Port 3
Port 4
Port 5
Load &Store Address
StoreData
Integer ALU & Shift
IntegerALU & LEA
Integer ALU & LEA
FMA FP MultFP Add
Divide
Port 6
Integer ALU & Shift
Port 7
Store Address
Port 0
New AGU for Stores• Leaves Port 2 & 3 open for
Loads
Branch
New Branch Unit• Reduces Port0 Conflicts• 2nd EU for high branch code
4th ALU• Great for integer workloads• Frees Port0 & 1 for vector
VectorShuffle
Branch
Vector IntMultiply
VectorLogicals
Vector Shifts
Vector IntALU
Vector IntALU
VectorLogicals
VectorLogicals
Intel® Microarchitecture (Haswell)
2xFMA• Doubles peak FLOPs• Two FP multiplies benefits
legacy
Haswell Execution Unit Overview
24
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Extends 128-bit integer vector instructions to 256-bit
Floating Point Fused Multiply Add: A*B + C
� Increased FLOPS potential
� Increased accuracy – Only a single round
Enhanced vectorization with Gather, Shifts and powerful permutes
Intel® AVX2 uses same 256-bit YMM registers as Intel AVX
Floating-Point Performance (Peak) per Core
2x
2x
AVX2Haswell
FMA (*,+)
FMA (*,+)
AVXSandyBridge/
Ivy Bridge
MUL (*)
ADD (+)
SSE4Nehalem/Westmere
MUL (*)
ADD (+)
8 DP (16 SP)
4 DP (8 SP)
16 DP (32 SP)
256b AVX1
16 SP / 8 DPFlops/Cycle
256b AVX2
32 SP / 16 DP Flops/Cycle (FMA)
25
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)
Cores
Vectors
Memory, caches
Data layout and alignment
OpenMP TBB Cilk plus
Vectorloops
Vectorfunctions
Blocking algorithms
Manual layout,ugly code
AoS � SoAlibrary
4 considerations when writing an efficient, unconstrained parallel program
Array notations
Threads, locks
Intrinsics
Directives for alignment
Performance Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Use math libs for best use of AVX1, AVX2 & AVX-512
1.0
2.0
0.0
AssemblyIntrinsicsAssemblyIntrinsics
MKL DgemmbenchmarkMKL Dgemmbenchmark
MKL FFT benchmarkMKL FFT benchmark
1.5
Use Intel® Math Kernel
Library as much as
possible
Use of intrinsics or
assembly for specific
kernels
Use Compiler and Intel
tools to optimize your
source code
speedup
Application Source codeApplication Source code
One core basis comparison
27
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Math Kernel Library: Optimized Mathematical Building Blocks
Linear Algebra
• BLAS
• LAPACK
• Sparse Solvers
• Iterative
• Pardiso*
• ScaLAPACK
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential, Log
• Power / Root
Vector RNGs
• Congruential
• Wichmann-Hill
• Mersenne Twister
• Sobol
• Neiderreiter
• Non-deterministic
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Intel® MKL is an integral part of Intel® Parallel Studio XE
28
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Many Ways to Vectorize
Ease of useCompiler: Auto-vectorization (no change of code)
Programmer control
Compiler: Auto-vectorization hints (#pragma simd , …)
SIMD intrinsic class(e.g.: F32vec , F64vec , …)
Vector intrinsic(e.g.: _mm_fmadd_pd(…) , _mm_add_ps(…), …)
Assembler code(e.g.: [v]addps , [v]addss , …)
Compiler: Intel® Cilk™ Plus Array Notation Extensions
29
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Control Vectorization !
Provides details on vectorization success & failure:
Linux*, Mac OS* X: -vec-report<n> , Windows*: /Qvec-report<n>
*: First available with Intel® Parallel Studio XE
n Diagnostic Messages
0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier.
1 Tells the vectorizer to report on vectorized loops.[default if nmissing]
2 Tells the vectorizer to report on vectorized and non-vectorized loops.
3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences.
4 Tells the vectorizer to report on non-vectorized loops.
5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.
6* Tells the vectorizer to use greater detail when reporting on vectorized and non-vectorized loops and any proven or assumed data dependences.
30
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vectorization Report II
Note:
In case inter-procedural optimization (-ipo or /Qipo ) is activated and
compilation and linking are separate compiler invocations, the switch to enable
reporting needs to be added to the link step!
35: subroutine fd( y )36: integer :: i37: real, dimension(10), intent(inout) :: y38: do i=2,1039: y(i) = y(i-1) + 140: end do41: end subroutine fd
novec.f90(38): (col. 3) remark: loop was not vector ized: existence of vector dependence.novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39.novec.f90(38:3-38:3):VEC:MAIN_: loop was not vecto rized: existence of vector dependence
31
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Reasons for Vectorization Fails & How to Succeed
● Most frequent reason is Dependence:
Minimize dependencies among iterations by design!
● Alignment: Align your arrays/data structures
● Function calls in loop body: Use aggressive in-lining (IPO)
● Complex control flow/conditional branches:
Avoid them in loops by creating multiple versions of loops
● Unsupported loop structure: Use loop invariant expressions
● Not inner loop: Manual loop interchange possible?
● Mixed data types: Avoid type conversions
● Non-unit stride between elements: Possible to change algorithm to
allow linear/consecutive access?
● Loop body too complex reports: Try splitting up the loops!
● Vectorization seems inefficient reports: Enforce vectorization,
benchmark !32
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
IVDEP vs. SIMD Pragma/Directives
33
Differences between IVDEP & SIMD pragmas/directives:
#pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran)
-Ignore vector dependencies (IVDEP):
Compiler ignores assumed but not proven dependencies for a loop
-Example:
#pragma simd (C/C++) or !DIR$ SIMD (Fortran):
- Aggressive version of IVDEP: Ignores all dependencies inside a loop
- It’s an imperative that forces the compiler try everything to vectorize
- Efficiency heuristic is ignored
- Attention: This can break semantically correct code!
However, it can vectorize code legally in some cases that wouldn’t be possible
otherwise!
void foo(int *a, int k, int c, int m){#pragma ivdep
for (int i = 0; i < m; i++)a[i] = a[i + k] * c;
}
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Memory Subsystem
34
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
CPU: Core/Uncore - Designed For Modularity
DRAMDRAMDRAMDRAM
QPIQPIQPIQPI
Core
Uncore
IMC QPIPower &Clock
#QPILinks
# memchannels
Size ofcache# cores
PowerManage-ment
Type ofMemory
Integratedgraphics
Differentiation in the “Uncore”:
…
QPI…
…
…
L3 Cache
QPI: Intel®
QuickPathInterconnect
CCCCOOOORRRREEEE
CCCCOOOORRRREEEE
CCCCOOOORRRREEEE
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Memory Bandwidth update
For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory
8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s)
For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory
8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s)
For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory
8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s)
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
2 full width QPI 1.12 full width QPI 1.1
DMI2DMI2
40L PCIe3.0
40L PCIe3.0 HSW
Socket-R3 LGA
HSWSocket-R3
LGADDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
36
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Processor: Intel Core i5-3427U
ark.intel.com:
37
In the Laptop We’ll be Using for Demo…
Memory Types DDR3/L/-RS 1333/1600
# of Memory Channels 2
Max Memory Bandwidth 25.6 GB/s
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
Platform: 2 channels , 1 sockets and 1600 MHz memory
8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)
Cores
Vectors
Memory, caches
Data layout and alignment
OpenMP TBB Cilk plus
Vectorloops
Vectorfunctions
Blocking algorithms
Manual layout,ugly code
AoS � SoAlibrary
4 considerations when writing an efficient, unconstrained parallel program
Array notations
Threads, locks
Intrinsics
Directives for alignment
Performance Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Many Integrated Core Architecture
39
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Up to 61 IA cores/1.2 GHz/ 244 Threads
Up to 16 GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Open Source Linux operating system
IP addressable
Standard programming languages, tools, clustering
22 nm process
Intel® Xeon Phi™ Product Family
Passive Card
Active Card
http://software.intel.com/en-us/mic-developer
40
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3 Family Outstanding Parallel Computing Solution
Performance/$ leadership
5 FamilyOptimized for High
Density EnvironmentsPerformance/Watt
leadership
8GB GDDR5
>300GB/s
>1TF DP
225-245W TDP
6GB GDDR5
240GB/s
>1TF DP
300W TDP
Intel® Xeon Phi™ Coprocessor Product Lineup
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance41
Optional 3-year Warranty
Extend to 3-year warranty on any Intel® Xeon Phi™ Coprocessor. Product Code:
XPX100WRNTY, MM# 933057
7 FamilyHighest Performance
Most Memory
Performance leadership
16GB GDDR5
352GB/s
>1.2TF DP
300W TDP
3120PMM# 927501
3120AMM# 927500
5110PMM# 924044
5120D (no thermal)
MM# 927503
7120PMM# 927499
7120X(No Thermal
Solution)MM# 927498
7120AMM# 934878
7120D(Dense Form
Factor)MM# 932330
41
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Core ArchitectureInstruction decoder
L1 Cache (I & D)
L2 Cache
Interprocessornetwork
VectorUnit
Scalar Unit
VectorRegisters
ScalarRegisters
512 KB Slice per
32 KB per core
L2 Hardware Prefetching
Fully Coherent
In Order
512-wide64-bit
4 Threads per Core
VPU: integer, SP, DP;3-operand,
16-instruction
42
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Spectrum of Execution Models(Offload / Native / Symmetric)
Offload:
Workload is run on host, and highly
parallel phases on Coprocessor
!dir$ omp offload target(mic)
!$omp parallel dodo i=1,10
A(i) = B(i) * C(i)enddo!$omp end parallel
MPI Exampleon Host with offload to coprocessors
43
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Spectrum of Execution Models(Offload / Native / Symmetric)
MPI exampleon Coprocessor only
Native (Coprocessor-only model):
Workload is run solely on coprocessor
icc –mmic … ./bin_mic
Then
ssh mic0
./bin_mic
Or start it from host
micnaticeloadex ./bin_mic
44
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Symmetric Mode Command Line
Arslan et al. 2013. Rice HPC Conf.
Workload runs on Host AND Coprocessors
45
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
QPI
IOH* IOH*
rank 0 in“mic0”
rank 1 in“mic1”
rank 4 in“mic2”
rank 2 in“cpu0”
rank 3 in“cpu1”
MPI Process
OpenMP Threads
244threads
244threads
12threads
12threads
244threads
244 threads
4x 7120A(61 Cores, 1.238 GHz, 16GB GDDR5)
2x E5-2697v2 (12C, 2.7GHz) and
64GB DDR3-1866 MHz
rank 5 in“mic3”
Peer-to-peer via DMA
*Integrated in the processor
Single Node Tests – HW and SW ConfigurationIsotropic RTM FD Kernel
Direct DMA transfers between devices
46
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Scalability study with one to four Intel® Xeon Phi™ coprocessors
1.1
4.0
9.3
14.7
20.1
24.4
0.0
5.0
10.0
15.0
20.0
25.0
30.0
0.0
0.4
0.8
1.2
1.6
TFlops
Scaling Based on Number of Coprocessors
CUDA K40c CUDA K10
High performance and scalability with Intel® Xeon Phi® coprocessor
Single Node Tests – Performance & ScalabilityIsotopic RTM FD Kernel
47
� Scaling analysis with each Intel® Xeon Phi™ coprocessor solving a 14GB subdomain and pair of Intel® Xeon® processors solving a 10GB subdomain
� 16th order 3D space and 2nd order time; 61 Flops per Cell
� 24.4 GCell/s total performance with 2 processors + 4 coprocessors
� semi-OPT measurement is an OpenMP parallel version implemented with cache-blocking and compiler directives to improve vectorization. The remaining measurements are on code with additional optimizations such as loop unrolling, non-temporal stores, tiling on Y-Z, prefetchtuning, and balance between MULs and ADDs via intrinsics
� CUDA K40c and CUDA K10 are measurements on single devices using code that extended the FDTD3d sample in the CUDA SDK5.5 to 16th order in space and further optimized to increase register reuse
4.2
GCell/s
5.1
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014
2x Xeon1
semi-OPT2x Xeon1 2x Xeon1 +
1x 7120A2x Xeon1 +2x 7120A
2x Xeon1 +3x 7120A
2x Xeon1 +4x 7120A
Config. SummaryIC 14.0 U1 MPI 4.1.1.036MPSS 6720-15ECC off,Turbo on (Xeon & 7120A)CUDA 5.5(875MHz Boost Enabled)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)
Cores
Vectors
Memory, caches
Data layout and alignment
OpenMP TBB Cilk plus
Vectorloops
Vectorfunctions
Blocking algorithms
Manual layout,ugly code
AoS � SoAlibrary
4 considerations when writing an efficient, unconstrained parallel program
Array notations
Threads, locks
Intrinsics
Directives for alignment
Performance Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Energy efficiency with multiple Intel® Xeon Phi cards
Note: 3 and 4 Xeon Phi power values are projections based on the data collected for 1 and 2 Xeon Phi.
Single Node Tests – Performance/Watt
High energy efficiency with Xeon Phi
This data was presented by Petrobras at SC13 and Rice 2014
Oil & Gas HPC Workshop
Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf50
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Next Intel® Xeon Phi™ Product Family (Codenamed Knights Landing)
51
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
• “Knights Landing” code name for the 2nd generation Intel® Xeon Phi™ product
• Based on Intel’s 14 nanometermanufacturing process
• Standalone bootable processor (running the host OS) and a PCIe coprocessor (PCIe end-point device)
• Integrated on-package high-bandwidth memory• Flexible memory modes for the on package memory
include: cache and flat• Support for Intel® Advanced Vector Extensions 512
(Intel® AVX-512)• 60+ cores, 3+ TeraFLOPS of double-precision peak performance per single socket node
• Multiple hardware threads per core with improved single-thread performance over the current generation Intel® Xeon Phi™ coprocessor
51 Note that code name above is not the product name
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Programming Resources
52
� Intel® Xeon Phi™ Coprocessor Developer’s Quick Start Guide
� Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors
� Access to webinar replays and over 50 training videos
� Beginning labs for the Intel® Xeon Phi™ Coprocessor
� Programming guides, tools, case studies, labs, code samples, forums & more
http://software.intel.com/mic-developer
Using a familiar programming model and tools means that developers don’t need to start from scratch. Many
programming resources are available to further accelerate time to solution.
52
Click on tabs
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Questions?Questions?
Are you ready for Multicore and
ManyCore?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
54