build optimized c/c++ & fortran applications for ... - intel · © 2017 intel corporation. all...
TRANSCRIPT
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Dr. Fabio Baruffa
Technical Consulting Engineer
Build optimized C/C++ & Fortran Applications for Linux*, macOS* & Windows*
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Agenda
Overview
New compiler options and features
What’s new in C/C++
What’s new in Fortran
New Features upcoming in OpenMP* 5.0 (non SIMD specific)
Vectorization Enhancement
AVX-512
OpenMP* SIMD
Summary
2
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 3
Fast, scalable, parallel code with intel ® CompilersDeliver Industry-leading C/C++ & Fortran Code Performance, Unleash the Power of the latest Intel® Processors
Develop optimized and vectorized code for various Intel® architectures, including Intel® Xeon® and Xeon Phi™ processors
Leverage latest language and OpenMP* standards, and compatibility with leading compilers & IDEs
Take advantage of Priority Support, which connects you direct to Intel engineers for technical questions (paid versions only)
For Intel® C/C++ and Fortran Compilers version 18.0
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
New options and features Microsoft Visual Studio 2017 support
Stack security protection changes with /GS
− by default provide full stack security level checking now
− performance of some programs may be impacted – use /GS:partial
Control-Flow Enforcement Technology
enable protection against exploit vulnerabilities
–[q|Q]cf-protection[[=|:]keyword]
where keyword is shadow_stack, branch_tracking, full or none
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
5
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Function splitting
useful to set the inline depth for too large function
–[f|Q]fnsplit[=n], where 0 <= n <= 100
function splitting for functions blocks with execution probability less or equal to n
it forces the compiler to do function splitting even if there is no dynamic profiling
also support GCC’s –freorder-blocks-and-partition (requires dynamic profiling)
6
New options and features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
New options and features
-[q|Q][no-]opt-assume-safe-padding
Allows the compiler to assume safe access of up to 64 bytes after each array/variable allocated by a program
Reintroduced this KNC-specific option for all targets supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Foundation instructions
7
struct
float f0;
float f1;
float f2;
d[];
for(int i=0; i < size; i++)
int j = index[i];
tmp += d[j].f0 + d[j].f1 + d[j].f2;
-xCORE-AVX512:vmovups d(index[i+0]*12), %xmm0%k1vmovups d(index[i+1]*12), %xmm1%k1…vmovups d(index[i+15]*12), %xmm15%k1vpermi2ps …
Sometimes allows generating unmasked loads
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
SVML calls dispatching during compile time:
this can create indirect jump which causes overhead
Direct call to cpu-specific SVML entry is performed by default now
removes SVML dynamic dispatching overhead for programs built with –x… options
no dispatching in runtime improves performance
direct call to the best suitable implementation is made. The downside is that the “best” may become different in future HW, but by not calling to dispatcher we stick to the current best performance
–[f|Q]imf-force-dynamic-target[[=|:]funclist] reverts the previous behavior
8
New options and features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
As software developers, we
Need value safety guarantees promised by fp-model precise
Need maximum performance through vectorization and SIMD
Problem:
LIBM and SVML use different not-correctly rounded (0.5 ulp) algorithms, hence deliver different results (difference in least significant bit, but it can accumulate fast and become visible on the application level)
FP consistency vector vs scalar code
for scalar math LIBM is used, which does not guarantee bitwise-same result of vectorized math lib SVML
–[f|Q]imf-use-svml changes scalar LIBM calls to “scalar“ SVML calls
compiler vectorizes math functions in fp-model precise
9
New options and features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
New options and features
10
The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake
Compile with processor-specific option [-/Q]xCORE-AVX512
Many HPC apps benefit from aggressive ZMM usage
Most non-HPC apps degrade from aggressive ZMM usage
A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512
void foo(double *a, double *b, int size)
#pragma omp simd
for(int i=0; i<size; i++)
b[i]=exp(a[i]);
icpc -c -xCORE-AVX512 –qopenmp -qopt-report:5 foo.cpp
remark #15305: vectorization support: vector length 4
…
remark #15321: Compiler has chosen to target XMM/YMM
vector. Try using -qopt-zmm-usage=high to override
…
remark #15478: estimated potential speedup: 5.260
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
New options and features
11
The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake
Compile with processor-specific option [-/Q]xCORE-AVX512
Many HPC apps benefit from aggressive ZMM usage
Most non-HPC apps degrade from aggressive ZMM usage
A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512
void foo(double *a, double *b, int size)
#pragma omp simd
for(int i=0; i<size; i++)
b[i]=exp(a[i]);
icpc -c -xCORE-AVX512 -qopt-zmm-usage=high –qopenmp
-qopt-report:5 foo.cpp
remark #15305: vectorization support: vector length 8
…
remark #15478: estimated potential speedup: 10.110
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
SourceCompiler
-[Q]prof-gen-sampling.obj
Linker .exe
VTune
Amplxe-pgo-report.sh <app> <options>
profmergesampling
(optional step)
Compiler
-[Q]prof-use-sampling:<data_file>
.obj
Object file contains an extended form of debug information, but does not contain
instrumentation(Prepares for data collection)
Performs PGO optimizations. Can take .pgo or .db file as input.
(Feedback compilation)
.pgo
.db
Merges 1 or more .pgo files into a database.
Optional step, but speeds up next step if done
VTune collects executed branches using processor s Last Branch Record counter
(Training runs)
HW based sampling PGO
For Intel® C/C++ compiler version 18.0
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Full C++11 and C++14 support
Full C11 support
Initial C++17 support
Parallel STL* for parallel and vector execution of the C++ STL
GNU compatibility
use –std switch i.e. –std=c++17 or –std=c11
Fully compatible with Microsoft Visual C++ 2017 with respect to C++14 and C11 functionality.
14
Standard support
Comparison to GNU and Microsoft can be found at:https://software.intel.com/en-us/articles/c0x-features-supported-by-intel-c-compiler
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 15
Improved SSE support
512 bit operators support
Control-Flow Enforcement Technology
__declspec(notrack)
OpenMP 4.5 and 5.0 support
taskloop, task_reduction and in_reduction clauses
C11_Atomic keyword
Counted range based for loops
#pragma openmp taskloop num_tasks (32)
for (long l = 0; l < 1024; l++)
do_something (l);
typedef void (*fp_type)(void);
void foo(fp_type f)
__declspec(notrack) fp_type p = f;
(*p)();
std::vector<char> v;
for (auto iter = v.begin(); iter != v.end(); iter++) … for (char c : v) …
C/C++ new features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Nested namespace definitions
Terse static_assert
Relaxed range based for loops
Loosens requirements on type of range object, e.g. end() type does not need to match begin() type
end() type does not need to be an iterator
16
namespace outer
namespace nested
void foo();
namespace outer::nested void foo() … // short hand
static_assert(array_index != 0);
Will print a default message
ex.cpp:4:4: error: static assertion failed
C++17 new features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Remove deprecated register keyword
Remove deprecated operator++(bool)
[[fallthrough]], [[nodiscard]], and [[maybe_unused]]
Using attribute namespaces without repetition
Standard and nonstandard attributes
__has_include
hex FP literals
0x1.999999999999ap-4
17
void foo()
int x = 42; // may warn about this
[[maybe_unused]] int y = 42;
// Warning suppressed
switch (c)
case 'a':
f(); // Warning emitted
case 'b':
g();
[[fallthrough]]; // Warning suppressed
case 'c':
h();
#if __has_include(<optional>)
#include <optional>
C++17 new features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Parallel stl
18
Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies
#include <algorithm>#include <execution>void increment( float *in, float *out, int N )
using namespace std;using namespace std::execution;transform( par, in, in + N, out, []( float f )
return f+1;);
specify namespacesstd and std::execution
For any of the implemented algorithms, pass one of the value:seq -> sequential executionunseq -> try to use SIMD. The policy SIMD-safe functionspar -> use multithreadingpar_unseq -> combined effect of unseq and par
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
What is Intel® sdlt?
19
The SIMD Data Layout Template library is a C++11 template library to quick convert Array of Structures to Structure of Arrays representation
SDLT vectorizes your code by making memory access contiguous, which can lead to more efficient code and better performance
A
Z[i+0] Z[i+1] Z[i+2] Z[…]
Y[i+0] Y[i+1] Y[i+2] Y[…]
X[i+0] X[i+1] X[i+2] X[…]
X
A[i+0]
Y Z X
A[i+1]
Y Z X
A[i+2]
Y Z
vector_register_10 1 2 3
X
A[i+3]
Y Z
vector_register_10 1 2 3
AOS SOA
For Intel® Fortran compiler version 18.0
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 21
complex(4), parameter :: x = atan(CMPLX(2,1))
complex(4) :: a
complex(4) :: var = CMPLX(2,1)
a = atan(var)
if ( x.eq.a ) then
print *, "PASS"
COMPLEX arguments to trigonometric and hyperbolic intrinsic functions
Optional BACK argument to MAXLOC and MINLOC
MAXLOC((/1,2,3,3,2,1/), back = .true.) gives 4.
MINLOC((/1,2,3,3,2,1/), back = .true.) gives 6.
FINDLOC intrinsicThe value of FINDLOC ([2, 6, 4, 6,], VALUE = 6) is [2]
The value of FINDLOC ([2, 6, 4, 6], VALUE = 6, BACK = .TRUE.) is [4]
new fortran features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 22
type t
integer i
type(t), allocatable :: next
type(t1), allocatable :: fwd
end type
type t1
integer j
end type
Allocatable components of recursive type and forward reference
Additional INIT options
-init=keyword [, keyword] or /Qinit:keyword [, keyword]
where keyword can be one of: [no]arrays, [no]snan, [no]zero, infinity, minus_infinity, huge, minus_huge, tiny, minus_tiny
e.g. -init=zero,minus_huge,snan program test
real x
integer i
complex c
end
new fortran features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 23
Subroutine dummy arguments can be pointers or assumed shape arrays, e.g.:
Vectorization and array contiguity
SUBROUTINE SUB(X,Y)
REAL, DIMENSION(:) :: X !assumed shape array
REAL, DIMENSION(:), POINTER :: Y !pointer
This avoids the need to pass parameters such as array bounds explicitly. The Fortran standard allows the actual arguments to be non-contiguous array sections or pointers, e.g.:
CALL SUB(A(1:100:10)) ! non-unit stride
CALL SUB(B(2:4,:)) ! incomplete columns
Therefore, the compiler cannot assume that consecutive elements of X and Y are adjacent in memory and so cannot blindly issue efficient vector loads for several elements at once.
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 24
If you know that such dummy arguments will always be contiguous in memory, you can use the CONTIGUOUS keyword to tell the compiler and it will generate more efficient code, e.g.:
Contiguity checking
SUBROUTINE SUB(X,Y)
REAL, DIMENSION(:),CONTIGUOUS :: X !assumed shape array
REAL, DIMENSION(:),CONTIGUOUS,POINTER :: Y !pointer
The CONTIGUOUS attribute may help the compiler optimize more effectively
[–/]assume contiguous_pointer and [–/]assume contiguous_assumed_shape
assert that all assumed shape arrays and/or pointers have unit stride
[–/]check contiguous compiler option help to diagnose non-contiguous pointer assignment to CONTIGUOUS pointer.
Non-simd specific
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
int sum=init(), A[100];
#pragma omp parallel for reduction(+:sum)
for (i=0; i<100; i++)
sum += A[i];
print(sum);
26
fork
join
int sum1 = 0;
for(i=0; i<50; i++)
sum1 += A[i];CRIT(sum += sum1);
int sum2 = 0;
for(i=50; i<100; i++)
sum2 += A[i];CRIT(sum += sum2);
print(sum);
sum = init();
Recap: parallel reduction
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
int sum=init(), A[100];
#pragma omp taskgroup task_reduction(+:sum)
#pragma omp task in_reduction(+:sum)
for (i=0; i<50; i++) sum += A[i];
#pragma omp task in_reduction(+:sum)
for (i=50; i<100; i++) sum += A[i];
// wait here until all tasks in taskgroup are completed
print(sum);
27
Openmp* 5.0: task reduction
*based on OpenMP 5.0 TR6
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 28
wait
int * p1= getMy(sum);
for(i=0; i<50; i++)
*p1 += A[i];
print(sum);
taskgroup:int sum1 = 0int sum2 = 0
end taskgroup:sum+= sum1+sum2;
int * p2= getMy(sum);
for(i=50; i<100; i++)
*p2 += A[i];
p1=&sum1 p2=&sum2
task reduction
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
#pragma omp taskgroup task_reduction(op:list)
29
Taskgroup
task_reduction clause
Reduction scoping clause
Defines the region in which reduction is computed by tasks
For each list item x
taskgroup keeps a copy of x for each task with in_reduction(x)
taskgroup begin: initialize all copies of x according to op
taskgroup end: reduce all copies of x into the original x
*based on OpenMP 5.0 TR6
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
#pragma omp task in_reduction(op:list)
30
Taskgroup
in_reduction clause Reduction participating clause
The task participates in a reduction defined by a reduction scoping clause
For each list item x Task requests a pointer p to its copy of x from an enclosing taskgroup
If nested, the innermost taskgroup with the matching task_reduction clause fulfills the request
All accesses to x in the task’s body are replaced by *p
*based on OpenMP 5.0 TR6
New Features of automatic Vectorization and explicit data parallel Programming via OpenMP* SIMD
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
32
New Intel® Advanced Vector Extensions 512 (Intel® AVX-512) 512-bit wide vectors
32 operand registers
8 64b mask registers
Embedded broadcast
Embedded rounding
7 Subsets 2 exclusive to Intel®
Xeon Phi™
2 common
3 exclusive to Intel® Xeon
Microarchitecture Instruction SetSP FLOPs per cycle
DP FLOPs per cycle
Skylake AVX512 & FMA 64 32
Haswell / Broadwell AVX2 & FMA 32 16
Sandybridge AVX (256b) 16 8
Nehalem SSE (128b) 8 4
Intel® AVX-512 Instruction Types
AVX512-PF Prefetch: multi-address prefetch using gather/scatter semantics
AVX512-ER Exponential and Reciprocal: ‘wide’ approximation of Log and RCP/RSQRT
AVX512-F AVX-512 Foundation Instructions
AVX512-CD Conflict Detect : used in vectorizing loops with potential address conflicts
AVX512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes
AVX512-BW 512-bit Byte/Word support
AVX512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
33
Tuning for Skylake - Compiler options
Both Skylake and Knights Landing processors have support for Intel® AVX-512 instructions. There are three ISA options in the Intel® Compiler:
-xCORE-AVX512 : Targets Skylake, contains instructions not supported by Knights Landing
-xCOMMON-AVX512 : Targets both Skylake and Knights Landing
-xMIC-AVX512 : Targets Knights Landing, includes instructions not supported by Skylake
Intel® Compiler is conservative in its use of ZMM (512bit) registers so to enable their use with Skylake the additional flag -qopt-zmm-usage=high must be set.
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 34
Ordered blocks in SIMD contexts#pragma omp ordered simd
Semantics: The ordered with simd clause construct specifies a structured block in the simd
loop or SIMD function that will be executed in the order of the loop iterations or sequence of call to SIMD functions.
Rules: #pragma omp ordered simd is only allowed inside a SIMD loop or SIMD-enabled
function.
#pragma omp ordered simd region must be a single-entry and single-exit code block
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 35
Monotonic keyword for ordered#pragma omp ordered simd monotonic([var:step]s)
Semantics: Same as for ‘omp ordered simd’ with a hint that vars inside the structured block
are monotonically changed with respect to execution.
Why: With this hint compiler can generate better vector code if CPU supports compress
or expand instructions.
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
OK:#pragma omp simd
for (int i = 0; i < n; ++i)
if (cond)
#pragma omp ordered simd monotonic(j:1)
if (c[i] > 0)
q[j++] = b[i]; // compress pattern
a[i] = j; // expand pattern
if (cond1)
#pragma omp ordered simd monotonic(k:1)
if (c[i] > 0)
b[i] = p[k++]; // expand pattern
36
Not OK:#pragma omp simd
for (int i = 0; i < n; ++i)
if (cond)
#pragma omp ordered simd monotonic(j:1)
if (c[i] > 0)
q[j++] = b[i]; // compress pattern
a[i] = j;
if (cond1)
#pragma omp ordered simd monotonic(j:1)
if (c[i] > 0)
b[i] = p[j++]; // expand pattern
Compiler won’t
complain!
Monotonic keyword for ordered
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 37
Overlap keyword for ordered#pragma omp ordered simd overlap(overlap_index)
Semantics: Same as for ‘omp ordered simd’ with a hint that overlap_index has equal values in
different lanes during vector execution. Compiler will do resolving for indirect accesses w.r.t. this overlap_index.
Why: With this hint compiler can generate better vector code if CPU supports vconflict
instruction. For example, CPU has support of AVX512CD.
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
OK:#pragma omp simdfor (int i = 0; i < n; ++i)
#pragma omp ordered simd overlap(b[i])
++a[b[i]]; // conflict pattern // compiler will resolve conflict according to b[i]
#pragma omp ordered simd overlap(i)
++a[b[i]]; // conflict pattern. // compiler will resolve conflict according to i.
#pragma omp simdfor (int i = 0; i < n; ++i)
#pragma omp ordered simd overlap(b[i])
++a[b[i]]; // conflict pattern // compiler will resolve conflict according to b[i]
c[b[i]] = b[i] + c[b[i]]; // conflict pattern // reassociation is allowed
Bail-out to ordered:double *a;int32_t *c;#pragma omp simdfor (int i = 0; i < n; ++i) #pragma omp ordered simd overlap(b[i])
++a[b[i]]; // conflict pattern++c[b[i]]; // conflict pattern
double *a0, *a1;#pragma omp simdfor (int i = 0; i < n; ++i) #pragma omp ordered simd overlap(b[i])
++a0[b[i]]; // conflict pattern. // ok to reassociate
a1[b[i]] = 1/a1[b[i]]; // conflict pattern// reassociation is not allowed
38
overlap keyword for ordered
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
OK:#pragma omp simd lastprivate(conditional: clp)
for (int i = 0; i < n; ++i)
if (cond)
clp = a[i];
#pragma omp simd lastprivate(conditional: clp)
for (int i = 0; i < n; ++i)
if (cond)
clp = a[i];
b[i] = clp;
Not OK:#pragma omp simd lastprivate(conditional: clp)
for (int i = 0; i < n; ++i)
if (cond)
clp = a[i];
b[i] = clp;
#pragma omp simd lastprivate(conditional: clp)
for (int i = 0; i < n; ++i)
b[i] = clp;
if (cond)
clp = a[i];
Compiler won’t
complain!
39
Conditional scalar assignment
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Download a free, 30-day trial of Intel® Parallel Studio XE 2018 today
https://software.intel.com/en-us/intel-parallel-studio-xe/try-buy
And Don’t Forget…
Code that performs and outperforms
To check your inbox for the evaluation survey which will be emailed after this presentation.
P.S.Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Q & A
41
Questions?
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
Legal Disclaimer & Optimization Notice
42
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 44
Visual Studio 2015 Shell for “standalone” Fortran
COMPILER_OPTIONS and COMPILER_VERSION in iso_fortran_env
Implied-do bounds expressions in a DATA statement can now be any constant expression
Passing a non-pointer data item to a pointer dummy argument
program main
use ISO_FORTRAN_ENV
implicit none
print *, “Compiler version: ", compiler_version()
print *, “Compiler options: ", compiler_options()
end program
DATA ((B (K, J), J = K + 1, size(a), size(c)-2),
K = 1, size(a)-2) / 9 * 1 /
integer, target :: j = 17 ! Must have target attribute
call foo2(j)
contains
subroutine foo2(i)
integer, intent(in), pointer :: i
print *,i
end subroutine
end
new fortran features
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 45
real, allocatable :: r(:)
!dir$ attributes memkind:hbw :: r
print *, for_get_memkind(r) !prints 1 for HBW
!dir$ memkind:ddr
allocate(r(1000))
print *, for_get_memkind(r) !prints 0 for DDR
end
MEMKIND attribute and directive
!DIR$ ATTRIBUTES MEMKIND: HBW | DDR :: obj
!DIR$ MEMKIND:HBW | DDR [,ALIGN:N]
intrinsic FOR_GET_MEMKIND
Polymorphic assignment with allocatable LHS and not a coarray.
class(parent), allocatable :: var
class(parent), allocatable :: expr1
class(child), allocatable :: expr2
var = expr1
var = expr2 ! Type compatible
Pointer function reference in a variable definition context
program main
integer, target :: var(10)
storage(1) = 11
print *, var(1) !Print 11
contains
function storage(key) result(loc)
integer, intent(in) :: key
integer, pointer :: loc
loc => var(key)
end function
end
new fortran features