build optimized c/c++ & fortran applications for ... - intel · © 2017 intel corporation. all...

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Dr. Fabio Baruffa

Technical Consulting Engineer

Build optimized C/C++ & Fortran Applications for Linux*, macOS* & Windows*

https://software.intel.com/en-us/articles/optimization-notice#opt-en


Agenda

Overview

New compiler options and features

What’s new in C/C++

What’s new in Fortran

New Features upcoming in OpenMP* 5.0 (non SIMD specific)

Vectorization Enhancement

AVX-512

OpenMP* SIMD

Summary

2


© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 3

Fast, scalable, parallel code with intel ® CompilersDeliver Industry-leading C/C++ & Fortran Code Performance, Unleash the Power of the latest Intel® Processors

Develop optimized and vectorized code for various Intel® architectures, including Intel® Xeon® and Xeon Phi™ processors

Leverage latest language and OpenMP* standards, and compatibility with leading compilers & IDEs

Take advantage of Priority Support, which connects you direct to Intel engineers for technical questions (paid versions only)


https://software.intel.com/en-us/support/intel-premier-support

For Intel® C/C++ and Fortran Compilers version 18.0


New options and features Microsoft Visual Studio 2017 support

Stack security protection changes with /GS

− by default provide full stack security level checking now

− performance of some programs may be impacted – use /GS:partial

Control-Flow Enforcement Technology

enable protection against exploit vulnerabilities

–[q|Q]cf-protection[[=|:]keyword]

where keyword is shadow_stack, branch_tracking, full or none

https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

5



Function splitting

useful to set the inline depth for too large function

–[f|Q]fnsplit[=n], where 0 <= n <= 100

function splitting for functions blocks with execution probability less or equal to n

it forces the compiler to do function splitting even if there is no dynamic profiling

also support GCC’s –freorder-blocks-and-partition (requires dynamic profiling)

6

New options and features




-[q|Q][no-]opt-assume-safe-padding

Allows the compiler to assume safe access of up to 64 bytes after each array/variable allocated by a program

Reintroduced this KNC-specific option for all targets supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Foundation instructions

7

struct

float f0;

float f1;

float f2;

d[];

for(int i=0; i < size; i++)

int j = index[i];

tmp += d[j].f0 + d[j].f1 + d[j].f2;

-xCORE-AVX512:vmovups d(index[i+0]*12), %xmm0%k1vmovups d(index[i+1]*12), %xmm1%k1…vmovups d(index[i+15]*12), %xmm15%k1vpermi2ps …

Sometimes allows generating unmasked loads



SVML calls dispatching during compile time:

this can create indirect jump which causes overhead

Direct call to cpu-specific SVML entry is performed by default now

removes SVML dynamic dispatching overhead for programs built with –x… options

no dispatching in runtime improves performance

direct call to the best suitable implementation is made. The downside is that the “best” may become different in future HW, but by not calling to dispatcher we stick to the current best performance

–[f|Q]imf-force-dynamic-target[[=|:]funclist] reverts the previous behavior

8




As software developers, we

Need value safety guarantees promised by fp-model precise

Need maximum performance through vectorization and SIMD

Problem:

LIBM and SVML use different not-correctly rounded (0.5 ulp) algorithms, hence deliver different results (difference in least significant bit, but it can accumulate fast and become visible on the application level)

FP consistency vector vs scalar code

for scalar math LIBM is used, which does not guarantee bitwise-same result of vectorized math lib SVML

–[f|Q]imf-use-svml changes scalar LIBM calls to “scalar“ SVML calls

compiler vectorizes math functions in fp-model precise

9





10

The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake

Compile with processor-specific option [-/Q]xCORE-AVX512

Many HPC apps benefit from aggressive ZMM usage

Most non-HPC apps degrade from aggressive ZMM usage

A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512

void foo(double *a, double *b, int size)

#pragma omp simd

for(int i=0; i<size; i++)

b[i]=exp(a[i]);

icpc -c -xCORE-AVX512 –qopenmp -qopt-report:5 foo.cpp

remark #15305: vectorization support: vector length 4

…

remark #15321: Compiler has chosen to target XMM/YMM

vector. Try using -qopt-zmm-usage=high to override

…

remark #15478: estimated potential speedup: 5.260




11

The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake

Compile with processor-specific option [-/Q]xCORE-AVX512

Many HPC apps benefit from aggressive ZMM usage

Most non-HPC apps degrade from aggressive ZMM usage

A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512

void foo(double *a, double *b, int size)

#pragma omp simd

for(int i=0; i<size; i++)

b[i]=exp(a[i]);

icpc -c -xCORE-AVX512 -qopt-zmm-usage=high –qopenmp

-qopt-report:5 foo.cpp

remark #15305: vectorization support: vector length 8

…

remark #15478: estimated potential speedup: 10.110



SourceCompiler

-[Q]prof-gen-sampling.obj

Linker .exe

VTune

Amplxe-pgo-report.sh <app> <options>

profmergesampling

(optional step)

Compiler

-[Q]prof-use-sampling:<data_file>

.obj

Object file contains an extended form of debug information, but does not contain

instrumentation(Prepares for data collection)

Performs PGO optimizations. Can take .pgo or .db file as input.

(Feedback compilation)

.pgo

.db

Merges 1 or more .pgo files into a database.

Optional step, but speeds up next step if done

VTune collects executed branches using processor s Last Branch Record counter

(Training runs)

HW based sampling PGO


For Intel® C/C++ compiler version 18.0


Full C++11 and C++14 support

Full C11 support

Initial C++17 support

Parallel STL* for parallel and vector execution of the C++ STL

GNU compatibility

use –std switch i.e. –std=c++17 or –std=c11

Fully compatible with Microsoft Visual C++ 2017 with respect to C++14 and C11 functionality.

14

Standard support

Comparison to GNU and Microsoft can be found at:https://software.intel.com/en-us/articles/c0x-features-supported-by-intel-c-compiler


https://software.intel.com/en-us/articles/c0x-features-supported-by-intel-c-compiler


Improved SSE support

512 bit operators support

Control-Flow Enforcement Technology

__declspec(notrack)

OpenMP 4.5 and 5.0 support

taskloop, task_reduction and in_reduction clauses

C11_Atomic keyword

Counted range based for loops

#pragma openmp taskloop num_tasks (32)

for (long l = 0; l < 1024; l++)

do_something (l);

typedef void (*fp_type)(void);

void foo(fp_type f)

__declspec(notrack) fp_type p = f;

(*p)();

std::vector<char> v;

for (auto iter = v.begin(); iter != v.end(); iter++) … for (char c : v) …

C/C++ new features



Nested namespace definitions

Terse static_assert

Relaxed range based for loops

Loosens requirements on type of range object, e.g. end() type does not need to match begin() type

end() type does not need to be an iterator

16

namespace outer

namespace nested

void foo();

namespace outer::nested void foo() … // short hand

static_assert(array_index != 0);

Will print a default message

ex.cpp:4:4: error: static assertion failed

C++17 new features



Remove deprecated register keyword

Remove deprecated operator++(bool)

[[fallthrough]], [[nodiscard]], and [[maybe_unused]]

Using attribute namespaces without repetition

Standard and nonstandard attributes

__has_include

hex FP literals

0x1.999999999999ap-4

17

void foo()

int x = 42; // may warn about this

[[maybe_unused]] int y = 42;

// Warning suppressed

switch (c)

case 'a':

f(); // Warning emitted

case 'b':

g();

[[fallthrough]]; // Warning suppressed

case 'c':

h();

#if __has_include(<optional>)

#include <optional>

C++17 new features



Parallel stl

18

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies

#include <algorithm>#include <execution>void increment( float *in, float *out, int N )

using namespace std;using namespace std::execution;transform( par, in, in + N, out, []( float f )

return f+1;);

specify namespacesstd and std::execution

For any of the implemented algorithms, pass one of the value:seq -> sequential executionunseq -> try to use SIMD. The policy SIMD-safe functionspar -> use multithreadingpar_unseq -> combined effect of unseq and par



What is Intel® sdlt?

19

The SIMD Data Layout Template library is a C++11 template library to quick convert Array of Structures to Structure of Arrays representation

SDLT vectorizes your code by making memory access contiguous, which can lead to more efficient code and better performance

A

Z[i+0] Z[i+1] Z[i+2] Z[…]

Y[i+0] Y[i+1] Y[i+2] Y[…]

X[i+0] X[i+1] X[i+2] X[…]

X

A[i+0]

Y Z X

A[i+1]

Y Z X

A[i+2]

Y Z

vector_register_10 1 2 3

X

A[i+3]

Y Z

vector_register_10 1 2 3

AOS SOA


For Intel® Fortran compiler version 18.0


complex(4), parameter :: x = atan(CMPLX(2,1))

complex(4) :: a

complex(4) :: var = CMPLX(2,1)

a = atan(var)

if ( x.eq.a ) then

print *, "PASS"

COMPLEX arguments to trigonometric and hyperbolic intrinsic functions

Optional BACK argument to MAXLOC and MINLOC

MAXLOC((/1,2,3,3,2,1/), back = .true.) gives 4.

MINLOC((/1,2,3,3,2,1/), back = .true.) gives 6.

FINDLOC intrinsicThe value of FINDLOC ([2, 6, 4, 6,], VALUE = 6) is [2]

The value of FINDLOC ([2, 6, 4, 6], VALUE = 6, BACK = .TRUE.) is [4]

new fortran features



type t

integer i

type(t), allocatable :: next

type(t1), allocatable :: fwd

end type

type t1

integer j

end type

Allocatable components of recursive type and forward reference

Additional INIT options

-init=keyword [, keyword] or /Qinit:keyword [, keyword]

where keyword can be one of: [no]arrays, [no]snan, [no]zero, infinity, minus_infinity, huge, minus_huge, tiny, minus_tiny

e.g. -init=zero,minus_huge,snan program test

real x

integer i

complex c

end




Subroutine dummy arguments can be pointers or assumed shape arrays, e.g.:

Vectorization and array contiguity

SUBROUTINE SUB(X,Y)

REAL, DIMENSION(:) :: X !assumed shape array

REAL, DIMENSION(:), POINTER :: Y !pointer

This avoids the need to pass parameters such as array bounds explicitly. The Fortran standard allows the actual arguments to be non-contiguous array sections or pointers, e.g.:

CALL SUB(A(1:100:10)) ! non-unit stride

CALL SUB(B(2:4,:)) ! incomplete columns

Therefore, the compiler cannot assume that consecutive elements of X and Y are adjacent in memory and so cannot blindly issue efficient vector loads for several elements at once.



If you know that such dummy arguments will always be contiguous in memory, you can use the CONTIGUOUS keyword to tell the compiler and it will generate more efficient code, e.g.:

Contiguity checking

SUBROUTINE SUB(X,Y)

REAL, DIMENSION(:),CONTIGUOUS :: X !assumed shape array

REAL, DIMENSION(:),CONTIGUOUS,POINTER :: Y !pointer

The CONTIGUOUS attribute may help the compiler optimize more effectively

[–/]assume contiguous_pointer and [–/]assume contiguous_assumed_shape

assert that all assumed shape arrays and/or pointers have unit stride

[–/]check contiguous compiler option help to diagnose non-contiguous pointer assignment to CONTIGUOUS pointer.


Non-simd specific


int sum=init(), A[100];

#pragma omp parallel for reduction(+:sum)

for (i=0; i<100; i++)

sum += A[i];

print(sum);

26

fork

join

int sum1 = 0;

for(i=0; i<50; i++)

sum1 += A[i];CRIT(sum += sum1);

int sum2 = 0;

for(i=50; i<100; i++)

sum2 += A[i];CRIT(sum += sum2);

print(sum);

sum = init();

Recap: parallel reduction



int sum=init(), A[100];

#pragma omp taskgroup task_reduction(+:sum)

#pragma omp task in_reduction(+:sum)

for (i=0; i<50; i++) sum += A[i];

#pragma omp task in_reduction(+:sum)

for (i=50; i<100; i++) sum += A[i];

// wait here until all tasks in taskgroup are completed

print(sum);

27

Openmp* 5.0: task reduction

*based on OpenMP 5.0 TR6



wait

int * p1= getMy(sum);

for(i=0; i<50; i++)

*p1 += A[i];

print(sum);

taskgroup:int sum1 = 0int sum2 = 0

end taskgroup:sum+= sum1+sum2;

int * p2= getMy(sum);

for(i=50; i<100; i++)

*p2 += A[i];

p1=&sum1 p2=&sum2

task reduction



#pragma omp taskgroup task_reduction(op:list)

29

Taskgroup

task_reduction clause

Reduction scoping clause

Defines the region in which reduction is computed by tasks

For each list item x

taskgroup keeps a copy of x for each task with in_reduction(x)

taskgroup begin: initialize all copies of x according to op

taskgroup end: reduce all copies of x into the original x




#pragma omp task in_reduction(op:list)

30

Taskgroup

in_reduction clause Reduction participating clause

The task participates in a reduction defined by a reduction scoping clause

For each list item x Task requests a pointer p to its copy of x from an enclosing taskgroup

If nested, the innermost taskgroup with the matching task_reduction clause fulfills the request

All accesses to x in the task’s body are replaced by *p



New Features of automatic Vectorization and explicit data parallel Programming via OpenMP* SIMD


32

New Intel® Advanced Vector Extensions 512 (Intel® AVX-512) 512-bit wide vectors

32 operand registers

8 64b mask registers

Embedded broadcast

Embedded rounding

7 Subsets 2 exclusive to Intel®

Xeon Phi™

2 common

3 exclusive to Intel® Xeon

Microarchitecture Instruction SetSP FLOPs per cycle

DP FLOPs per cycle

Skylake AVX512 & FMA 64 32

Haswell / Broadwell AVX2 & FMA 32 16

Sandybridge AVX (256b) 16 8

Nehalem SSE (128b) 8 4

Intel® AVX-512 Instruction Types

AVX512-PF Prefetch: multi-address prefetch using gather/scatter semantics

AVX512-ER Exponential and Reciprocal: ‘wide’ approximation of Log and RCP/RSQRT

AVX512-F AVX-512 Foundation Instructions

AVX512-CD Conflict Detect : used in vectorizing loops with potential address conflicts

AVX512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes

AVX512-BW 512-bit Byte/Word support

AVX512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)



33

Tuning for Skylake - Compiler options

Both Skylake and Knights Landing processors have support for Intel® AVX-512 instructions. There are three ISA options in the Intel® Compiler:

-xCORE-AVX512 : Targets Skylake, contains instructions not supported by Knights Landing

-xCOMMON-AVX512 : Targets both Skylake and Knights Landing

-xMIC-AVX512 : Targets Knights Landing, includes instructions not supported by Skylake

Intel® Compiler is conservative in its use of ZMM (512bit) registers so to enable their use with Skylake the additional flag -qopt-zmm-usage=high must be set.



Ordered blocks in SIMD contexts#pragma omp ordered simd

Semantics: The ordered with simd clause construct specifies a structured block in the simd

loop or SIMD function that will be executed in the order of the loop iterations or sequence of call to SIMD functions.

Rules: #pragma omp ordered simd is only allowed inside a SIMD loop or SIMD-enabled

function.

#pragma omp ordered simd region must be a single-entry and single-exit code block



Monotonic keyword for ordered#pragma omp ordered simd monotonic([var:step]s)

Semantics: Same as for ‘omp ordered simd’ with a hint that vars inside the structured block

are monotonically changed with respect to execution.

Why: With this hint compiler can generate better vector code if CPU supports compress

or expand instructions.



OK:#pragma omp simd

for (int i = 0; i < n; ++i)

if (cond)

#pragma omp ordered simd monotonic(j:1)

if (c[i] > 0)

q[j++] = b[i]; // compress pattern

a[i] = j; // expand pattern

if (cond1)

#pragma omp ordered simd monotonic(k:1)

if (c[i] > 0)

b[i] = p[k++]; // expand pattern

36

Not OK:#pragma omp simd

for (int i = 0; i < n; ++i)

if (cond)


if (c[i] > 0)

q[j++] = b[i]; // compress pattern

a[i] = j;

if (cond1)


if (c[i] > 0)

b[i] = p[j++]; // expand pattern

Compiler won’t

complain!

Monotonic keyword for ordered



Overlap keyword for ordered#pragma omp ordered simd overlap(overlap_index)

Semantics: Same as for ‘omp ordered simd’ with a hint that overlap_index has equal values in

different lanes during vector execution. Compiler will do resolving for indirect accesses w.r.t. this overlap_index.

Why: With this hint compiler can generate better vector code if CPU supports vconflict

instruction. For example, CPU has support of AVX512CD.



OK:#pragma omp simdfor (int i = 0; i < n; ++i)

#pragma omp ordered simd overlap(b[i])

++a[b[i]]; // conflict pattern // compiler will resolve conflict according to b[i]

#pragma omp ordered simd overlap(i)

++a[b[i]]; // conflict pattern. // compiler will resolve conflict according to i.

#pragma omp simdfor (int i = 0; i < n; ++i)

#pragma omp ordered simd overlap(b[i])

++a[b[i]]; // conflict pattern // compiler will resolve conflict according to b[i]

c[b[i]] = b[i] + c[b[i]]; // conflict pattern // reassociation is allowed

Bail-out to ordered:double *a;int32_t *c;#pragma omp simdfor (int i = 0; i < n; ++i) #pragma omp ordered simd overlap(b[i])

++a[b[i]]; // conflict pattern++c[b[i]]; // conflict pattern

double *a0, *a1;#pragma omp simdfor (int i = 0; i < n; ++i) #pragma omp ordered simd overlap(b[i])

++a0[b[i]]; // conflict pattern. // ok to reassociate

a1[b[i]] = 1/a1[b[i]]; // conflict pattern// reassociation is not allowed

38

overlap keyword for ordered



OK:#pragma omp simd lastprivate(conditional: clp)

for (int i = 0; i < n; ++i)

if (cond)

clp = a[i];

#pragma omp simd lastprivate(conditional: clp)

for (int i = 0; i < n; ++i)

if (cond)

clp = a[i];

b[i] = clp;

Not OK:#pragma omp simd lastprivate(conditional: clp)

for (int i = 0; i < n; ++i)

if (cond)

clp = a[i];

b[i] = clp;

#pragma omp simd lastprivate(conditional: clp)

for (int i = 0; i < n; ++i)

b[i] = clp;

if (cond)

clp = a[i];

Compiler won’t

complain!

39

Conditional scalar assignment



Download a free, 30-day trial of Intel® Parallel Studio XE 2018 today

https://software.intel.com/en-us/intel-parallel-studio-xe/try-buy

And Don’t Forget…

Code that performs and outperforms

To check your inbox for the evaluation survey which will be emailed after this presentation.

P.S.Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!



Q & A

41

Questions?



Legal Disclaimer & Optimization Notice

42

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



Visual Studio 2015 Shell for “standalone” Fortran

COMPILER_OPTIONS and COMPILER_VERSION in iso_fortran_env

Implied-do bounds expressions in a DATA statement can now be any constant expression

Passing a non-pointer data item to a pointer dummy argument

program main

use ISO_FORTRAN_ENV

implicit none

print *, “Compiler version: ", compiler_version()

print *, “Compiler options: ", compiler_options()

end program

DATA ((B (K, J), J = K + 1, size(a), size(c)-2),

K = 1, size(a)-2) / 9 * 1 /

integer, target :: j = 17 ! Must have target attribute

call foo2(j)

contains

subroutine foo2(i)

integer, intent(in), pointer :: i

print *,i

end subroutine

end




real, allocatable :: r(:)

!dir$ attributes memkind:hbw :: r

print *, for_get_memkind(r) !prints 1 for HBW

!dir$ memkind:ddr

allocate(r(1000))

print *, for_get_memkind(r) !prints 0 for DDR

end

MEMKIND attribute and directive

!DIR$ ATTRIBUTES MEMKIND: HBW | DDR :: obj

!DIR$ MEMKIND:HBW | DDR [,ALIGN:N]

intrinsic FOR_GET_MEMKIND

Polymorphic assignment with allocatable LHS and not a coarray.

class(parent), allocatable :: var

class(parent), allocatable :: expr1

class(child), allocatable :: expr2

var = expr1

var = expr2 ! Type compatible

Pointer function reference in a variable definition context

program main

integer, target :: var(10)

storage(1) = 11

print *, var(1) !Print 11

contains

function storage(key) result(loc)

integer, intent(in) :: key

integer, pointer :: loc

loc => var(key)

end function

end



build optimized c/c++ & fortran applications for ... - intel · © 2017 intel corporation. all...

Documents