intel parallel studio xe 2011 sp1 · 2013-10-30 · intel® sse4.1 as first introduced in intel®...

51
Software & Services Group Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Parallel Studio XE 2011 SP1 for Advanced Performance Boost Performance. Scale Forward. Ensure Confidence.

Upload: others

Post on 21-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel® Parallel Studio XE 2011 SP1

for Advanced Performance

Boost Performance. Scale Forward. Ensure Confidence.

Page 2: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/30/2012 2

Using the Intel Compiler

Page 3: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Windows Command Prompt: Basic Compiler Usage

• Compiler drivers are ‘ifort’ for Fortran and ‘icl’ for C/C++

• “/O” switches

/O2 default optimization level

– /O* doesn’t imply the same set of opts for MSVC++ and Intel, but

similar concepts, /Od for debugging, /O2 for default optimization, /O3

for more advanced optimizations

• icl –help or ifort -help provides extensive list

7/30/2012 3

Optimization Notice

Page 4: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Linux: Basic Compiler Usage

• source <installdir>/bin/[compilervars.sh | compilervars.csh]

[intel64 | ia32]

– <=11.x version compilers, was named ‘ifortvars.sh’ or ‘iccvars.sh’

– Local installations may set up ‘module’ scripts (not default)

• Compiler drivers are ‘ifort’, ‘icc’ for C, and ‘icpc’ for C++

• “-O” switches compatible, but not identical to gcc

– -O2 default optimization level (gcc default is –O0)

– -O* doesn’t imply the same set of opts for gcc and Intel, but similar

concepts, -O0 for debugging, -O2 for default, -O3 for more advanced

optimizations

• icc –help, icpc -help or ifort -help provides extensive list

• Intel debugger: IDB

– Linux: ‘idbc’ command line, ‘idb’ X11 GUI

– Mac OS: ‘idb’ command line only ( note DDD can wrap it )

7/30/2012 4

Optimization Notice

Page 5: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

New User Essentials

• Fortran 90/95/03/08 – Stack Exhaustion Issue

– Symptom: SIGSEGV segfault, SIGBUS(Mac OS*)

– Use –heap-arrays or ulimit –s to avoid

– See: http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/

• Getting a traceback:

-g –traceback or /Zi /traceback

• “Incorrect results/numbers” (more on this later)

-fp-model precise or /fp:precise

Page 6: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Linux: Documentation, Samples, and Tutorials

• HTML- and PDF-based documentation:

– <install dir>/composerxe/Documentation/en_US/

– Release_Notes[F | C].pdf

– documentation_[f | c].htm

• Samples: <installdir>/Samples/en_US/[C++ | Fortran]

/sample.htm for an index

– Vectorization, openmp, PGO, IPO, GAP, Corray Fortran, Cilk Plus

• Tutorials: <installdir>/composerxe/Documentation/en_US/

/tutorials_[c | f]/index.htm

– Tutorials are based on the Samples included

– Steps user through new technologies

7/30/2012 6

Page 7: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Getting Help • User Forum(s)

– Web-based, publically visible

– No account needed to browse. To post, need a login

– Can send/receive files. Posts can be made “private” with Intel

•http://software.intel.com/en-us/forums/intel-fortran-

compiler-for-linux-and-mac-os-x/

•http://software.intel.com/en-us/forums/intel-visual-fortran-

compiler-for-windows/

•http://software.intel.com/en-us/forums/intel-c-compiler/

• Licensed user can access Premier Support http://premier.intel.com

– if you have your serial number, you can set up a premier account

• Webinars:

• http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/

7/30/2012 7

Page 8: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Watch for Webinars

• Keep watch on for future webinars on User Forum and … http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/

http://software.intel.com/en-us/articles/intel-vectorization-february-15-2012-webinar/

7/30/2012 8

Page 9: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Essential Compiler Optimizations

9

Page 10: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Essentials Hands-on with “Quicklabs”

• Follow instructor to open a command line window – Linux users: source compilervars.sh

– Locate your “quicklabs” directory, cd to that directory

– Choose Windows or Linux, cd to that directory

– Choose Fortran or C++, cd to that directory

Intel Confidential

7/30/2012 10

Page 11: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Common Optimization Switches

Windows* Linux* Mac OS*

Disable optimization /Od -O0

Optimize for speed (no code size increase) /O1 -O1

Optimize for speed (default) /O2 -O2

High-level optimizer, including prefetch, unroll

/O3 -O3

Create symbols for debugging /Zi -g

Inter-procedural optimization /Qipo -ipo

Profile guided optimization (muli-step build) /Qprof-gen

/Qprof-use

-prof-gen

-prof-use

Optimize for speed across the entire program **warning: -fast def’n changes over time

/fast (same as: /O3 /Qipo /Qprec-div- /QxHost)

-fast (same as: -ipo –O3 -no-prec-div -static -xHost)

OpenMP 3.0 support /Qopenmp -openmp

Automatic parallelization /Qparallel -parallel

11

Page 12: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab 1: Matrix calculation O0 and O1

• Compile the program matrix.c/matrix.f90 without optimization and with O1, compare results

• Run quicklab1.bat or quicklab1.sh, which will compile and run with … Linux/Mac C++: icc –O0 matrix.c -o noopt_matrix.exe

icc -O1 matrix.c -o O1_matrix.exe

Windows Fortran: ifort /Od /exe: noopt_matrix.exe

matrix.f90

icl /O1 /exe:O1_matrix.exe matrix.f90

• Runs and times the executable

noopt_matrix.exe >> 30.25 sec.

O1_matrix.exe >> 9.6 sec.

[Intel® Core™ i7, 4-Core 64-bit, 1.6GHz, Windows*] 12

Page 13: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

High-Level Optimizer (HLO)

• Compiler switches: -O2, -O3 (Linux*)

/O2, /O3 (Windows*),

• Loop level optimizations

– loop unrolling, cache blocking, prefetching

• More aggressive dependency analysis

– Determines whether or not it‘s safe to reorder or parallelize statements

• Scalar replacement

– Goal is to reduce memory references with register references

13

Page 14: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab 2: Optimizations -O2 -O3 • Run quicklab2.bat or quicklab2.sh, which will

compile and run with …

• Compiles the program matrix.c/matrix.f90 with optimization -O[2|3], but w/o vectorizer, compare results: icc -O2 –no-vec matrix.c -o O2_novec_matrix.exe

icc -O3 –no-vec matrix.c -o O3_novec_matrix.exe

• Run and time the executable noopt_matrix.exe >> 30.25 sec.

O1_matrix.exe >> 9.60 sec.

O2_novec_matrix.exe >> 2.12 sec.

O3_novec_matrix.exe >> 1.60 sec.

[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]

14

Page 15: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization – the ‘Make or Break’ for IA Performance

15

Page 16: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Auto-Vectorization SIMD – Single Instruction Multiple Data

7/30/2012 16

• Scalar mode – one instruction produces

one result

• SIMD processing – with SSE or AVX instructions

– one instruction can produce multiple results

+ a[i]

b[i]

a[i]+b[i]

+

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

a

b

a+b

+

do i=1,MAX

c(i)=a(i)+b(i)

end do

Page 17: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization is Achieved through SIMD Instructions & Hardware

17

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 128 Intel® SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float

Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float First introduced in 2011

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 127

X8

Y8

X8opY8

X7

Y7

X7opY7

X6

Y6

X6opY6

X5

Y5

X5opY5

128 255

Page 18: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What Kinds of Applications can use Vectorization?

18

Bioinformatics

Financial Analytics Energy

Medical Imaging & Analysis

Signal Processing

Engineering Design

Science & Research

Broadcast & Film

3D Modeling & Visualization

Digital Content Creation

GIS & Satellite Imagery

Database & Business Intelligence

Defense & Security

Game Development

Telecommunications

Page 19: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Comparison of Ways Applications can Take Advantage of Vectorization

19

Effort Required

Code Maintain-ability

Performance Potential

Scale Forward

Assembly/Intrinsics Most Least Best No

Existing libraries such as Intel® IPP, Intel® MKL

Least Most Best Yes

Compiler Vectorization Least Most Good Yes

High-level Constructs such as Intel® Cilk™ Plus, Fortran DO CONCURRENT

Moderate Most Best Yes

Page 20: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Based Vectorization Extension Specification Feature Extension

Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors

sse2

Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors

sse3

Supplemental Streaming SIMD Extensions 3 (SSSE3) as available in Intel® Core™2 Duo processors

ssse3

Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture

sse4.1

Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by by Intel® Core™ i7 processors

sse4.2

Extensions offered by Intel® ATOM™ processor : Intel SSSE3 (!!) and MOVBE instruction

sse3_atom

Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel Core processor family – code name Sandy Bridge

AVX

Intel® Advanced Vector Extensions (Intel® AVX) as code-name Ivy Bridge and in code-name Haswell (available only in compilers v13+, fall 2012)

CORE-AVX-I CORE-AVX2

Page 21: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Basic Vectorization – Switches [1]

{L&M} -x<extension> {W}: /Qx<extension> • Targeting Intel® processors - specific optimizations for Intel® processors

• Compiler will try to make use of all instruction set extensions up to and including <extension>; for Intel® processors only !

• Processor-check added to main-program

• Application will not start (will display message), in case feature is not available

{L&M}: -m<extension> {W}: /arch:<extension> • No Intel processor check

• Does not perform Intel-specific optimizations

• Application is optimized for and will run on both Intel and non-Intel processors

• Missing check can cause application to fail if instructions not supported

{L&M}: -ax<extension> {W}: /Qax<extension> • Multiple code paths – a ‘baseline’ and ‘optimized, processor-specific’ path(s)

• Optimized code path for Intel® processors defined by <extension>

• Baseline code path defaults to –msse2 (Windows: /arch:sse2) – The baseline code path can be modified by –m or –x (/Qx or /arch) switches

• -axavx,sse4.2 - paths for avx, sse4.2 and sse2 (default)

Page 22: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Basic Vectorization – Switches [2]

Special switch -xHost (Windows: /QxHost)

• Compiler checks compilation host processor and makes use of ‘latest’ instruction set extension available for THAT processor

The Xeon default is –msse2 (Windows: /arch:sse2)

• Activated implicitly for –O2 or higher

• Implies the need for a target processor with Intel® SSE2

• Multiple Code Paths

Multiple extensions can be used in combinations like -ax<ext1>,<ext2> ( Windows: /Qax<ext1>,<ext2> )

-axavx,sse4.2,sse3 for example, OR

-axavx –xsse4.2 (AVX path, SSE4.2 path, default SSE2 path)

Page 23: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization – More Switches and Directives

Disable vectorization • Globally via switch: {L&M}: -no-vec {W}: /Qvec-

• For a single loop: directive #pragma novector, !DIR$ novector

– Disabling vectorization means not using packed SSE/AVX instructions.

– The compiler still may use SSE instruction set extensions

Enforcing vectorization for a loop - overwriting the compiler

heuristics : #pragma vector always

– will enforce vectorization even if the compiler thinks it is not profitable to do so ( e.g due to non-unit strides or alignment issues)

– It’s a SUGGESTION: Will not enforce vectorization if the compiler fails to recognize this as a semantically correct transformation – hint to compiler to relax

– Using directive #pragma vector always assert will print error

message in case the loop cannot be vectorized and will abort compilation

Page 24: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization Switches – Some Notes

• Former vectorization switches like –xW, /QxT, /QaxP etc are considered ‘deprecated’ and will not be supported anymore in the future

– See appendix or compiler documentation for mapping between old and new names

• It is not possible anymore to generate vector code exclusively for the initial SSE (32 bit FP) instruction set (introduced by Intel® Pentium™ 3 processor)

• The instruction set extension name for Intel® Atom™ processors (SSE3_ATOM) is misleading: Since the architecture supports up to SSSE3, e.g. switch –xsse3_atom will make use of SSSE3 too

– In case the code should be optimized for Intel® Atom processors but should run too on all processors supporting up to SSSE3, add option –minstruction=nomovbe (Windows: /Qinstruction:nomovbe) to avoid the use of the Atom-specific instruction MOVBE

Page 25: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization – More Switches and Directives

Disable vectorization • Globally via switch: {L&M}: -no-vec {W}: /Qvec-

• For a single loop: directive novector

– Disabling vectorization here means not using packed SSE/AVX instructions. The compiler still might make use of the corresponding instruction set extensions

Enforcing vectorization for a loop - overwriting the compiler

heuristics : #pragma vector always

– will enforce vectorization even if the compiler thinks it is not profitable to do so ( e.g due to non-unit strides or alignment issues)

– Will not enforce vectorization if the compiler fails to recognize this as a semantically correct transformation

– Using directive #pragma vector always assert will print error

message in case the loop cannot be vectorized and will abort compilation

Page 26: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Reports – Vectorization Report

Compiler switch: /Qvec-report<n> (Windows)

-vec-report<n> (Linux)

Set diagnostic level dumped to stdout

N=0: No diagnostic information (default ) n=1: Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info

• Note: For Linux, –opt_report_phase hpo provides additional diagnostic

information for vectorization

26

Page 27: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab 3: Compiler vectorizer • Run quicklab3.bat or quicklab3.sh which …

Compiles the program matrix.c/matrix.f90 with .. -O[2|3], vectorizer and create report: icc -O2 matrix.c -vec-report2 -o O2_matrix.exe

icc -O3 matrix.c -vec-report2 -o O3_matrix.exe

icc -O3 –xHost -vec-report2 matrix.c -o O3_xHost_matrix.exe

icc -fast matrix.c -o fast_matrix.exe

[Do all loops vectorize? Is there room for improvement?]

• Run and time the executable noopt_matrix.exe >> 30.25 sec.

O1_matrix.exe >> 9.60 sec.

O2_novec_matrix.exe >> 2.12 sec.

O3_novec_matrix.exe >> 1.60 sec.

O2_matrix.exe >> 0.98 sec.

O3_matrix.exe >> 0.83 sec.

O3_xHost_matrix.exe >> 0.78 sec.

fast_matrix.exe >> 0.78 sec.

27

[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]

Page 28: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Other Optimizations that Support or Complement Vectorization

• Vectorization relies on

– NO Dependencies in data between loop iterations

– or dependencies that can be removed by simple methods

– Regular, predictable data patterns for all operands

– No pointer chasing, indirect accesses

– Vector lengths that are large enough AND

– Data is aligned on natural boundaries ( 16, 32, or 64 bytes )

– cache line matching boundary, can use cache as staging

– Streaming stores – store optimization

• COST MODEL – is it worth it to vectorize???

• Advanced optimizations help with some of these

28

Page 29: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 29

Interprocedural Optimizations (IPO) Multi-pass Optimization

• Interprocedural optimizations performs a static, topological analysis of your application!

• ip: Enables inter-procedural optimizations for current source file compilation

• ipo: Enables inter-procedural optimizations across files

• Can inline functions in separate files

• Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Procedure reordering • Interprocedural dead code elimination, constant propagation and procedure

reordering • Enhances optimization when used in combination with other compiler features

Windows* Linux* Mac OS*

/Qip -ip

/Qipo -ipo

Page 30: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 30

Interprocedural Optimizations (IPO) Usage: Two-Step Process

Linking

Linux* icc -ipo main.o func1.o

func2.o

Windows* icl /Qipo main.o func1.o

func2.obj

Pass 1

Pass 2

mock object

executable

Compiling

Linux* icc -c -ipo main.c func1.c

func2.c

Windows* icl -c /Qipo main.c func1.c

func2.c

Page 31: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab 4: Include IPO Quick Lab 4-1: Show multi-file optimizations • Compiles the program matrix.c/matrix.f90 with

optimization and with including /Qipo – compare results: icc -O3 -ipo matrix.c -opt-report-phase:ipo -o O3_ipo_matrix.exe

icc -O3 –ipo -xHost matrix.c -o O3_ipo_xHost_matrix.exe

[Don’t expect much with IPO on this sample (why?)]

• Run and time the executable O2_novec_matrix.exe >> 2.12 sec.

O3_novec_matrix.exe >> 1.60 sec.

O2_matrix.exe >> 0.98 sec.

O3_matrix.exe >> 0.83 sec.

O3_ipo_matrix.exe >> 0.82 sec.

O3_ipo_xHost_matrix.exe >> 0.75 sec.

• Run quicklab4-1 to see what IPO can do for multi-file optimization

31

[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]

Page 32: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Profile-Guided Optimizations (PGO)

• Static analysis leaves many questions open for the optimizer like: – How often is x > y – What is the size of count – Which code is touched how often

• Use execution-time feedback to guide (final) optimization

• Enhancements with PGO: – More accurate branch prediction – Basic block movement to improve instruction

cache behavior – Better decision of functions to inline (help IPO) – Can optimize function ordering – Switch-statement optimization – Better vectorization decisions

32

if (x > y) do_this(); else do that();

for(i=0; i<count; ++I

do_work();

Page 33: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

PGO Usage: Three Step Process

33

Compile + link to add instrumentation icc -prof_gen prog.c

Execute instrumented program prog.exe (on a typical dataset)

Compile + link using feedback icc -prof_use prog.c

Dynamic profile: 12345678.dyn

Instrumented executable: foo.exe

Merged .dyn files: pgopti.dpi

Step 1

Step 2

Step 3

Optimized executable: foo.exe

Page 34: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab5: Use PGO

Run quicklab5.bat or quicklab5.sh which …

• Compiles the program matrix.c/matrix.f90 with /Qprof-gen /Qprof-use – compare results: icc -prof-gen matrix.c -o pgen_matrix.exe

pgen_matrix.exe [Why does this execution take so long?]

icc -prof-use -O3 -xHost matrix.c -o puse_matrix.exe

• Run and time the executable O2_novec_matrix.exe >> 2.12 sec.

O3_novec_matrix.exe >> 1.60 sec.

O2_matrix.exe >> 0.98 sec.

O3_matrix.exe >> 0.83 sec.

O3_ipo_matrix.exe >> 0.82 sec.

O3_ipo_xHost_matrix.exe >> 0.75 sec.

puse_matrix.exe >> 0.32 sec.

34

[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]

Page 35: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Parallelization

• Obviously, we now have multi-core and many-core

• How to utilize those cores?

– Treat as another system: process level parallelism

– 1 MPI process per core

– disadvantage: larger memory footprint, unneeded redundancy

– Threading: map threads to cores

– Smaller memory footprint, sharing of data possible

– Auto-parallel – let the compiler do it

– OpenMP, Pthreads, Win threads

– Higher level language abstractions and constructs

35

Page 36: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Auto-Parallelization

• Compiler automatically translates portions of serial code into equivalent multithreaded code with using these options:

-parallel /Qparallel

• The auto-parallelizer analyzes the dataflow of loops and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

• The auto-parallelizer report can provide information about

program sections that could be parallelized by the compiler. Compiler option:

-par-report0|1|2|3

/Qpar-report:0|1|2|3

0 is report disabled, 3 maximum diagnostics level

36

Page 37: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Reports – Optimization Report

Compiler option: Control the level of detail in the report: -opt-report[0|1|2|3] (Linux, Mac OS X) /Qopt-report[0|1|2|3] (Windows)

-opt-report-phase[=phase] (Linux)

/Qopt-report-phase[:phase] (Windows)

phase can be: • ipo - Interprocedural Optimization • ilo – Intermediate Language Scalar Optimization • hpo – High Performance Optimization (cache blocking,

data access) • hlo – High-level Optimization (loop transforms, etc.) • all – All optimizations (default if opt-report-phase not

used)

37

Page 38: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Report Example icc –O3 –opt-report-phase=hlo -opt-report-phase=hpo icl /O3 /Qopt-report-phase:hlo /Qopt-report-phase:hpo

LOOP INTERCHANGE in loops at line: 7 8 9

Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )

Loop at line 8 blocked by 128

Loop at line 9 blocked by 128

Loop at line 10 blocked by 128

Loop at line 10 unrolled and jammed by 4

Loop at line 8 unrolled and jammed by 4

…(10)… loop was not vectorized: not inner loop.

…(8)… loop was not vectorized: not inner loop.

…(9)… PERMUTED LOOP WAS VECTORIZED

–vec-report2 (/Qvec-report2) to get a vectorization report

38

Page 39: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Quick lab 6: Auto-parallelization • Compile the program matrix.c/matrix.f90 with /Qparallel

/Qpar-report2 with and without additional

vectorization/optimizations: icc -parallel -par-report2 –no-vec -o par_novec_matrix.exe

icc -parallel -par-report2 -o par_matrix.exe

icc -parallel -par-report2 -O3 -vec-report2 -ipo –xHost -o

par_vec_matrix.exe

icc -O3 -parallel -prof-use -O3 -ipo –xHost -o

par_vec_puse_matrix.exe

[Examine the parallel report. Is the program well parallelized? What are the possible pitfalls with thread-level parallelization?]

• Run and time the executable O3_ipo_xHost_matrix.exe >> 0.75 sec.

puse_matrix.exe >> 0.32 sec.

par_novec_matrix.exe >> 0.90 sec.

par_matrix.exe >> 0.35 sec.

par_vec_matrix.exe >> 0.35 sec.

par_vec_puse_matrix.exe >> 0.12 sec.

39

[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]

Page 40: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Profiling Function level and/or Loop-level

7/30/2012 40

Page 41: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Loop Profiler Identify Time Consuming Functions and Loops

• Compiler switch: -profile-functions /Qprofile-functions,

Insert instrumentation calls on function entry and exit points to collect the cycles spent within the function.

• Compiler switch: -profile-loops= <inner|outer|all>

/Qprofile-loops=<inner|outer|all> – Insert instrumentation calls for function entry and exit points as well as

the instrumentation before and after instrument able loops of the type listed as the option’s argument.

• Loop Profiler switches trigger generation of text (.dump) and XML (.xml) output files – Invocation of XML on command line: java -jar loopprofviewer.jar <xml datafile>

41

Page 42: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Loop Profiler Text Dump (.dump file)

42

Page 43: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Loop Profiler Data Viewer GUI (copy from sl. 46)

43

Function Profile View

Loop Profile View

Column headers allow selection

to control sort criteria

independently for function and

loop table

Menu to allow user to enable

filtering or displaying the

source code

Page 44: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using the Intel® Math Kernel Library

Page 45: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using MKL with Intel Compilers

• The Intel® Math Kernel Library (MKL) is

– supplied with the Intel® Composer XE products

– Easy to use with the Intel compilers 12.1 and above:

-mkl or -mkl=parallel link with standard threaded

Intel MKL

-mkl=sequential link with sequential version

-mkl=cluster link with Intel MKL cluster components

(sequential) that use Intel MPI

Intel Confidential

45

Page 46: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using MKL, Controlling Threads

• Threaded MKL library uses OpenMP threading to use available cores

• These Intel MKL function domains are threaded:

– Direct sparse solver

– LAPACK: list of threaded routines, see MKL docs”Threaded LAPACK Routines”

– Level1 and Level2 BLAS: list of threaded routines, see MKL docs “Threaded BLAS Level1 and Level2 Routines”

– All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.

– All mathematical VML functions.

– FFT: list of FFT transforms that can be threaded, see MKL docs “Threaded FFT Problems”

Intel Confidential

46

Page 47: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using MKL, Controlling Threads

• Set one of the OpenMP or Intel MKL environment variables:

– OMP_NUM_THREADS – applies to entire application OpenMP

– MKL_NUM_THREADS – applies just to all MKL threaded funcs

– MKL_DOMAIN_NUM_THREADS – finer control on domain-specific MKL functions:

– export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4”

• Call one of the OpenMP or Intel MKL functions:

– omp_set_num_threads()

– mkl_set_num_threads()

– mkl_domain_set_num_threads()

Intel Confidential

47

Page 48: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using MKL, Explicit Use

• Use online “MKL Link Line Advisor”: http://software.intel

.com/sites/products/m

kl/MKL_Link_Line_Advi

sor.html

Intel Confidential

48

Page 49: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Questions?

Page 50: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2

Software & Services Group

Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

50 Intel Confidential

7/30/2012

Page 51: Intel Parallel Studio XE 2011 SP1 · 2013-10-30 · Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture sse4.1 Intel® SSE4.2