intel parallel studio xe 2011 sp1 · 2013-10-30 · intel® sse4.1 as first introduced in intel®...
TRANSCRIPT
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Parallel Studio XE 2011 SP1
for Advanced Performance
Boost Performance. Scale Forward. Ensure Confidence.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/30/2012 2
Using the Intel Compiler
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Windows Command Prompt: Basic Compiler Usage
• Compiler drivers are ‘ifort’ for Fortran and ‘icl’ for C/C++
• “/O” switches
/O2 default optimization level
– /O* doesn’t imply the same set of opts for MSVC++ and Intel, but
similar concepts, /Od for debugging, /O2 for default optimization, /O3
for more advanced optimizations
• icl –help or ifort -help provides extensive list
7/30/2012 3
Optimization Notice
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Linux: Basic Compiler Usage
• source <installdir>/bin/[compilervars.sh | compilervars.csh]
[intel64 | ia32]
– <=11.x version compilers, was named ‘ifortvars.sh’ or ‘iccvars.sh’
– Local installations may set up ‘module’ scripts (not default)
• Compiler drivers are ‘ifort’, ‘icc’ for C, and ‘icpc’ for C++
• “-O” switches compatible, but not identical to gcc
– -O2 default optimization level (gcc default is –O0)
– -O* doesn’t imply the same set of opts for gcc and Intel, but similar
concepts, -O0 for debugging, -O2 for default, -O3 for more advanced
optimizations
• icc –help, icpc -help or ifort -help provides extensive list
• Intel debugger: IDB
– Linux: ‘idbc’ command line, ‘idb’ X11 GUI
– Mac OS: ‘idb’ command line only ( note DDD can wrap it )
7/30/2012 4
Optimization Notice
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New User Essentials
• Fortran 90/95/03/08 – Stack Exhaustion Issue
– Symptom: SIGSEGV segfault, SIGBUS(Mac OS*)
– Use –heap-arrays or ulimit –s to avoid
– See: http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/
• Getting a traceback:
-g –traceback or /Zi /traceback
• “Incorrect results/numbers” (more on this later)
-fp-model precise or /fp:precise
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Linux: Documentation, Samples, and Tutorials
• HTML- and PDF-based documentation:
– <install dir>/composerxe/Documentation/en_US/
– Release_Notes[F | C].pdf
– documentation_[f | c].htm
• Samples: <installdir>/Samples/en_US/[C++ | Fortran]
/sample.htm for an index
– Vectorization, openmp, PGO, IPO, GAP, Corray Fortran, Cilk Plus
• Tutorials: <installdir>/composerxe/Documentation/en_US/
/tutorials_[c | f]/index.htm
– Tutorials are based on the Samples included
– Steps user through new technologies
7/30/2012 6
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Getting Help • User Forum(s)
– Web-based, publically visible
– No account needed to browse. To post, need a login
– Can send/receive files. Posts can be made “private” with Intel
•http://software.intel.com/en-us/forums/intel-fortran-
compiler-for-linux-and-mac-os-x/
•http://software.intel.com/en-us/forums/intel-visual-fortran-
compiler-for-windows/
•http://software.intel.com/en-us/forums/intel-c-compiler/
• Licensed user can access Premier Support http://premier.intel.com
– if you have your serial number, you can set up a premier account
• Webinars:
• http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/
7/30/2012 7
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Watch for Webinars
• Keep watch on for future webinars on User Forum and … http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/
http://software.intel.com/en-us/articles/intel-vectorization-february-15-2012-webinar/
7/30/2012 8
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Essential Compiler Optimizations
9
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Essentials Hands-on with “Quicklabs”
• Follow instructor to open a command line window – Linux users: source compilervars.sh
– Locate your “quicklabs” directory, cd to that directory
– Choose Windows or Linux, cd to that directory
– Choose Fortran or C++, cd to that directory
Intel Confidential
7/30/2012 10
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Common Optimization Switches
Windows* Linux* Mac OS*
Disable optimization /Od -O0
Optimize for speed (no code size increase) /O1 -O1
Optimize for speed (default) /O2 -O2
High-level optimizer, including prefetch, unroll
/O3 -O3
Create symbols for debugging /Zi -g
Inter-procedural optimization /Qipo -ipo
Profile guided optimization (muli-step build) /Qprof-gen
/Qprof-use
-prof-gen
-prof-use
Optimize for speed across the entire program **warning: -fast def’n changes over time
/fast (same as: /O3 /Qipo /Qprec-div- /QxHost)
-fast (same as: -ipo –O3 -no-prec-div -static -xHost)
OpenMP 3.0 support /Qopenmp -openmp
Automatic parallelization /Qparallel -parallel
11
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab 1: Matrix calculation O0 and O1
• Compile the program matrix.c/matrix.f90 without optimization and with O1, compare results
• Run quicklab1.bat or quicklab1.sh, which will compile and run with … Linux/Mac C++: icc –O0 matrix.c -o noopt_matrix.exe
icc -O1 matrix.c -o O1_matrix.exe
Windows Fortran: ifort /Od /exe: noopt_matrix.exe
matrix.f90
icl /O1 /exe:O1_matrix.exe matrix.f90
• Runs and times the executable
noopt_matrix.exe >> 30.25 sec.
O1_matrix.exe >> 9.6 sec.
[Intel® Core™ i7, 4-Core 64-bit, 1.6GHz, Windows*] 12
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
High-Level Optimizer (HLO)
• Compiler switches: -O2, -O3 (Linux*)
/O2, /O3 (Windows*),
• Loop level optimizations
– loop unrolling, cache blocking, prefetching
• More aggressive dependency analysis
– Determines whether or not it‘s safe to reorder or parallelize statements
• Scalar replacement
– Goal is to reduce memory references with register references
13
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab 2: Optimizations -O2 -O3 • Run quicklab2.bat or quicklab2.sh, which will
compile and run with …
• Compiles the program matrix.c/matrix.f90 with optimization -O[2|3], but w/o vectorizer, compare results: icc -O2 –no-vec matrix.c -o O2_novec_matrix.exe
icc -O3 –no-vec matrix.c -o O3_novec_matrix.exe
• Run and time the executable noopt_matrix.exe >> 30.25 sec.
O1_matrix.exe >> 9.60 sec.
O2_novec_matrix.exe >> 2.12 sec.
O3_novec_matrix.exe >> 1.60 sec.
[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]
14
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization – the ‘Make or Break’ for IA Performance
15
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Auto-Vectorization SIMD – Single Instruction Multiple Data
7/30/2012 16
• Scalar mode – one instruction produces
one result
• SIMD processing – with SSE or AVX instructions
– one instruction can produce multiple results
+ a[i]
b[i]
a[i]+b[i]
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
a
b
a+b
+
do i=1,MAX
c(i)=a(i)+b(i)
end do
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization is Achieved through SIMD Instructions & Hardware
17
X4
Y4
X4opY4
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
0 128 Intel® SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float
Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float First introduced in 2011
X4
Y4
X4opY4
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
0 127
X8
Y8
X8opY8
X7
Y7
X7opY7
X6
Y6
X6opY6
X5
Y5
X5opY5
128 255
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What Kinds of Applications can use Vectorization?
18
Bioinformatics
Financial Analytics Energy
Medical Imaging & Analysis
Signal Processing
Engineering Design
Science & Research
Broadcast & Film
3D Modeling & Visualization
Digital Content Creation
GIS & Satellite Imagery
Database & Business Intelligence
Defense & Security
Game Development
Telecommunications
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Comparison of Ways Applications can Take Advantage of Vectorization
19
Effort Required
Code Maintain-ability
Performance Potential
Scale Forward
Assembly/Intrinsics Most Least Best No
Existing libraries such as Intel® IPP, Intel® MKL
Least Most Best Yes
Compiler Vectorization Least Most Good Yes
High-level Constructs such as Intel® Cilk™ Plus, Fortran DO CONCURRENT
Moderate Most Best Yes
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Based Vectorization Extension Specification Feature Extension
Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors
sse2
Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors
sse3
Supplemental Streaming SIMD Extensions 3 (SSSE3) as available in Intel® Core™2 Duo processors
ssse3
Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture
sse4.1
Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by by Intel® Core™ i7 processors
sse4.2
Extensions offered by Intel® ATOM™ processor : Intel SSSE3 (!!) and MOVBE instruction
sse3_atom
Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel Core processor family – code name Sandy Bridge
AVX
Intel® Advanced Vector Extensions (Intel® AVX) as code-name Ivy Bridge and in code-name Haswell (available only in compilers v13+, fall 2012)
CORE-AVX-I CORE-AVX2
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Basic Vectorization – Switches [1]
{L&M} -x<extension> {W}: /Qx<extension> • Targeting Intel® processors - specific optimizations for Intel® processors
• Compiler will try to make use of all instruction set extensions up to and including <extension>; for Intel® processors only !
• Processor-check added to main-program
• Application will not start (will display message), in case feature is not available
{L&M}: -m<extension> {W}: /arch:<extension> • No Intel processor check
• Does not perform Intel-specific optimizations
• Application is optimized for and will run on both Intel and non-Intel processors
• Missing check can cause application to fail if instructions not supported
{L&M}: -ax<extension> {W}: /Qax<extension> • Multiple code paths – a ‘baseline’ and ‘optimized, processor-specific’ path(s)
• Optimized code path for Intel® processors defined by <extension>
• Baseline code path defaults to –msse2 (Windows: /arch:sse2) – The baseline code path can be modified by –m or –x (/Qx or /arch) switches
• -axavx,sse4.2 - paths for avx, sse4.2 and sse2 (default)
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Basic Vectorization – Switches [2]
Special switch -xHost (Windows: /QxHost)
• Compiler checks compilation host processor and makes use of ‘latest’ instruction set extension available for THAT processor
The Xeon default is –msse2 (Windows: /arch:sse2)
• Activated implicitly for –O2 or higher
• Implies the need for a target processor with Intel® SSE2
• Multiple Code Paths
Multiple extensions can be used in combinations like -ax<ext1>,<ext2> ( Windows: /Qax<ext1>,<ext2> )
-axavx,sse4.2,sse3 for example, OR
-axavx –xsse4.2 (AVX path, SSE4.2 path, default SSE2 path)
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization – More Switches and Directives
Disable vectorization • Globally via switch: {L&M}: -no-vec {W}: /Qvec-
• For a single loop: directive #pragma novector, !DIR$ novector
– Disabling vectorization means not using packed SSE/AVX instructions.
– The compiler still may use SSE instruction set extensions
Enforcing vectorization for a loop - overwriting the compiler
heuristics : #pragma vector always
– will enforce vectorization even if the compiler thinks it is not profitable to do so ( e.g due to non-unit strides or alignment issues)
– It’s a SUGGESTION: Will not enforce vectorization if the compiler fails to recognize this as a semantically correct transformation – hint to compiler to relax
– Using directive #pragma vector always assert will print error
message in case the loop cannot be vectorized and will abort compilation
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Switches – Some Notes
• Former vectorization switches like –xW, /QxT, /QaxP etc are considered ‘deprecated’ and will not be supported anymore in the future
– See appendix or compiler documentation for mapping between old and new names
• It is not possible anymore to generate vector code exclusively for the initial SSE (32 bit FP) instruction set (introduced by Intel® Pentium™ 3 processor)
• The instruction set extension name for Intel® Atom™ processors (SSE3_ATOM) is misleading: Since the architecture supports up to SSSE3, e.g. switch –xsse3_atom will make use of SSSE3 too
– In case the code should be optimized for Intel® Atom processors but should run too on all processors supporting up to SSSE3, add option –minstruction=nomovbe (Windows: /Qinstruction:nomovbe) to avoid the use of the Atom-specific instruction MOVBE
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization – More Switches and Directives
Disable vectorization • Globally via switch: {L&M}: -no-vec {W}: /Qvec-
• For a single loop: directive novector
– Disabling vectorization here means not using packed SSE/AVX instructions. The compiler still might make use of the corresponding instruction set extensions
Enforcing vectorization for a loop - overwriting the compiler
heuristics : #pragma vector always
– will enforce vectorization even if the compiler thinks it is not profitable to do so ( e.g due to non-unit strides or alignment issues)
– Will not enforce vectorization if the compiler fails to recognize this as a semantically correct transformation
– Using directive #pragma vector always assert will print error
message in case the loop cannot be vectorized and will abort compilation
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Reports – Vectorization Report
Compiler switch: /Qvec-report<n> (Windows)
-vec-report<n> (Linux)
Set diagnostic level dumped to stdout
N=0: No diagnostic information (default ) n=1: Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info
• Note: For Linux, –opt_report_phase hpo provides additional diagnostic
information for vectorization
26
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab 3: Compiler vectorizer • Run quicklab3.bat or quicklab3.sh which …
Compiles the program matrix.c/matrix.f90 with .. -O[2|3], vectorizer and create report: icc -O2 matrix.c -vec-report2 -o O2_matrix.exe
icc -O3 matrix.c -vec-report2 -o O3_matrix.exe
icc -O3 –xHost -vec-report2 matrix.c -o O3_xHost_matrix.exe
icc -fast matrix.c -o fast_matrix.exe
[Do all loops vectorize? Is there room for improvement?]
• Run and time the executable noopt_matrix.exe >> 30.25 sec.
O1_matrix.exe >> 9.60 sec.
O2_novec_matrix.exe >> 2.12 sec.
O3_novec_matrix.exe >> 1.60 sec.
O2_matrix.exe >> 0.98 sec.
O3_matrix.exe >> 0.83 sec.
O3_xHost_matrix.exe >> 0.78 sec.
fast_matrix.exe >> 0.78 sec.
27
[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Other Optimizations that Support or Complement Vectorization
• Vectorization relies on
– NO Dependencies in data between loop iterations
– or dependencies that can be removed by simple methods
– Regular, predictable data patterns for all operands
– No pointer chasing, indirect accesses
– Vector lengths that are large enough AND
– Data is aligned on natural boundaries ( 16, 32, or 64 bytes )
– cache line matching boundary, can use cache as staging
– Streaming stores – store optimization
• COST MODEL – is it worth it to vectorize???
• Advanced optimizations help with some of these
28
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 29
Interprocedural Optimizations (IPO) Multi-pass Optimization
• Interprocedural optimizations performs a static, topological analysis of your application!
• ip: Enables inter-procedural optimizations for current source file compilation
• ipo: Enables inter-procedural optimizations across files
• Can inline functions in separate files
• Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Procedure reordering • Interprocedural dead code elimination, constant propagation and procedure
reordering • Enhances optimization when used in combination with other compiler features
Windows* Linux* Mac OS*
/Qip -ip
/Qipo -ipo
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 30
Interprocedural Optimizations (IPO) Usage: Two-Step Process
Linking
Linux* icc -ipo main.o func1.o
func2.o
Windows* icl /Qipo main.o func1.o
func2.obj
Pass 1
Pass 2
mock object
executable
Compiling
Linux* icc -c -ipo main.c func1.c
func2.c
Windows* icl -c /Qipo main.c func1.c
func2.c
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab 4: Include IPO Quick Lab 4-1: Show multi-file optimizations • Compiles the program matrix.c/matrix.f90 with
optimization and with including /Qipo – compare results: icc -O3 -ipo matrix.c -opt-report-phase:ipo -o O3_ipo_matrix.exe
icc -O3 –ipo -xHost matrix.c -o O3_ipo_xHost_matrix.exe
[Don’t expect much with IPO on this sample (why?)]
• Run and time the executable O2_novec_matrix.exe >> 2.12 sec.
O3_novec_matrix.exe >> 1.60 sec.
O2_matrix.exe >> 0.98 sec.
O3_matrix.exe >> 0.83 sec.
O3_ipo_matrix.exe >> 0.82 sec.
O3_ipo_xHost_matrix.exe >> 0.75 sec.
• Run quicklab4-1 to see what IPO can do for multi-file optimization
31
[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Profile-Guided Optimizations (PGO)
• Static analysis leaves many questions open for the optimizer like: – How often is x > y – What is the size of count – Which code is touched how often
• Use execution-time feedback to guide (final) optimization
• Enhancements with PGO: – More accurate branch prediction – Basic block movement to improve instruction
cache behavior – Better decision of functions to inline (help IPO) – Can optimize function ordering – Switch-statement optimization – Better vectorization decisions
32
if (x > y) do_this(); else do that();
for(i=0; i<count; ++I
do_work();
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
PGO Usage: Three Step Process
33
Compile + link to add instrumentation icc -prof_gen prog.c
Execute instrumented program prog.exe (on a typical dataset)
Compile + link using feedback icc -prof_use prog.c
Dynamic profile: 12345678.dyn
Instrumented executable: foo.exe
Merged .dyn files: pgopti.dpi
Step 1
Step 2
Step 3
Optimized executable: foo.exe
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab5: Use PGO
Run quicklab5.bat or quicklab5.sh which …
• Compiles the program matrix.c/matrix.f90 with /Qprof-gen /Qprof-use – compare results: icc -prof-gen matrix.c -o pgen_matrix.exe
pgen_matrix.exe [Why does this execution take so long?]
icc -prof-use -O3 -xHost matrix.c -o puse_matrix.exe
• Run and time the executable O2_novec_matrix.exe >> 2.12 sec.
O3_novec_matrix.exe >> 1.60 sec.
O2_matrix.exe >> 0.98 sec.
O3_matrix.exe >> 0.83 sec.
O3_ipo_matrix.exe >> 0.82 sec.
O3_ipo_xHost_matrix.exe >> 0.75 sec.
puse_matrix.exe >> 0.32 sec.
34
[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallelization
• Obviously, we now have multi-core and many-core
• How to utilize those cores?
– Treat as another system: process level parallelism
– 1 MPI process per core
– disadvantage: larger memory footprint, unneeded redundancy
– Threading: map threads to cores
– Smaller memory footprint, sharing of data possible
– Auto-parallel – let the compiler do it
– OpenMP, Pthreads, Win threads
– Higher level language abstractions and constructs
35
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Auto-Parallelization
• Compiler automatically translates portions of serial code into equivalent multithreaded code with using these options:
-parallel /Qparallel
• The auto-parallelizer analyzes the dataflow of loops and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.
• The auto-parallelizer report can provide information about
program sections that could be parallelized by the compiler. Compiler option:
-par-report0|1|2|3
/Qpar-report:0|1|2|3
0 is report disabled, 3 maximum diagnostics level
36
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Reports – Optimization Report
Compiler option: Control the level of detail in the report: -opt-report[0|1|2|3] (Linux, Mac OS X) /Qopt-report[0|1|2|3] (Windows)
-opt-report-phase[=phase] (Linux)
/Qopt-report-phase[:phase] (Windows)
phase can be: • ipo - Interprocedural Optimization • ilo – Intermediate Language Scalar Optimization • hpo – High Performance Optimization (cache blocking,
data access) • hlo – High-level Optimization (loop transforms, etc.) • all – All optimizations (default if opt-report-phase not
used)
37
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Report Example icc –O3 –opt-report-phase=hlo -opt-report-phase=hpo icl /O3 /Qopt-report-phase:hlo /Qopt-report-phase:hpo
…
LOOP INTERCHANGE in loops at line: 7 8 9
Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )
…
Loop at line 8 blocked by 128
Loop at line 9 blocked by 128
Loop at line 10 blocked by 128
…
Loop at line 10 unrolled and jammed by 4
Loop at line 8 unrolled and jammed by 4
…
…(10)… loop was not vectorized: not inner loop.
…(8)… loop was not vectorized: not inner loop.
…(9)… PERMUTED LOOP WAS VECTORIZED
…
–vec-report2 (/Qvec-report2) to get a vectorization report
38
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Quick lab 6: Auto-parallelization • Compile the program matrix.c/matrix.f90 with /Qparallel
/Qpar-report2 with and without additional
vectorization/optimizations: icc -parallel -par-report2 –no-vec -o par_novec_matrix.exe
icc -parallel -par-report2 -o par_matrix.exe
icc -parallel -par-report2 -O3 -vec-report2 -ipo –xHost -o
par_vec_matrix.exe
icc -O3 -parallel -prof-use -O3 -ipo –xHost -o
par_vec_puse_matrix.exe
[Examine the parallel report. Is the program well parallelized? What are the possible pitfalls with thread-level parallelization?]
• Run and time the executable O3_ipo_xHost_matrix.exe >> 0.75 sec.
puse_matrix.exe >> 0.32 sec.
par_novec_matrix.exe >> 0.90 sec.
par_matrix.exe >> 0.35 sec.
par_vec_matrix.exe >> 0.35 sec.
par_vec_puse_matrix.exe >> 0.12 sec.
39
[Intel® Core™ i7, 4-Core 64-bit, 1,6GHz]
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Profiling Function level and/or Loop-level
7/30/2012 40
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Loop Profiler Identify Time Consuming Functions and Loops
• Compiler switch: -profile-functions /Qprofile-functions,
Insert instrumentation calls on function entry and exit points to collect the cycles spent within the function.
• Compiler switch: -profile-loops= <inner|outer|all>
/Qprofile-loops=<inner|outer|all> – Insert instrumentation calls for function entry and exit points as well as
the instrumentation before and after instrument able loops of the type listed as the option’s argument.
• Loop Profiler switches trigger generation of text (.dump) and XML (.xml) output files – Invocation of XML on command line: java -jar loopprofviewer.jar <xml datafile>
41
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Loop Profiler Text Dump (.dump file)
42
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Loop Profiler Data Viewer GUI (copy from sl. 46)
43
Function Profile View
Loop Profile View
Column headers allow selection
to control sort criteria
independently for function and
loop table
Menu to allow user to enable
filtering or displaying the
source code
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using the Intel® Math Kernel Library
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using MKL with Intel Compilers
• The Intel® Math Kernel Library (MKL) is
– supplied with the Intel® Composer XE products
– Easy to use with the Intel compilers 12.1 and above:
-mkl or -mkl=parallel link with standard threaded
Intel MKL
-mkl=sequential link with sequential version
-mkl=cluster link with Intel MKL cluster components
(sequential) that use Intel MPI
Intel Confidential
45
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using MKL, Controlling Threads
• Threaded MKL library uses OpenMP threading to use available cores
• These Intel MKL function domains are threaded:
– Direct sparse solver
– LAPACK: list of threaded routines, see MKL docs”Threaded LAPACK Routines”
– Level1 and Level2 BLAS: list of threaded routines, see MKL docs “Threaded BLAS Level1 and Level2 Routines”
– All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.
– All mathematical VML functions.
– FFT: list of FFT transforms that can be threaded, see MKL docs “Threaded FFT Problems”
Intel Confidential
46
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using MKL, Controlling Threads
• Set one of the OpenMP or Intel MKL environment variables:
– OMP_NUM_THREADS – applies to entire application OpenMP
– MKL_NUM_THREADS – applies just to all MKL threaded funcs
– MKL_DOMAIN_NUM_THREADS – finer control on domain-specific MKL functions:
– export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4”
• Call one of the OpenMP or Intel MKL functions:
– omp_set_num_threads()
– mkl_set_num_threads()
– mkl_domain_set_num_threads()
Intel Confidential
47
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using MKL, Explicit Use
• Use online “MKL Link Line Advisor”: http://software.intel
.com/sites/products/m
kl/MKL_Link_Line_Advi
sor.html
Intel Confidential
48
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Questions?
Software & Services Group
Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
50 Intel Confidential
7/30/2012