multi-core programming tools. 2 intel compilers 9.x on the intel® core duo™ processor windows...

Post on 25-Dec-2015

231 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Multi-core Programming

Tools

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

2

Topics

• General Ideas• Compiler Switches• Dual Core• Vectorization

3

General Ideas - Optimization

• Exploiting Architectural Power requires Sophisticated Compilers

• Optimal use of– Registers & functional units– Dual-Core/Multi-processor– SSE instructions– Cache architecture

4

General Ideas – Compatibility

Compatibility is a key concern for software development. Always read manuals to ensure:

• Tool compatibility with hardware revisions.• Tool compatibility with software revisions.• Tool compatibility with IDE.• Tool compatibility with native (development) and target

(deployment) operating system.

5

General Ideas – Intel C++ Compatibility with Microsoft

• Source & binary compatible with VC2003 with /Qvc71,

• Source & binary compatible with w/ VC 2005 under /Qvc8.

• Microsoft* & Intel OpenMP binaries are not compatible.

• Use the one compiler for all modules compiled with OpenMP

• For more information, refer to the User’s Guide

6

General Ideas – Tools

Never ignore or take tools for granted.

A key part of system development is initial specification and qualification of tools required to get the job done.

The wrong tool alone can destroy project success chances.

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

7

General Ideas - Use Intel Compiler in Microsoft IDE C++

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

8

Topics

• General Ideas• Compiler Switches• Dual Core• Vectorization

9Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

Compiler Switches - General Optimizations

Windows* Linux* Mac*

/Od -O0 -O0 Disables optimizations

/Zi -g -g Creates symbols

/O1 -O1 -O1 Optimize for Binary Size: Server Code

/O2 -O2 -O2 Optimizes for speed (default)

/O3 -O3 -O3 Optimize for Data Cache:

Loopy Floating Point Code

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

10

Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO)

ip: Enables interproceduraloptimizations for single file compilation

ipo: Enables interproceduraloptimizations across filesCan inline functions in separate files

Enhances optimization when used in combination with other compiler features

Windows* Linux* Mac*

/Qip -ip -ip

/Qipo -ipo -ipo

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

11

Compiler Switches - Multi-pass Optimization - IPOUsage: Two-Step Process

Linking

Windows* icl /Qipo main.o func1.o func2.o

Linux* icc -ipo main.o func1.o func2.o

Mac* icc -ipo main.o func1.o func2.o

Pass 1

Pass 2

virtual .o

executable

Compiling

Windows* icl -c /Qipo main.c func1.c func2.c

Linux* icc -c -ipo main.c func1.c func2.c

Mac* icc -c -ipo main.c func1.c func2.c

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

12

Compiler Switches - Profile Guided Optimizations (PGO)

• Use execution-time feedback to guide many other compiler optimizations

• Helps I-cache, paging, branch-prediction• Enabled optimizations:– Basic block ordering– Better register allocation– Better decision of functions to inline– Function ordering– Switch-statement optimization– Better vectorization decisions

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

13

Instrumented Compilation(Mac*/Linux*) icc -prof_gen[x] prog.c(Windows*) icl -Qprof_gen[x] prog.c

Instrumented ExecutionRun program on a typical dataset

Feedback Compilation(Mac/Linux) icc -prof_use prog.c(Windows) icl -Qprof_use prog.c

DYN file containingdynamic info: .dyn

Instrumented executable

Merged DYNsummary file: .dpiDelete old dyn files if you do not want the info included

Step 1

Step 2

Step 3

Compiler Switches - Multi-pass OptimizationPGO: Three-Step Process

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

14

Topics

• General Ideas• Compiler Switches• Dual Core

• Auto Parallelization• OpenMP• Threading Diagnostics

• Vectorization

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

15

Auto-parallelization

• Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives.

– Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.

Windows* Linux* Mac*

/Qparallel -parallel -parallel

/Qpar_report[n] -par_report[n] -par_report[n]

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

16

OpenMP* Threading Technology • Pragma based approach to parallelism• Usage:

– OpenMP switches: -openmp : /Qopenmp– OpenMP reports: -openmp-report : /Qopenmp-report

#pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

17

OpenMP: Workqueueing Extension Example

• Intel Compiler’s Workqueuing extension– Create Queue of tasks…Works on…• Recursive functions• Linked lists, etc.

#pragma intel omp parallel taskq shared(p){ while (p != NULL) {#pragma intel omp task captureprivate(p)

do_work1(p);     p = p->next; }}

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

18

Parallel Diagnostics• Source Instrumentation for Intel Thread Checker

• Allows thread checker to diagnose threading correctness bugs• To use tcheck/Qtcheck you must have Intel Thread Checker installed

Windows* Linux* Mac*

/Qtcheck -tcheck No support

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

19

Topics

• General Ideas• Compiler Switches• Dual Core• Vectorization

• SSE & Vectorization• Vectorization Reports• Explanations of a few specific vectorization inhibitors

20

SIMD – SSE, SSE2, SSE3 Support

16x bytes

8x words

4x dwords

2x qwords

1x dqword

4x floats

2x doubles

MMX*

SSE

SSE2SSE3

* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers

21

SIMD FP using AOS format*

Thread Synchronization

Video encoding

Complex arithmetic

FP to integer conversions

HADDPD, HSUBPD

HADDPS, HSUBPS

MONITOR, MWAIT

LDDQU

ADDSUBPD, ADDSUBPS,

MOVDDUP, MOVSHDUP,

MOVSLDUP

FISTTP

* Also benefits Complex and Vectorization

SSE3 Instructions

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

22

Using SSE3 - Your Task: Convert This…

128-bit Registers

A[0]

B[0]

C[0]

+ + + +

A[1]

B[1]

C[1]

not used not used not used

not used not used not used

not used not used not used

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

23

… Into This …

128-bit Registers

A[3] A[2]

B[3] B[2]

C[3] C[2]

+ +

A[1] A[0]

B[1] B[0]

C[1] C[0]

+ +

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

24Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

Compiler Based VectorizationProcessor Specific

Description Use Windows* Linux* Mac*

Generate instructions and optimize for Intel® Pentium® 4 compatible processors including MMX, SSE and SSE2.

W /QxW -xW Does not apply

Generate instructions and optimize for Intel® processors with SSE3 capability including Core Duo. These processors support SSE3 as well as MMX,SSE and SSE2.

P /QxP/QaxP

-xP,-axP

Vector-ization occurs by default

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

25

Compiler Based Vectorization Automatic Processor Dispatch – ax[?]

• Single executable – Optimized for Intel® Core Duo processors and

generic code that runs on all IA32 processors.• For each target processor it uses:– Processor-specific instructions– Vectorization

• Low overhead – Some increase in code size

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

26

Why Loops Don’t Vectorize• Independence

– Loop Iterations generally must be independent

• Some relevant qualifiers:

– Some dependent loops can be vectorized.– Most function calls cannot be vectorized. – Some conditional branches prevent vectorization. – Loops must be countable.– Outer loop of nest cannot be vectorized.– Mixed data types cannot be vectorized.

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

27

Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh*

-Qvec_reportn -vec_reportn -vec_reportn

Set diagnostic level dumped to stdout

– n=0: No diagnostic information– n=1: (Default) Loops successfully vectorized – n=2: Loops not vectorized – and the reason why not– n=3: Adds dependency Information– n=4: Reports only non-vectorized loops– n=5: Reports only non-vectorized loops and adds dependency info

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

28

Why Loops Don’t Vectorize

– “Existence of vector dependence”– “Nonunit stride used”– “Mixed Data Types”– “Unsupported Loop Structure”– “Contains unvectorizable statement at line XX”– There are more reasons loops don’t vectorize but

we will disucss the reasons above

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

29

“Existence of Vector Dependency”

• Usually, indicates a real dependency between iterations of the loop, as shown here:

for (i = 0; i < 100; i++) x[i] = A * x[i + 1];

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

30

Defining Loop Independence

Iteration Y of a loop is independent of when (or whether) iteration X occurs.

int a[MAX], b[MAX];

for (j=0;j<MAX;j++) {

a[j] = b[j];

}

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

31

“Nonunit stride used”

for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) {

c[I][J]+=1; // Unit Stridec[J][I]+=1; // Non-UnitA[J*J]+=1; // Non-unitA[B[J]]+=1; // Non-Unitif (A[MAX-J])=1 last1=J;}// Non-Unit

End Result: Loading Vector may take more cycles than executing operation sequentially.

Mem

ory

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

32

“Mixed Data Types”An example:

int howmany_close(double *x, double *y) { int withinborder=0; double dist; for(int i=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; }

}

Mixed data types are possible – but complicate thingsi.e.: 2 doubles vs 4 ints per SIMD register

Some operations with specific data types won’t work

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

33

“Unsupported Loop Structure”

Example:struct _xx {

int data;int bound; } ;

doit1(int *a, struct _xx *x) { for (int i=0; i<x->bound; i++) a[i] = 0;

An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count.

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

34

“Contains unvectorizable statement”

for (i=1;i<nx;i++) {

B[i] = func(A[i]); }

128-bit Registers128-bit Registers

A[3] A[2]

B[3] B[2]

func func

A[1] A[0]

B[1] B[0]

func func

Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

35

Reference

• White papers and technical notes– www.intel.com/ids– www.intel.com/software/products

• Product support resources– www.intel.com/software/products/support

top related