intel® xeon® phi coprocessor high performance programming

Intel® Xeon® Phi Coprocessor High Performance ProgrammingParallelizing a Simple Image Blurring Algorithm

Brian Gesiak

April 16th, 2014

Research Student, The University of Tokyo

@modocache

• Image blurring with a 9-point stencil algorithm • Comparing performance

• Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor

• Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads

• Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages

Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

A 9-Point Stencil on a 2D Matrix

Stencil Algorithms

weight.center;

Stencil Algorithms

weight.center;

weight.next;

Stencil Algorithms

weight.center;

weight.diagonal;

weight.next;

Image BlurringApplying a 9-Point Stencil to a Bitmap

Halo Effect

Image BlurringApplying a 9-Point Stencil to a Bitmap

• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times

Sample Application

Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor

Processor Clock Frequency

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Intel® Xeon® Phi

Coprocessor1.091 GHz 61 8 GB/

GDDR51.065/2.130 TeraFLOP/s 352 GB/s

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

Assumed vector dependency

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

$ icc -openmp -O3 stencil.c -o stencil

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

Dual is 11 times faster than Phi

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

ivdepTells compiler to ignore assumed dependencies

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.

• The ivdep pragma negates this assumption.

• The ivdep pragma negates this assumption.• Proven dependencies may not be ignored.

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

$ icc -openmp -O3 stencil.c -o stencil

1.3 times faster

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

4.5 times faster

1.3 times faster

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

4.5 times faster

1.3 times faster

Dual is now only 4 times faster than Phi

3rd Comparison: MultithreadingWork Division Using Parallel For Loops

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Phi, 61 Threads 11.366 88,246.452