intel® xeon® phi coprocessor high performance programming

Post on 07-Dec-2014

5.446 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Intel® Xeon® Phi Coprocessor High Performance ProgrammingParallelizing a Simple Image Blurring Algorithm

Brian Gesiak

April 16th, 2014

Research Student, The University of Tokyo

@modocache

Today

• Image blurring with a 9-point stencil algorithm • Comparing performance

• Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor

• Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads

• Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages

Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix

Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

A 9-Point Stencil on a 2D Matrix

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

weight.center;

A 9-Point Stencil on a 2D Matrix

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

weight.center;

weight.next;

A 9-Point Stencil on a 2D Matrix

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

weight.center;

weight.diagonal;

weight.next;

A 9-Point Stencil on a 2D Matrix

Image BlurringApplying a 9-Point Stencil to a Bitmap

Image BlurringApplying a 9-Point Stencil to a Bitmap

Image BlurringApplying a 9-Point Stencil to a Bitmap

Halo Effect

Image BlurringApplying a 9-Point Stencil to a Bitmap

• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times

Sample Application

• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times

Sample Application

• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times

Sample Application

Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor

Processor Clock Frequency

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor

Processor Clock Frequency

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor

Processor Clock Frequency

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Intel® Xeon® Phi

Coprocessor1.091 GHz 61 8 GB/

GDDR51.065/2.130 TeraFLOP/s 352 GB/s

1st Comparison: Serial Execution

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

1st Comparison: Serial Execution

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

Assumed vector dependency

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

$ icc -openmp -O3 stencil.c -o stencil

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

Dual is 11 times faster than Phi

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

ivdepTells compiler to ignore assumed dependencies

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

ivdepTells compiler to ignore assumed dependencies

• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

ivdepTells compiler to ignore assumed dependencies

• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.

• The ivdep pragma negates this assumption.

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

ivdepTells compiler to ignore assumed dependencies

• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.

• The ivdep pragma negates this assumption.• Proven dependencies may not be ignored.

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

186.585 seconds (3.1 minutes) 5,375.572

Intel® Xeon® Phi Coprocessor

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

186.585 seconds (3.1 minutes) 5,375.572

Intel® Xeon® Phi Coprocessor

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

$ icc -openmp -O3 stencil.c -o stencil

1.3 times faster

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

186.585 seconds (3.1 minutes) 5,375.572

Intel® Xeon® Phi Coprocessor

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

4.5 times faster

1.3 times faster

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

186.585 seconds (3.1 minutes) 5,375.572

Intel® Xeon® Phi Coprocessor

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

4.5 times faster

1.3 times faster

Dual is now only 4 times faster than Phi

3rd Comparison: MultithreadingWork Division Using Parallel For Loops

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Dual Proc., 32 Threads 46.247 21,688.103

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Dual Proc., 32 Threads 46.247 21,688.103

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Dual Proc., 32 Threads 46.247 21,688.103

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Dual Proc., 32 Threads 46.247 21,688.103

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

4x

71x

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Dual Proc., 32 Threads 46.247 21,688.103

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

4x

71x

Phi now 5 times faster

Further Optimizations

Further Optimizations

1. Padded arrays

Further Optimizations

1. Padded arrays2. Streaming stores

Further Optimizations

1. Padded arrays2. Streaming stores3. Huge memory pages

Optimization 1: Padded ArraysOptimizing Cache Access

Optimization 1: Padded Arrays

• We can add extra, unused data to the end of each row

Optimizing Cache Access

Optimization 1: Padded Arrays

• We can add extra, unused data to the end of each row• Doing so aligns heavily used memory addresses for efficient cache line access

Optimizing Cache Access

Optimization 1: Padded Arrays

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

((5900*sizeof(real)+63)/64)*(64/sizeof(real));

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);

sizeof(real)* width*kPaddingSize * height;

((5900*sizeof(real)+63)/64)*(64/sizeof(real));

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

_mm_free(fin); _mm_free(fout);

(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);

sizeof(real)* width*kPaddingSize * height;

((5900*sizeof(real)+63)/64)*(64/sizeof(real));

Optimization 1: Padded ArraysAccommodating for Padding

Optimization 1: Padded Arrays

#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }

Accommodating for Padding

Optimization 1: Padded Arrays

#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }

Accommodating for Padding

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Optimization 1: Padded ArraysResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Optimization 1: Padded ArraysResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Optimization 1: Padded ArraysResults

Optimization 2: Streaming StoresRead-less Writes

Optimization 2: Streaming StoresRead-less Writes

• By default, Xeon® Phi processors read the value at an address before writing to that address.

Optimization 2: Streaming StoresRead-less Writes

• By default, Xeon® Phi processors read the value at an address before writing to that address.

• When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.

Optimization 2: Streaming Stores

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Read-less Writes with Vector Nontemporal

Optimization 2: Streaming Stores

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Read-less Writes with Vector Nontemporal

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Optimization 2: Streaming StoresResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Optimization 2: Streaming StoresResults

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Optimization 2: Streaming StoresResults

Optimization 3: Huge Memory Pages

• Memory pages map virtual memory used by our program to physical memory

Optimization 3: Huge Memory Pages

• Memory pages map virtual memory used by our program to physical memory

• Mappings are stored in a translation look-aside buffer (TLB)

Optimization 3: Huge Memory Pages

• Memory pages map virtual memory used by our program to physical memory

• Mappings are stored in a translation look-aside buffer (TLB)

• Mappings are traversed in a “page table walk”

Optimization 3: Huge Memory Pages

• Memory pages map virtual memory used by our program to physical memory

• Mappings are stored in a translation look-aside buffer (TLB)

• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default

Optimization 3: Huge Memory Pages

• Memory pages map virtual memory used by our program to physical memory

• Mappings are stored in a translation look-aside buffer (TLB)

• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default

• By increasing the size of each memory page, traversal time may be reduced

Optimization 3: Huge Memory Pages

Optimization 3: Huge Memory Pages

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);

Optimization 3: Huge Memory Pages

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,

size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

Optimization 3: Huge Memory Pages

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,

size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

ResultsOptimization 3: Huge Memory Pages

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

ResultsOptimization 3: Huge Memory Pages

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

ResultsOptimization 3: Huge Memory Pages

Takeaways

• The key to achieving high-performance is to use loop vectorization and multiple threads

• Completely serial programs run faster on standard processors

• Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor

• Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages

Sources and Additional Resources

• Today’s slides • http://modocache.io/xeon-phi-high-performance

• Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders)

• http://www.amazon.com/dp/0124104142 • Intel Documentation

• ivdep: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

• vector: https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_vector.htm

top related