intel® xeon® phi coprocessor high performance programming
DESCRIPTION
TRANSCRIPT
Intel® Xeon® Phi Coprocessor High Performance ProgrammingParallelizing a Simple Image Blurring Algorithm
Brian Gesiak
April 16th, 2014
Research Student, The University of Tokyo
@modocache
Today
• Image blurring with a 9-point stencil algorithm • Comparing performance
• Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor
• Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads
• Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix
Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
weight.next;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
weight.diagonal;
weight.next;
A 9-Point Stencil on a 2D Matrix
Image BlurringApplying a 9-Point Stencil to a Bitmap
Image BlurringApplying a 9-Point Stencil to a Bitmap
Image BlurringApplying a 9-Point Stencil to a Bitmap
Halo Effect
Image BlurringApplying a 9-Point Stencil to a Bitmap
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
Intel® Xeon® Dual
Processor2.6 GHz 16 (8 x 2
CPUs)63 GB / DDR3
345.6 / 691.2 GigaFLOP/s 85.3 GB/s
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
Intel® Xeon® Dual
Processor2.6 GHz 16 (8 x 2
CPUs)63 GB / DDR3
345.6 / 691.2 GigaFLOP/s 85.3 GB/s
Intel® Xeon® Phi
Coprocessor1.091 GHz 61 8 GB/
GDDR51.065/2.130 TeraFLOP/s 352 GB/s
1st Comparison: Serial Execution
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
Assumed vector dependency
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
$ icc -openmp -O3 stencil.c -o stencil
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
Dual is 11 times faster than Phi
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Ignoring Assumed Vector Dependencies
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Ignoring Assumed Vector Dependencies
ivdepTells compiler to ignore assumed dependencies
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
• The ivdep pragma negates this assumption.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
• The ivdep pragma negates this assumption.• Proven dependencies may not be ignored.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
$ icc -openmp -O3 stencil.c -o stencil
1.3 times faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
4.5 times faster
1.3 times faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
4.5 times faster
1.3 times faster
Dual is now only 4 times faster than Phi
3rd Comparison: MultithreadingWork Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Work Division Using Parallel For Loops
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
4x
71x
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
4x
71x
Phi now 5 times faster
Further Optimizations
Further Optimizations
1. Padded arrays
Further Optimizations
1. Padded arrays2. Streaming stores
Further Optimizations
1. Padded arrays2. Streaming stores3. Huge memory pages
Optimization 1: Padded ArraysOptimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row• Doing so aligns heavily used memory addresses for efficient cache line access
Optimizing Cache Access
Optimization 1: Padded Arrays
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
_mm_free(fin); _mm_free(fout);
(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded ArraysAccommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }
Accommodating for Padding
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
Optimization 2: Streaming StoresRead-less Writes
Optimization 2: Streaming StoresRead-less Writes
• By default, Xeon® Phi processors read the value at an address before writing to that address.
Optimization 2: Streaming StoresRead-less Writes
• By default, Xeon® Phi processors read the value at an address before writing to that address.
• When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Read-less Writes with Vector Nontemporal
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Read-less Writes with Vector Nontemporal
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our program to physical memory
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default
• By increasing the size of each memory page, traversal time may be reduced
Optimization 3: Huge Memory Pages
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,
size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,
size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
Takeaways
• The key to achieving high-performance is to use loop vectorization and multiple threads
• Completely serial programs run faster on standard processors
• Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor
• Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
Sources and Additional Resources
• Today’s slides • http://modocache.io/xeon-phi-high-performance
• Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders)
• http://www.amazon.com/dp/0124104142 • Intel Documentation
• ivdep: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
• vector: https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_vector.htm