intel® xeon® phi coprocessor high performance programming

Intel® Xeon® Phi Coprocessor High Performance ProgrammingParallelizing a Simple Image Blurring Algorithm

Brian Gesiak

April 16th, 2014

Research Student, The University of Tokyo

@modocache

Page 2: Intel® Xeon® Phi Coprocessor High Performance Programming

Today

• Image blurring with a 9-point stencil algorithm • Comparing performance

• Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor

• Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads

• Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages

Page 3: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix

Page 4: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix

Page 5: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil Algorithms

typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;

A 9-Point Stencil on a 2D Matrix

Page 6: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil Algorithms

weight.center;

Page 7: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil Algorithms

weight.center;

weight.next;

Page 8: Intel® Xeon® Phi Coprocessor High Performance Programming

Stencil Algorithms

weight.center;

weight.diagonal;

weight.next;

Page 9: Intel® Xeon® Phi Coprocessor High Performance Programming

Image BlurringApplying a 9-Point Stencil to a Bitmap

Page 10: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 11: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 12: Intel® Xeon® Phi Coprocessor High Performance Programming

Halo Effect

Page 13: Intel® Xeon® Phi Coprocessor High Performance Programming

• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times

Sample Application

Page 14: Intel® Xeon® Phi Coprocessor High Performance Programming

Sample Application

Page 15: Intel® Xeon® Phi Coprocessor High Performance Programming

Sample Application

Page 16: Intel® Xeon® Phi Coprocessor High Performance Programming

Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor

Processor Clock Frequency

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Page 17: Intel® Xeon® Phi Coprocessor High Performance Programming

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Page 18: Intel® Xeon® Phi Coprocessor High Performance Programming

Number of Cores

Memory Size/Type

Peak DP/SP FLOPs

Peak Memory

Bandwidth

Intel® Xeon® Dual

Processor2.6 GHz 16 (8 x 2

CPUs)63 GB / DDR3

345.6 / 691.2 GigaFLOP/s 85.3 GB/s

Intel® Xeon® Phi

Coprocessor1.091 GHz 61 8 GB/

GDDR51.065/2.130 TeraFLOP/s 352 GB/s

Page 19: Intel® Xeon® Phi Coprocessor High Performance Programming

1st Comparison: Serial Execution

Page 20: Intel® Xeon® Phi Coprocessor High Performance Programming

void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

Page 21: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 22: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 23: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 24: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 25: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 26: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 27: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 28: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 29: Intel® Xeon® Phi Coprocessor High Performance Programming

Assumed vector dependency

Page 30: Intel® Xeon® Phi Coprocessor High Performance Programming

Processor Elapsed Wall Time MegaFLOPS

Intel® Xeon® Dual Processor

244.178 seconds (4 minutes) 4,107.658

Intel® Xeon® Phi Coprocessor

2,838.342 seconds (47.3 minutes) 353.375

1st Comparison: Serial ExecutionResults

Page 31: Intel® Xeon® Phi Coprocessor High Performance Programming

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

$ icc -openmp -O3 stencil.c -o stencil

Page 32: Intel® Xeon® Phi Coprocessor High Performance Programming

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

Page 33: Intel® Xeon® Phi Coprocessor High Performance Programming

Dual is 11 times faster than Phi

244.178 seconds (4 minutes) 4,107.658

2,838.342 seconds (47.3 minutes) 353.375

Page 34: Intel® Xeon® Phi Coprocessor High Performance Programming

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

Page 35: Intel® Xeon® Phi Coprocessor High Performance Programming

2nd Comparison: Vectorization

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Ignoring Assumed Vector Dependencies

Page 36: Intel® Xeon® Phi Coprocessor High Performance Programming

ivdepTells compiler to ignore assumed dependencies

Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

Page 37: Intel® Xeon® Phi Coprocessor High Performance Programming

• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.

Page 38: Intel® Xeon® Phi Coprocessor High Performance Programming

• The ivdep pragma negates this assumption.

Page 39: Intel® Xeon® Phi Coprocessor High Performance Programming

• The ivdep pragma negates this assumption.• Proven dependencies may not be ignored.

Page 40: Intel® Xeon® Phi Coprocessor High Performance Programming

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

2nd Comparison: VectorizationResults

Page 41: Intel® Xeon® Phi Coprocessor High Performance Programming

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

$ icc -openmp -O3 stencil.c -o stencil

1.3 times faster

Page 42: Intel® Xeon® Phi Coprocessor High Performance Programming

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

$ icc -openmp -mmic -O3 stencil.c -o stencil_phi

4.5 times faster

1.3 times faster

Page 43: Intel® Xeon® Phi Coprocessor High Performance Programming

186.585 seconds (3.1 minutes) 5,375.572

623.302 seconds (10.3 minutes) 1,609.171

4.5 times faster

1.3 times faster

Dual is now only 4 times faster than Phi

Page 44: Intel® Xeon® Phi Coprocessor High Performance Programming

3rd Comparison: MultithreadingWork Division Using Parallel For Loops

Page 45: Intel® Xeon® Phi Coprocessor High Performance Programming

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

Page 46: Intel® Xeon® Phi Coprocessor High Performance Programming

3rd Comparison: Multithreading

#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Work Division Using Parallel For Loops

Page 47: Intel® Xeon® Phi Coprocessor High Performance Programming

Processor Elapsed Wall Time (seconds) MegaFLOPS

Xeon® Dual Proc., 16 Threads 43.862 22,867.185

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

3rd Comparison: MultithreadingResults

Page 48: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

Page 49: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

Page 50: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

4x

71x

Page 51: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.366 88,246.452

Xeon® Phi, 122 Threads 8.772 114,338.399

Xeon® Phi, 183 Threads 10.546 94,946.364

Xeon® Phi, 244 Threads 12.696 78,999.44

4x

71x

Phi now 5 times faster

Page 52: Intel® Xeon® Phi Coprocessor High Performance Programming

Further Optimizations

Page 53: Intel® Xeon® Phi Coprocessor High Performance Programming

1. Padded arrays

Page 54: Intel® Xeon® Phi Coprocessor High Performance Programming

1. Padded arrays2. Streaming stores

Page 55: Intel® Xeon® Phi Coprocessor High Performance Programming

1. Padded arrays2. Streaming stores3. Huge memory pages

Page 56: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 1: Padded ArraysOptimizing Cache Access

Page 57: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 1: Padded Arrays

• We can add extra, unused data to the end of each row

Optimizing Cache Access

Page 58: Intel® Xeon® Phi Coprocessor High Performance Programming

• We can add extra, unused data to the end of each row• Doing so aligns heavily used memory addresses for efficient cache line access

Optimizing Cache Access

Page 59: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 60: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

Page 61: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 62: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 63: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 64: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 65: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 66: Intel® Xeon® Phi Coprocessor High Performance Programming

Page 67: Intel® Xeon® Phi Coprocessor High Performance Programming

((5900*sizeof(real)+63)/64)*(64/sizeof(real));

Page 68: Intel® Xeon® Phi Coprocessor High Performance Programming

(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);

sizeof(real)* width*kPaddingSize * height;

Page 69: Intel® Xeon® Phi Coprocessor High Performance Programming

_mm_free(fin); _mm_free(fout);

(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);

sizeof(real)* width*kPaddingSize * height;

Page 70: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 1: Padded ArraysAccommodating for Padding

Page 71: Intel® Xeon® Phi Coprocessor High Performance Programming

#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }

Accommodating for Padding

Page 72: Intel® Xeon® Phi Coprocessor High Performance Programming

#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }

Accommodating for Padding

Page 73: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Optimization 1: Padded ArraysResults

Page 74: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Page 75: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 11.644 86,138.371

Xeon® Phi, 122 Threads 8.973 111,774.803

Xeon® Phi, 183 Threads 10.326 97,132.546

Xeon® Phi, 244 Threads 11.469 87,452.707

Page 76: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 2: Streaming StoresRead-less Writes

Page 77: Intel® Xeon® Phi Coprocessor High Performance Programming

• By default, Xeon® Phi processors read the value at an address before writing to that address.

Page 78: Intel® Xeon® Phi Coprocessor High Performance Programming

• By default, Xeon® Phi processors read the value at an address before writing to that address.

• When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.

Page 79: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 2: Streaming Stores

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Read-less Writes with Vector Nontemporal

Page 80: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 2: Streaming Stores

for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }

Read-less Writes with Vector Nontemporal

Page 81: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Optimization 2: Streaming StoresResults

Page 82: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Page 83: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 13.588 73,978.915

Xeon® Phi, 122 Threads 8.491 111,774.803

Xeon® Phi, 183 Threads 8.663 115,773.405

Xeon® Phi, 244 Threads 9.507 105,498.781

Page 84: Intel® Xeon® Phi Coprocessor High Performance Programming

Optimization 3: Huge Memory Pages

Page 85: Intel® Xeon® Phi Coprocessor High Performance Programming

• Memory pages map virtual memory used by our program to physical memory

Page 86: Intel® Xeon® Phi Coprocessor High Performance Programming

• Mappings are stored in a translation look-aside buffer (TLB)

Page 87: Intel® Xeon® Phi Coprocessor High Performance Programming

• Mappings are traversed in a “page table walk”

Page 88: Intel® Xeon® Phi Coprocessor High Performance Programming

• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default

Page 89: Intel® Xeon® Phi Coprocessor High Performance Programming

• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default

• By increasing the size of each memory page, traversal time may be reduced

Page 90: Intel® Xeon® Phi Coprocessor High Performance Programming

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);

Page 91: Intel® Xeon® Phi Coprocessor High Performance Programming

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,

size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

Page 92: Intel® Xeon® Phi Coprocessor High Performance Programming

size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,

size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

Page 93: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

ResultsOptimization 3: Huge Memory Pages

Page 94: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

Page 95: Intel® Xeon® Phi Coprocessor High Performance Programming

Xeon® Phi, 61 Threads 14.486 69,239.365

Xeon® Phi, 122 Threads 8.226 121,924.389

Xeon® Phi, 183 Threads 8.749 114,636.799

Xeon® Phi, 244 Threads 9.466 105,955.358

Page 96: Intel® Xeon® Phi Coprocessor High Performance Programming

Takeaways

• The key to achieving high-performance is to use loop vectorization and multiple threads

• Completely serial programs run faster on standard processors

• Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor

• Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages

Page 97: Intel® Xeon® Phi Coprocessor High Performance Programming

Sources and Additional Resources

• Today’s slides • http://modocache.io/xeon-phi-high-performance

• Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders)

• http://www.amazon.com/dp/0124104142 • Intel Documentation

• ivdep: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm

• vector: https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_vector.htm

intel® xeon® phi coprocessor high performance programming

Technology