clash of the titans - streamhpc gaburov... · clash of the titans..a personal view.. thursday, june...
TRANSCRIPT
![Page 1: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/1.jpg)
XeonPhi
K20
Evghenii Gaburov
Clash of the Titans
..a personal view..
Thursday, June 20, 13
![Page 2: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/2.jpg)
> 1 TFLOP/s
on a desktop
Thursday, June 20, 13
![Page 3: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/3.jpg)
K20XXeonPhi
Thursday, June 20, 13
![Page 4: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/4.jpg)
192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit
64KB L1$+shared
1.5MB L2$15 SMX @0.73GHz
240 GB/s
32 SIMD width
hardware thread scheduling
255 reg/thread
K20X
2048 threadsin-order execution
image: GK110 whitepaper
1.4 TFLOP/s fp64
Thursday, June 20, 13
![Page 5: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/5.jpg)
XeonPhi (KNC)
software thread scheduling
61 pentium cores @1.1GHz
352 GB/s
16 fp32 SIMD 8 fp64 SIMD
32KB L1$512KB L2$ shared
512bit SIMD register
32 SIMD reg/thread
4 threadsin-order execution
image: Intel Xeon Phi programming overview
30.5MB L$2 1.1 TFLOP/s fp64
Thursday, June 20, 13
![Page 6: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/6.jpg)
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Thursday, June 20, 13
![Page 7: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/7.jpg)
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Xeon Phi: 61 core x 2 threads x 8 double = 976
Thursday, June 20, 13
![Page 8: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/8.jpg)
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Xeon Phi: 61 core x 2 threads x 8 double = 976
Xeon E5: 8 core x 1 thread x 4 double = 32
Xeon Phi is much more parallel than Xeon E5!
above all: *algorithm* MUST scale!
Thursday, June 20, 13
![Page 9: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/9.jpg)
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
![Page 10: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/10.jpg)
~3x
for the same number of threadsimage: Intel Xeon Phi programming overview
Thursday, June 20, 13
![Page 11: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/11.jpg)
~3x
to get the same performanceimage: Intel Xeon Phi programming overview
Thursday, June 20, 13
![Page 12: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/12.jpg)
~3x
image: Intel Xeon Phi programming overview
if app doesn’t scale
Thursday, June 20, 13
![Page 13: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/13.jpg)
~3x
image: Intel Xeon Phi programming overview
if app doesn’t scale ... or worse
Thursday, June 20, 13
![Page 14: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/14.jpg)
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
![Page 15: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/15.jpg)
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
![Page 16: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/16.jpg)
.. don’t forget the Amdahl’s law
P=0.99, N=1 P=0.99, N=32 P=0.99, N=960
S1=1 S32=24 S960=91
𝛆=100% 𝛆=75% 𝛆=9.4%
Thursday, June 20, 13
![Page 17: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/17.jpg)
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
Thursday, June 20, 13
![Page 18: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/18.jpg)
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
Thursday, June 20, 13
![Page 19: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/19.jpg)
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
Thursday, June 20, 13
![Page 20: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/20.jpg)
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
MPIMPI+OpenMP
MPI+OpenCL, MPI+....Not possible
Thursday, June 20, 13
![Page 21: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/21.jpg)
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
MPIMPI+OpenMP
MPI+OpenCL, MPI+....Not possible
software schedulingthread affinity is important
hardware schedulingno worries about threads
Thursday, June 20, 13
![Page 22: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/22.jpg)
for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
Thursday, June 20, 13
![Page 23: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/23.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
Thursday, June 20, 13
![Page 24: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/24.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
Thursday, June 20, 13
![Page 25: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/25.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
Thursday, June 20, 13
![Page 26: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/26.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
Thursday, June 20, 13
![Page 27: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/27.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
Thursday, June 20, 13
![Page 28: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/28.jpg)
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
K20X: use CUDA, it works!
Thursday, June 20, 13
![Page 29: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/29.jpg)
M
N
max # parallel units: 64x1024 = 64K
much larger than #FPUs
M=64, N=1024
Thursday, June 20, 13
![Page 30: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/30.jpg)
M
N
max # parallel units: 64x1024 = 64K
much larger than #FPUs
M=64, N=1024
minimize surface-to-volume ratio
Thursday, June 20, 13
![Page 31: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/31.jpg)
M
N
nby
nbx
{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by
}minimize surface-to-volume ratio
Thursday, June 20, 13
![Page 32: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/32.jpg)
M
N
nby
nbx
{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}
minimize surface-to-volume ratio
Thursday, June 20, 13
![Page 33: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/33.jpg)
#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}
M
N
minimize surface-to-volume ratio
nby
nbx
Thursday, June 20, 13
![Page 34: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/34.jpg)
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
Thursday, June 20, 13
![Page 35: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/35.jpg)
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
#pragma omp simd ... not that simple
Thursday, June 20, 13
![Page 36: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/36.jpg)
This is where the I find the biggest limitation ... the TOOLS!
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
#pragma omp simd ... not that simple
Thursday, June 20, 13
![Page 37: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/37.jpg)
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 38: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/38.jpg)
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 39: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/39.jpg)
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}
auto-vectorization usually works well for simple cases
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 40: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/40.jpg)
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple cases
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 41: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/41.jpg)
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 42: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/42.jpg)
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...
please compiler
have mercy!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 43: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/43.jpg)
do not fight the compiler
Thursday, June 20, 13
![Page 44: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/44.jpg)
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...
please compiler
have mercy!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 45: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/45.jpg)
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
![Page 47: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/47.jpg)
“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”
MySerialCode.cpp ./parallel_a.out
Thursday, June 20, 13
![Page 48: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/48.jpg)
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
Thursday, June 20, 13
![Page 49: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/49.jpg)
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Thursday, June 20, 13
![Page 50: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/50.jpg)
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..
Thursday, June 20, 13
![Page 51: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/51.jpg)
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..
CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)
Thursday, June 20, 13
![Page 52: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/52.jpg)
http://arxiv.org/abs/1302.1078Thursday, June 20, 13
![Page 53: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/53.jpg)
http://icl.cs.utk.edu/magma/Thursday, June 20, 13
![Page 54: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June](https://reader034.vdocuments.site/reader034/viewer/2022042313/5edc4d35ad6a402d6666e9a7/html5/thumbnails/54.jpg)
http://icl.cs.utk.edu/magma/Thursday, June 20, 13