many-core computing - vrije universiteit amsterdam › ~bal › college14 › class2-2k14.pdf ·...
TRANSCRIPT
![Page 1: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/1.jpg)
MANY-CORE
COMPUTING Ana Lucia Varbanescu, UvA
Original slides: Rob van Nieuwpoort, eScience Center 6-Oct-2014
![Page 2: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/2.jpg)
Schedule
1. Introduction and Programming Basics (2-10-
2014)
2. Performance analysis (6-10-2014)
3. Advanced CUDA programming (9-10-2014)
4. Case study: LOFAR telescope with many-
cores
by Rob van Nieuwpoort (??)
2
![Page 3: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/3.jpg)
GPUs @ AMD 3
Radeon R9 Top of the line: R9 295X2
For comparison: R9 290X Performance: 5.6 TFLOPs
Memory: 4GB Bandwidth: 320GB/s
NVIDIA GTX980 (Maxwell) Performance: 5.0 TFLOPs
Memory: 4GB Lower bandwidth: 224 GB/s
NVIDIA GTX Titan Black (Kepler) Performance: 5.3 TFLOPs
Memory: 6GB Higher bandwidth: 336 GB/s
NVIDIA GTX Titan Z vs R9 295X2: fairly similar numbers, higher DP performance
![Page 4: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/4.jpg)
Today 4
Revisit the VectorAdd
For GPUs
For many-core CPUs
Hardware revisited
Performance analysis
Hardware performance
Application performance
![Page 5: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/5.jpg)
VectorAdd revisited 5
![Page 6: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/6.jpg)
Vector add: sequential 6
void vector_add(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i++) {
c[i] = a[i] + b[i];
}
}
![Page 7: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/7.jpg)
Vector add: GPU code (skeleton) 7
// compute vector sum c = a + b
// each thread performs one pair-wise addition
__global__ void vector_add(int N,float* A,float* B,float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i<N) C[i] = A[i] + B[i];
}
int main() {
// initialization code here ...
N = 5120;
// launch N/256 blocks of 256 threads each
vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);
// cleanup code here ...
}
Device code
Host code
(should be in the same file)
![Page 8: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/8.jpg)
Multi-core CPU programming
Two levels of parallelism:
Coarse-grain: threads / processes
Fine-grain: SIMD operations
Instantiate the threads
Pthreads
Java threads
OpenMP
MPI
Vectorize
Rely on compilers
Manual vectorization
Vector types
Intrinsics
8
![Page 9: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/9.jpg)
OpenMP 9
Add directives to sequential code for parallel
sections.
// Phi function to add two vectors
vector_add_Phi(int n, int* a, int* b, int* c) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
// main program
int main() {
int i, in1[SIZE], in2[SIZE], res[SIZE];
{
vector_add_Phi(SIZE, in1, in2, res);
}}
![Page 10: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/10.jpg)
OpenMP (for Xeon Phi, too) 10
Add directives to sequential code for parallel
sections.
// Phi function to add two vectors
__attribute__((target(mic)))
vector_add_Phi(int n, int* a, int* b, int* c) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
// main program
int main() {
int i, in1[SIZE], in2[SIZE], res[SIZE];
#pragma offload target(mic) in(in1,in2) inout(res)
{
vector_add_Phi(SIZE, in1, in2, res);
}}
For Xeon Phi
For Xeon Phi
![Page 11: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/11.jpg)
Cilk (for Xeon Phi, too) 11
Add directives to parallelize sequential code
by divide-and-conquer
cilk VectorAdd(float *a, float *b, float *c, int n){
if (n<GrainSize) {
int i;
for(i=0; i<n; ++i)
a[i] = b[i]+c[i];
}
else {
spawn (a,b,c,n/2);
spawn (a+n/2,b+n/2,c+n/2,n/2);
}
}
![Page 12: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/12.jpg)
Vectorization on x86 architectures 12
Sinc
e
Name Bits Single
precision
vector size
Double
precision
vector size
1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only
1999 Streaming SIMD Extensions
(SSE)
128 bit 4 float 2 double
2011 Advanced Vector Extensions
(AVX)
256 bit 8 float 4 double
2012 Intel Xeon Phi accelerator
(was Larrabee, MIC)
512 bit 16 float 8 double
![Page 13: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/13.jpg)
Vectorizing with SSE
Assembly instructions
Execute on vector registers
C or C++: intrinsics
Declare vector variables
Name instruction
Work on variables, not registers
13
![Page 14: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/14.jpg)
Vectorizing with SSE examples
float data[1024];
// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.
init(data);
// Set all elements in my vector to zero.
__m128 myVector0 = _mm_setzero_ps();
// Load the first 4 elements of the array into my vector.
__m128 myVector1 = _mm_load_ps(data);
// Load the second 4 elements of the array into my vector.
__m128 myVector2 = _mm_load_ps(data+4);
0.0
0 element
value
1 2 3
0.0 0.0 0.0
0.0
0 element
value
1 2 3
3.0 2.0 1.0
4.0
0 element
value
1 2 3
7.0 6.0 5.0
14
![Page 15: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/15.jpg)
Vectorizing with SSE examples
// Add vectors 1 and 2; instruction performs 4 FLOP.
__m128 myVector3 = _mm_add_ps(myVector1, myVector2);
// Multiply vectors 1 and 2; instruction performs 4 FLOP.
__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);
// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.
__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,
_MM_SHUFFLE(2, 3, 0, 1));
0 element
value
1 2 3
4.0 = + 6.0 8.0 10.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 = x 5.0 12.0 21.0
0 element
value
1 2 3
2.0 = 3.0 4.0 5.0 s
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
15
![Page 16: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/16.jpg)
Vector add with SSE: unroll loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
c[i+0] = a[i+0] + b[i+0];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
}
}
16
![Page 17: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/17.jpg)
Vector add with SSE: vectorize
loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a
__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b
__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts
_mm_store_ps(c + i, vecC); // store four elts
}
}
17
![Page 18: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/18.jpg)
Optional assignment 18
Implement a vectorized version of
Element-wise array multiplication, with complex
numbers
Element-wise array division, with complex numbers
Compile with gcc and measure performance
with/without vectorization.
Send (pseudo-)code (and performance numbers,
if you have them) by email to
![Page 19: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/19.jpg)
CPUs
NVIDIA GPUs
Hardware revisited 19
![Page 20: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/20.jpg)
Generic multi-core CPU 20
Hardware threads
SIMD units (vector lanes)
L1 and L2
dedicated
caches
Shared L3/L4 cache Main memory, I/O
Peak
performance
Bandwidth
![Page 21: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/21.jpg)
Generic GPU 21
Single or SIMD execution units Hardware scheduler
Local memory/cache Units for executing
functions with high precision
Peak
performance
Bandwidth
![Page 22: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/22.jpg)
NVIDIA GPUs 22
Kepler: Larger SM (SMX), more registers, better scheduler, dynamic parallelism, multi-GPU
Maxwell: Modular SM (SMM), dedicated registers, dedicated schedulers, more L2 cache
![Page 23: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/23.jpg)
Platform architecture (Fermi) 23
![Page 24: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/24.jpg)
Memory architecture (from Fermi) 24
Configurable L1 cache per SM
16KB L1 cache / 48KB Shared
48KB L1 cache / 16KB Shared
Shared L2 cache
Device memory
L2 cache
Host memory
PCI-e
bus
registers
L1 cache /
shared mem
registers
L1 cache /
shared mem ….
![Page 25: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/25.jpg)
Fermi 25
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine Consumer: GTX 480, 580
HPC: Tesla C2050
More memory, ECC
1.0 TFlop SP
515 Gflop DP
16 streaming multiprocessors (SM) GTX 580: 16
GTX 480: 15
C2050: 14
768 KB L2 cache
![Page 26: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/26.jpg)
Fermi : SM 26
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
32 cores per SM
64KB configurable
L1 cache / shared memory
32,768 32-bit registers
![Page 27: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/27.jpg)
Fermi: CUDA Core* 27
Decoupled floating point and integer data paths
Double Fused-multiply-add (FMA)
Integer operations optimized for extended precision
DP throughput is 50% of SP throughput
DP: 256 FMA ops /clock
SP: 512 FMA ops /clock
*http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
![Page 28: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/28.jpg)
Kepler: the new SMX
Consumer:
GTX680, GTX780, GTX-Titan
HPC
Tesla K10..K40
SMX features
192 CUDA cores
32 in Fermi
32 Special Function Units (SFU)
4 for Fermi
32 Load/Store units (LD/ST)
16 for Fermi
3x Perf/Watt improvement
28
![Page 29: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/29.jpg)
A comparison 29
![Page 30: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/30.jpg)
Maxwell: the newest SMM
Consumer:
GTX 970, GTX 980, …
HPC:
?
SMM Features:
4 subblocks of 32 cores
Dedicated L1/LM per 64 cores
Dispatch/ecode/registers per 32 cores
L2 cache: 2MB (~3x vs. Kepler)
40 texture units
Lower power consumption
30
![Page 31: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/31.jpg)
Hardware performance 31
![Page 32: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/32.jpg)
Hardware Performance metrics
Clock frequency [GHz] = absolute hardware speed
Memories, CPUs, interconnects
Operational speed [GFLOPs]
Instructions per cycle + frequency
Memory bandwidth [GB/s]
differs a lot between different memories on chip
Power [Watt]
Derived metrics
FLOP/Byte, FLOP/Watt
32
![Page 33: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/33.jpg)
Theoretical peak performance
Peak = chips * cores * threads/core * vector_lanes *
FLOPs/cycle * clockFrequency
Examples from DAS-4:
Intel Core i7 CPU
2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs
NVIDIA GTX 580 GPU
1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs
ATI HD 6970
1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle
* 0.880 GHz = 2703 GFLOPs
33
![Page 34: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/34.jpg)
DRAM Memory bandwidth
Throughput =
memory bus frequency * bits per cycle * bus
width
Memory clock != CPU clock!
In bits, divide by 8 for GB/s
Examples:
Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s
NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s
ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176
GB/s
34
![Page 35: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/35.jpg)
Memory bandwidths
On-chip memory can be orders of magnitude faster
Registers, shared memory, caches, …
E.g., AMD HD 7970 L1 cache achieves 2 TB/s
Off-chip memories: depends on the interconnect
Intel’s technology: QPI (Quick Path Interconnect)
25.6 GB/s
AMD’s technology: HT3 (Hyper Transport 3) 19.2 GB/s
Accelerators: PCI-e 2.0 8 GB/s
35
![Page 36: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/36.jpg)
Power
Chip manufactures specify Thermal Design Power
(TDP)
We can measure dissipated power
Whole system
Typically (much) lower than TDP
Power efficiency
FLOPS / Watt
Examples (with theoretical peak and TDP)
Intel Core i7: 154 / 160 = 1.0 GFLOPs/W
NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W
ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
36
![Page 37: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/37.jpg)
Summary
Cores Threads/ALUs GFLOPS Bandwidth
Sun Niagara 2 8 64 11.2 76
IBM BG/P 4 8 13.6 13.6
IBM Power 7 8 32 265 68
Intel Core i7 4 16 85 25.6
AMD Barcelona 4 8 37 21.4
AMD Istanbul 6 6 62.4 25.6
AMD Magny-Cours 12 12 125 25.6
Cell/B.E. 8 8 205 25.6
NVIDIA GTX 580 16 512 1581 192
NVIDIA GTX 680 8 1536 3090 192
AMD HD 6970 384 1536 2703 176
AMD HD 7970 32 2048 3789 264
Intel Xeon Phi 7120 61 240 2417 352
![Page 38: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/38.jpg)
Absolute hardware performance
Only achieved in the optimal conditions:
Processing units 100% used
All parallelism 100% exploited
All data transfers at maximum bandwidth
In real life, there are no applications like this
Can we reason about “real” performance?
38
![Page 39: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/39.jpg)
Optional assignment 39
Compute and fill in the numbers in the table
with the CPU and GPU from your machine.
Compute the FLOPs/BW as well
Compute the numbers and fill in the table for
your dream GPU
Please send me your answers (just the added
lines) by Thursday @ 11:00 at
![Page 40: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/40.jpg)
Amdahl’s Law
Operational Intensity and the Roofline
model
Performance analysis 40
![Page 41: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/41.jpg)
Software performance metrics (3
P’s)
Performance
Execution time
Speed-up
Computational throughput (GFLOP/s)
Computational efficiency (i.e., utilization)
Bandwidth (GB/s)
Memory efficiency (i.e., utilization)
Productivity and Portability
Programmability
Production costs
Maintenance costs
41
![Page 42: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/42.jpg)
Reason early about performance 42
Amdahl’s law:
s = fraction of sequential code
p = number of processors
Parallel part: assumed perfectly parallel!
How fast can it really be?
Compute achievable performance
![Page 43: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/43.jpg)
Amdhal’s Law in pictures
![Page 44: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/44.jpg)
RGB to gray
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
45
![Page 45: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/45.jpg)
Performance evaluation 46
Measure execution time : Tpar
Absolute performance
Calculate speed-up : S = Tseq / Tpar
Relative performance
Does not take application into account!
Execution time and speedup can be used to
compare implementations of the same
algorithm!
![Page 46: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/46.jpg)
Performance measurement setup
Image sizes:
Select at least 7 different images
Order them increasingly
Run the code 10 times per image
Assume outliers are eliminated
Ts = average 10 sequential runs
Choose different p’s:
Tp = average 10 parallel runs
Tp_par = execution time for the parallel part
Tp_seq = execution time for the sequential part (should be the same)
Report execution time & speed-ups
Full application
Parallel section only
![Page 47: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/47.jpg)
An example: execution time
0
5
10
15
20
25
30
35
Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7
Ts
T2
T4
T8
T16
![Page 48: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/48.jpg)
Same example: speed-up
0
1
2
3
4
5
6
7
8
2 4 8 16
Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 7
Strong scaling
How would you build a weak scaling
experiment?
Weak scaling: keep the same work per
compute node and increase the number
of compute nodes.
Strong scaling: keep the total workload
constant and increase the number of
cores/nodes.
![Page 49: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/49.jpg)
Derived metrics 50
Throughput: GFLOPs = #FLOPs / Tpar
Takes application into account!
Calculate compute utilization: Ec = GFLOPs/peak *100
Bandwidth: BW = #(RD+WR) / Tpar
Takes application into account!
Calculate bandwidth utilization: Ebw = BW/peak*100
Achieved bandwidth and throughput can be used
to compare *different* algorithms.
Utilization can be used to compare *different*
(application, platform) combinations.
![Page 50: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/50.jpg)
Performance analysis 51
Real-life performance vs. theoretical limits.
Understand bottlenecks
Perform correct optimizations
… decide when to stop fiddling with code!!!
Computing the theoretical limits is the most
difficult challenge in parallel performance
analysis
Use theoretical peak limits => low accuracy
Use application characteristics
Use the platform characteristics
![Page 51: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/51.jpg)
Arithmetic/operational intensity
The number of operations per byte of
accessed memory
Compute-intensive?
Data-intensive?
It is an application characteristic!
Ignore “overheads”
Loop counters
Array index calculations
Branches
52
![Page 52: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/52.jpg)
RGB to gray
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x]; // 3-byte structure
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
53
2 x ADD, 3 x MUL = 5 Ops
1 x RD, 1 x WR => 4 bytes of memory accessed
OI = 5/4 = 1.25
![Page 53: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/53.jpg)
Many-core platforms
Cores
Threads or ALUs GFLOPS Bandwidth
FLOPs/Byte
Sun Niagara 2 8 64 11.2 76 0.1
IBM bg/p 4 8 13.6 13.6 1.0
IBM Power 7 8 32 265 68 3.9
Intel Core i7 4 16 85 25.6 3.3
AMD Barcelona 4 8 37 21.4 1.7
AMD Istanbul 6 6 62.4 25.6 2.4
AMD Magny-Cours 12 12 125 25.6 4.9
Cell/B.E. 8 8 205 25.6 8.0
NVIDIA GTX 580 16 512 1581 192 8.2
NVIDIA GTX 680 8 1536 3090 192 16.1
AMD HD 6970 384 1536 2703 176 15.4
AMD HD 7970 32 2048 3789 264 14.4
Intel Xeon Phi 7120 61 240 2417 352 6.9
![Page 54: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/54.jpg)
Compute or memory intensive?
RGB to Gray
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Sun Niagara 2IBM bg/p
IBM Power 7Intel Core i7
AMD BarcelonaAMD Istanbul
AMD Magny-CoursCell/B.E.
NVIDIA GTX 580NVIDIA GTX 680
AMD HD 6970Intel Xeon Phi 7120Intel Xeon Phi 3120
55
“A multi-/many-core processor is a
device built to turn a compute-intensive
application into a memory-intensive
one”
Kathy Yelick, UC Berkeley
![Page 55: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/55.jpg)
Applications OI
Operational Intensity
O( N ) O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra
(BLAS3)
Particle Methods
56
![Page 56: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/56.jpg)
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational
Intensity)
Peak iff OIapp ≥ PeakFLOPs/PeakBW
Compute-intensive iff OIapp ≥ (FLOPs/Byte)platform
Memory-intensive iff OIapp < (FLOPs/Byte)platform
Attainable performance 58
Compute intensive
Memory intensive
![Page 57: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/57.jpg)
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational Intensity)
Example: RGB-to-Gray
AI = 1.25
NVIDIA GTX680 P = min ( 3090, 1.25 * 192) = 240 GFLOPs
Only 7.8% of the peak
Intel Xeon Phi P = min ( 2417, 1.25 * 352) = 440 GFLOPs
Only 18.2% of the peak
Attainable performance 59
Compute intensive
Memory intensive
![Page 58: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/58.jpg)
The Roofline model
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
60
![Page 59: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/59.jpg)
Roofline: comparing architectures
AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9
61
![Page 60: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/60.jpg)
Roofline: computational ceilings
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
62
![Page 61: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/61.jpg)
Roofline: bandwidth ceilings
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
63
![Page 62: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/62.jpg)
Roofline: optimization regions 64
![Page 63: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d63397e708231d43a1a6a/html5/thumbnails/63.jpg)
Use the Roofline model
Determine what to do first to gain performance
Increase memory streaming rate
Apply in-core optimizations
Increase arithmetic intensity
Reader
Samuel Williams, Andrew Waterman, David
Patterson
“Roofline: an insightful visual performance model
for multicore architectures”
65