kokkos tutorial - sandia national laboratories€¦ · kokkos tutorial je amelang 2, christian r....
TRANSCRIPT
![Page 1: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/1.jpg)
Kokkos Tutorial
Jeff Amelang 2, Christian R. Trott 1, H. Carter Edwards 1
1Sandia National Laboratories
2Harvey Mudd College
GTC’16, April 4-7, 2016
Sandia National Laboratories is a multi-program laboratory managed and operated by
Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the
U.S. Department of Energy’s National Nuclear Security Administration under contract
DE-AC04-94AL85000.
SAND2016-2933 C
![Page 2: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/2.jpg)
GTC’16, April 4-7, 2016 2/121
GTC NVLABS LOGISTICS (1)
SOFTWARE FOR LAB
Remote Desktop Software:
I Download NoMachine now for best performance fromwww.nomachine.com/download
I Alternatively you may use a VNC client or the providedbrowser-based VNC option
SSH Access Software (optional):
I PuTTy for Windows can be downloaded from www.putty.org
I Alternatively you may use a provided browser-based SSHoption
![Page 3: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/3.jpg)
GTC’16, April 4-7, 2016 3/121
GTC NVLABS LOGISTICS (2)
CONNECTION INSTRUCTIONSI Navigate to nvlabs.qwiklab.com
I Login or create a new account
I Select the Instructor-Led Hands-on Labs Class
I Find the lab called Kokkos, ..., select it, click Select, andfinally click Start
I After a short wait, lab instance Connection information will beshown
I Please ask Lab Assistants for help!
![Page 4: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/4.jpg)
GTC’16, April 4-7, 2016 4/121
GTC NVLABS ADVERTISEMENT (3)
FURTHER HANDS-ON TRAINING
Check out the Self-Paced labs at the conference.
I Deep Learning, CUDA, OpenACC, Tools and more!
I Just grab a seat and take any available lab
I Located in the lower-level outside of LL20C
You will also receive Credits to take additional labs atnvidia.qwiklab.com
I Log in using the same account you used in this lab
![Page 5: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/5.jpg)
GTC’16, April 4-7, 2016 5/121
Kokkos Tutorial
Jeff Amelang 2, Christian R. Trott 1, H. Carter Edwards 1
1Sandia National Laboratories
2Harvey Mudd College
GTC’16, April 4-7, 2016
Sandia National Laboratories is a multi-program laboratory managed and operated by
Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the
U.S. Department of Energy’s National Nuclear Security Administration under contract
DE-AC04-94AL85000.
SAND2016-2933 C
![Page 6: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/6.jpg)
GTC’16, April 4-7, 2016 6/121
Prerequisites for Tutorial Exercises
Knowledge of C++: class constructors, member variables,member functions, member operators, template arguments
Using GTC’16 NVLABS
I Kokkos pre installed in ${HOME}/kokkos
I Exercises pre installed in ${HOME}/GTC2016
Using your own ${HOME}I Git
I GCC 4.7.2 (or newer) OR Intel 14 (or newer) OR Clang 3.5.2 (or newer)
I CUDA nvcc 6.5.14 (or newer) AND NVIDIA compute capability 3.0 (or newer)
I clone github.com/kokkos/kokkos into ${HOME}/kokkos
I clone github.com/kokkos/kokkos-tutorials/GTC2016 into ${HOME}/GTC2016
makefiles look for ${HOME}/kokkos
![Page 7: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/7.jpg)
GTC’16, April 4-7, 2016 7/121
Tutorial Objectives
Understand Kokkos Programming Model Abstractions
I What, how and why of performance portability
I Productivity and hope for future-proofing
Part One:
I Simple data parallel computations
I Deciding where code is run and where data is placed
Part Two:
I Managing data access pattens for performance portability
I Thread safety and thread scalability
I Thread-teams for maximizing parallelism
![Page 8: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/8.jpg)
GTC’16, April 4-7, 2016 8/121
Tutorial Takeaways
I High performance computers are increasingly heterogenousMPI-only is no longer sufficient.
I For portability: OpenMP, OpenACC, ... or Kokkos.
I Only Kokkos obtains performant memory access patterns viaarchitecture-aware arrays and work mapping.
i.e., not just portable, performance portable.
I With Kokkos, simple things stay simple (parallel-for, etc.).i.e., it’s no more difficult than OpenMP.
I Advanced performance-optimizing patterns are simplerwith Kokkos than with native versions.
i.e., you’re not missing out on advanced features.
![Page 9: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/9.jpg)
GTC’16, April 4-7, 2016 9/121
Operating assumptions (0)
Target audience:
I Wants to use GPUs (and perhaps other accelerators)
I Is familiar with data parallelism
I Has taken “cuda programming 101”
I Familiar with NVIDIA GPU architecture at a high levelAware that coalesced memory access is important
I Some familiarity with OpenMP
I Wants CUDA to be easier
I Would like portability, if it doesn’t hurt performance
![Page 10: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/10.jpg)
GTC’16, April 4-7, 2016 10/121
Operating assumptions (1)
Target machine:
DRAM
NVRAM
On-PackageMemory
Network-on-Chip
Core Core
External NetworkInterface
...
Core Core...
Acc.On-Package
Memory
External Interconnect
Node
NUMA Domain
NUMA Domain
Accelerator
![Page 11: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/11.jpg)
GTC’16, April 4-7, 2016 11/121
Important Point: Performance Portability
Important Point
There’s a difference between portability and performanceportability.
Example: implementations may target particular architectures andmay not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns acrossarchitectures,
I can leverage architecture-specific features where possible.
Kokkos: performance portability across manycore architectures.
![Page 12: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/12.jpg)
GTC’16, April 4-7, 2016 11/121
Important Point: Performance Portability
Important Point
There’s a difference between portability and performanceportability.
Example: implementations may target particular architectures andmay not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns acrossarchitectures,
I can leverage architecture-specific features where possible.
Kokkos: performance portability across manycore architectures.
![Page 13: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/13.jpg)
GTC’16, April 4-7, 2016 11/121
Important Point: Performance Portability
Important Point
There’s a difference between portability and performanceportability.
Example: implementations may target particular architectures andmay not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns acrossarchitectures,
I can leverage architecture-specific features where possible.
Kokkos: performance portability across manycore architectures.
![Page 14: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/14.jpg)
GTC’16, April 4-7, 2016 12/121
Concepts for threaded dataparallelism
Learning objectives:
I Terminology of pattern, policy, and body.
I The data layout problem.
![Page 15: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/15.jpg)
GTC’16, April 4-7, 2016 13/121
Concepts: Patterns, Policies, and Bodies
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
total += dot(left[element ][qp], right[element ][qp]);
}
elementValues[element] = total;
}
Terminology:
I Pattern: structure of the computationsfor, reduction, scan, task-graph, ...
I Execution Policy: how computations are executedstatic scheduling, dynamic scheduling, thread teams, ...
I Computational Body: code which performs each unit ofwork; e.g., the loop body
⇒ The pattern and policy drive the computational body.
![Page 16: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/16.jpg)
GTC’16, April 4-7, 2016 13/121
Concepts: Patterns, Policies, and Bodies
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
total += dot(left[element ][qp], right[element ][qp]);
}
elementValues[element] = total;
}
Terminology:
I Pattern: structure of the computationsfor, reduction, scan, task-graph, ...
I Execution Policy: how computations are executedstatic scheduling, dynamic scheduling, thread teams, ...
I Computational Body: code which performs each unit ofwork; e.g., the loop body
⇒ The pattern and policy drive the computational body.
Pattern Policy
Body
![Page 17: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/17.jpg)
GTC’16, April 4-7, 2016 14/121
Threading “Parallel for”
What if we want to thread the loop?
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
total += dot(left[element ][qp], right[element ][qp]);
}
elementValues[element] = total;
}
(Change the execution policy from “serial” to “parallel.”)
OpenMP is simple for parallelizing loops on multi-core CPUs,but what if we then want to do this on other architectures?
Intel MIC and NVIDIA GPU and AMD Fusion and ...
![Page 18: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/18.jpg)
GTC’16, April 4-7, 2016 14/121
Threading “Parallel for”
What if we want to thread the loop?
#pragma omp parallel for
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
total += dot(left[element ][qp], right[element ][qp]);
}
elementValues[element] = total;
}
(Change the execution policy from “serial” to “parallel.”)
OpenMP is simple for parallelizing loops on multi-core CPUs,but what if we then want to do this on other architectures?
Intel MIC and NVIDIA GPU and AMD Fusion and ...
![Page 19: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/19.jpg)
GTC’16, April 4-7, 2016 14/121
Threading “Parallel for”
What if we want to thread the loop?
#pragma omp parallel for
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
total += dot(left[element ][qp], right[element ][qp]);
}
elementValues[element] = total;
}
(Change the execution policy from “serial” to “parallel.”)
OpenMP is simple for parallelizing loops on multi-core CPUs,but what if we then want to do this on other architectures?
Intel MIC and NVIDIA GPU and AMD Fusion and ...
![Page 20: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/20.jpg)
GTC’16, April 4-7, 2016 15/121
“Parallel for” on a GPU via pragmas
Option 1: OpenMP 4.0
#pragma omp target data map (...)
#pragma omp teams num_teams (...) num_threads (...) private (...)
#pragma omp distribute
for (element = 0; element < numElements; ++ element) {
total = 0
#pragma omp parallel for
for (qp = 0; qp < numQPs; ++qp)
total += dot(left[element ][qp], right[element ][qp]);
elementValues[element] = total;
}
Option 2: OpenACC
#pragma acc parallel copy (...) num_gangs (...) vector_length (...)
#pragma acc loop gang vector
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp)
total += dot(left[element ][qp], right[element ][qp]);
elementValues[element] = total;
}
![Page 21: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/21.jpg)
GTC’16, April 4-7, 2016 15/121
“Parallel for” on a GPU via pragmas
Option 1: OpenMP 4.0
#pragma omp target data map (...)
#pragma omp teams num_teams (...) num_threads (...) private (...)
#pragma omp distribute
for (element = 0; element < numElements; ++ element) {
total = 0
#pragma omp parallel for
for (qp = 0; qp < numQPs; ++qp)
total += dot(left[element ][qp], right[element ][qp]);
elementValues[element] = total;
}
Option 2: OpenACC
#pragma acc parallel copy (...) num_gangs (...) vector_length (...)
#pragma acc loop gang vector
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp)
total += dot(left[element ][qp], right[element ][qp]);
elementValues[element] = total;
}
![Page 22: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/22.jpg)
GTC’16, April 4-7, 2016 16/121
Portable, but not performance portable
A standard thread parallel programming modelmay give you portable parallel executionif it is supported on the target architecture.
But what about performance?
Performance depends upon the computation’smemory access pattern.
![Page 23: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/23.jpg)
GTC’16, April 4-7, 2016 16/121
Portable, but not performance portable
A standard thread parallel programming modelmay give you portable parallel executionif it is supported on the target architecture.
But what about performance?
Performance depends upon the computation’smemory access pattern.
![Page 24: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/24.jpg)
GTC’16, April 4-7, 2016 17/121
Problem: memory access pattern
#pragma something , opencl , etc.
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
for (i = 0; i < vectorSize; ++i) {
total +=
left[element * numQPs * vectorSize +
qp * vectorSize + i] *
right[element * numQPs * vectorSize +
qp * vectorSize + i];
}
}
elementValues[element] = total;
}
Memory access pattern problem: CPU data layout reduces GPUperformance by more than 10X.
Important Point
For performance, the memory access pattern must depend on thearchitecture.
![Page 25: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/25.jpg)
GTC’16, April 4-7, 2016 17/121
Problem: memory access pattern
#pragma something , opencl , etc.
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
for (i = 0; i < vectorSize; ++i) {
total +=
left[element * numQPs * vectorSize +
qp * vectorSize + i] *
right[element * numQPs * vectorSize +
qp * vectorSize + i];
}
}
elementValues[element] = total;
}
Memory access pattern problem: CPU data layout reduces GPUperformance by more than 10X.
Important Point
For performance, the memory access pattern must depend on thearchitecture.
![Page 26: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/26.jpg)
GTC’16, April 4-7, 2016 17/121
Problem: memory access pattern
#pragma something , opencl , etc.
for (element = 0; element < numElements; ++ element) {
total = 0;
for (qp = 0; qp < numQPs; ++qp) {
for (i = 0; i < vectorSize; ++i) {
total +=
left[element * numQPs * vectorSize +
qp * vectorSize + i] *
right[element * numQPs * vectorSize +
qp * vectorSize + i];
}
}
elementValues[element] = total;
}
Memory access pattern problem: CPU data layout reduces GPUperformance by more than 10X.
Important Point
For performance, the memory access pattern must depend on thearchitecture.
![Page 27: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/27.jpg)
GTC’16, April 4-7, 2016 18/121
Kokkos overview
How does Kokkos address performance portability?
Kokkos is a productive, portable, performant, shared-memoryprogramming model.
I is a C++ library, not a new language or language extension.
I supports clear, concise, thread-scalable parallel patterns.
I lets you write algorithms once and run on many architecturese.g. multi-core CPU, Nvidia GPGPU, Xeon Phi, ...
I minimizes the amount of architecture-specificimplementation details users must know.
I solves the data layout problem by using multi-dimensionalarrays with architecture-dependent layouts
![Page 28: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/28.jpg)
GTC’16, April 4-7, 2016 19/121
Data parallel patterns
Learning objectives:
I How computational bodies are passed to the Kokkos runtime.
I How work is mapped to cores.
I The difference between parallel for andparallel reduce.
I Start parallelizing a simple example.
![Page 29: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/29.jpg)
GTC’16, April 4-7, 2016 20/121
Using Kokkos for data parallel patterns (0)
Data parallel patterns and work
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex) {
atomForces[atomIndex] = calculateForce (... data ...);
}
Kokkos maps work to cores
I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.
Important concept: Work mapping
You give an iteration range and computational body (kernel)to Kokkos, Kokkos maps iteration indices to cores and thenruns the computational body on those cores.
![Page 30: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/30.jpg)
GTC’16, April 4-7, 2016 20/121
Using Kokkos for data parallel patterns (0)
Data parallel patterns and work
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex) {
atomForces[atomIndex] = calculateForce (... data ...);
}
Kokkos maps work to cores
I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.
Important concept: Work mapping
You give an iteration range and computational body (kernel)to Kokkos, Kokkos maps iteration indices to cores and thenruns the computational body on those cores.
![Page 31: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/31.jpg)
GTC’16, April 4-7, 2016 20/121
Using Kokkos for data parallel patterns (0)
Data parallel patterns and work
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex) {
atomForces[atomIndex] = calculateForce (... data ...);
}
Kokkos maps work to cores
I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.
Important concept: Work mapping
You give an iteration range and computational body (kernel)to Kokkos, Kokkos maps iteration indices to cores and thenruns the computational body on those cores.
![Page 32: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/32.jpg)
GTC’16, April 4-7, 2016 21/121
Using Kokkos for data parallel patterns (2)
How are computational bodies given to Kokkos?
As functors or function objects, a common pattern in C++.
Quick review, a functor is a function with data. Example:
struct ParallelFunctor {
...
void operator ()( a work assignment ) const {
/* ... computational body ... */
...
};
![Page 33: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/33.jpg)
GTC’16, April 4-7, 2016 21/121
Using Kokkos for data parallel patterns (2)
How are computational bodies given to Kokkos?
As functors or function objects, a common pattern in C++.
Quick review, a functor is a function with data. Example:
struct ParallelFunctor {
...
void operator ()( a work assignment ) const {
/* ... computational body ... */
...
};
![Page 34: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/34.jpg)
GTC’16, April 4-7, 2016 21/121
Using Kokkos for data parallel patterns (2)
How are computational bodies given to Kokkos?
As functors or function objects, a common pattern in C++.
Quick review, a functor is a function with data. Example:
struct ParallelFunctor {
...
void operator ()( a work assignment ) const {
/* ... computational body ... */
...
};
![Page 35: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/35.jpg)
GTC’16, April 4-7, 2016 22/121
Using Kokkos for data parallel patterns (3)
How is work assigned to functor operators?
A total amount of work items is given to a Kokkos pattern,
ParallelFunctor functor;
Kokkos :: parallel_for(numberOfIterations , functor );
and work items are assigned to functors one-by-one:
struct Functor {
void operator ()( const size_t index) const {...}
}
Warning: concurrency and order
Concurrency and ordering of parallel iterations is not guaranteedby the Kokkos runtime.
![Page 36: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/36.jpg)
GTC’16, April 4-7, 2016 22/121
Using Kokkos for data parallel patterns (3)
How is work assigned to functor operators?
A total amount of work items is given to a Kokkos pattern,
ParallelFunctor functor;
Kokkos :: parallel_for(numberOfIterations , functor );
and work items are assigned to functors one-by-one:
struct Functor {
void operator ()( const size_t index) const {...}
}
Warning: concurrency and order
Concurrency and ordering of parallel iterations is not guaranteedby the Kokkos runtime.
![Page 37: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/37.jpg)
GTC’16, April 4-7, 2016 22/121
Using Kokkos for data parallel patterns (3)
How is work assigned to functor operators?
A total amount of work items is given to a Kokkos pattern,
ParallelFunctor functor;
Kokkos :: parallel_for(numberOfIterations , functor );
and work items are assigned to functors one-by-one:
struct Functor {
void operator ()( const size_t index) const {...}
}
Warning: concurrency and order
Concurrency and ordering of parallel iterations is not guaranteedby the Kokkos runtime.
![Page 38: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/38.jpg)
GTC’16, April 4-7, 2016 22/121
Using Kokkos for data parallel patterns (3)
How is work assigned to functor operators?
A total amount of work items is given to a Kokkos pattern,
ParallelFunctor functor;
Kokkos :: parallel_for(numberOfIterations , functor );
and work items are assigned to functors one-by-one:
struct Functor {
void operator ()( const size_t index) const {...}
}
Warning: concurrency and order
Concurrency and ordering of parallel iterations is not guaranteedby the Kokkos runtime.
![Page 39: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/39.jpg)
GTC’16, April 4-7, 2016 23/121
Using Kokkos for data parallel patterns (4)
How is data passed to computational bodies?
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex) {
atomForces[atomIndex] = calculateForce(... data ...);
}
struct AtomForceFunctor {
...
void operator ()( const size_t atomIndex) const {
atomForces[atomIndex] = calculateForce(... data ...);
}
...
}
How does the body access the data?
Important concept
A parallel functor body must have access to all the data it needsthrough the functor’s data members.
![Page 40: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/40.jpg)
GTC’16, April 4-7, 2016 23/121
Using Kokkos for data parallel patterns (4)
How is data passed to computational bodies?
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex) {
atomForces[atomIndex] = calculateForce(... data ...);
}
struct AtomForceFunctor {
...
void operator ()( const size_t atomIndex) const {
atomForces[atomIndex] = calculateForce(... data ...);
}
...
}
How does the body access the data?
Important concept
A parallel functor body must have access to all the data it needsthrough the functor’s data members.
![Page 41: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/41.jpg)
GTC’16, April 4-7, 2016 24/121
Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct AtomForceFunctor {
ForceType _atomForces;
AtomDataType _atomData;
void operator ()( const size_t atomIndex) const {
_atomForces[atomIndex] = calculateForce(_atomData );
}
}
Q/ How would we reproduce serial execution with this functor?
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
atomForces[atomIndex] = calculateForce(data);
}
AtomForceFunctor functor(atomForces , data);
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
functor(atomIndex );
}
![Page 42: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/42.jpg)
GTC’16, April 4-7, 2016 24/121
Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct AtomForceFunctor {
ForceType _atomForces;
AtomDataType _atomData;
void operator ()( const size_t atomIndex) const {
_atomForces[atomIndex] = calculateForce(_atomData );
}
}
Q/ How would we reproduce serial execution with this functor?
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
atomForces[atomIndex] = calculateForce(data);
}
AtomForceFunctor functor(atomForces , data);
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
functor(atomIndex );
}
Ser
ial
![Page 43: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/43.jpg)
GTC’16, April 4-7, 2016 24/121
Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct AtomForceFunctor {
ForceType _atomForces;
AtomDataType _atomData;
void operator ()( const size_t atomIndex) const {
_atomForces[atomIndex] = calculateForce(_atomData );
}
}
Q/ How would we reproduce serial execution with this functor?
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
atomForces[atomIndex] = calculateForce(data);
}
AtomForceFunctor functor(atomForces , data);
for (atomIndex = 0; atomIndex < numberOfAtoms; ++ atomIndex ){
functor(atomIndex );
}
Ser
ial
Fu
nct
or
![Page 44: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/44.jpg)
GTC’16, April 4-7, 2016 25/121
Using Kokkos for data parallel patterns (6)
The complete picture (using functors):
1. Defining the functor (operator+data):
struct AtomForceFunctor {
ForceType _atomForces;
AtomDataType _atomData;
AtomForceFunctor(atomForces , data) :
_atomForces(atomForces) _atomData(data) {}
void operator ()( const size_t atomIndex) const {
_atomForces[atomIndex] = calculateForce(_atomData );
}
}
2. Executing in parallel with Kokkos pattern:AtomForceFunctor functor(atomForces , data);
Kokkos :: parallel_for(numberOfAtoms , functor );
![Page 45: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/45.jpg)
GTC’16, April 4-7, 2016 26/121
Using Kokkos for data parallel patterns (7)
Functors are verbose ⇒ C++11 Lambdas are concise
atomForces already exists
data already exists
Kokkos :: parallel_for(numberOfAtoms ,
[=] (const size_t atomIndex) {
atomForces[atomIndex] = calculateForce(data);
}
);
A lambda is not magic, it is the compiler auto-generating afunctor for you.
Warning: Lambda capture and C++ containers
For portability (e.g., to GPU) a lambda must capture by value[=]. Don’t capture containers (e.g., std::vector) by value becausethis copies the container’s entire contents.
![Page 46: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/46.jpg)
GTC’16, April 4-7, 2016 26/121
Using Kokkos for data parallel patterns (7)
Functors are verbose ⇒ C++11 Lambdas are concise
atomForces already exists
data already exists
Kokkos :: parallel_for(numberOfAtoms ,
[=] (const size_t atomIndex) {
atomForces[atomIndex] = calculateForce(data);
}
);
A lambda is not magic, it is the compiler auto-generating afunctor for you.
Warning: Lambda capture and C++ containers
For portability (e.g., to GPU) a lambda must capture by value[=]. Don’t capture containers (e.g., std::vector) by value becausethis copies the container’s entire contents.
![Page 47: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/47.jpg)
GTC’16, April 4-7, 2016 26/121
Using Kokkos for data parallel patterns (7)
Functors are verbose ⇒ C++11 Lambdas are concise
atomForces already exists
data already exists
Kokkos :: parallel_for(numberOfAtoms ,
[=] (const size_t atomIndex) {
atomForces[atomIndex] = calculateForce(data);
}
);
A lambda is not magic, it is the compiler auto-generating afunctor for you.
Warning: Lambda capture and C++ containers
For portability (e.g., to GPU) a lambda must capture by value[=]. Don’t capture containers (e.g., std::vector) by value becausethis copies the container’s entire contents.
![Page 48: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/48.jpg)
GTC’16, April 4-7, 2016 27/121
parallel for examples
How does this compare to OpenMP?
for (size_t i = 0; i < N; ++i) {
/* loop body */
}
#pragma omp parallel for
for (size_t i = 0; i < N; ++i) {
/* loop body */
}
parallel_for(N, [=] (const size_t i) {
/* loop body */
});
Important concept
Simple Kokkos usage is no more conceptually difficult thanOpenMP, the annotations just go in different places.
Ser
ial
Op
enM
PK
ok
kos
![Page 49: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/49.jpg)
GTC’16, April 4-7, 2016 28/121
Scalar integration (0)
Riemann-sum-style numerical integration:
y =
∫ upper
lowerfunction(x) dx
wikipedia
![Page 50: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/50.jpg)
GTC’16, April 4-7, 2016 28/121
Scalar integration (0)
Riemann-sum-style numerical integration:
y =
∫ upper
lowerfunction(x) dx
wikipedia
double totalIntegral = 0;
for (size_t i = 0; i < numberOfIntervals; ++i) {
const double x =
lower + (i/numberOfIntervals) * (upper - lower );const double thisIntervalsContribution = function(x);totalIntegral += thisIntervalsContribution;
}
totalIntegral *= dx;
![Page 51: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/51.jpg)
GTC’16, April 4-7, 2016 28/121
Scalar integration (0)
Riemann-sum-style numerical integration:
y =
∫ upper
lowerfunction(x) dx
wikipedia
double totalIntegral = 0;
for (size_t i = 0; i < numberOfIntervals; ++i) {
const double x =
lower + (i/numberOfIntervals) * (upper - lower );const double thisIntervalsContribution = function(x);totalIntegral += thisIntervalsContribution;
}
totalIntegral *= dx;
How would we parallelize it?
![Page 52: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/52.jpg)
GTC’16, April 4-7, 2016 28/121
Scalar integration (0)
Riemann-sum-style numerical integration:
y =
∫ upper
lowerfunction(x) dx
wikipedia
double totalIntegral = 0;
for (size_t i = 0; i < numberOfIntervals; ++i) {
const double x =
lower + (i/numberOfIntervals) * (upper - lower );const double thisIntervalsContribution = function(x);totalIntegral += thisIntervalsContribution;
}
totalIntegral *= dx;
How would we parallelize it?
Pattern?Policy?
Body?
![Page 53: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/53.jpg)
GTC’16, April 4-7, 2016 29/121
Scalar integration (1)
An (incorrect) attempt:
double totalIntegral = 0;
Kokkos :: parallel_for(numberOfIntervals ,
[=] (const size_t index) {
const double x =
lower + (index/numberOfIntervals) * (upper - lower );
totalIntegral += function(x);},
);
totalIntegral *= dx;
First problem: compiler error; cannot increment totalIntegral(lambdas capture by value and are treated as const!)
![Page 54: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/54.jpg)
GTC’16, April 4-7, 2016 30/121
Scalar integration (2)
An (incorrect) solution to the (incorrect) attempt:
double totalIntegral = 0;
double * totalIntegralPointer = &totalIntegral;
Kokkos :: parallel_for(numberOfIntervals ,
[=] (const size_t index) {
const double x =
lower + (index/numberOfIntervals) * (upper - lower );
*totalIntegralPointer += function(x);},
);
totalIntegral *= dx;
Second problem: race condition
step thread 0 thread 1
0 load
1 increment load
2 write increment
3 write
![Page 55: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/55.jpg)
GTC’16, April 4-7, 2016 30/121
Scalar integration (2)
An (incorrect) solution to the (incorrect) attempt:
double totalIntegral = 0;
double * totalIntegralPointer = &totalIntegral;
Kokkos :: parallel_for(numberOfIntervals ,
[=] (const size_t index) {
const double x =
lower + (index/numberOfIntervals) * (upper - lower );
*totalIntegralPointer += function(x);},
);
totalIntegral *= dx;
Second problem: race condition
step thread 0 thread 1
0 load
1 increment load
2 write increment
3 write
![Page 56: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/56.jpg)
GTC’16, April 4-7, 2016 31/121
Scalar integration (3)
Root problem: we’re using the wrong pattern, for instead ofreduction
Important concept: Reduction
Reductions combine the results contributed by parallel work.
How would we do this with OpenMP?double finalReducedValue = 0;
#pragma omp parallel for reduction(+: finalReducedValue)
for (size_t i = 0; i < N; ++i) {
finalReducedValue += ...
}
How will we do this with Kokkos?double finalReducedValue = 0;
parallel_reduce(N, functor , finalReducedValue );
![Page 57: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/57.jpg)
GTC’16, April 4-7, 2016 31/121
Scalar integration (3)
Root problem: we’re using the wrong pattern, for instead ofreduction
Important concept: Reduction
Reductions combine the results contributed by parallel work.
How would we do this with OpenMP?double finalReducedValue = 0;
#pragma omp parallel for reduction(+: finalReducedValue)
for (size_t i = 0; i < N; ++i) {
finalReducedValue += ...
}
How will we do this with Kokkos?double finalReducedValue = 0;
parallel_reduce(N, functor , finalReducedValue );
![Page 58: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/58.jpg)
GTC’16, April 4-7, 2016 31/121
Scalar integration (3)
Root problem: we’re using the wrong pattern, for instead ofreduction
Important concept: Reduction
Reductions combine the results contributed by parallel work.
How would we do this with OpenMP?double finalReducedValue = 0;
#pragma omp parallel for reduction(+: finalReducedValue)
for (size_t i = 0; i < N; ++i) {
finalReducedValue += ...
}
How will we do this with Kokkos?double finalReducedValue = 0;
parallel_reduce(N, functor , finalReducedValue );
![Page 59: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/59.jpg)
GTC’16, April 4-7, 2016 31/121
Scalar integration (3)
Root problem: we’re using the wrong pattern, for instead ofreduction
Important concept: Reduction
Reductions combine the results contributed by parallel work.
How would we do this with OpenMP?double finalReducedValue = 0;
#pragma omp parallel for reduction(+: finalReducedValue)
for (size_t i = 0; i < N; ++i) {
finalReducedValue += ...
}
How will we do this with Kokkos?double finalReducedValue = 0;
parallel_reduce(N, functor , finalReducedValue );
![Page 60: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/60.jpg)
GTC’16, April 4-7, 2016 32/121
Scalar integration (4)
Example: Scalar integration
double totalIntegral = 0;
#pragma omp parallel for reduction(+: totalIntegral)
for (size_t i = 0; i < numberOfIntervals; ++i) {
totalIntegral += function (...);
}
double totalIntegral = 0;
parallel_reduce(numberOfIntervals ,
[=] (const size_t i, double & valueToUpdate) {
valueToUpdate += function (...);
},
totalIntegral );
I The operator takes two arguments: a work index and a valueto update.
I The second argument is a thread-private value that is madeand used by Kokkos; it is not the final reduced value.
Op
enM
PK
ok
kos
![Page 61: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/61.jpg)
GTC’16, April 4-7, 2016 33/121
Scalar integration (5)
Warning: Parallelism is NOT free
Dispatching (launching) parallel work has non-negligible cost.
Simplistic data-parallel performance model: Time = α + β∗NP
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency
Speedup = P ÷(
1 + α∗Pβ∗N
)I Should have α ∗ P � β ∗ NI All runtimes strive to minimize launch overhead α
I Find more parallelism to increase N
I Merge (fuse) parallel operations to increase β
![Page 62: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/62.jpg)
GTC’16, April 4-7, 2016 33/121
Scalar integration (5)
Warning: Parallelism is NOT free
Dispatching (launching) parallel work has non-negligible cost.
Simplistic data-parallel performance model: Time = α + β∗NP
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency
Speedup = P ÷(
1 + α∗Pβ∗N
)I Should have α ∗ P � β ∗ NI All runtimes strive to minimize launch overhead α
I Find more parallelism to increase N
I Merge (fuse) parallel operations to increase β
![Page 63: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/63.jpg)
GTC’16, April 4-7, 2016 33/121
Scalar integration (5)
Warning: Parallelism is NOT free
Dispatching (launching) parallel work has non-negligible cost.
Simplistic data-parallel performance model: Time = α + β∗NP
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency
Speedup = P ÷(
1 + α∗Pβ∗N
)I Should have α ∗ P � β ∗ NI All runtimes strive to minimize launch overhead α
I Find more parallelism to increase N
I Merge (fuse) parallel operations to increase β
![Page 64: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/64.jpg)
GTC’16, April 4-7, 2016 34/121
Scalar integration (6)
Results: illustrates simple speedup model = P ÷(
1 + α∗Pβ∗N
)
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1x106 1x107 1x108
spee
dup
over
ser
ial [
-]
number of intervals [-]
Kokkos speedup over serial: Scalar Integration
Kokkos Cuda K40Kokkos OpenMP KNCKokkos OpenMP SNBNative OpenMP SNB
Unity
No
te:
log
sca
le
![Page 65: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/65.jpg)
GTC’16, April 4-7, 2016 35/121
Recurring Exercise: Inner Product
Exercise: Inner product < y ,A ∗ x >
Details:
I y is Nx1, A is NxM, x is Mx1
I We’ll use this exercise throughout the tutorial
![Page 66: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/66.jpg)
GTC’16, April 4-7, 2016 36/121
Exercise #1: include, initialize, finalize Kokkos
The first step in using Kokkos is to include, initialize, and finalize:
#include <Kokkos_Core.hpp >
int main(int argc , char** argv) {
/* ... do any necessary setup (e.g., initialize MPI) ... */
Kokkos :: initialize(argc , argv);
/* ... do computations ... */
Kokkos :: finalize ();
return 0;
}
(Optional) Command-line arguments:
--kokkos-threads=INTtotal number of threads
(or threads within NUMA region)
--kokkos-numa=INT number of NUMA regions
--kokkos-device=INT device (GPU) ID to use
![Page 67: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/67.jpg)
GTC’16, April 4-7, 2016 37/121
Exercise #1: Inner Product, Flat Parallelism on the CPU
Exercise: Inner product < y ,A ∗ x >
Details:
I Location: ~/GTC2016/Exercises/01/
I Look for comments labeled with “EXERCISE”
I Need to include, initialize, and finalize Kokkos library
I Parallelize loops with parallel for or parallel reduce
I Use lambdas instead of functors for computational bodies.
I For now, this will only use the CPU.
![Page 68: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/68.jpg)
GTC’16, April 4-7, 2016 38/121
Exercise #1: logistics
Compiling for CPU
cd ~/ GTC2016/Exercises /01/ Begin
# gcc using OpenMP (default) and Serial back -ends
make -j [KOKKOS_DEVICES=OpenMP ,Serial]
Running on CPU with OpenMP back-end
# Set OpenMP affinity
export OMP_NUM_THREADS =8
export GOMP_CPU_AFFINITY =0-8
# Print example command line options:
./01 _Exercise.host -h
# Run with defaults on CPU
./01 _Exercise.host
# Run larger problem
./01 _Exercise -S 26
Things to try:
I Vary number of threads
I Vary problem size
I Vary number of rows (-N ...)
![Page 69: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/69.jpg)
GTC’16, April 4-7, 2016 39/121
Exercise #1 results
0
20
40
60
80
100
120
140
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Ban
dwid
th (
GB
/s)
number of rows
<y,Ax> Exercise01, fixed problem size
KNCHSW
![Page 70: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/70.jpg)
GTC’16, April 4-7, 2016 40/121
General reductions
How do you do arbitrary reductions?
Example: finding index of closest point
Point searchLocation = ...;
size_t indexOfClosest = 0;
for (size_t i = 1; i < numberOfPoints; ++i) {
if (magnitude(searchLocation - points[i]) <
magnitude(searchLocation - points[indexOfClosest ])) {
indexOfClosest = i;
}
}
What information must we provide to do a reduction?
I The type of the value to reduce (“value type”)
I How to combine (“join”) two value types
I How to initialize a value type
![Page 71: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/71.jpg)
GTC’16, April 4-7, 2016 40/121
General reductions
How do you do arbitrary reductions?
Example: finding index of closest point
Point searchLocation = ...;
size_t indexOfClosest = 0;
for (size_t i = 1; i < numberOfPoints; ++i) {
if (magnitude(searchLocation - points[i]) <
magnitude(searchLocation - points[indexOfClosest ])) {
indexOfClosest = i;
}
}
What information must we provide to do a reduction?
I The type of the value to reduce (“value type”)
I How to combine (“join”) two value types
I How to initialize a value type
![Page 72: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/72.jpg)
GTC’16, April 4-7, 2016 41/121
Advanced features we haven’t covered
I Exclusive and inclusive prefix scan with the parallel scan
pattern.
I Using tag dispatch interface to allow non-trivial functors tohave multiple “operator()” functions.
I NEW: Directed acyclic graph (DAG) of tasks pattern.
I Concurrently executing parallel kernels on CPU and GPU(experimental).
I Hierarchical parallelism with team policies, covered later.
![Page 73: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/73.jpg)
GTC’16, April 4-7, 2016 42/121
Section Summary
I Simple usage is similar to OpenMP, advanced features arealso straightforward
I Three common data-parallel patterns are parallel for,parallel reduce, and parallel scan.
I A parallel computation is characterized by its pattern, policy,and body.
I User provides computational bodies as functors or lambdaswhich handle a single work item.
![Page 74: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/74.jpg)
GTC’16, April 4-7, 2016 43/121
Views
Learning objectives:
I Motivation behind the View abstraction.
I Key View concepts and template parameters.
I The View life cycle.
![Page 75: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/75.jpg)
GTC’16, April 4-7, 2016 44/121
View motivation
Example: running daxpy on the GPU:
double * x = new double[N]; // also y
parallel_for(N, [=] (const size_t i) {
y[i] = a * x[i] + y[i];
});
struct Functor {
double *_x , *_y, a;
void operator ()( const size_t i) {
_y[i] = _a * _x[i] + _y[i];
}
};
Problem: x and y reside in CPU memory.
Solution: We need a way of storing data (multidimensional arrays)which can be communicated to an accelerator (GPU).
⇒ Views
La
mb
da
Fu
nct
or
![Page 76: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/76.jpg)
GTC’16, April 4-7, 2016 44/121
View motivation
Example: running daxpy on the GPU:
double * x = new double[N]; // also y
parallel_for(N, [=] (const size_t i) {
y[i] = a * x[i] + y[i];
});
struct Functor {
double *_x , *_y, a;
void operator ()( const size_t i) {
_y[i] = _a * _x[i] + _y[i];
}
};
Problem: x and y reside in CPU memory.
Solution: We need a way of storing data (multidimensional arrays)which can be communicated to an accelerator (GPU).
⇒ Views
La
mb
da
Fu
nct
or
![Page 77: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/77.jpg)
GTC’16, April 4-7, 2016 44/121
View motivation
Example: running daxpy on the GPU:
double * x = new double[N]; // also y
parallel_for(N, [=] (const size_t i) {
y[i] = a * x[i] + y[i];
});
struct Functor {
double *_x , *_y, a;
void operator ()( const size_t i) {
_y[i] = _a * _x[i] + _y[i];
}
};
Problem: x and y reside in CPU memory.
Solution: We need a way of storing data (multidimensional arrays)which can be communicated to an accelerator (GPU).
⇒ Views
La
mb
da
Fu
nct
or
![Page 78: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/78.jpg)
GTC’16, April 4-7, 2016 45/121
Views (0)
View abstraction
I A lightweight C++ class with a pointer to array data and alittle meta-data,
I that is templated on the data type (and other things).
High-level example of Views for daxpy using lambda:
View <double , ...> x(...), y(...);
... populate x, y...
parallel_for(N, [=] (const size_t i) {
// Views x and y are captured by value (copy)
y(i) = a * x(i) + y(i);
});
Important point
Views are like pointers, so copy them when passing.
![Page 79: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/79.jpg)
GTC’16, April 4-7, 2016 45/121
Views (0)
View abstraction
I A lightweight C++ class with a pointer to array data and alittle meta-data,
I that is templated on the data type (and other things).
High-level example of Views for daxpy using lambda:
View <double , ...> x(...), y(...);
... populate x, y...
parallel_for(N, [=] (const size_t i) {
// Views x and y are captured by value (copy)
y(i) = a * x(i) + y(i);
});
Important point
Views are like pointers, so copy them when passing.
![Page 80: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/80.jpg)
GTC’16, April 4-7, 2016 46/121
Views (1)
View overview:
I Multi-dimensional array of 0 or more dimensionsscalar (0), vector (1), matrix (2), etc.
I Number of dimensions (rank) is fixed at compile-time.
I Arrays are rectangular, not ragged.
I Sizes of dimensions set at compile-time or runtime.e.g., 2x20, 50x50, etc.
Example:
View <double ***> data("label", N0 , N1, N2); 3 run, 0 compile
View <double **[N2]> data("label", N0, N1); 2 run, 1 compile
View <double *[N1][N2]> data("label", N0); 1 run, 2 compile
View <double[N0][N1][N2]> data("label"); 0 run, 3 compile
Note: runtime-sized dimensions must come first.
![Page 81: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/81.jpg)
GTC’16, April 4-7, 2016 46/121
Views (1)
View overview:
I Multi-dimensional array of 0 or more dimensionsscalar (0), vector (1), matrix (2), etc.
I Number of dimensions (rank) is fixed at compile-time.
I Arrays are rectangular, not ragged.
I Sizes of dimensions set at compile-time or runtime.e.g., 2x20, 50x50, etc.
Example:
View <double ***> data("label", N0 , N1, N2); 3 run, 0 compile
View <double **[N2]> data("label", N0, N1); 2 run, 1 compile
View <double *[N1][N2]> data("label", N0); 1 run, 2 compile
View <double[N0][N1][N2]> data("label"); 0 run, 3 compile
Note: runtime-sized dimensions must come first.
![Page 82: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/82.jpg)
GTC’16, April 4-7, 2016 47/121
Views (2)
View life cycle:
I Allocations only happen when explicitly specified.i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr
Example:View <double*> a("a", N0), b("b", N0);
a = b;
View <double*> c(b);
a(0) = 1;
b(0) = 2;
c(0) = 3;
print a(0) // prints 3
![Page 83: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/83.jpg)
GTC’16, April 4-7, 2016 47/121
Views (2)
View life cycle:
I Allocations only happen when explicitly specified.i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr
Example:View <double*> a("a", N0), b("b", N0);
a = b;
View <double*> c(b);
a(0) = 1;
b(0) = 2;
c(0) = 3;
print a(0) // prints 3
![Page 84: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/84.jpg)
GTC’16, April 4-7, 2016 48/121
Exercise #2 a): Inner Product, Flat Parallelism on the CPU, with Views
I Location: ~/GTC2016/Exercises/02/
I Assignment: Change data storage from arrays to Views.
I Ignore the HostMirror Views and deep copy for now!!I Use Device View only!!
I Compile and run on CPU, and then on GPU with UVM
make -j KOKKOS_DEVICES=OpenMP # CPU -only using OpenMP
make -j KOKKOS_DEVICES=Cuda \
KOKKOS_CUDA_OPTIONS=force_uvm ,enable_lambda
# Run exercise
./02 _Exercise.host -S 26
./02 _Exercise.cuda -S 26
# Note the warnings , set appropriate environment variables
I Vary problem size: -S #
I Vary number of rows: -N #
I Vary repeats: -nrepeat #
I Compare performance of CPU vs GPU
![Page 85: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/85.jpg)
GTC’16, April 4-7, 2016 49/121
Advanced features we haven’t covered
I Memory space in which view’s data resides covered next.
I deep copy view’s data; covered later.Note: Kokkos never hides a deep copy of data.
I Layout of multidimensional array; covered later.
I Memory traits; covered later.
I Subview: Generating a view that is a “slice” of othermultidimensional array view; will not be covered today.
![Page 86: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/86.jpg)
GTC’16, April 4-7, 2016 50/121
Execution and Memory spaces
Execution and Memory Spaces
Learning objectives:
I Heterogeneous nodes and the space abstractions.
I How to control where parallel bodies are run, executionspace.
I How to control where view data resides, memory space.
I How to avoid illegal memory accesses and manage datamovement.
I The need for Kokkos::initialize and finalize.
I Where to use Kokkos annotation macros for portability.
![Page 87: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/87.jpg)
GTC’16, April 4-7, 2016 51/121
Execution spaces (0)
Thought experiment: Consider this code:
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will section 1 be run? CPU? GPU?
I Where will section 2 be run? CPU? GPU?
I How do I control where code is executed?
⇒ Execution spaces
section1
section2
![Page 88: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/88.jpg)
GTC’16, April 4-7, 2016 51/121
Execution spaces (0)
Thought experiment: Consider this code:
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will section 1 be run? CPU? GPU?
I Where will section 2 be run? CPU? GPU?
I How do I control where code is executed?
⇒ Execution spaces
section1
section2
![Page 89: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/89.jpg)
GTC’16, April 4-7, 2016 51/121
Execution spaces (0)
Thought experiment: Consider this code:
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will section 1 be run? CPU? GPU?
I Where will section 2 be run? CPU? GPU?
I How do I control where code is executed?
⇒ Execution spaces
section1
section2
![Page 90: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/90.jpg)
GTC’16, April 4-7, 2016 52/121
Execution spaces (1)
Execution Spacea homogeneous set of cores and an execution mechanism
(i.e., “place to run code”)
DRAM
NVRAM
On-PackageMemory
Network-on-Chip
Core Core
External NetworkInterface
...
Core Core...
Acc.On-Package
Memory
External Interconnect
Node
NUMA Domain
NUMA Domain
Accelerator
Execution spaces: Serial, Threads, OpenMP, Cuda, ...
![Page 91: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/91.jpg)
GTC’16, April 4-7, 2016 53/121
Execution spaces (2)
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will Host code be run? CPU? GPU?⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?⇒ The default execution space
I How do I control where the Parallel body is executed?Changing the default execution space (at compilation),or specifying an execution space in the policy.
Host
Parallel
![Page 92: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/92.jpg)
GTC’16, April 4-7, 2016 53/121
Execution spaces (2)
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will Host code be run? CPU? GPU?⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?⇒ The default execution space
I How do I control where the Parallel body is executed?Changing the default execution space (at compilation),or specifying an execution space in the policy.
Host
Parallel
![Page 93: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/93.jpg)
GTC’16, April 4-7, 2016 53/121
Execution spaces (2)
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will Host code be run? CPU? GPU?⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?⇒ The default execution space
I How do I control where the Parallel body is executed?Changing the default execution space (at compilation),or specifying an execution space in the policy.
Host
Parallel
![Page 94: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/94.jpg)
GTC’16, April 4-7, 2016 53/121
Execution spaces (2)
MPI_Reduce (...);
FILE * file = fopen (...);
runANormalFunction (... data ...);
Kokkos :: parallel_for(numberOfSomethings ,
[=] (const size_t somethingIndex) {
const double y = ...;
// do something interesting
}
);
I Where will Host code be run? CPU? GPU?⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?⇒ The default execution space
I How do I control where the Parallel body is executed?Changing the default execution space (at compilation),or specifying an execution space in the policy.
Host
Parallel
![Page 95: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/95.jpg)
GTC’16, April 4-7, 2016 54/121
Execution spaces (3)
Changing the parallel execution space:
parallel_for(
RangePolicy < Execut ionSpace >(0, numberOfIntervals),
[=] (const size_t i) {
/* ... body ... */
});
parallel_for(
numberOfIntervals , // == RangePolicy <>(0, numberOfIntervals)
[=] (const size_t i) {
/* ... body ... */
});
Requirements for enabling execution spaces:I Kokkos must be compiled with the execution spaces enabled.
I Execution spaces must be initialized (and finalized).
I Functions must be marked with a macro for non-CPU spaces.
I Lambdas must be marked with a macro for non-CPU spaces.
Def
au
ltC
ust
om
![Page 96: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/96.jpg)
GTC’16, April 4-7, 2016 54/121
Execution spaces (3)
Changing the parallel execution space:
parallel_for(
RangePolicy < Execut ionSpace >(0, numberOfIntervals),
[=] (const size_t i) {
/* ... body ... */
});
parallel_for(
numberOfIntervals , // == RangePolicy <>(0, numberOfIntervals)
[=] (const size_t i) {
/* ... body ... */
});
Requirements for enabling execution spaces:I Kokkos must be compiled with the execution spaces enabled.
I Execution spaces must be initialized (and finalized).
I Functions must be marked with a macro for non-CPU spaces.
I Lambdas must be marked with a macro for non-CPU spaces.
Def
au
ltC
ust
om
![Page 97: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/97.jpg)
GTC’16, April 4-7, 2016 55/121
Execution spaces (5)
Kokkos function and lambda portability annotation macros:
Function annotation with KOKKOS INLINE FUNCTION macros t r u c t P a r a l l e l F u n c t o r {
KOKKOS INLINE FUNCTIONd o u b l e h e l p e r F u n c t i o n ( c o n s t s i z e t s ) c o n s t { . . .}KOKKOS INLINE FUNCTIONv o i d o p e r a t o r ( ) ( c o n s t s i z e t i n d e x ) c o n s t {
h e l p e r F u n c t i o n ( i n d e x ) ;}
}// Where kokkos d e f i n e s :#d e f i n e KOKKOS INLINE FUNCTION i n l i n e /∗ #i f CPU−o n l y ∗/#d e f i n e KOKKOS INLINE FUNCTION i n l i n e d e v i c e h o s t /∗ #i f CPU+Cuda ∗/
Lambda annotation with KOKKOS LAMBDA macro (requires CUDA 7.5)
Kokkos : : p a r a l l e l f o r ( n u m b e r O f I t e r a t i o n s ,KOKKOS LAMBDA ( c o n s t s i z e t i n d e x ) { . . . } ) ;
// Where kokkos d e f i n e s :#d e f i n e KOKKOS LAMBDA [=] /∗ #i f CPU−o n l y ∗/#d e f i n e KOKKOS LAMBDA [=] d e v i c e /∗ #i f CPU+Cuda ∗/
![Page 98: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/98.jpg)
GTC’16, April 4-7, 2016 55/121
Execution spaces (5)
Kokkos function and lambda portability annotation macros:
Function annotation with KOKKOS INLINE FUNCTION macros t r u c t P a r a l l e l F u n c t o r {
KOKKOS INLINE FUNCTIONd o u b l e h e l p e r F u n c t i o n ( c o n s t s i z e t s ) c o n s t { . . .}KOKKOS INLINE FUNCTIONv o i d o p e r a t o r ( ) ( c o n s t s i z e t i n d e x ) c o n s t {
h e l p e r F u n c t i o n ( i n d e x ) ;}
}// Where kokkos d e f i n e s :#d e f i n e KOKKOS INLINE FUNCTION i n l i n e /∗ #i f CPU−o n l y ∗/#d e f i n e KOKKOS INLINE FUNCTION i n l i n e d e v i c e h o s t /∗ #i f CPU+Cuda ∗/
Lambda annotation with KOKKOS LAMBDA macro (requires CUDA 7.5)
Kokkos : : p a r a l l e l f o r ( n u m b e r O f I t e r a t i o n s ,KOKKOS LAMBDA ( c o n s t s i z e t i n d e x ) { . . . } ) ;
// Where kokkos d e f i n e s :#d e f i n e KOKKOS LAMBDA [=] /∗ #i f CPU−o n l y ∗/#d e f i n e KOKKOS LAMBDA [=] d e v i c e /∗ #i f CPU+Cuda ∗/
![Page 99: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/99.jpg)
GTC’16, April 4-7, 2016 56/121
Memory Space Motivation
Memory space motivating example: summing an array
View <double*> data("data", size);
for (size_t i = 0; i < size; ++i) {
data(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: Where is the data stored? GPU memory? CPUmemory? Both?
⇒ Memory Spaces
![Page 100: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/100.jpg)
GTC’16, April 4-7, 2016 56/121
Memory Space Motivation
Memory space motivating example: summing an array
View <double*> data("data", size);
for (size_t i = 0; i < size; ++i) {
data(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: Where is the data stored? GPU memory? CPUmemory? Both?
⇒ Memory Spaces
![Page 101: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/101.jpg)
GTC’16, April 4-7, 2016 56/121
Memory Space Motivation
Memory space motivating example: summing an array
View <double*> data("data", size);
for (size_t i = 0; i < size; ++i) {
data(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: Where is the data stored? GPU memory? CPUmemory? Both?
⇒ Memory Spaces
![Page 102: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/102.jpg)
GTC’16, April 4-7, 2016 56/121
Memory Space Motivation
Memory space motivating example: summing an array
View <double*> data("data", size);
for (size_t i = 0; i < size; ++i) {
data(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: Where is the data stored? GPU memory? CPUmemory? Both?
⇒ Memory Spaces
![Page 103: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/103.jpg)
GTC’16, April 4-7, 2016 57/121
Memory spaces (0)
Memory space:explicitly-manageable memory resource
(i.e., “place to put data”)
DRAM
NVRAM
On-PackageMemory
Network-on-Chip
Core Core
External NetworkInterface
...
Core Core...
Acc.On-Package
Memory
External Interconnect
Node
NUMA Domain
NUMA Domain
Accelerator
![Page 104: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/104.jpg)
GTC’16, April 4-7, 2016 58/121
Memory spaces (1)
Important concept: Memory spaces
Every view stores its data in a memory space set at compile time.
I View<double***,MemorySpace> data(...);
I Available memory spaces:HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which isused if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the defaultmemory space of the default execution space.
![Page 105: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/105.jpg)
GTC’16, April 4-7, 2016 58/121
Memory spaces (1)
Important concept: Memory spaces
Every view stores its data in a memory space set at compile time.
I View<double***,MemorySpace> data(...);
I Available memory spaces:HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which isused if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the defaultmemory space of the default execution space.
![Page 106: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/106.jpg)
GTC’16, April 4-7, 2016 58/121
Memory spaces (1)
Important concept: Memory spaces
Every view stores its data in a memory space set at compile time.
I View<double***,MemorySpace> data(...);
I Available memory spaces:HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which isused if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the defaultmemory space of the default execution space.
![Page 107: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/107.jpg)
GTC’16, April 4-7, 2016 58/121
Memory spaces (1)
Important concept: Memory spaces
Every view stores its data in a memory space set at compile time.
I View<double***,MemorySpace> data(...);
I Available memory spaces:HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which isused if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the defaultmemory space of the default execution space.
![Page 108: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/108.jpg)
GTC’16, April 4-7, 2016 58/121
Memory spaces (1)
Important concept: Memory spaces
Every view stores its data in a memory space set at compile time.
I View<double***,MemorySpace> data(...);
I Available memory spaces:HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which isused if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the defaultmemory space of the default execution space.
![Page 109: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/109.jpg)
GTC’16, April 4-7, 2016 59/121
Memory spaces (2)
Example: HostSpace
View <double**, HostSpace> hostView (...);
Example: CudaSpace
View <double**, CudaSpace> view (...);
![Page 110: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/110.jpg)
GTC’16, April 4-7, 2016 59/121
Memory spaces (2)
Example: HostSpace
View <double**, HostSpace> hostView (...);
Example: CudaSpace
View <double**, CudaSpace> view (...);
![Page 111: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/111.jpg)
GTC’16, April 4-7, 2016 60/121
Execution and Memory spaces (0)
Anatomy of a kernel launch:
1. User declares views, allocating.
2. User instantiates a functor with views.
3. User launches parallel something:I Functor is copied to the device.I Kernel is run.I Copy of functor on the device is released.
View <int*, Cuda > dev (...);
parallel_for(N,
[=] (int i) {
dev(i) = ...;
});
Note: no deep copies of array data are performed;views are like pointers.
![Page 112: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/112.jpg)
GTC’16, April 4-7, 2016 61/121
Execution and Memory spaces (1)
Example: one view
View <int*, Cuda > dev;
parallel_for(N,
[=] (int i) {
dev(i) = ...;
});
![Page 113: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/113.jpg)
GTC’16, April 4-7, 2016 62/121
Execution and Memory spaces (2)
Example: two views
View <int*, Cuda > dev;
View <int*, Host > host;
parallel_for(N,
[=] (int i) {
dev(i) = ...;
host(i) = ...;
});
![Page 114: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/114.jpg)
GTC’16, April 4-7, 2016 62/121
Execution and Memory spaces (2)
Example: two views
View <int*, Cuda > dev;
View <int*, Host > host;
parallel_for(N,
[=] (int i) {
dev(i) = ...;
host(i) = ...;
});
![Page 115: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/115.jpg)
GTC’16, April 4-7, 2016 63/121
Execution and Memory spaces (3)
Example (redux): summing an array with the GPU
(failed) Attempt 1:
View <double*, CudaSpace> array("array", size);
for (size_t i = 0; i < size; ++i) {
array(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Cuda >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += array(index );
},
sum);
![Page 116: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/116.jpg)
GTC’16, April 4-7, 2016 63/121
Execution and Memory spaces (3)
Example (redux): summing an array with the GPU
(failed) Attempt 1:
View <double*, CudaSpace> array("array", size);
for (size_t i = 0; i < size; ++i) {
array(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Cuda >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += array(index );
},
sum);
fault
![Page 117: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/117.jpg)
GTC’16, April 4-7, 2016 64/121
Execution and Memory spaces (4)
Example (redux): summing an array with the GPU
(failed) Attempt 2:
View <double*, HostSpace> array("array", size);
for (size_t i = 0; i < size; ++i) {
array(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Cuda >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += array(index );
},
sum);
What’s the solution?I CudaUVMSpace
I CudaHostPinnedSpace (skipping)
I Mirroring
![Page 118: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/118.jpg)
GTC’16, April 4-7, 2016 64/121
Execution and Memory spaces (4)
Example (redux): summing an array with the GPU
(failed) Attempt 2:
View <double*, HostSpace> array("array", size);
for (size_t i = 0; i < size; ++i) {
array(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Cuda >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += array(index );
},
sum);
What’s the solution?I CudaUVMSpace
I CudaHostPinnedSpace (skipping)
I Mirroring
illegal access
![Page 119: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/119.jpg)
GTC’16, April 4-7, 2016 64/121
Execution and Memory spaces (4)
Example (redux): summing an array with the GPU
(failed) Attempt 2:
View <double*, HostSpace> array("array", size);
for (size_t i = 0; i < size; ++i) {
array(i) = ... read from file ...
}
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Cuda >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += array(index );
},
sum);
What’s the solution?I CudaUVMSpace
I CudaHostPinnedSpace (skipping)
I Mirroring
illegal access
![Page 120: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/120.jpg)
GTC’16, April 4-7, 2016 65/121
Execution and Memory spaces (5)
CudaUVMSpace
View <double*,
CudaUVMSpace> array;
array = ... from file ...
double sum = 0;
parallel_reduce(N,
[=] (int i,
double & d) {
d += array(i);
},
sum);
Cuda runtime automatically handles data movement,at a performance hit.
![Page 121: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/121.jpg)
GTC’16, April 4-7, 2016 66/121
Views, Spaces, and Mirrors
Important concept: Mirrors
Mirrors are views of equivalent arrays residing in possibly differentmemory spaces.
Mirroring schematic
typedef Kokkos ::View <double**, Space> ViewType;
ViewType view (...);
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
![Page 122: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/122.jpg)
GTC’16, April 4-7, 2016 66/121
Views, Spaces, and Mirrors
Important concept: Mirrors
Mirrors are views of equivalent arrays residing in possibly differentmemory spaces.
Mirroring schematic
typedef Kokkos ::View <double**, Space> ViewType;
ViewType view (...);
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
![Page 123: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/123.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 124: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/124.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 125: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/125.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 126: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/126.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 127: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/127.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 128: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/128.jpg)
GTC’16, April 4-7, 2016 67/121
Mirroring pattern
1. Create a view’s array in some memory space.typedef Kokkos ::View <double*, Space> ViewType;
ViewType view (...);
2. Create hostView, a mirror of the view’s array residing in thehost memory space.
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
3. Populate hostView on the host (from file, etc.).
4. Deep copy hostView’s array to view’s array.Kokkos :: deep copy (view, hostView );
5. Launch a kernel processing the view’s array.Kokkos :: parallel_for(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (...) { use and change view });
6. If needed, deep copy the view’s updated array back to thehostView’s array to write file, etc.
Kokkos :: deep copy (hostView , view);
![Page 129: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/129.jpg)
GTC’16, April 4-7, 2016 68/121
Mirrors of Views in HostSpace
What if the View is in HostSpace too? Does it make a copy?
typedef Kokkos ::View <double*, Space> ViewType;
ViewType view("test", 10);
ViewType ::HostMirror hostView =
Kokkos : : c r e a t e m i r r o r v i e w (view);
I create mirror view allocates data only if the host processcannot access view’s data, otherwise hostView references thesame data.
I create mirror always allocates data.
I Kokkos never performs a hidden deep copy.
![Page 130: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/130.jpg)
GTC’16, April 4-7, 2016 69/121
Exercise #2 b): Flat Parallelism on the GPU, Views and Host Mirrors
Details:
I Location: ~/GTC2016/Exercises/02/
I Add HostMirror Views and deep copy to Exercise 02
I Make sure you use the correct view in initialization and Kernel
# Compile for CPU
make -j KOKKOS_DEVICES=OpenMP
# Compile for GPU (we do not need UVM anymore)
make -j KOKKOS_DEVICES=Cuda
# Run on GPU
./02 _Exercise.cuda -S 26
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
![Page 131: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/131.jpg)
GTC’16, April 4-7, 2016 70/121
View and Spaces Section Summary
I Data is stored in Views that are “pointers” tomulti-dimensional arrays residing in memory spaces.
I Views abstract away platform-dependent allocation,(automatic) deallocation, and access.
I Heterogenous nodes have one or more memory spaces.
I Mirroring is used for performant access to views in host anddevice memory.
I Heterogenous nodes have one or more execution spaces.
I You control where parallel code is run by a templateparameter on the execution policy, or by compile-timeselection of the default execution space.
![Page 132: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/132.jpg)
GTC’16, April 4-7, 2016 71/121
Managing memory access patternsfor performance portability
Learning objectives:
I How the View’s Layout parameter controls data layout.
I How memory access patterns result from Kokkos mappingparallel work indices and layout of multidimensional array data
I Why memory access patterns and layouts have such aperformance impact (caching and coalescing).
I See a concrete example of the performance of various memoryconfigurations.
![Page 133: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/133.jpg)
GTC’16, April 4-7, 2016 72/121
Example: inner product (0)
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, N),
KOKKOS_LAMBDA (const size_t row , double & valueToUpdate) {
double thisRowsSum = 0;
for (size_t entry = 0; entry < M; ++entry) {
thisRowsSum += A(row , entry) * x(entry);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
How should A be laid out in memory?
![Page 134: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/134.jpg)
GTC’16, April 4-7, 2016 72/121
Example: inner product (0)
Kokkos :: parallel_reduce(
RangePolicy <ExecutionSpace >(0, N),
KOKKOS_LAMBDA (const size_t row , double & valueToUpdate) {
double thisRowsSum = 0;
for (size_t entry = 0; entry < M; ++entry) {
thisRowsSum += A(row , entry) * x(entry);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
How should A be laid out in memory?
![Page 135: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/135.jpg)
GTC’16, April 4-7, 2016 73/121
Example: inner product (1)
Layout is the mapping of multi-index to memory:
LayoutLeft
in 2D, “column-major”
LayoutRight
in 2D, “row-major”
![Page 136: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/136.jpg)
GTC’16, April 4-7, 2016 74/121
Layout
Important concept: Layout
Every View has a multidimensional array Layout set atcompile-time.
View <double ***, Layout , Space > name (...);
I Most-common layouts are LayoutLeft and LayoutRight.LayoutLeft: left-most index is stride 1.LayoutRight: right-most index is stride 1.
I If no layout specified, default for that memory space is used.LayoutLeft for CudaSpace, LayoutRight for HostSpace.
I Layouts are extensible: ˜50 lines
I Advanced layouts: LayoutStride, LayoutTiled, ...
![Page 137: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/137.jpg)
GTC’16, April 4-7, 2016 74/121
Layout
Important concept: Layout
Every View has a multidimensional array Layout set atcompile-time.
View <double ***, Layout , Space > name (...);
I Most-common layouts are LayoutLeft and LayoutRight.LayoutLeft: left-most index is stride 1.LayoutRight: right-most index is stride 1.
I If no layout specified, default for that memory space is used.LayoutLeft for CudaSpace, LayoutRight for HostSpace.
I Layouts are extensible: ˜50 lines
I Advanced layouts: LayoutStride, LayoutTiled, ...
![Page 138: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/138.jpg)
GTC’16, April 4-7, 2016 75/121
Exercise #3: Inner Product, Flat Parallelism
Details:
I Location: ~/GTC2016/Exercises/03/
I Replace ‘‘N’’ in parallel dispatch with RangePolicy<ExecSpace>
I Add MemSpace to all Views and Layout to A
I Experiment with the combinations of ExecSpace, Layout to viewperformance
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
I Compare using UVM vs not using UVM on GPUs
I Check what happens if MemSpace and ExecSpace do not match.
![Page 139: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/139.jpg)
GTC’16, April 4-7, 2016 76/121
Exercise #3: Inner Product, Flat Parallelism
0
20
40
60
80
100
120
140
160
180
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Ban
dwid
th (
GB
/s)
number of rows
<y,Ax> Exercise03, fixed problem size
K40 RightK40 Left
KNC RightKNC Left
HSW RightHSW Left
Why?
![Page 140: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/140.jpg)
GTC’16, April 4-7, 2016 77/121
Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
Question: once a thread reads d, does it need to wait?
I CPU threads are independent.i.e., threads may execute at any rate.
I GPU threads are synchronized in groups (of 32).i.e., threads in groups must execute instructions together.
In particular, all threads in a group (warp) must finished their loadsbefore any thread can move on.
So, how many cache lines must be fetched before threads canmove on?
![Page 141: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/141.jpg)
GTC’16, April 4-7, 2016 77/121
Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
Question: once a thread reads d, does it need to wait?
I CPU threads are independent.i.e., threads may execute at any rate.
I GPU threads are synchronized in groups (of 32).i.e., threads in groups must execute instructions together.
In particular, all threads in a group (warp) must finished their loadsbefore any thread can move on.
So, how many cache lines must be fetched before threads canmove on?
![Page 142: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/142.jpg)
GTC’16, April 4-7, 2016 77/121
Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
Question: once a thread reads d, does it need to wait?
I CPU threads are independent.i.e., threads may execute at any rate.
I GPU threads are synchronized in groups (of 32).i.e., threads in groups must execute instructions together.
In particular, all threads in a group (warp) must finished their loadsbefore any thread can move on.
So, how many cache lines must be fetched before threads canmove on?
![Page 143: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/143.jpg)
GTC’16, April 4-7, 2016 77/121
Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
Question: once a thread reads d, does it need to wait?
I CPU threads are independent.i.e., threads may execute at any rate.
I GPU threads are synchronized in groups (of 32).i.e., threads in groups must execute instructions together.
In particular, all threads in a group (warp) must finished their loadsbefore any thread can move on.
So, how many cache lines must be fetched before threads canmove on?
![Page 144: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/144.jpg)
GTC’16, April 4-7, 2016 78/121
Caching and coalescing (1)
CPUs: few (independent) cores with separate caches:
GPUs: many (synchronized) cores with a shared cache:
![Page 145: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/145.jpg)
GTC’16, April 4-7, 2016 78/121
Caching and coalescing (1)
CPUs: few (independent) cores with separate caches:
GPUs: many (synchronized) cores with a shared cache:
![Page 146: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/146.jpg)
GTC’16, April 4-7, 2016 79/121
Caching and coalescing (2)
Important point
For performance, accesses to views in HostSpace must be cached,while access to views in CudaSpace must be coalesced.
Caching: if thread t’s current access is at position i,thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,thread t+1’s current access should be at position i+1.
Warning
Uncoalesced access in CudaSpace greatly reduces performance(more than 10X)
Note: uncoalesced read-only, random access in CudaSpace is okaythrough Kokkos const RandomAccess views (more later).
![Page 147: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/147.jpg)
GTC’16, April 4-7, 2016 79/121
Caching and coalescing (2)
Important point
For performance, accesses to views in HostSpace must be cached,while access to views in CudaSpace must be coalesced.
Caching: if thread t’s current access is at position i,thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,thread t+1’s current access should be at position i+1.
Warning
Uncoalesced access in CudaSpace greatly reduces performance(more than 10X)
Note: uncoalesced read-only, random access in CudaSpace is okaythrough Kokkos const RandomAccess views (more later).
![Page 148: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/148.jpg)
GTC’16, April 4-7, 2016 79/121
Caching and coalescing (2)
Important point
For performance, accesses to views in HostSpace must be cached,while access to views in CudaSpace must be coalesced.
Caching: if thread t’s current access is at position i,thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,thread t+1’s current access should be at position i+1.
Warning
Uncoalesced access in CudaSpace greatly reduces performance(more than 10X)
Note: uncoalesced read-only, random access in CudaSpace is okaythrough Kokkos const RandomAccess views (more later).
![Page 149: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/149.jpg)
GTC’16, April 4-7, 2016 80/121
Mapping indices to cores (0)
Consider the array summation example:
View <double*, Space> data("data", size);
... populate data ...
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: is this cached (for OpenMP) and coalesced (for Cuda)?
Given P threads, which indices do we want thread 0 to handle?
Contiguous:0, 1, 2, ..., N/P
Strided:0, N/P, 2*N/P, ...
CPU GPUWhy?
![Page 150: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/150.jpg)
GTC’16, April 4-7, 2016 80/121
Mapping indices to cores (0)
Consider the array summation example:
View <double*, Space> data("data", size);
... populate data ...
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: is this cached (for OpenMP) and coalesced (for Cuda)?
Given P threads, which indices do we want thread 0 to handle?
Contiguous:0, 1, 2, ..., N/P
Strided:0, N/P, 2*N/P, ...
CPU GPUWhy?
![Page 151: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/151.jpg)
GTC’16, April 4-7, 2016 80/121
Mapping indices to cores (0)
Consider the array summation example:
View <double*, Space> data("data", size);
... populate data ...
double sum = 0;
Kokkos :: parallel_reduce(
RangePolicy < Space >(0, size),
KOKKOS_LAMBDA (const size_t index , double & valueToUpdate) {
valueToUpdate += data(index );
},
sum);
Question: is this cached (for OpenMP) and coalesced (for Cuda)?
Given P threads, which indices do we want thread 0 to handle?
Contiguous:0, 1, 2, ..., N/P
Strided:0, N/P, 2*N/P, ...
CPU GPUWhy?
![Page 152: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/152.jpg)
GTC’16, April 4-7, 2016 81/121
Mapping indices to cores (1)
Iterating for the execution space:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
As users we don’t control how indices are mapped to threads, sohow do we achieve good memory access?
Important point
Kokkos maps indices to cores in contiguous chunks on CPUexecution spaces, and strided for Cuda.
Important point
Kokkos index mapping and default layouts provide efficient accessif iteration indices correspond to the first index of array.
![Page 153: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/153.jpg)
GTC’16, April 4-7, 2016 81/121
Mapping indices to cores (1)
Iterating for the execution space:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
As users we don’t control how indices are mapped to threads, sohow do we achieve good memory access?
Important point
Kokkos maps indices to cores in contiguous chunks on CPUexecution spaces, and strided for Cuda.
Important point
Kokkos index mapping and default layouts provide efficient accessif iteration indices correspond to the first index of array.
![Page 154: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/154.jpg)
GTC’16, April 4-7, 2016 81/121
Mapping indices to cores (1)
Iterating for the execution space:
operator ()( const size_t index , double & valueToUpdate) {
const double d = _data(index );
valueToUpdate += d;
}
As users we don’t control how indices are mapped to threads, sohow do we achieve good memory access?
Important point
Kokkos maps indices to cores in contiguous chunks on CPUexecution spaces, and strided for Cuda.
Important point
Kokkos index mapping and default layouts provide efficient accessif iteration indices correspond to the first index of array.
![Page 155: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/155.jpg)
GTC’16, April 4-7, 2016 82/121
Example: inner product (2)
Important point
Performant memory access is achieved by Kokkos mapping parallelwork indices and multidimensional array layout optimally for thearchitecture.
Analysis: row-major (LayoutRight)
I HostSpace: cached (good)
I CudaSpace: uncoalesced (bad)
![Page 156: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/156.jpg)
GTC’16, April 4-7, 2016 82/121
Example: inner product (2)
Important point
Performant memory access is achieved by Kokkos mapping parallelwork indices and multidimensional array layout optimally for thearchitecture.
Analysis: row-major (LayoutRight)
I HostSpace: cached (good)
I CudaSpace: uncoalesced (bad)
![Page 157: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/157.jpg)
GTC’16, April 4-7, 2016 82/121
Example: inner product (2)
Important point
Performant memory access is achieved by Kokkos mapping parallelwork indices and multidimensional array layout optimally for thearchitecture.
Analysis: row-major (LayoutRight)
I HostSpace: cached (good)
I CudaSpace: uncoalesced (bad)
![Page 158: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/158.jpg)
GTC’16, April 4-7, 2016 83/121
Example: inner product (3)
Important point
Performance memory access is achieved by Kokkos mappingparallel work indices and multidimensional array layout optimallyfor the architecture.
Analysis: column-major (LayoutLeft)
I HostSpace: uncached (bad)
I CudaSpace: coalesced (good)
![Page 159: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/159.jpg)
GTC’16, April 4-7, 2016 83/121
Example: inner product (3)
Important point
Performance memory access is achieved by Kokkos mappingparallel work indices and multidimensional array layout optimallyfor the architecture.
Analysis: column-major (LayoutLeft)
I HostSpace: uncached (bad)
I CudaSpace: coalesced (good)
![Page 160: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/160.jpg)
GTC’16, April 4-7, 2016 84/121
Example: inner product (4)
Analysis: Kokkos architecture-dependent
View <double**, Execut ionSpace > A(N, M);
parallel_for(RangePolicy < Execut ionSpace >(0, N),
... thisRowsSum += A(j, i) * x(i);
(a) OpenMP (b) Cuda
I HostSpace: cached (good)
I CudaSpace: coalesced (good)
![Page 161: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/161.jpg)
GTC’16, April 4-7, 2016 85/121
Example: inner product (5)
Layout performance, revisited
0
20
40
60
80
100
120
140
160
180
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Ban
dwid
th (
GB
/s)
number of rows
<y,Ax> Exercise03, fixed problem size
K40 RightK40 Left
KNC RightKNC Left
HSW RightHSW Left
coalesced
cached
cached
uncacheduncoalesced
![Page 162: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/162.jpg)
GTC’16, April 4-7, 2016 86/121
Memory Access Pattern Summary
I Every View has a Layout set at compile-time through atemplate parameter.
I LayoutRight and LayoutLeft are most common.
I Views in HostSpace default to LayoutRight and Views inCudaSpace default to LayoutLeft.
I Layouts are extensible and flexible.
I For performance, memory access patterns must result incaching on a CPU and coalescing on a GPU.
I Kokkos maps parallel work indices and multidimensional arraylayout for performance portable memory access patterns.
I There is nothing in OpenMP, OpenACC, or OpenCL to managelayouts.⇒ You’ll need multiple versions of code or pay theperformance penalty.
![Page 163: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/163.jpg)
GTC’16, April 4-7, 2016 87/121
Thread safety andatomic operationsLearning objectives:
I Understand that coordination techniques for low-count CPUthreading are not scalable.
I Understand how atomics can parallelize the scatter-addpattern.
I Gain performance intuition for atomics on the CPU andGPU, for different data types and contention rates.
![Page 164: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/164.jpg)
GTC’16, April 4-7, 2016 88/121
Examples: Histogram
Histogram kernel:
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
++ _histogram(bucketIndex );
});
Problem: Multiple threads may try to write to the same location.
Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:not thread-scalable
I Atomics
http://www.farmaceuticas.com.br/tag/graficos/
![Page 165: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/165.jpg)
GTC’16, April 4-7, 2016 88/121
Examples: Histogram
Histogram kernel:
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
++ _histogram(bucketIndex );
});
Problem: Multiple threads may try to write to the same location.
Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:not thread-scalable
I Atomics
http://www.farmaceuticas.com.br/tag/graficos/
![Page 166: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/166.jpg)
GTC’16, April 4-7, 2016 88/121
Examples: Histogram
Histogram kernel:
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
++ _histogram(bucketIndex );
});
Problem: Multiple threads may try to write to the same location.
Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:not thread-scalable
I Atomics
http://www.farmaceuticas.com.br/tag/graficos/
![Page 167: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/167.jpg)
GTC’16, April 4-7, 2016 89/121
Atomics
Atomics: the portable and thread-scalable solution
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
Kokkos :: atomic_add (& _histogram(bucketIndex), 1);
});
I Atomics are the only scalable solution to thread safety.
I Locks or data replication are strongly discouraged.
![Page 168: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/168.jpg)
GTC’16, April 4-7, 2016 89/121
Atomics
Atomics: the portable and thread-scalable solution
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
Kokkos :: atomic_add (& _histogram(bucketIndex), 1);
});
I Atomics are the only scalable solution to thread safety.
I Locks or data replication are strongly discouraged.
![Page 169: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/169.jpg)
GTC’16, April 4-7, 2016 89/121
Atomics
Atomics: the portable and thread-scalable solution
parallel_for(N, KOKKOS_LAMBDA(const size_t index) {
const Something value = ...;
const int bucketIndex = computeBucketIndex(value );
Kokkos :: atomic_add (& _histogram(bucketIndex), 1);
});
I Atomics are the only scalable solution to thread safety.
I Locks or data replication are strongly discouraged.
![Page 170: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/170.jpg)
GTC’16, April 4-7, 2016 90/121
Performance of atomics (0)
How expensive are atomics?
Thought experiment: scalar integration
operator ()( const unsigned int intervalIndex ,
double & valueToUpdate) const {
double contribution = function (...);
valueToUpdate += contribution;
}
Idea: what if we instead do this with parallel for and atomics?
operator ()( const unsigned int intervalIndex) const {
const double contribution = function (...);
Kokkos : : atomic add (&globalSum , contribution );
}
How much of a performance penalty is incurred?
![Page 171: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/171.jpg)
GTC’16, April 4-7, 2016 90/121
Performance of atomics (0)
How expensive are atomics?
Thought experiment: scalar integration
operator ()( const unsigned int intervalIndex ,
double & valueToUpdate) const {
double contribution = function (...);
valueToUpdate += contribution;
}
Idea: what if we instead do this with parallel for and atomics?
operator ()( const unsigned int intervalIndex) const {
const double contribution = function (...);
Kokkos : : atomic add (&globalSum , contribution );
}
How much of a performance penalty is incurred?
![Page 172: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/172.jpg)
GTC’16, April 4-7, 2016 91/121
Performance of atomics (1)
Two costs: (independent) work and coordination.parallel_reduce(numberOfIntervals ,
KOKKOS_LAMBDA (const unsigned int intervalIndex ,
double & valueToUpdate) {
valueToUpdate += function (...);
}, totalIntegral );
Experimental setup
operator ()( const unsigned int index) const {
Kokkos :: atomic_add (& globalSums[index % atomicStride], 1);
}
I This is the most extreme case: all coordination and no work.
I Contention is captured by the atomicStride.atomicStride → 1 ⇒ Scalar integrationatomicStride → large ⇒ Independent
![Page 173: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/173.jpg)
GTC’16, April 4-7, 2016 91/121
Performance of atomics (1)
Two costs: (independent) work and coordination.parallel_reduce(numberOfIntervals ,
KOKKOS_LAMBDA (const unsigned int intervalIndex ,
double & valueToUpdate) {
valueToUpdate += function (...);
}, totalIntegral );
Experimental setup
operator ()( const unsigned int index) const {
Kokkos :: atomic_add (& globalSums[index % atomicStride], 1);
}
I This is the most extreme case: all coordination and no work.
I Contention is captured by the atomicStride.atomicStride → 1 ⇒ Scalar integrationatomicStride → large ⇒ Independent
![Page 174: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/174.jpg)
GTC’16, April 4-7, 2016 92/121
Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel
0 1 2 3 4 5 6
log10(contention) [-]
5
4
3
2
1
0
log10(s
peedup o
ver
independent)
[-]
Slowdown from atomics: Summary for 1 million adds, mod, 0 pows
cuda doublecuda size_t
cuda floatcuda unsignedomp doubleomp size_t
omp floatomp unsignedphi doublephi size_t
phi floatphi unsigned
Not
e:lo
gsc
ale
![Page 175: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/175.jpg)
GTC’16, April 4-7, 2016 92/121
Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel
0 1 2 3 4 5 6
log10(contention) [-]
5
4
3
2
1
0
log10(s
peedup o
ver
independent)
[-]
Slowdown from atomics: Summary for 1 million adds, mod, 0 pows
cuda doublecuda size_t
cuda floatcuda unsignedomp doubleomp size_t
omp floatomp unsignedphi doublephi size_t
phi floatphi unsigned
Not
e:lo
gsc
ale
No penalty for low contention
High penalty forhigh contention
![Page 176: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/176.jpg)
GTC’16, April 4-7, 2016 93/121
Performance of atomics (3)
Atomics performance: 1 million adds, some work per kernel
0 1 2 3 4 5 6
log10(contention) [-]
5
4
3
2
1
0
log10(s
peedup o
ver
independent)
[-]
Slowdown from atomics: Summary for 1 million adds, mod, 2 pows
cuda doublecuda size_t
cuda floatcuda unsignedomp doubleomp size_t
omp floatomp unsignedphi doublephi size_t
phi floatphi unsigned
Not
e:lo
gsc
ale
No penalty for low contention
High penalty forhigh contention
![Page 177: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/177.jpg)
GTC’16, April 4-7, 2016 94/121
Performance of atomics (4)
Atomics performance: 1 million adds, lots of work per kernel
0 1 2 3 4 5 6
log10(contention) [-]
5
4
3
2
1
0
log10(s
peedup o
ver
independent)
[-]
Slowdown from atomics: Summary for 1 million adds, mod, 5 pows
cuda doublecuda size_t
cuda floatcuda unsignedomp doubleomp size_t
omp floatomp unsignedphi doublephi size_t
phi floatphi unsigned
Not
e:lo
gsc
ale
No penalty for low contention
High penalty forhigh contention
![Page 178: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/178.jpg)
GTC’16, April 4-7, 2016 95/121
Advanced features
Atomics on arbitrary types:
I Atomic operations work if the corresponding operator exists, i.e., atomic add works on any data type with “+”.
I Atomic exchange works on any data type.// Assign *dest to val , return former value of *dest
template <typename T>
T atomic_exchange(T * dest , T val);
// If *dest == comp then assign *dest to val
// Return true if succeeds.
template <typename T>
bool atomic_compare_exchange_strong(T * dest , T comp , T val);
![Page 179: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/179.jpg)
GTC’16, April 4-7, 2016 96/121
Memory traits
Slight detour: View memory traits:
I Beyond a Layout and Space, Views can have memory traits.
I Memory traits either provide convenience or allow for certainhardware-specific optimizations to be performed.
Example: If all accesses to a View will be atomic, use the Atomic
memory trait:View <double**, Layout , Space ,
MemoryTraits<Atomic> > forces (...);
Many memory traits exist or are experimental, including Read,Write, ReadWrite, ReadOnce (non-temporal), Contiguous, andRandomAccess.
![Page 180: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/180.jpg)
GTC’16, April 4-7, 2016 96/121
Memory traits
Slight detour: View memory traits:
I Beyond a Layout and Space, Views can have memory traits.
I Memory traits either provide convenience or allow for certainhardware-specific optimizations to be performed.
Example: If all accesses to a View will be atomic, use the Atomic
memory trait:View <double**, Layout , Space ,
MemoryTraits<Atomic> > forces (...);
Many memory traits exist or are experimental, including Read,Write, ReadWrite, ReadOnce (non-temporal), Contiguous, andRandomAccess.
![Page 181: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/181.jpg)
GTC’16, April 4-7, 2016 97/121
RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, randomaccess, originally designed for textures.
How to access texture memory via CUDA:
How to access texture memory via Kokkos:View < const double ***, Layout , Space ,
MemoryTraits<RandomAccess> > name (...);
![Page 182: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/182.jpg)
GTC’16, April 4-7, 2016 97/121
RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, randomaccess, originally designed for textures.
How to access texture memory via CUDA:
How to access texture memory via Kokkos:View < const double ***, Layout , Space ,
MemoryTraits<RandomAccess> > name (...);
![Page 183: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/183.jpg)
GTC’16, April 4-7, 2016 97/121
RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, randomaccess, originally designed for textures.
How to access texture memory via CUDA:
How to access texture memory via Kokkos:View < const double ***, Layout , Space ,
MemoryTraits<RandomAccess> > name (...);
![Page 184: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/184.jpg)
GTC’16, April 4-7, 2016 98/121
Section Summary
I Atomics are the only thread-scalable solution to thread safety.I Locks or data replication are not portable or scalable
I Atomic performance depends on ratio of independent workand atomic operations.
I With more work, there is a lower performance penalty, becauseof increased opportunity to interleave work and atomic.
I The Atomic memory trait can be used to make all accessesto a view atomic.
I The cost of atomics can be negligible:I CPU ideal: contiguous access, integer typesI GPU ideal: scattered access, 32-bit types
I Many programs with the scatter-add pattern can bethread-scalably parallelized using atomics without muchmodification.
![Page 185: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/185.jpg)
GTC’16, April 4-7, 2016 99/121
Hierarchical parallelismFinding and exploiting more parallelism in your computations.
Learning objectives:
I Similarities and differences between outer and inner levels ofparallelism
I Thread teams (league of teams of threads)
I Performance improvement with well-coordinated teams
![Page 186: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/186.jpg)
GTC’16, April 4-7, 2016 100/121
Example: inner product (0)
(Flat parallel) Kernel:
Kokkos :: parallel_reduce(N,
KOKKOS_LAMBDA (const int row , double & valueToUpdate) {
double thisRowsSum = 0;
for (int col = 0; col < M; ++col) {
thisRowsSum += A(row ,col) * x(col);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
Problem: What if we don’t haveenough rows to saturate the GPU?
Solutions?I Atomics
I Thread teams
![Page 187: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/187.jpg)
GTC’16, April 4-7, 2016 100/121
Example: inner product (0)
(Flat parallel) Kernel:
Kokkos :: parallel_reduce(N,
KOKKOS_LAMBDA (const int row , double & valueToUpdate) {
double thisRowsSum = 0;
for (int col = 0; col < M; ++col) {
thisRowsSum += A(row ,col) * x(col);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
Problem: What if we don’t haveenough rows to saturate the GPU?
Solutions?I Atomics
I Thread teams
![Page 188: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/188.jpg)
GTC’16, April 4-7, 2016 100/121
Example: inner product (0)
(Flat parallel) Kernel:
Kokkos :: parallel_reduce(N,
KOKKOS_LAMBDA (const int row , double & valueToUpdate) {
double thisRowsSum = 0;
for (int col = 0; col < M; ++col) {
thisRowsSum += A(row ,col) * x(col);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
Problem: What if we don’t haveenough rows to saturate the GPU?
Solutions?
I Atomics
I Thread teams
![Page 189: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/189.jpg)
GTC’16, April 4-7, 2016 100/121
Example: inner product (0)
(Flat parallel) Kernel:
Kokkos :: parallel_reduce(N,
KOKKOS_LAMBDA (const int row , double & valueToUpdate) {
double thisRowsSum = 0;
for (int col = 0; col < M; ++col) {
thisRowsSum += A(row ,col) * x(col);
}
valueToUpdate += y(row) * thisRowsSum;
}, result );
Problem: What if we don’t haveenough rows to saturate the GPU?
Solutions?I Atomics
I Thread teams
![Page 190: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/190.jpg)
GTC’16, April 4-7, 2016 101/121
Example: inner product (1)
Atomics kernel:
Kokkos :: parallel_for(N,
KOKKOS_LAMBDA (const size_t index) {
const int row = extractRow(index);
const int col = extractCol(index);
atomic_add (& result , A(row ,col) * x(col));
});
Problem: Poor performance
![Page 191: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/191.jpg)
GTC’16, April 4-7, 2016 101/121
Example: inner product (1)
Atomics kernel:
Kokkos :: parallel_for(N,
KOKKOS_LAMBDA (const size_t index) {
const int row = extractRow(index);
const int col = extractCol(index);
atomic_add (& result , A(row ,col) * x(col));
});
Problem: Poor performance
![Page 192: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/192.jpg)
GTC’16, April 4-7, 2016 102/121
Example: inner product (2)
Doing each individual row with atomics is like doing scalarintegration with atomics.
Instead, you could envision doing a large number ofparallel reduce kernels.
for each row
Functor functor(row , ...);
parallel_reduce(M, functor );
}
This is an example of hierarchical work.
Important concept: Hierarchical parallelism
Algorithms that exhibit hierarchical structure can exploithierarchical parallelism with thread teams.
![Page 193: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/193.jpg)
GTC’16, April 4-7, 2016 102/121
Example: inner product (2)
Doing each individual row with atomics is like doing scalarintegration with atomics.
Instead, you could envision doing a large number ofparallel reduce kernels.
for each row
Functor functor(row , ...);
parallel_reduce(M, functor );
}
This is an example of hierarchical work.
Important concept: Hierarchical parallelism
Algorithms that exhibit hierarchical structure can exploithierarchical parallelism with thread teams.
![Page 194: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/194.jpg)
GTC’16, April 4-7, 2016 103/121
Example: inner product (3)
Important concept: Thread team
A collection of threads which are guaranteed to be executingconcurrently and can synchronize.
High-level strategy:
1. Do one parallel launch of N teams of M threads.
2. Each thread performs one entry in the row.
3. The threads within teams perform a reduction.
4. The thread teams perform a reduction.
![Page 195: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/195.jpg)
GTC’16, April 4-7, 2016 103/121
Example: inner product (3)
Important concept: Thread team
A collection of threads which are guaranteed to be executingconcurrently and can synchronize.
High-level strategy:
1. Do one parallel launch of N teams of M threads.
2. Each thread performs one entry in the row.
3. The threads within teams perform a reduction.
4. The thread teams perform a reduction.
![Page 196: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/196.jpg)
GTC’16, April 4-7, 2016 104/121
Example: inner product (4)
The final hierarchical parallel kernel:
parallel_reduce(
team_policy(N, Kokkos ::AUTO),
KOKKOS_LAMBDA (member_type & teamMember , double & update) {
int row = teamMember.league_rank ();
double thisRowsSum = 0;
parallel_reduce(TeamThreadRange(teamMember , M),
[=] (int col, double & innerUpdate) {
innerUpdate += A(row, col) * x(col);
}, thisRowsSum );
if (teamMember.team_rank () == 0) {
update += y(row) * thisRowsSum;
}
}, result );
![Page 197: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/197.jpg)
GTC’16, April 4-7, 2016 105/121
TeamPolicy (0)
Important point
Using teams is changing the execution policy.
“Flat parallelism” uses RangePolicy:
We specify a total amount of work.
// total work = N
parallel_for(
RangePolicy <ExecutionSpace >(0,N), functor );
“Hierarchical parallelism” uses TeamPolicy:
We specify a team size and a number of teams.
// total work = numberOfTeams * teamSize
parallel_for(
TeamPolicy <ExecutionSpace >( numberOfTeams , teamSize), functor );
![Page 198: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/198.jpg)
GTC’16, April 4-7, 2016 105/121
TeamPolicy (0)
Important point
Using teams is changing the execution policy.
“Flat parallelism” uses RangePolicy:
We specify a total amount of work.
// total work = N
parallel_for(
RangePolicy <ExecutionSpace >(0,N), functor );
“Hierarchical parallelism” uses TeamPolicy:
We specify a team size and a number of teams.
// total work = numberOfTeams * teamSize
parallel_for(
TeamPolicy <ExecutionSpace >( numberOfTeams , teamSize), functor );
![Page 199: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/199.jpg)
GTC’16, April 4-7, 2016 106/121
TeamPolicy (1)
Important point
When using teams, functor operators receive a team member.
typedef typename TeamPolicy <ExecSpace >:: member_type member_type;
void operator ()( const member_type & teamMember) {
// Which team am I on?const unsigned int leagueRank = teamMember.league_rank ();
// Which thread am I on this team?const unsigned int teamRank = teamMember.team_rank ();
}
Warning
There may be more (or fewer) team members than pieces of youralgorithm’s work per team
![Page 200: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/200.jpg)
GTC’16, April 4-7, 2016 106/121
TeamPolicy (1)
Important point
When using teams, functor operators receive a team member.
typedef typename TeamPolicy <ExecSpace >:: member_type member_type;
void operator ()( const member_type & teamMember) {
// Which team am I on?const unsigned int leagueRank = teamMember.league_rank ();
// Which thread am I on this team?const unsigned int teamRank = teamMember.team_rank ();
}
Warning
There may be more (or fewer) team members than pieces of youralgorithm’s work per team
![Page 201: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/201.jpg)
GTC’16, April 4-7, 2016 107/121
TeamThreadRange (0)
First attempt at inner product exercise:
operator () (const member_type & teamMember ) {
const unsigned int row = teamMember.league_rank ();
const unsigned int col = teamMember.team_rank ();
atomic_add (& result ,y(row) * A(row ,col) * x(entry ));
}
I When team size 6= number of columns, how are units of workmapped to team’s member threads? Is the mappingarchitecture-dependent?
I atomic add performs badly under high contention, how canteam’s member threads performantly cooperate for a nestedreduction?
![Page 202: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/202.jpg)
GTC’16, April 4-7, 2016 107/121
TeamThreadRange (0)
First attempt at inner product exercise:
operator () (const member_type & teamMember ) {
const unsigned int row = teamMember.league_rank ();
const unsigned int col = teamMember.team_rank ();
atomic_add (& result ,y(row) * A(row ,col) * x(entry ));
}
I When team size 6= number of columns, how are units of workmapped to team’s member threads? Is the mappingarchitecture-dependent?
I atomic add performs badly under high contention, how canteam’s member threads performantly cooperate for a nestedreduction?
![Page 203: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/203.jpg)
GTC’16, April 4-7, 2016 108/121
TeamThreadRange (1)
We shouldn’t be hard-coding the work mapping...
operator () (member_type & teamMember , double & update) {
const int row = teamMember.league_rank ();
double thisRowsSum;
‘‘do a reduction ’’(‘‘over M columns ’’,
[=] (const int col) {
thisRowsSum += A(row ,col) * x(col);
});
if (teamMember.team_rank () == 0) {
update += (row) * thisRowsSum;
}
}
If this were a parallel execution,we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
![Page 204: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/204.jpg)
GTC’16, April 4-7, 2016 108/121
TeamThreadRange (1)
We shouldn’t be hard-coding the work mapping...
operator () (member_type & teamMember , double & update) {
const int row = teamMember.league_rank ();
double thisRowsSum;
‘‘do a reduction ’’(‘‘over M columns ’’,
[=] (const int col) {
thisRowsSum += A(row ,col) * x(col);
});
if (teamMember.team_rank () == 0) {
update += (row) * thisRowsSum;
}
}
If this were a parallel execution,we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
![Page 205: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/205.jpg)
GTC’16, April 4-7, 2016 108/121
TeamThreadRange (1)
We shouldn’t be hard-coding the work mapping...
operator () (member_type & teamMember , double & update) {
const int row = teamMember.league_rank ();
double thisRowsSum;
‘‘do a reduction ’’(‘‘over M columns ’’,
[=] (const int col) {
thisRowsSum += A(row ,col) * x(col);
});
if (teamMember.team_rank () == 0) {
update += (row) * thisRowsSum;
}
}
If this were a parallel execution,we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
![Page 206: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/206.jpg)
GTC’16, April 4-7, 2016 108/121
TeamThreadRange (1)
We shouldn’t be hard-coding the work mapping...
operator () (member_type & teamMember , double & update) {
const int row = teamMember.league_rank ();
double thisRowsSum;
‘‘do a reduction ’’(‘‘over M columns ’’,
[=] (const int col) {
thisRowsSum += A(row ,col) * x(col);
});
if (teamMember.team_rank () == 0) {
update += (row) * thisRowsSum;
}
}
If this were a parallel execution,we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
![Page 207: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/207.jpg)
GTC’16, April 4-7, 2016 109/121
TeamThreadRange (2)
TeamThreadRange:
operator () (const member_type & teamMember , double & update ) {
const int row = teamMember.league_rank ();
double thisRowsSum;
parallel_reduce(TeamThreadRange(teamMember , M),
[=] (const int col , double & rowUpdate ) {
rowUpdate += A(row , col) * x(col);
}, thisRowsSum );
if (teamMember.team_rank () == 0) {
update += y(row) * thisRowsSum;
}
}
I The mapping of work indices to threads isarchitecture-dependent.
I The amount of work given to the TeamThreadRange need notbe a multiple of the team size.
I Intra-team reduction handled by Kokkos.
![Page 208: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/208.jpg)
GTC’16, April 4-7, 2016 109/121
TeamThreadRange (2)
TeamThreadRange:
operator () (const member_type & teamMember , double & update ) {
const int row = teamMember.league_rank ();
double thisRowsSum;
parallel_reduce(TeamThreadRange(teamMember , M),
[=] (const int col , double & rowUpdate ) {
rowUpdate += A(row , col) * x(col);
}, thisRowsSum );
if (teamMember.team_rank () == 0) {
update += y(row) * thisRowsSum;
}
}
I The mapping of work indices to threads isarchitecture-dependent.
I The amount of work given to the TeamThreadRange need notbe a multiple of the team size.
I Intra-team reduction handled by Kokkos.
![Page 209: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/209.jpg)
GTC’16, April 4-7, 2016 110/121
Nested parallelism
Anatomy of nested parallelism:
parallel_outer(
TeamPolicy <ExecutionSpace >( numberOfTeams , teamSize),
KOKKOS_LAMBDA (const member_type & teamMember [ , . . . ] ) {
/* beginning of outer body */
parallel_inner(
TeamThreadRange(teamMember , thisTeamsRangeSize),
[=] (const unsigned int indexWithinBatch [ , . . . ] ) {
/* inner body */
} [ , . . . ] );/* end of outer body */
} [ , . . . ] );
I parallel outer and parallel inner may be anycombination of for, reduce, or scan.
I The inner lambda may capture by reference, butcapture-by-value is recommended.
I The policy of the inner lambda is always a TeamThreadRange.
I TeamThreadRange cannot be nested.
![Page 210: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/210.jpg)
GTC’16, April 4-7, 2016 111/121
What should the team size be?
In practice, you can let Kokkos decide:
parallel_something(
TeamPolicy <ExecutionSpace >( numberOfTeams , Kokkos ::AUTO),
/* functor */);
NVIDIA GPU:
I Special hardware available for coordination within a team.
I Within a team 32 threads (warp) execute “lock step.”
I Maximum team size: 1024; Recommended team size: 256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchya well-coordinated team avoids cache-thrashing
![Page 211: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/211.jpg)
GTC’16, April 4-7, 2016 111/121
What should the team size be?
In practice, you can let Kokkos decide:
parallel_something(
TeamPolicy <ExecutionSpace >( numberOfTeams , Kokkos ::AUTO),
/* functor */);
NVIDIA GPU:
I Special hardware available for coordination within a team.
I Within a team 32 threads (warp) execute “lock step.”
I Maximum team size: 1024; Recommended team size: 256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchya well-coordinated team avoids cache-thrashing
![Page 212: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/212.jpg)
GTC’16, April 4-7, 2016 111/121
What should the team size be?
In practice, you can let Kokkos decide:
parallel_something(
TeamPolicy <ExecutionSpace >( numberOfTeams , Kokkos ::AUTO),
/* functor */);
NVIDIA GPU:
I Special hardware available for coordination within a team.
I Within a team 32 threads (warp) execute “lock step.”
I Maximum team size: 1024; Recommended team size: 256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchya well-coordinated team avoids cache-thrashing
![Page 213: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/213.jpg)
GTC’16, April 4-7, 2016 112/121
Exercise #4: Inner Product, Hierarchical Parallelism
Details:
I Location: ~/GTC2016/Exercises/04/
I Replace RangePolicy<Space> with TeamPolicy<Space>
I Use AUTO for team size
I Make the inner loop a parallel reduce with TeamThreadRange
policy
I Experiment with the combinations of Layout, Space, N to viewperformance
I Hint: what should the layout of A be?
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behaviour with Exercise 3 for very non-square matrices
I Compare behavior of CPU vs GPU
![Page 214: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/214.jpg)
GTC’16, April 4-7, 2016 113/121
Exercise #3: Inner Product, Flat Parallelism
Reminder: Exercise 3 results (flat parallelism)
0
20
40
60
80
100
120
140
160
180
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Ban
dwid
th (
GB
/s)
number of rows
<y,Ax> Exercise03, fixed problem size
K40 RightK40 Left
KNC RightKNC Left
HSW RightHSW Left
![Page 215: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/215.jpg)
GTC’16, April 4-7, 2016 114/121
Exercise #4: Inner Product, Hierarchical Parallelism
New: Exercise 4 results (hierarchical parallelism)
0
20
40
60
80
100
120
140
160
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Ban
dwid
th (
GB
/s)
number of rows
<y,Ax> Exercise04, fixed problem size
K40 RightK40 Left
KNC RightKNC Left
HSW RightHSW Left
![Page 216: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/216.jpg)
GTC’16, April 4-7, 2016 115/121
Shared memory
Learning objectives:
I Understand how shared memory can reduce global memoryaccesses
I Recognize when to use shared memory
I Understand how to use shared memory and why barriers arenecessary
![Page 217: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/217.jpg)
GTC’16, April 4-7, 2016 116/121
Shared memory (0)
Each team has access to a “scratch pad”.
![Page 218: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/218.jpg)
GTC’16, April 4-7, 2016 117/121
Shared memory (1)
Shared memory (scratch pad) details:
I Accessing data is shared memory is (usually) much fasterthan global memory.
I GPUs have separate, dedicated, small, low-latency sharedmemories (NOT subject to coalescing requirements).
I CPUs dont have special hardware, but programming withshared memory results in cache-aware memory access patterns.
I Roughly, it’s like a user-managed L1 cache.
Important concept
When members of a team read the same data multiple times, it’sbetter to load the data into shared memory and read from there.
![Page 219: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/219.jpg)
GTC’16, April 4-7, 2016 117/121
Shared memory (1)
Shared memory (scratch pad) details:
I Accessing data is shared memory is (usually) much fasterthan global memory.
I GPUs have separate, dedicated, small, low-latency sharedmemories (NOT subject to coalescing requirements).
I CPUs dont have special hardware, but programming withshared memory results in cache-aware memory access patterns.
I Roughly, it’s like a user-managed L1 cache.
Important concept
When members of a team read the same data multiple times, it’sbetter to load the data into shared memory and read from there.
![Page 220: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/220.jpg)
GTC’16, April 4-7, 2016 118/121
Shared memory example: finite difference
Main idea: Load global data into shared memory and reuse
operator ()( member_type teamMember) const {
// Declare team -shared tile of memory
View < double ***
, execution_space :: scratch_memory_space
> tile( teamMember.team_shared (), ... );
// copy subgrid data into tile
teamMember.team_barrier ();
// Compute stencil using tile
}
![Page 221: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/221.jpg)
GTC’16, April 4-7, 2016 119/121
Advanced features we haven’t covered
I There is a third level in the hierarchy belowTeamThreadRange: ThreadVectorRange
I Just like for TeamThreadRange, you can performparallel for, parallel reduce, or parallel scan.
I Important for full performance of Xeon Phi and GPUs
I Restricting execution to a single member:PerTeam: one thread per teamPerThread: one vector lane per thread
I Multiple shared views can be made in shared memory.
![Page 222: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/222.jpg)
GTC’16, April 4-7, 2016 120/121
Section Summary
I Hierarchical work can be parallelized via hierarchicalparallelism.
I Hierarchical parallelism is leveraged using thread teamslaunched with a TeamPolicy.
I Team “worksets” are processed by a team in nestedparallel for (or reduce or scan) calls with aTeamThreadRange policy.
I Teams can be used to reduce contention for global resourceseven in “flat” algorithms.
I Teams have access to “scratch pad” shared memory.
![Page 223: Kokkos Tutorial - Sandia National Laboratories€¦ · Kokkos Tutorial Je Amelang 2, Christian R. Trott 1, H. Carter Edwards 1Sandia National Laboratories 2Harvey Mudd College GTC’16,](https://reader034.vdocuments.site/reader034/viewer/2022052002/60159e56b367c020ad356808/html5/thumbnails/223.jpg)
GTC’16, April 4-7, 2016 121/121
Tutorial Takeaways
I High performance computers are increasingly heterogenousMPI-only is no longer sufficient.
I For portability: OpenMP, OpenACC, ... or Kokkos.
I Only Kokkos obtains performant memory access patterns viaarchitecture-aware arrays and work mapping.
i.e., not just portable, performance portable.
I With Kokkos, simple things stay simple (parallel-for, etc.).i.e., it’s no more difficult than OpenMP.
I Advanced performance-optimizing patterns are simplerwith Kokkos than with native versions.
i.e., you’re not missing out on advanced features.