Download - GPU Programming on CPU - Using C++AMP
![Page 1: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/1.jpg)
GPU Programming
on CPUs
Using C++AMP
Miller Lee
![Page 2: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/2.jpg)
Outline
1. Introduction to C++AMP2. Introduction to Tiling3. tile_static4. barrier.wait and solutions
a. C++11 threadb. setjmp/longjmpc. ucontext
2
![Page 3: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/3.jpg)
(Homogeneous coordinates)
(0, 0) (0, 1) (0, 2) (0, 3)
(1, 0) (1, 1) (1, 2) (1, 3)
(2, 0) (2, 1) (2, 2) (2, 3)
(3, 0) (3, 1) (3, 2) (3, 3)
X
0
1
2
3
Matrix A b
=
0
1
2
3
result
Computing example
● Simple matrix multiplication
3
![Page 4: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/4.jpg)
C++ Version
1. int A[4][4];2. int b[4];3. int result[4];4. for (int i = 0; i < 4; i++) {5. result[i] = 0;6. for (int j = 0; j < 4; j++)7. result[i] += A[i][j] * b[j];8. }
4
![Page 5: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/5.jpg)
C++AMP Version1. array_view<float, 2> A(4, 4);2. array_view<float, 1> b(4);3. array_view<float, 1> result(4);4. extent<1> ext(4);5. parallel_for_each(ext, [&](index<1> idx) restrict(amp)6. {7. result[idx[0]] = 0;8. for (int i = 0; i < 4; i++)9. result[idx[0]] += A(idx[0], i) * b(i);
10. });
5
![Page 6: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/6.jpg)
memory access
0 1 2 3
P0 P1 P2 P3
global memory
b
100t
Total access time = 400t 6
![Page 7: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/7.jpg)
shared memory
0 1 2 3
shared memory
10t
100t
Total access time = 130t
b
7
![Page 8: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/8.jpg)
1. array_view<float, 2> A(4, 4);2. array_view<float, 1> b(4);3. array_view<float, 1> result(4);4. extent<1> ext(4);5. parallel_for_each(ext.tile<4>(), [&](tiled_index<4> tidx)
restrict(amp)6. {7. int local = tidx.local[0];8. int global = tidx.global[0];9. tile_statc int buf[4];
10. buf[local] = b[global];11. tidx.barrier.wait();12. result[idx[0]] = 0;13. for (int i = 0; i < 4; i++)14. result[idx[0]] += A[idx[0]][i] * buf[i];15. }); 8
![Page 9: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/9.jpg)
barrier
9
![Page 10: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/10.jpg)
Architecture
source: NVIDIA TESLA:AUNIFIED GRAPHICS AND COMPUTING ARCHITECTURE
shared memoryaccessible to all SPs
10
![Page 11: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/11.jpg)
Goal
● Implement all the C++AMP function on CPU instead of GPU without any compiler modification.
11
![Page 12: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/12.jpg)
tiled_static
● The limitation of C++ syntax leads to the following choices○ const, volatile○ __attribute__(...)○ static
● Choose static○ static memory can be shared among all the threads○ side effect: At most one thread group can be
executed at the same time.
#define tile_static static
12
![Page 13: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/13.jpg)
Barrier.wait
● Threads in the same thread group will be waited at the point where “wait” is called.
● Program cana. perform real barrier actionb. jump out of current execution context
13
![Page 14: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/14.jpg)
● True threading○ C++11 thread
● Fake threading(Coroutines)○ setjmp/longjmp○ makecontext/getcontext/swapcontext/setcontext
Approaches
14
![Page 15: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/15.jpg)
C++11 thread
● launch hundreds of threads at a time.● implemente my own barrier by using C++11
mutex library.→ extremely slow.→ The data on static memory will be corrupted
15
![Page 16: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/16.jpg)
setjmp/longjmp
● int setjmp(jmp_buf env)○ setjmp() saves the stack context/environment in env
for later use by longjmp.○ The stack context will be invalidated if the function
which called setjmp() returns.● void longjmp(jmp_buf env, int val);
○ longjmp() restores the environment saved by the last call of setjmp.
16
![Page 17: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/17.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf;4. void wait(void) {5. printf("wait\n"); // prints6. longjmp(buf,1); 7. }8. void first(void) {9. wait();
10. printf("first\n"); // does not print11. }12. int main() { 13. if (!setjmp(buf))14. first(); // when executed, setjmp returns 015. else // when longjmp jumps back, setjmp returns 116. printf("main\n"); // prints17. return 0;18. }
17
![Page 18: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/18.jpg)
Pseudo code (1)void entry(){while(!finish) for(t : tasks) run(t)}
void fun(){… wait();...}
void fun(){… wait();...}
void entry(){while(!finish) for(t : tasks) run(t)}
void fun(){… wait();...}
void fun(){… wait();...}
18
![Page 19: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/19.jpg)
Pseudo code (2)void entry(){while(!finish) for(t : tasks) run(t)}
void fun(){… wait();...}
void fun(){… wait();...}
void entry(){while(!finish) for(t : tasks) run(t)}
void fun(){… wait();...}
void fun(){… wait();...}
19
![Page 20: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/20.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {
10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. } 20
![Page 21: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/21.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {
10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. }
buf
21
![Page 22: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/22.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {
10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. }
ret address
buf
b
22
![Page 23: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/23.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {
10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. }
buf
b
23
![Page 24: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/24.jpg)
1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {
10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. }
Cannot return???
???
??? buf
b
24
![Page 25: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/25.jpg)
Problems
● Cannot return○ return address in the stack is destroyed
● Cannot use too many static variables○ will lost spilled registers
→ can be solved by using “alloca”http://www.codemud.net/~thinker/GinGin_CGI.py/show_id_doc/489
25
![Page 26: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/26.jpg)
ucontext.h
● ucontext_t● getcontext● makecontest● swapcontext● setcontext
26
![Page 27: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/27.jpg)
ucontext_ttypedef struct ucontext { struct ucontext *uc_link; sigset_t uc_sigmask; stack_t uc_stack; mcontext_t uc_mcontext; ...} ucontext_t;
● uc_link○ points to the context that will be resumed when the current context
terminates● uc_stack
○ the stack used by this context ● uc_mcontext
○ machine-specific representation of the saved context, that includes the calling thread's machine registers
27
![Page 28: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/28.jpg)
Functions
● int getcontext(ucontext_t *ucp);○ initializes the structure pointed at by ucp.
● int setcontext(const ucontext_t *ucp);○ restores the user context pointed at by ucp
● int swapcontext(ucontext_t *oucp, const ucontext_t *ucp);○ saves the current context in the structure pointed to
by oucp, and then activates the context pointed to by ucp.
28
![Page 29: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/29.jpg)
makecontext
● void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);○ glibc(x86_64) saves the arguments to registers
instead of pushing them on stack as AMD64 ABI said
○ The size of the arguments that passed to makecontext should be no less than sizeof(register)
29
![Page 30: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/30.jpg)
1. #include <stdio.h>
2. #include <ucontext.h>
3. static ucontext_t ctx[2];
4. static void f1 (void) {
5. puts("start f1");
6. swapcontext(&ctx[1], &ctx[0]);
7. puts("finish f1");
8. }
9. int main (void)
10. {
11. char st1[8192];
12. getcontext(&ctx[1]);
13. ctx[1].uc_stack.ss_sp = st1;
14. ctx[1].uc_stack.ss_size = sizeof st1;
15. ctx[1].uc_link = &ctx[0];
16. makecontext(&ctx[1], f1, 0);
17. swapcontext(&ctx[0], &ctx[1]);
18. swapcontext(&ctx[0], &ctx[1]);
19. return 0;
20. } 30
![Page 31: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/31.jpg)
1. #include <stdio.h>
2. #include <ucontext.h>
3. static ucontext_t ctx[3];
4. static void f1 (void) {
5. puts("start f1");
6. swapcontext(&ctx[1], &ctx
[0]);
7. puts("finish f1");
8. }
9. static void f2 (void)
10. {
11. puts("start f2");
12. swapcontext(&ctx[2], &ctx
[1]);
13. puts("finish f2");
14. }
1. int main (void)
2. {
3. char st1[8192], st2[8192];
4. getcontext(&ctx[1]);
5. ctx[1].uc_stack.ss_sp = st1;
6. ctx[1].uc_stack.ss_size = sizeof
st1;
7. ctx[1].uc_link = &ctx[0];
8. makecontext(&ctx[1], f1, 0);
9.
10. getcontext(&ctx[2]);
11. ctx[2].uc_stack.ss_sp = st2;
12. ctx[2].uc_stack.ss_size = sizeof
st2;
13. ctx[2].uc_link = &ctx[1];
14. makecontext(&ctx[2], f2, 0);
15. swapcontext(&ctx[0], &ctx[2]);
16. swapcontext(&ctx[0], &ctx[2]);
17. return 0;
18. }
31
![Page 32: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/32.jpg)
Fake threading (yield)void entry(){ setup(fun, 2);while(!finish) switch_to();}
void fun(){… wait();...}
void fun(){… wait();...}
32
void entry(){ setup(fun, 2);while(!finish) switch_to();}
void fun(){… wait();...}
void fun(){… wait();...}
![Page 33: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/33.jpg)
Problems
1. How to pass a lambda?○ makecontext(&ctx,
(void (*)(void))&Kernel::operator(), …);2. How to pass non-int arguments?
○ What if sizeof(Type) > sizeof(int)○ How about complex structure and class
33
![Page 34: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/34.jpg)
Pass lambda
1. Use a wrapper function!!template <typename Ker, typename Arg>
void fun(Ker k, Arg arg)
{
k(arg);
}
template <typename Ker, typename Arg>
void makectx(Ker k, Arg arg)
{
makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, k, arg);
}
34
![Page 35: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/35.jpg)
Pass non-int arguments
2. Pass pointer instead!!template <typename Ker, typename Arg>
void fun(Ker *k, Arg *arg)
{
(*k)(*arg);
}
template <typename Ker, typename Arg>
void makectx(Ker k, Arg arg)
{
makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, &k, &arg);
}
35
![Page 36: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/36.jpg)
Additional
● Use a counter so that we can spawn coroutines dynamically
● Can it be multithreaded? Yes
36
![Page 37: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/37.jpg)
true threading
barrier
There are 12 threads in one thread group
37
![Page 38: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/38.jpg)
one thread
barrier
38
![Page 39: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/39.jpg)
multithreading
barrier
Hardware Core = 4
39
![Page 40: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/40.jpg)
barrierstruct bar_t { unsigned const count; std::atomic<unsigned> spaces; std::atomic<unsigned> generation; bar_t(unsigned count_) : count(count_), spaces(count_), generation(0) {} void wait() noexcept { unsigned const my_generation = generation; if (!--spaces) { spaces = count; ++generation; } else { while(generation == my_generation); } }}; source: C++ Concurrency in Action: Practical Multithreading 40
![Page 41: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/41.jpg)
Summary
● It works fine on AMP right now● The importance of low level knowledge
41
![Page 42: GPU Programming on CPU - Using C++AMP](https://reader031.vdocuments.site/reader031/viewer/2022020110/559137451a28ab0d498b45cb/html5/thumbnails/42.jpg)
42