![Page 1: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/1.jpg)
CS 698L: Parallel Architecture and Programming Models
Swarnendu Biswas
Semester 2019-2020-I
CSE, IIT Kanpur
Content influenced by many excellent references, see References slide for acknowledgements.
![Page 2: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/2.jpg)
How can we sum up all elements in an array?
int array[1000] = {0, 1, 34, 2, 89, -5, 67, 8, 4, 56, 23, 67, 0, 9, …}
CS 698L Swarnendu Biswas
𝑠𝑢𝑚 =
đť‘–=1
đť‘›
𝑎𝑟𝑟𝑎𝑦[𝑖]
![Page 3: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/3.jpg)
Comparing ImplementationsMain Thread
long sum = 0;
for (int i =0; i < LEN; i++) {
sum += array[i];
}
CS 698L Swarnendu Biswas
![Page 4: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/4.jpg)
Comparing ImplementationsMain Thread
long sum = 0;
for (int i =0; i < LEN; i++) {sum += array[i];
}
Main Thread
Spawn n threadslong thr_sum[n] = {0}
for (int i = 0; i < n; i++) {sum += thr_sum[i];
}
Thread i
Compute CHUNK i of array[]for (int j = CHUNK_START; j + CHUNK_START <
CHUNK_END; j++) {thr_sum[j] += array[j];
}
CS 698L Swarnendu Biswas
![Page 5: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/5.jpg)
Serial vs Parallel Processing
CS 698L Swarnendu Biswas
https://computing.llnl.gov/tutorials/parallel_comp/
![Page 6: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/6.jpg)
Is it Worth the Extra Complexity?
CS 698L Swarnendu Biswas
Order of magnitude improvement
Array of unsigned ints of size 10^6, and four threads
![Page 7: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/7.jpg)
Parallel Programming Overview
CS 698L Swarnendu Biswas
Find parallelization opportunities in the problem• Decompose the problem into parallel units
![Page 8: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/8.jpg)
Parallel Programming Overview
CS 698L Swarnendu Biswas
Create parallel units of execution• Manage efficient execution of the parallel units
Find parallelization opportunities in the problem• Decompose the problem into parallel units
![Page 9: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/9.jpg)
Parallel Programming Overview
CS 698L Swarnendu Biswas
Create parallel units of execution• Manage efficient execution of the parallel units
Find parallelization opportunities in the problem• Decompose the problem into parallel units
Problem may require inter-unit communication• Communication between threads, cores, …
![Page 10: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/10.jpg)
Inter-unit Communication
CS 698L Swarnendu Biswas
Units may be on the same processor or across processors or across nodes
The problem logic will possibly require inter-unit communication
![Page 11: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/11.jpg)
What do we communicate in sequential programs?
CS 698L Swarnendu Biswas
![Page 12: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/12.jpg)
What do we communicate in sequential programs?
CS 698L Swarnendu Biswas
• Global variables or data structures• Function arguments and call parameters
![Page 13: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/13.jpg)
Parallelism vs Concurrency
CS 698L Swarnendu Biswas
![Page 14: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/14.jpg)
Parallelism vs Concurrency
Parallel programming
• Use additional resources to speed up computation
• Performance perspective
Concurrent programming
• Correct and efficient control of access to shared resources
• Correctness perspective
Distinction is not absolute
CS 698L Swarnendu Biswas
![Page 15: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/15.jpg)
Parallel ArchitecturesQuick Overview
CS 698L Swarnendu Biswas
![Page 16: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/16.jpg)
Architecture Classification
• Popular dimensions for classification• Instruction and data stream
• Source of parallelism
• Structure of the system
CS 698L Swarnendu Biswas
![Page 17: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/17.jpg)
Flynn’s Taxonomy
• Single Instruction Single Data • Single Instruction Multiple Data
CS 698L Swarnendu Biswas
![Page 18: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/18.jpg)
Flynn’s Taxonomy
• Multiple Instructions Single Data • Multiple Instructions Multiple Data
CS 698L Swarnendu Biswas
![Page 19: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/19.jpg)
Sources of Parallelism in Hardware
• Instruction-Level Parallelism (ILP)• Pipelining, out-of-order execution,
Superscalar, VLIW, …
• Data parallelism• Increase amount of data to be
operated on at same time
• Processor and resource parallelism• Increase units, memory
bandwidth, …
CS 698L Swarnendu Biswas
![Page 20: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/20.jpg)
Source of Parallelism
Data
• Vector processors, systolic arrays, and SIMD
Control/Function
• Pipelined, superscalar, VLIW processors
• Shared-memory systems, distributed memory systems
CS 698L Swarnendu Biswas
![Page 21: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/21.jpg)
Control Parallel Architectures
Function-parallel
architectures
Instruction level
Parallel ArchThread level
Parallel Arch
Process level
Parallel Arch(ILPs) (MIMDs)
VLIWs Superscalar
processorsDistributed
Memory MIMD
Shared
Memory
MIMD
Pipelined
processors
Xin Yuang - Parallel Computer Architecture Classification. UFL.
![Page 22: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/22.jpg)
Modern Classification
Uniprocessor
• Scalar processor
• Vector processor
• SIMD
Multiprocessor
• Symmetric multiprocessors (SMP)
• Distributed memory multiprocessor
• SMP clusters• Shared memory addressing within
node
• Message passing between nodes
CS 698L Swarnendu Biswas
![Page 23: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/23.jpg)
Performance Metrics of Parallel Architectures
• MIPS – million instructions per second
• MFLOPS – million floating point operations per second
• Which is a better metric?
![Page 24: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/24.jpg)
Shared Memory Architecture
• Single address space shared by multiple cores
• Communication is implicit through memory instructions (i.e., loads and stores)
• Can share data efficiently
CS 698L Swarnendu Biswas
P
M
P
M
P P
multi-port
P
M
P P
shared bus
P P P
Minterconnectionnetwork
![Page 25: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/25.jpg)
Implementing Shared Memory
• Uniform memory access (UMA)
• Interconnection network used in the UMA can be a single bus, multiple buses, or a crossbar switch
CS 698L Swarnendu Biswas
Wikipedia.
![Page 26: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/26.jpg)
Implementing Shared Memory
• Non-uniform memory
access
• Memory access time depends on the distance from the core
CS 698L Swarnendu Biswas
![Page 27: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/27.jpg)
Challenges with Shared Memory
• Caches play key role in SMP performance• Reduce average data access time, reduce interconnect bandwidth
• However, private caches create problem of data coherence• Copies of a variable can be present in multiple caches
CS 698L Swarnendu Biswas
![Page 28: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/28.jpg)
Sequence of Operations
CS 698L Swarnendu Biswas
Core 1Private Cache
X = 10x = x + 5
x = x + 15
![Page 29: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/29.jpg)
Sequence of Operations
CS 698L Swarnendu Biswas
Core 1Private Cache
X = 10x = x + 5
x = x + 15
Final value of x will be 30
![Page 30: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/30.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2Private Cache
Private Cache
X = 10
![Page 31: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/31.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2Private Cache
Private Cache
X = 10
Read x
Read x
![Page 32: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/32.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2
X = 10
Read x
Read x
X = 10
X = 10
![Page 33: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/33.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2
X = 10
x = x + 5
x = x + 15
X = 10
X = 10
![Page 34: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/34.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2
X = 10
x = x + 5
x = x + 15
X = 15
X = 25
![Page 35: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/35.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2
X = 15
x = x + 5
x = x + 15
X = 15
X = 25
![Page 36: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/36.jpg)
Problem of Data Coherence
CS 698L Swarnendu Biswas
Core 1
Core 2
X = 25
x = x + 5
x = x + 15
X = 15
X = 25
![Page 37: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/37.jpg)
Challenges with Shared Memory
• Caches play key role in SMP performance• Reduce average data access time, reduce interconnect bandwidth
• Private caches create data coherence problem• Copies of a variable can be present in multiple caches
• Need support for cache coherence
CS 698L Swarnendu Biswas
![Page 38: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/38.jpg)
Challenges with Shared Memory
• Access conflicts - several threads can try to access the same shared location• Data race is the accesses are not correctly synchronized and one the accesses is a
write
• Synchronization is not cheap
• Programmer responsible for synchronized accesses to memory
• Coherence operations can become a bottleneck• Takes time and effort in keeping shared-memory locations consistent
• Traffic due to data and cache/memory management
• Lack of scalability
• Other performance hazards – false sharing
CS 698L Swarnendu Biswas
![Page 39: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/39.jpg)
False Sharing
CS 698L Swarnendu Biswas
Intel. Avoiding and Identifying False Sharing Among Threads.
What is going on here?
![Page 40: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/40.jpg)
False Sharing
CS 698L Swarnendu Biswas
Intel. Avoiding and Identifying False Sharing Among Threads.
What is going on here?
Does anyone remember the MESI
protocol?
![Page 41: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/41.jpg)
State Transitions in MESI
CS 698L Swarnendu Biswas
A Primer on Memory Consistency and Cache Coherence.
![Page 42: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/42.jpg)
Distributed Memory Architecture
• Each processor has its own private memory• Physically separated memory address space
• Processor must communicate to access non-local data• Also called message passing architecture
• Requires interconnection network for communication • Interconnection network topology is a key design factor, determines how the
system scales
• Need high bandwidth for communication
CS 698L Swarnendu Biswas
![Page 43: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/43.jpg)
CS 698L Swarnendu Biswas
Wikipedia.
![Page 44: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/44.jpg)
Advantages of Distributed Memory
CS 698L Swarnendu Biswas
• Memory scales with the number of processors• Can quickly access your own memory without the need for
global coherence• Can use off-the-shelf components
![Page 45: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/45.jpg)
Be clear with uses!
Parallel computing
• Multiple tasks in a program cooperate to solve a problem efficiently
Concurrent programming
• Multiple tasks in a program can be in progress at the same time
Distributed computing
• A program needs to cooperate with other programs to solve a problem
CS 698L Swarnendu Biswas
Yonghong Yan.
![Page 46: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/46.jpg)
Parallel Programming Models
CS 698L Swarnendu Biswas
![Page 47: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/47.jpg)
Parallel Programming Models
• An abstraction of parallel computer architectures
• Building block to design algorithms and write programs
• Dimensions• Performance – how efficiently can programs run
• Productivity – how easy is it to develop programs
CS 698L Swarnendu Biswas
![Page 48: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/48.jpg)
Parallel Programming Models
Shared-memory
Distributed memory
Data parallel (PGAS)
Single program multiple data (SPMD)
Multiple program multiple data (MPMD)
Hybrid
CS 698L Swarnendu Biswas
![Page 49: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/49.jpg)
Shared Memory without Threads
• Processes share a common address space• Notion of ownership of data is
missing, complicating matters
• Unix-like systems provide support via functions like shm_open(), shmget(), andshmctl()
CS 698L Swarnendu Biswas
![Page 50: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/50.jpg)
Shared Memory with Threads
• A single process can be composed of multiple worker threads
• Threads are software analog of cores• Each thread has its own PC, SP,
registers, etc
• All threads share the process heap and the global data structures
CS 698L Swarnendu Biswas
![Page 51: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/51.jpg)
Shared Memory with Threads
• All threads access the shared address space• Threads also have a private memory
• Synchronization is required to access shared resources• Can otherwise lead to pernicious bugs
• Runtime system schedules threads to cores• Concurrent units of execution
• If there are more threads than cores, the runtime will time-slice threads on to the cores
CS 698L Swarnendu Biswas
![Page 52: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/52.jpg)
Distributed Memory
• The problem size may not fit on a single machine• Graph analytics
• Obvious step: Go distributed!
• Distributed computing model• Launch multiple processes on multiple systems
• Processes carry out work
• Processes may communicate through message passing
• Processes coordinate either through message passing or synchronization
CS 698L Swarnendu Biswas
![Page 53: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/53.jpg)
Distributed Memory
• Also called message passing programming model
• Set of tasks that use local memory for computation• Exchange data via communication
• MPI is a popular runtime
CS 698L Swarnendu Biswas
![Page 54: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/54.jpg)
Challenges with Distributed Memory
CS 698L Swarnendu Biswas
Often, communication turns out to be the primary bottleneck• How do you partition the data between different nodes?• Network topology is very important for scalability
• Non-uniform memory access
Since communication is explicit, therefore it excludes race conditions• Programmer’s responsibility to synchronize between tasks
![Page 55: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/55.jpg)
Shared Memory vs Distributed Memory Programming
Shared Memory
• Communication is implicit
• Explicit synchronization
• Requires hardware support for coherence
• Lower development effort to begin with
Distributed Memory
• Communication via explicit messages
• Synchronization implicit via messages
• Requires support for in-node coherence and network communication
• Higher development effort to begin with
CS 698L Swarnendu Biswas
![Page 56: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/56.jpg)
Data Parallel
• Also known as the partitioned global address space (PGAS)• Address space is global, and
partitioned for tasks• Tasks operate on their own
partition• Can have locality of reference
• Implementations• Unified Parallel C (UPC)• X10 from IBM• Chapel
CS 698L Swarnendu Biswas
![Page 57: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/57.jpg)
Data Parallel
• No library calls for communication
• Variables can name memory locations on other machines• Assume y points to a remote
location
• The following is equivalent to a send/receive
• x = *y
CS 698L Swarnendu Biswas
![Page 58: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/58.jpg)
Single Program Multiple Data (SPMD)
• Tasks execute the same copy of the program on different data• Data parallel
• Can be threads or message passing interface
CS 698L Swarnendu Biswas
![Page 59: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/59.jpg)
Multiple Program Multiple Data (MPMD)
• Tasks may execute different programs with different data
CS 698L Swarnendu Biswas
![Page 60: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/60.jpg)
Hybrid
• Combine more than one of the other models
• Examples• Combine shared-memory on local
nodes, and exchange data over networks
• Use GPUs for compute kernels with CUDA for exchange between host and device, and MPI for inter-node communication
CS 698L Swarnendu Biswas
![Page 61: CS 698L: Parallel Architecture and Programming Models](https://reader031.vdocuments.site/reader031/viewer/2022012101/61db6eaa5f879c6fd3290b16/html5/thumbnails/61.jpg)
References
• James Demmel and Katherine Yelick – CS 267: Shared Memory Programming: Threads and OpenMP
• Keshav Pingali – CS 377P: Programming Shared-memory Machines, UT Austin.
• Blaise Barney, LLNL. Introduction to Parallel Computing, https://computing.llnl.gov/tutorials/parallel_comp/
CS 698L Swarnendu Biswas