![Page 1: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/1.jpg)
Parallel ModelsDifferent ways to exploit parallelism
![Page 2: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/2.jpg)
Reusing this material
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US
This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must
distribute your work under the same license as the original.
Acknowledge EPCC as follows: “© EPCC, The University of Edinburgh, www.epcc.ed.ac.uk”
Note that this presentation contains images owned by others. Please seek their permission before reusing these images.
2
![Page 3: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/3.jpg)
Outline
• Shared-Variables Parallelism
- threads
- shared-memory architectures
• Message-Passing Parallelism
- processes
- distributed-memory architectures
• Practicalities
- usage on real HPC architectures
3
![Page 4: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/4.jpg)
Shared Variables
Threads-based parallelism
4
![Page 5: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/5.jpg)
Shared-memory concepts
• Have already covered basic concepts
- threads can all see data of parent process
- can run on different cores
- potential for parallel speedup
5
![Page 6: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/6.jpg)
Analogy
• One very large whiteboard in a two-person office
- the shared memory
• Two people working on the same problem
- the threads running on different cores attached to the memory
• How do they collaborate?
- working together
- but not interfering
• Also need private data
my
data
shared
datamy
data
6
![Page 7: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/7.jpg)
Threads
7
PC PC PCPrivate data Private data Private data
Shared data
Thread 1 Thread 2 Thread 3
![Page 8: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/8.jpg)
Thread 1 Thread 2
mya=23
mya=a+1
23
23 24
Program
Private
data
Shared
data
a=mya
Thread Communication
8
![Page 9: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/9.jpg)
Synchronisation
• Synchronisation crucial for shared variables approach
- thread 2’s code must execute after thread 1
• Most commonly use global barrier synchronisation
- other mechanisms such as locks also available
• Writing parallel codes relatively straightforward
- access shared data as and when its needed
• Getting correct code can be difficult!
9
![Page 10: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/10.jpg)
Specific example
• Computing asum = a0+ a1 + … a7
- shared:
• main array: a[8]
• result: asum
- private:
• loop counter: i
• loop limits: istart, istop
• local sum: myasum
- synchronisation:
• thread0: asum += myasum
• barrier
• thread1: asum += myasum
loop: i = istart,istop
myasum += a[i]
end loop
asum
asum=0
10
![Page 11: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/11.jpg)
Reductions
• A reduction produces a single value from associative operations such as
addition, multiplication, max, min, and, or.
asum = 0;
for (i=0; i < n; i++)
asum += a[i];
• Only one thread at a time updating asum removes all parallelism
- each thread accumulates own private copy; copies reduced to give final result.
- if the number of operations is much larger than the number of threads, most of
the operations can proceed in parallel
• Want common patterns like this to be automated
- not programmed by hand as in previous slide
11
![Page 12: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/12.jpg)
Hardware
• Needs support of a shared-memory architecture
12
Memory
Processor
Shared Bus
Processor Processor Processor Processor
Single Operating System
![Page 13: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/13.jpg)
Thread Placement: Shared Memory
13
OS
User
TTT T T TT TTT T T TTT T
![Page 14: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/14.jpg)
Threads in HPC
• Threads existed before parallel computers
- Designed for concurrency
- Many more threads running than physical cores
• scheduled / descheduled as and when needed
• For parallel computing
- Typically run a single thread per core
- Want them all to run all the time
• OS optimisations
- Place threads on selected cores
- Stop them from migrating
14
![Page 15: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/15.jpg)
Practicalities• Threading can only operate within a single node
- Each node is a shared-memory computer (e.g. 24 cores on ARCHER)
- Controlled by a single operating system
• Simple parallelisation
- Speed up a serial program using threads
- Run an independent program per node (e.g. a simple task farm)
• More complicated
- Use multiple processes (e.g. message-passing – next)
- On ARCHER: could run one process per node, 24 threads per
process
• or 2 procs per node / 12 threads per process or 4 / 6 ...
15
![Page 16: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/16.jpg)
Threads: Summary
• Shared blackboard a good analogy for thread parallelism
• Requires a shared-memory architecture
- in HPC terms, cannot scale beyond a single node
• Threads operate independently on the shared data
- need to ensure they don’t interfere; synchronisation is crucial
• Threading in HPC usually uses OpenMP directives
- supports common parallel patterns
- e.g. loop limits computed by the compiler
- e.g. summing values across threads done automatically
16
![Page 17: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/17.jpg)
Message Passing
Process-based parallelism
17
![Page 18: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/18.jpg)
Analogy
• Two whiteboards in different single-person offices
- the distributed memory
• Two people working on the same problem
- the processes on different nodes attached to the interconnect
• How do they collaborate?
- to work on single problem
• Explicit communication
- e.g. by telephone
- no shared data
my
data
my
data
18
![Page 19: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/19.jpg)
a=23 Recv(1,b)
Process 1 Process 2
23
23
24
23
Program
Data
Send(2,a) a=b+1
Process communication
19
![Page 20: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/20.jpg)
Synchronisation
• Synchronisation is automatic in message-passing
- the messages do it for you
• Make a phone call …
- … wait until the receiver picks up
• Receive a phone call
- … wait until the phone rings
• No danger of corrupting someone else’s data
- no shared blackboard
20
![Page 21: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/21.jpg)
Communication modes
• Sending a message can either be synchronous or
asynchronous
• A synchronous send is not completed until the message
has started to be received
• An asynchronous send completes as soon as the
message has gone
• Receives are usually synchronous - the receiving process
must wait until the message arrives
21
![Page 22: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/22.jpg)
Synchronous send
• Analogy with faxing a letter.
• Know when letter has started to be received.
22
![Page 23: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/23.jpg)
Asynchronous send
• Analogy with posting a letter.
• Only know when letter has been posted, not when it has been
received.
23
![Page 24: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/24.jpg)
Point-to-Point Communications
• We have considered two processes
- one sender
- one receiver
• This is called point-to-point communication
- simplest form of message passing
- relies on matching send and receive
• Close analogy to sending personal emails
24
![Page 25: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/25.jpg)
Message Passing: Collective
communications
Process-based parallelism
25
![Page 26: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/26.jpg)
Collective Communications
• A simple message communicates between two processes
• There are many instances where communication between
groups of processes is required
• Can be built from simple messages, but often
implemented separately, for efficiency
26
![Page 27: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/27.jpg)
Broadcast: one to all communication
27
![Page 28: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/28.jpg)
Broadcast
• From one process to all others
28
8
8 8
8
8
8
![Page 29: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/29.jpg)
Scatter
• Information scattered to many processes
29
0 1 2 3 4 5
0
1
3
4
5
2
![Page 30: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/30.jpg)
Gather
• Information gathered onto one process
30
0 1 2 3 4 5
0
1
3
4
5
2
![Page 31: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/31.jpg)
Reduction Operations
• Combine data from several processes to form a single result
31
Strike?
![Page 32: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/32.jpg)
Reduction
• Form a global sum, product, max, min, etc.
32
0
1
3
4
5
2
15
![Page 33: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/33.jpg)
Hardware
• Natural map to
distributed-memory
- one process per
processor-core
- messages go over
the interconnect,
between nodes/OS’s
Processor
Processor
Processor
Processor
Processor
Processor
ProcessorProcessor
Interconnect
33
![Page 34: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/34.jpg)
Processes: Summary
• Processes cannot share memory
- ring-fenced from each other
- analogous to white boards in separate offices
• Communication requires explicit messages
- analogous to making a phone call, sending an email, …
- synchronisation is done by the messages
• Almost exclusively use Message-Passing Interface
- MPI is a library of function calls / subroutines
34
![Page 35: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/35.jpg)
Practicalities
How we use the parallel models
35
![Page 36: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/36.jpg)
Practicalities
• 8-core machine might only have 2 nodes- how do we run MPI on a real HPC
machine?
• Mostly ignore architecture- pretend we have single-core nodes
- one MPI process per processor-core
- e.g. run 8 processes on the 2 nodes
• Messages between processor-cores on the same node are fast- but remember they also share access
to the network
Interconnect
36
![Page 37: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/37.jpg)
Message Passing on Shared Memory
• Run one process per core
- don’t directly exploit shared memory
- analogy is phoning your office mate
- actually works well in practice!
my
data
my
data• Message-passing
programs run by a
special job launcher
• user specifies #copies
• some control over
allocation to nodes
37
![Page 38: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/38.jpg)
Summary
38
![Page 39: Parallel Models - Archer...How we use the parallel models 35 Practicalities •8-core machine might only have 2 nodes-how do we run MPI on a real HPC machine? •Mostly ignore architecture-pretend](https://reader033.vdocuments.site/reader033/viewer/2022042403/5f1480c5f2147818dc45f00a/html5/thumbnails/39.jpg)
Summary
• Shared-variables parallelism
- uses threads
- requires shared-memory machine
- easy to implement but limited scalability
- in HPC, done using OpenMP compilers
• Distributed memory
- uses processes
- can run on any machine: messages can go over the interconnect
- harder to implement but better scalability
- on HPC, done using the MPI library
39