se-292 high performance computing intro. to concurrent programming & parallel architecture r....
TRANSCRIPT
![Page 1: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/1.jpg)
SE-292 High Performance ComputingIntro. To Concurrent Programming &
Parallel Architecture
R. Govindarajan
govind@serc
![Page 2: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/2.jpg)
2
PARALLEL ARCHITECTUREParallel Machine: a computer system with more
than one processor Special parallel machines designed to
make this interaction overhead less
Questions: What about Multicores? What about a network of machines?
Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent
![Page 3: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/3.jpg)
3
Classification of Parallel MachinesFlynn’s Classification In terms of number of Instruction streams and
Data streams Instruction stream: path to instruction
memory (PC) Data stream: path to data memory SISD: single instruction stream single data
stream SIMD MIMD
![Page 4: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/4.jpg)
4
PARALLEL PROGRAMMING
Recall: Flynn SIMD, MIMD Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages
Speedup =
parallel
sequential
T
T
![Page 5: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/5.jpg)
5
Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors
How Much Speedup is Possible?
ns
s 1
1n
lim
Let s be the fraction of sequential execution time of a given program that can not be parallelized
Speedup = s
1
Maximum speedup achievable is limited by the sequential fraction of the sequential program
Amdahl’s Law
![Page 6: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/6.jpg)
6
Understanding Amdahl’s Law
con
curr
ency
Time
p
1
p
n2 /p
n2 /p
con
curr
ency
(c) Parallel
n2/p n2
p
1con
curr
ency
(b) Naïve Parallel
Time
1
n2 n2
(a) Serial
Time
Compute : min ( |A[i,j]- B[i,i]| )
![Page 7: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/7.jpg)
7
Classification 2:Shared Memory vs Message Passing Shared memory machine: The n processors
share physical address space Communication can be done through this shared
memory
The alternative is sometimes referred to as a message passing machine or a distributed memory machine
PP P P PP P
Interconnect
Main Memory
PP P P PP P
Interconnect
M MMMMMM
![Page 8: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/8.jpg)
8
Shared Memory MachinesThe shared memory could itself be distributed
among the processor nodes Each processor might have some portion of the
shared physical address space that is physically close to it and therefore accessible in less time
Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture
Non-Uniform Memory Access
![Page 9: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/9.jpg)
© RG@SERC,IISc 9
M
Network
° ° °
Centralized Shared Memory Uniform Memory Access (UMA)
M M
$P
$P
$P
° ° °
Network
Distributed Shared MemoryNon-Uniform Memory Access (NUMA)
M
$
P
M
$
P
° ° °
Shared Memory Architecture
![Page 10: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/10.jpg)
© RG@SERC,IISc 10
MultiCore Structure
L2-Cache
C0 C2
L1$ L1$
L2-Cache
C4 C6
L1$ L1$
L2-Cache
C1 C3
L1$ L1$
L2-Cache
C5 C7
L1$ L1$
Memory
![Page 11: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/11.jpg)
© RG@SERC,IISc 11
NUMA Architecture
L2$
C0 C2
L1$ L1$
C4 C6
L1$ L1$
L2$ L2$ L2$
L3-Cache
IMC QPI
L2$
C1 C3
L1$ L1$
C5 C7
L1$ L1$
L2$ L2$ L2$
L3-Cache
QPI IMC
Memory Memory
![Page 12: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/12.jpg)
© RG@SERC,IISc 12
Distributed Memory Architecture
Network
M $
P ° ° °M $
P
M $
P
Message Passing Architecture Memory is private to each node Processes communicate by messages
Cluster
![Page 13: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/13.jpg)
13
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected
to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN
Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus,
hypercube Routing techniques: how the route taken by the
message from source to destination is decided
![Page 14: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/14.jpg)
14
Indirect Interconnects
Shared bus Multiple bus
Crossbar switch
Multistage Interconnection Network
2x2 crossbar
![Page 15: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/15.jpg)
15
Direct Interconnect Topologies
Linear RingStar
Mesh2D
Torus
Hypercube(binary n-cube)
n=2 n=3
![Page 16: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/16.jpg)
16
X: 0
Shared Memory Architecture: Caches
X: 0
Read X Read X
X: 0
Read X
Cache hit: Wrong data!!
P1 P2
Write X=1
X: 1
X: 1
![Page 17: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/17.jpg)
17
Cache Coherence Problem If each processor in a shared memory
multiple processor machine has a data cache Potential data consistency problem: the cache
coherence problem Shared variable modification, private cache
Objective: processes shouldn’t read `stale’ data
Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence
![Page 18: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/18.jpg)
18
Example: Write Once Protocol Assumption: shared bus interconnect where
all cache controllers monitor all bus activity Called snooping
There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or
invalidating a cache block
![Page 19: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/19.jpg)
19
X: 0
Invalidation Based Cache Coherence
X: 0
Read X Read X
X: 0
Read X
Invalidate
P1 P2
Write X=1
X: 1X: 1
X: 1
![Page 20: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/20.jpg)
20
Snoopy Cache Coherence and Locks
Test&Set(L)
L: 1
Test&Set(L) Test&Set(L) Test&Set(L)
L: 1 L: 1 L: 1
L: 0
Test&Set(L)
Holds lock; in critical sectionRelease L;
L: 0
L: 1L: 0
![Page 21: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/21.jpg)
21
Space of Parallel ComputingProgramming Models What programmer
uses in coding applns. Specifies synch. And
communication. Programming Models:
Shared address space
Message passing Data parallel
Parallel Architecture
Shared Memory Centralized shared
memory (UMA) Distributed Shared
Memory (NUMA)
Distributed Memory A.k.a. Message passing E.g., Clusters
![Page 22: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/22.jpg)
22
Memory Consistency Model
Memory consistency model Order in which memory operations will appear
to execute Þ What value can a read return?
Contract between appln. software and system.
Þ Affects ease-of-programming and performance
![Page 23: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/23.jpg)
23
Implicit Memory Model
Sequential consistency (SC) [Lamport] Result of an execution appears as if
• Operations from different processors executed in some sequential (interleaved) order
• Memory operations of each process in program order
MEMORY
P1 P3P2 Pn
![Page 24: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/24.jpg)
© RG@SERC,IISc 24
Distributed Memory Architecture
Network
M $
P ° ° °M $
P
M $
P
Message Passing Architecture Memory is private to each node Processes communicate by messages
Proc.Node
Proc.Node
Proc.Node
![Page 25: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/25.jpg)
© RG@SERC,IISc 25
NUMA Architecture
L2$
C0 C2
L1$ L1$
C4 C6
L1$ L1$
L2$ L2$ L2$
L3-Cache
IMC QPI
L2$
C1 C3
L1$ L1$
C5 C7
L1$ L1$
L2$ L2$ L2$
L3-Cache
QPI IMC
NIC
Memory MemoryIO-Hub
![Page 26: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/26.jpg)
© RG@SERC,IISc 26
Using MultiCores to Build Cluster
Memory MemoryNIC NIC
Memory MemoryNIC NIC
N/WSwitch
Node 0 Node 1
Node 3 Node 2
![Page 27: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc](https://reader037.vdocuments.site/reader037/viewer/2022110319/56649c775503460f9492c887/html5/thumbnails/27.jpg)
Multi-core Workshop© RG@SERC,IISc 27
Using MultiCores to Build Cluster
Memory MemoryNIC NIC
Memory MemoryNIC NIC
N/WSwitch
Node 0 Node 1
Node 3 Node 2
Send A(1:N)