cse 392/cs 378: high-performance computing - principles ...pingali/cse392/2011sp/... · cse 392/cs...
TRANSCRIPT
![Page 1: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/1.jpg)
CSE 392/CS 378: High-performance Computing - Principles and Practice
Parallel Computer Architectures
A Conceptual Introduction for Software Developers
![Page 2: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/2.jpg)
Parallel Computer ArchitecturesA Conceptual Introduction
Lecture Outline• Why Do Applications Care about Architecture?• Example Architecture – Ranger• What Are the Issues?• Why Do Application Developers/Users Need to Know about
Parallel Architectures?– To efficiently develop efficient applications
• What Do Application Developers/Users Need to Know about Parallel Architectures?– Characteristics of Memory Hierarchy– Parallelism in Cluster Computer Architectures
• Homogeneous Multicore Chip Architecture• Homogeneous Chip Node Architecture• Interconnect Architecture• Heterogeneous Multicore Chip Architecture (Not covered in this lecture)
![Page 3: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/3.jpg)
What Are the Issues ? Parallelism!
Algorithms:Models/Types of Parallelism
Programming Model/Language:Models/Types of Parallelism
HardwareArchitecture:Models/Types of Parallelism
ApplicationProgram:Models/Types of Parallelism
Map from the models/types of parallel computation
of algorithms, programming languages and applications
to the models and types of parallelism of hardware architectures.
Core – SISD,SIMDnode –MIMD/shared Interconnect –MIMD/distributed
Many different
models/types
of parallelism
![Page 4: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/4.jpg)
What are the Issues? Memory Hierarchy!
• Multiple levels of memory
– Five levels of caching
• Complex Parallelism in Memory Access
– Cache Line Fetches
• Interference among processors for access to memory which is shared among processors
• Effects of prefetching
![Page 5: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/5.jpg)
What Knowledge is Needed to Map Applications to
Architectures?
• Characteristics of Memory Hierarchy– Core
– Chip
– Node
– Interconnect
• Parallelism:– Algorithms/Libraries
– Resource Control – Multicore Programming
– Programming Models/Languages
– Hardware Architectures
![Page 6: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/6.jpg)
Ranger Architecture - Single Processor Architecture
Processor Core for AMD Barcelona Chip
![Page 7: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/7.jpg)
Ranger Architecture - Chip Architecture
AMD Barcelona Chip
![Page 8: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/8.jpg)
Ranger Architecture - Node Architecture
![Page 9: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/9.jpg)
Ranger - Node Memory Architecture
• 32 GB per node• Assume 32K pages => ~1,000,000 pages/node• 48 pages mapped in TLB• DRAM Bank – Parallel accessibility to each bank• DRAM pages - A DRAM page ((not to be confused with
OS pages.) ) represents a row of data that hasbeen read from a DRAM bank and is cached within the DRAMfor faster access. DRAM pages can be large, 32 KB in the caseof Ranger, and there are typically two per DIMM. This meansthat a Ranger node shares 32 different 32 KB pages among 16cores on four chips, yielding a megabyte of SRAM cache in themain memory.
![Page 10: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/10.jpg)
Ranger Interconnect - InfiniBand
Properties: One Hop Connectivity, Packets and DMA between Nodes
![Page 11: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/11.jpg)
Why Do Applications Care about Architecture?
• Return on Investment
– Understand What You are Doing!
– Performance
• Choice of Algorithms
• Choice of Libraries
• Optimization of Application Specific Code
– Productivity
• Personal
• Resource
![Page 12: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/12.jpg)
What Do Application Developers/Users Need to Know about Architecture?
• Parallelism in Software– Algorithm– Resource Control– Language/Library/Programming Model– Hardware Architecture
• Parallelism in Cluster Computer Architecture– Processor/Core Architecture– Homogeneous Multicore Chips– Node Architecture– Interconnect Architecture– Heterogeneous Multicore Chips (Not covered in this
lecture.
![Page 13: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/13.jpg)
Flynn’s Classic Parallel Hardware Taxonomy
Processor Organizations
Single instruction, single data (SISD) stream
Multiple instruction, single data (MISD) stream
Single instruction, multiple data (SIMD) stream
Multiple instruction, multiple data (MIMD)
stream
Uniprocessor
Vector processor
Array processor Shared memory
Distributed memory
Symmetric multiprocessor (SMP)
Nonuniform memory access (NUMA)
Clusters
![Page 14: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/14.jpg)
Amdahl and Gustafson’s Laws
Amdahl’s Law – Speedup S for:
F = fraction parallelizable and
P = number of processors
assuming perfect parallelism
and identical problem size.
S = 1/((1-F)+F/P)
S = 1/(1-F) P infinite
Gustafson’s Law – Speedup S for
P(n) = fraction parallelizable
as a function of problem size
measured by n and P = number
of processors assuming perfect
parallelism but increasing problem size.
S = (1-F(n)) + P*F(n)
S = P*F(n)
P,n >> 1
In some problems, F(n) => 1 as n becomes large so that S => P
![Page 15: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/15.jpg)
Types of Parallelism
• Streaming Parallelism
• Vector Parallelism
• Asynchronous SIMD => SPMD
• Synchronous SIMD
• Asynchronous MIMD
• Shared Name Space Parallel Execution
• Distributed Name Space Parallel Execution
![Page 16: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/16.jpg)
Taxonomy of Parallelism
Attributes {• mechanism for control – synchronous (implicit by time), asynchronous
(explicit)• resolution of control – static, dynamic • cardinality of control – single instruction stream, multiple instruction
stream• locality of control – single, distributed• state data for control – central/shared, distributed• name space for execution data – shared, distributed• cardinality of input/output (data) stream – single, multiple• granularity of data access – discrete, chunk, stream• }Models - {control: (mechanism, resolution, cardinality, locality, state data),
name space, cardinality of input/output, granularity of data access}
![Page 17: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/17.jpg)
SIMD {(synchronous, static, multiple, single, central), shared, multiple, stream}
Naturally suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.
![Page 18: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/18.jpg)
MISD - {(synchronous, static, multiple, single, central),
shared, single,stream} – MISD
• Natural uses include:
– multiple frequency filters operating on a single signal
stream
– multiple cryptography algorithms attempting to crack a
single coded message.
![Page 19: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/19.jpg)
MIMD - {(asynchronous, static/dynamic, multiple,
single, central/distributed), multiple, discrete} – MIMD
• Most Clusters, Servers and Grids
![Page 20: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/20.jpg)
Visible Parallelism
• Invisible Parallelism – Internal to core or instruction– Prefetching from Memory
– Pipelining of instruction execution
– SIMD instructions
• Mostly addressed by compilers and runtime systems but compilers often need help.
• Will be addressed in later lectures on performance optimization.
![Page 21: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/21.jpg)
Visible Parallelism
• Four Levels of Visible Parallelism
– Core
– Chip
– Node
– System
• Generally asynchronous and user controllable.
![Page 22: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/22.jpg)
Threads – Software and Hardware
• Software
– Multiple operating system defined threads (pThreads) can be created on each core.
– The thread scheduler can switch among them but only one can be active similtaneously.
• But some core/chip architecture have hardware implemented threads.
– Simultaneous Multithreading
![Page 23: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/23.jpg)
23
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e an
d C
on
tro
lB
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e a
nd
Co
ntr
ol
Bu
s
Core 1 - Thread 1 Core 2 - Thread 2
![Page 24: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/24.jpg)
24
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e an
d C
on
tro
lB
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e a
nd
Co
ntr
ol
Bu
s
Core 1 – Two Threads Core 2 – Two Threads
Multi-core: multiple threads can run on each core
![Page 25: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/25.jpg)
25
A technique complementary to multi-core:
Simultaneous multithreading on one core
• Problem addressed:The processor pipeline can get stalled:– Waiting for the result
of a long floating point (or integer) operation
– Waiting for data to arrive from memory
Other execution unitswait unused BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e a
nd
Co
ntr
ol
Bu
s
Source: Intel
![Page 26: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/26.jpg)
26
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads” on the same core
• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
![Page 27: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/27.jpg)
27
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ach
e an
d C
on
tro
lB
us
Thread 1: floating point
Without SMT, only a single thread can run at any given time
![Page 28: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/28.jpg)
28
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ach
e an
d C
on
tro
lB
us
Thread 2:integer operation
![Page 29: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/29.jpg)
29
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ach
e an
d C
on
tro
lB
us
Thread 1: floating pointThread 2:integer operation
![Page 30: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/30.jpg)
30
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ach
e an
d C
on
tro
lB
us
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE
![Page 31: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/31.jpg)
31
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each simultaneous thread as a separate “virtual processor”
• The chip has only a single copy of each resource
• Compare to multi-core:each core has its own copy of resources
![Page 32: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/32.jpg)
32
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
![Page 33: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/33.jpg)
33
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e an
d C
on
tro
lB
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ach
e a
nd
Co
ntr
ol
Bu
s
Thread 1 Thread 3 Thread 2 Thread 4
![Page 34: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/34.jpg)
34
Comparison: multi-core vs SMT
• Multi-core:– Since there are several cores,
each is smaller and not as powerful(but also easier to design and manufacture)
– However, great with thread-level parallelism
• SMT– Can have one large and fast superscalar core
– Great performance on a single thread
– Mostly still only exploits instruction-level parallelism
![Page 35: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/35.jpg)
Node Level Parallelism
• Memory sharing and interference issues become even more complex.
• Ranger nodes have asymmetries which lead to additional issues of load imbalance and cores getting out of “phase.”
• I/O is at node level – Each node can be writing to the file system simultaneously and asynchronously.
![Page 36: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/36.jpg)
36
The memory hierarchy
• If simultaneous multithreading only: – all caches shared
• Multi-core chips:– L1 caches core private
– L2 caches core private in some architecturesand shared in others
– L3 caches usually shared across cores
– DRAM page cache usually shared
• Memory is always shared
![Page 37: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/37.jpg)
37
Designs with private L2 caches
memory
L2 cache
L1 cache L1 cacheC O
R E
1
C O
R E
0
L2 cache
memory
L2 cache
L1 cache L1 cacheC O
R E
1
C O
R E
0
L2 cache
Both L1 and L2 are private
Examples: AMD Opteron, AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches
Example: Intel Itanium 2
![Page 38: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/38.jpg)
38
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the same cache data
– More cache space available if a single (or a few) high-performance thread runs on the system
![Page 39: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/39.jpg)
39
The cache coherence problem
• Since we have private caches:How to keep the data consistent across caches?
• Each core should perceive the memory as a monolithic array, shared by all the cores
![Page 40: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/40.jpg)
40
The cache coherence problem
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 41: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/41.jpg)
41
The cache coherence problem
Core 1 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 42: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/42.jpg)
42
The cache coherence problem
Core 2 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 43: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/43.jpg)
43
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
assuming write-through caches
![Page 44: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/44.jpg)
44
The cache coherence problem
Core 2 attempts to read x… gets a stale copy
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 45: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/45.jpg)
45
Solutions for cache coherence
• This is a general problem with multiprocessors, not limited just to multi-core
• There exist many solution algorithms, coherence protocols, etc.
• We will return to consistency and coherence issues in a later lecture.
• A HARDWARE solution: invalidation-based protocol with snooping
![Page 46: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/46.jpg)
46
Inter-core bus
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memory
multi-core chip
inter-corebus
![Page 47: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/47.jpg)
47
Invalidation protocol with snooping
• Invalidation:If a core writes to a data item, all other copies of this data item in other caches are invalidated
• Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.
![Page 48: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/48.jpg)
48
The cache coherence problem
Revisited: Cores 1 and 2 have both read x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 49: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/49.jpg)
49
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
assuming write-through caches
INVALIDATEDsendsinvalidationrequest
inter-corebus
![Page 50: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/50.jpg)
50
The cache coherence problem
After invalidation:
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 51: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/51.jpg)
51
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 52: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/52.jpg)
52
Invalidation protocols
• This was just the basic invalidation protocol
• More sophisticated protocols use extra cache state bits
• MSI, MESI - (Modified, Exclusive, Shared, Invalid) to which we will return later
![Page 53: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/53.jpg)
53
Another hardware solution: update protocol
Core 1 writes x=21660:
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
assuming write-through caches
UPDATED
broadcastsupdatedvalue inter-core
bus
![Page 54: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/54.jpg)
54
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write (which includes new variable value)
• Invalidation generally performs better:it generates less bus traffic
![Page 55: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/55.jpg)
55
User Control of Multi-core Parallelism
• Programmers must use threads or processes
• Spread the workload across multiple cores
• Write parallel algorithms
• OS will map threads/processes to cores
![Page 56: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/56.jpg)
56
Thread safety very important
• Pre-emptive context switching:context switch can happen AT ANY TIME
• True concurrency, not just uniprocessor time-slicing
• Concurrency bugs exposed much faster with multi-core
![Page 57: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/57.jpg)
57
However: Need to use synchronization even if only time-slicing on a uniprocessor
int counter=0;
void thread1() {
int temp1=counter;
counter = temp1 + 1;
}
void thread2() {
int temp2=counter;
counter = temp2 + 1;
}
![Page 58: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/58.jpg)
58
Need to use synchronization even if only time-slicing on a uniprocessor
temp1=counter;
counter = temp1 + 1;
temp2=counter;
counter = temp2 + 1
temp1=counter;
temp2=counter;
counter = temp1 + 1;
counter = temp2 + 1
gives counter=2
gives counter=1
![Page 59: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/59.jpg)
59
Assigning threads to the cores
• Each thread/process has an affinity mask
• Affinity mask specifies what cores the thread is allowed to run on
• Different threads can have different masks
• Affinities are inherited across fork()
![Page 60: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/60.jpg)
60
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT
1011
core 3 core 2 core 1 core 0
• Process/thread is allowed to run oncores 0,2,3, but not on core 1
![Page 61: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/61.jpg)
61
Affinity masks when multi-core and SMT combined
• Separate bits for each simultaneous thread
• Example: 4-way multi-core, 2 threads per core
1
core 3 core 2 core 1 core 0
1 0 0 1 0 1 1
thread1
• Core 2 can’t run the process• Core 1 can only use one simultaneous thread
thread0
thread1
thread0
thread1
thread0
thread1
thread0
![Page 62: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/62.jpg)
62
Default Affinities
• Default affinity mask is all 1s:all threads can run on all processors
• Then, the OS scheduler decides what threads run on what core
• OS scheduler detects skewed workloads, migrating threads to less busy processors
![Page 63: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/63.jpg)
63
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core
• This is called soft affinity
![Page 64: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/64.jpg)
64
User Set Affinities
• The programmer can prescribe her own affinities (hard affinities)
• Rule of thumb: use the default scheduler unless you have knowledge of program behavior
• There are also memory affinity tags which can be set.
![Page 65: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/65.jpg)
65
When to set your own affinities
• Two (or more) threads share data-structures in memory
– map to same core so that can share cache
• Real-time threads:Example: a thread running a robot controller:- must not be context switched, or else robot can go unstable
- dedicate an entire core just to this threadSource: Sensable.com
![Page 66: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/66.jpg)
Connections to Succeeding Lectures
• Parallel programming for multicore chips and multichip nodes.
• Concurrent/Parallel execution and consistency and coherence
• Models for thinking about and formulating parallel computations
• Performance optimization – Software engineering
![Page 67: CSE 392/CS 378: High-performance Computing - Principles ...pingali/CSE392/2011sp/... · CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures](https://reader033.vdocuments.site/reader033/viewer/2022042804/5f53407d67363145c425a7bc/html5/thumbnails/67.jpg)
Assignment
• To be submitted on Friday, 9/16 by 5PM• Go to the User Guide for Ranger
http://www.tacc.utexas.edu/user-services/user-guides/ranger-user-guide
• Read the sections on affinity policy and NUMA Control
• Take the program which will be distributed on Wednesday, 9/14 and define the affinity policy and NUMA control policy you would use and justify your choices.