lecture 6: multicore systems
DESCRIPTION
Lecture 6: Multicore Systems. Multicore Computers (chip multiprocessors). Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used. Pollack’s Rule. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 6:Lecture 6:
Multicore SystemsMulticore Systems
Multicore Computers (chip multiprocessors)
Combine two or more processors (cores) on a single piece of silicon
Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches
Multithreading is used
Pollack’s Rule
Performance increase is roughly proportional to the square root of the increase in complexity
performance √complexity
Power consumption increase is roughly linearly proportional to the increase in complexity
power consumption complexity
Pollack’s Rule
complexity power performance
1 1 1
4 4 2
25 25 5
100s of low complexity cores, each operating at very low power
Ex: Four small cores
complexity power performance
4x1 4x1 4
Increasing CPU Performance
Manycore Chip Composed of hybrid cores
• Some general purpose
• Some graphics
• Some floating point
Exascale Systems
Board composed of multiple manycore chips sharing memory
A room full of these racks
Millions of coresExascale systems (1018 Flop/s)
Rack composed of multiple boards
Moore’s Law Reinterpreted
Number of cores per chip doubles every 2 years
Number of threads of execution doubles every 2 years
Shared Memory MIMD
Shared memory
• Single address space
• All processes have access to the pool of shared memory
Memory
Bus
P P P P
Shared Memory MIMD
Each processor executes different instructions asynchronously, using different dataM
emor
y
PE
PE
PE
PE
data
data
data
data
instruction
CU
CU
CU
CU
Symmetric Multiprocessors (SMP)
MIMD Shared memory UMA
Proc
L1
L2
Main Memory I/O
I/O
I/O
Proc
L1
L2
…
System bus
Symmetric Multiprocessors (SMP)Characteristics:
Two or more similar processors
Processors share the same memory and I/O facilities
Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor
All processors share access to I/O devices
All processors can perform the same functions
The system is controlled by an integrated operating system that provides interaction between processors and their programs
Symmetric Multiprocessors (SMP)
Operating system:
Provides tools and functions to exploit the parallelism
Schedules processes or threads across all of the processors
Takes care of
• scheduling of threads and processes on processors
• synchronization among processors
Multicore Computers
Dedicated L1 Cache
(ARM11 MPCore)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
Multicore Computers
Dedicated L2 Cache
(AMD Opteron)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
L2
Multicore Computers
Shared L2 Cache
(Intel Core Duo)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
Multicore Computers
Shared L3 Cache
(Intel Core i7)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
L2
L3
Multicore Computers
Advantages of Shared L2 cache Reduced overall miss rate
• Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache
Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement
Advantages of Dedicated L2 cache Each core can access its private cache more rapidly
L3 cache When the amount of memory and number of cores grow, L3 cache provides
better performance
Multicore Computers
On-chip interconnects Bus Crossbar
Off-chip communication (CPU-to-CPU or I/O): Bus-based
Multicore Computers (chip multiprocessors)
Combine two or more processors (cores) on a single piece of silicon
Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches
Multithreading is used
Multicore Computers
Multithreading
A multithreaded processor provides a separate PC for each thread (hardware multithreading)
Implicit multithreading• Concurrent execution of multiple threads extracted from a single sequential
program
Explicit multithreading• Execute instructions from different explicit threads by interleaving
instructions from different threads on shared or parallel pipelines
Multicore Computers Explicit Multithreading
Fine-grained multithreading (Interleaved multithreading)• Processor deals with two or more thread contexts at a time• Switching from one thread to another at each clock cycle
Coarse-grained multithreading (Blocked multithreading)• Instructions of a thread are executed sequentially until an event that causes a delay
(eg. cache miss) occurs• This event causes a switch to another thread
Simultaneous multithreading (SMT)• Instructions are simultaneously issued from multiple threads to the execution units
of a superscalar processor• Thread-level parallelism is combined with instruction-level parallelism (ILP)
Chip multiprocessing (CMP)• Each processor of a multicore system handles separate threads
Coarse-grained, Fine-grained, Symmetric Multithreading, CMP
GPUs (Graphics Processing Units)
Characteristics of GPUs
GPUs are accelerators for CPUs
SIMD
GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)
CPU-GPU combination is an example for heterogeneous computing
GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU
GPUs
GPUs
Core Complexity
Out-of-order execution
Dynamic branch prediction
Larger pipelines for higher clock rates
More circuitry
High performance
GPUs
Complex cores are preferable:
Highly instruction parallel numeric applications Floating-point applications
Large number of simple cores are preferable:
Application’s serial part is small
Cache Performance Intel Core i7
Roofline Performance Model
Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory
floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte
number of data bytes
Roofline Performance Model
Attainable GFLOPs/second
Peak memory bandwidth x Arithmetic intensity= min
Peak floating-point performance
Roofline Performance Model Peak floating-point performance is given by the hardware
specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance
of all the cores on the chip. So, multiply the peak per chip by the number of chips
Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)
Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as
Peak memory bandwidth x Arithmetic intensity
(bytes/second) x (FLOPs/bytes) ==> FLOPs/second
Roofline Performance Model
Roofline sets an upper bound on performance
Roofline of a computer does not vary by benchmark kernel
Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are
larger than the cache size http://www.cs.virginia.edu/stream/ref.html