lecture 6: multicore systems

Lecture 6:Lecture 6:

Multicore SystemsMulticore Systems

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Pollack’s Rule

Performance increase is roughly proportional to the square root of the increase in complexity

performance √complexity

Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption complexity

Pollack’s Rule

complexity power performance

1 1 1

4 4 2

25 25 5

100s of low complexity cores, each operating at very low power

Ex: Four small cores

complexity power performance

4x1 4x1 4

Increasing CPU Performance

Manycore Chip Composed of hybrid cores

• Some general purpose

• Some graphics

• Some floating point

Exascale Systems

Board composed of multiple manycore chips sharing memory

A room full of these racks

Millions of coresExascale systems (1018 Flop/s)

Rack composed of multiple boards

Moore’s Law Reinterpreted

Number of cores per chip doubles every 2 years

Number of threads of execution doubles every 2 years

Shared Memory MIMD

Shared memory

• Single address space

• All processes have access to the pool of shared memory

Memory

Bus

P P P P

Shared Memory MIMD

Each processor executes different instructions asynchronously, using different dataM

emor

y

PE

PE

PE

PE

data

data

data

data

instruction

CU

CU

CU

CU

Symmetric Multiprocessors (SMP)

MIMD Shared memory UMA

Proc

L1

L2

Main Memory I/O

I/O

I/O

Proc

L1

L2

…

System bus

Symmetric Multiprocessors (SMP)Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by an integrated operating system that provides interaction between processors and their programs

Symmetric Multiprocessors (SMP)

Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of

• scheduling of threads and processes on processors

• synchronization among processors

Multicore Computers

Dedicated L1 Cache

(ARM11 MPCore)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Multicore Computers

Dedicated L2 Cache

(AMD Opteron)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

Multicore Computers

Shared L2 Cache

(Intel Core Duo)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Multicore Computers

Shared L3 Cache

(Intel Core i7)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

L3

Multicore Computers

Advantages of Shared L2 cache Reduced overall miss rate

• Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache

Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement

Advantages of Dedicated L2 cache Each core can access its private cache more rapidly

L3 cache When the amount of memory and number of cores grow, L3 cache provides

better performance

Multicore Computers

On-chip interconnects Bus Crossbar

Off-chip communication (CPU-to-CPU or I/O): Bus-based

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Multicore Computers

Multithreading

A multithreaded processor provides a separate PC for each thread (hardware multithreading)

Implicit multithreading• Concurrent execution of multiple threads extracted from a single sequential

program

Explicit multithreading• Execute instructions from different explicit threads by interleaving

instructions from different threads on shared or parallel pipelines

Multicore Computers Explicit Multithreading

Fine-grained multithreading (Interleaved multithreading)• Processor deals with two or more thread contexts at a time• Switching from one thread to another at each clock cycle

Coarse-grained multithreading (Blocked multithreading)• Instructions of a thread are executed sequentially until an event that causes a delay

(eg. cache miss) occurs• This event causes a switch to another thread

Simultaneous multithreading (SMT)• Instructions are simultaneously issued from multiple threads to the execution units

of a superscalar processor• Thread-level parallelism is combined with instruction-level parallelism (ILP)

Chip multiprocessing (CMP)• Each processor of a multicore system handles separate threads

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

GPUs (Graphics Processing Units)

Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

GPUs

Core Complexity

Out-of-order execution

Dynamic branch prediction

Larger pipelines for higher clock rates

More circuitry

High performance

GPUs

Complex cores are preferable:

Highly instruction parallel numeric applications Floating-point applications

Large number of simple cores are preferable:

Application’s serial part is small

Cache Performance Intel Core i7

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes


Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity= min

Peak floating-point performance

Roofline Performance Model Peak floating-point performance is given by the hardware

specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance

of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second


Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are

larger than the cache size http://www.cs.virginia.edu/stream/ref.html

lecture 6: multicore systems

Documents

processors cores

shared cache

multiple cores

processorsmulticore

dedicated l2 cacheeach

memory access time

chip doubles

s of low complexity