lecture 6: multicore systems

32
Lecture 6: Lecture 6: Multicore Systems Multicore Systems

Upload: joel-mays

Post on 01-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Lecture 6: Multicore Systems. Multicore Computers (chip multiprocessors). Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used. Pollack’s Rule. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 6: Multicore  Systems

Lecture 6:Lecture 6:

Multicore SystemsMulticore Systems

Page 2: Lecture 6: Multicore  Systems

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Page 3: Lecture 6: Multicore  Systems

Pollack’s Rule

Performance increase is roughly proportional to the square root of the increase in complexity

performance √complexity

Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption complexity

Page 4: Lecture 6: Multicore  Systems

Pollack’s Rule

complexity power performance

1 1 1

4 4 2

25 25 5

100s of low complexity cores, each operating at very low power

Ex: Four small cores

complexity power performance

4x1 4x1 4

Page 5: Lecture 6: Multicore  Systems

Increasing CPU Performance

Manycore Chip Composed of hybrid cores

• Some general purpose

• Some graphics

• Some floating point

Page 6: Lecture 6: Multicore  Systems

Exascale Systems

Board composed of multiple manycore chips sharing memory

A room full of these racks

Millions of coresExascale systems (1018 Flop/s)

Rack composed of multiple boards

Page 7: Lecture 6: Multicore  Systems

Moore’s Law Reinterpreted

Number of cores per chip doubles every 2 years

Number of threads of execution doubles every 2 years

Page 8: Lecture 6: Multicore  Systems

Shared Memory MIMD

Shared memory

• Single address space

• All processes have access to the pool of shared memory

Memory

Bus

P P P P

Page 9: Lecture 6: Multicore  Systems

Shared Memory MIMD

Each processor executes different instructions asynchronously, using different dataM

emor

y

PE

PE

PE

PE

data

data

data

data

instruction

CU

CU

CU

CU

Page 10: Lecture 6: Multicore  Systems

Symmetric Multiprocessors (SMP)

MIMD Shared memory UMA

Proc

L1

L2

Main Memory I/O

I/O

I/O

Proc

L1

L2

System bus

Page 11: Lecture 6: Multicore  Systems

Symmetric Multiprocessors (SMP)Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by an integrated operating system that provides interaction between processors and their programs

Page 12: Lecture 6: Multicore  Systems

Symmetric Multiprocessors (SMP)

Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of

• scheduling of threads and processes on processors

• synchronization among processors

Page 13: Lecture 6: Multicore  Systems

Multicore Computers

Dedicated L1 Cache

(ARM11 MPCore)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Page 14: Lecture 6: Multicore  Systems

Multicore Computers

Dedicated L2 Cache

(AMD Opteron)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

Page 15: Lecture 6: Multicore  Systems

Multicore Computers

Shared L2 Cache

(Intel Core Duo)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Page 16: Lecture 6: Multicore  Systems

Multicore Computers

Shared L3 Cache

(Intel Core i7)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

L3

Page 17: Lecture 6: Multicore  Systems

Multicore Computers

Advantages of Shared L2 cache Reduced overall miss rate

• Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache

Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement

Advantages of Dedicated L2 cache Each core can access its private cache more rapidly

L3 cache When the amount of memory and number of cores grow, L3 cache provides

better performance

Page 18: Lecture 6: Multicore  Systems

Multicore Computers

On-chip interconnects Bus Crossbar

Off-chip communication (CPU-to-CPU or I/O): Bus-based

Page 19: Lecture 6: Multicore  Systems

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Page 20: Lecture 6: Multicore  Systems

Multicore Computers

Multithreading

A multithreaded processor provides a separate PC for each thread (hardware multithreading)

Implicit multithreading• Concurrent execution of multiple threads extracted from a single sequential

program

Explicit multithreading• Execute instructions from different explicit threads by interleaving

instructions from different threads on shared or parallel pipelines

Page 21: Lecture 6: Multicore  Systems

Multicore Computers Explicit Multithreading

Fine-grained multithreading (Interleaved multithreading)• Processor deals with two or more thread contexts at a time• Switching from one thread to another at each clock cycle

Coarse-grained multithreading (Blocked multithreading)• Instructions of a thread are executed sequentially until an event that causes a delay

(eg. cache miss) occurs• This event causes a switch to another thread

Simultaneous multithreading (SMT)• Instructions are simultaneously issued from multiple threads to the execution units

of a superscalar processor• Thread-level parallelism is combined with instruction-level parallelism (ILP)

Chip multiprocessing (CMP)• Each processor of a multicore system handles separate threads

Page 22: Lecture 6: Multicore  Systems

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

Page 23: Lecture 6: Multicore  Systems

GPUs (Graphics Processing Units)

Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

Page 24: Lecture 6: Multicore  Systems

GPUs

Page 25: Lecture 6: Multicore  Systems

GPUs

Core Complexity

Out-of-order execution

Dynamic branch prediction

Larger pipelines for higher clock rates

More circuitry

High performance

Page 26: Lecture 6: Multicore  Systems

GPUs

Complex cores are preferable:

Highly instruction parallel numeric applications Floating-point applications

Large number of simple cores are preferable:

Application’s serial part is small

Page 27: Lecture 6: Multicore  Systems

Cache Performance Intel Core i7

Page 28: Lecture 6: Multicore  Systems

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes

Page 29: Lecture 6: Multicore  Systems

Roofline Performance Model

Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity= min

Peak floating-point performance

Page 30: Lecture 6: Multicore  Systems

Roofline Performance Model Peak floating-point performance is given by the hardware

specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance

of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Page 31: Lecture 6: Multicore  Systems

Roofline Performance Model

Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel

Page 32: Lecture 6: Multicore  Systems

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are

larger than the cache size http://www.cs.virginia.edu/stream/ref.html