structure of computer systems course 6 multi-core systems

17
Structure of Structure of Computer Systems Computer Systems Course 6 Course 6 Multi-core systems Multi-core systems

Upload: zachary-woolcock

Post on 14-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Structure of Computer Systems Course 6 Multi-core systems

Structure of Computer Structure of Computer SystemsSystems

Course 6Course 6

Multi-core systemsMulti-core systems

Page 2: Structure of Computer Systems Course 6 Multi-core systems

Multithreading and multi-processingMultithreading and multi-processing

Exploiting different forms of parallelism:Exploiting different forms of parallelism: data level parallelism (DLP) – same operations on a set of data – SIMD data level parallelism (DLP) – same operations on a set of data – SIMD

architectures, multiple ALUsarchitectures, multiple ALUs instruction level parallelism (ILP) – instructions phases executed in instruction level parallelism (ILP) – instructions phases executed in

parallel – pipeline architecturesparallel – pipeline architectures thread level parallelism (TLP) – instruction sequences/streams executed thread level parallelism (TLP) – instruction sequences/streams executed

in parallel – hyper-treading, multiprocessor architectures (mult-icore, in parallel – hyper-treading, multiprocessor architectures (mult-icore, GRID, cloud, parallel computers)GRID, cloud, parallel computers)

Thread level parallelism execution issues:Thread level parallelism execution issues: synchronization between threadsynchronization between thread data consistencydata consistency concurrent access to shared resourcesconcurrent access to shared resources communication between threadscommunication between threads

Page 3: Structure of Computer Systems Course 6 Multi-core systems

MultiprocessingMultiprocessing Limits of performance Limits of performance

increaseincrease

Amdahl’s lawAmdahl’s law S - speedup of a parallel S - speedup of a parallel

executionexecution ts – time for sequential executionts – time for sequential execution tp – time for parallel executiontp – time for parallel execution q fraction of a program which can q fraction of a program which can

be executed in parallelbe executed in parallel n – number of nodes/threadsn – number of nodes/threads

nqq

nsqtstqst

ptstS

/1

1

/)1(

Examples:

q=50%, n->∞ => S=2

q=75%, n->∞ => S=4

q=95%, n->∞ => S=20

Page 4: Structure of Computer Systems Course 6 Multi-core systems

Hyper-threadingHyper-threading hyper-treading - parallel execution of instruction streams hyper-treading - parallel execution of instruction streams

on a single CPUon a single CPU Idea: Idea: when a tread is stalled because of some hazard cases when a tread is stalled because of some hazard cases

another thread can be executedanother thread can be executed

SolutionSolution:: two threads executed in parallel on the same pipelined CPUtwo threads executed in parallel on the same pipelined CPU after every stage after every stage two bufferstwo buffers (registers) store the partial results of the (registers) store the partial results of the

two threadstwo threads Speedup – approximately 30%Speedup – approximately 30% The operating system will detect 2 logical CPUs !!The operating system will detect 2 logical CPUs !!

IF ID Ex M WbSingle threaded

IF ID Ex M WbHyper threaded

Thread 1

Thread 2

Thread

Page 5: Structure of Computer Systems Course 6 Multi-core systems

MultiprocessorsMultiprocessors

Parallel execution of instruction streams on multiple CPUsParallel execution of instruction streams on multiple CPUs Implementations:Implementations:

multi-core architecturesmulti-core architectures – multiple CPUs in a single integrated – multiple CPUs in a single integrated circuit (IC) circuit (IC)

parallel computersparallel computers – multiple CPUs on different ICs, but in the – multiple CPUs on different ICs, but in the same computer infrastructuresame computer infrastructure

distributed computing facilitiesdistributed computing facilities – multiple CPUs on different – multiple CPUs on different computers, connected through a networkcomputers, connected through a network

• network of PCsnetwork of PCs• GRID architecturesGRID architectures – distributed computing resources for virtual – distributed computing resources for virtual

organizations (VOs), manly for batch processing organizations (VOs), manly for batch processing • cloud architecturescloud architectures – computing resources (execution and storage) – computing resources (execution and storage)

offered as a service; it can be hired dynamicallyoffered as a service; it can be hired dynamically combination of all above: multi-cores on parallel computers, combination of all above: multi-cores on parallel computers,

building distributed computing facilitiesbuilding distributed computing facilities

Page 6: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Why multi-core: Why multi-core: Difficult to make single-core clock frequencies even higher; in Difficult to make single-core clock frequencies even higher; in

the last 4-5 years the clock frequency growth saturated at 2.5-3 the last 4-5 years the clock frequency growth saturated at 2.5-3 GHz GHz

power consumption and dissipation problems (figher frequency power consumption and dissipation problems (figher frequency means more power)means more power)

pipeline architectures (instruction level parallelism) reached their pipeline architectures (instruction level parallelism) reached their efficiency limits (around 20 pipeline stages)efficiency limits (around 20 pipeline stages)

designing a very complex CPU (with multiple optimization designing a very complex CPU (with multiple optimization schemes involved) requires coordination of very large designing schemes involved) requires coordination of very large designing teamsteams

many new applications are multithreaded (e.g. servers that solve many new applications are multithreaded (e.g. servers that solve multiple concurrent requests, agent systems, gaming, multiple concurrent requests, agent systems, gaming, simulation, etc.) simulation, etc.)

Page 7: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors Issues (decision choices):Issues (decision choices):

same or different functionalities for CPUs (homogeneous v.s. same or different functionalities for CPUs (homogeneous v.s. heterogeneous CPUs)heterogeneous CPUs)

• symmetric coressymmetric cores (SMP – Symmetric multi-core processor) – every (SMP – Symmetric multi-core processor) – every core has the same structure and functionalitycore has the same structure and functionality

• asymmetric coresasymmetric cores (ASMP) – there are coordination cores and (ASMP) – there are coordination cores and (simpler) specialized cores(simpler) specialized cores

the relation with the memorythe relation with the memory• symmetric memory access - the symmetric memory access - the SYMASYMA

• non-uniform memory access – non-uniform memory access – NUMANUMA connection between coresconnection between cores

• common bus – parallel or network-based (see network-on-chip)common bus – parallel or network-based (see network-on-chip)

• crossbar – multiple connections controlled with a switchcrossbar – multiple connections controlled with a switch

• memory hierarchy (cache) – common memory zones memory hierarchy (cache) – common memory zones

Page 8: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors architectural solutionsarchitectural solutions

Memory

Core Core

L1 L1

L2

Switch

Symmetric multi-core with private L1 cache and shared L2 and memory

Core Core Core Core

L1 L1 L1 L1

L2 L2

L3L3

Memory Module 1

Memory Module 2

crossbar

Symmetric multi-core partially shared L2 and L3

Page 9: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

architectural solutions (cont.)architectural solutions (cont.)

Core (2x SMT)

CoreL1

L2

Core

LocalStore

LocalStore

Core Core

LocalStore

LocalStore

I/OMemory Module

Heterogeneous multi-core with local and shared cache

Memory

Core Core

L1 L1

L2

Switch

Core Core

L1 L1

L2

Switch

Two processors with two cores and shared memory

Processor 1 Processor 2

Ring network

Page 10: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Shared cacheShared cache high speed memory used by a number of cores (CPUs)high speed memory used by a number of cores (CPUs) advantages:advantages:

• efficient allocation of existing memory spaceefficient allocation of existing memory space

• one core may pre-fetch data for the other coreone core may pre-fetch data for the other core

• sharing of common datasharing of common data

• no cache coherence problemsno cache coherence problems

• less accesses to external memoryless accesses to external memory drawbacks:drawbacks:

• conflict between cores when allocating space on the cache; one core conflict between cores when allocating space on the cache; one core may replace the other core’s datamay replace the other core’s data

• more complex control circuit and longer latency time because of the more complex control circuit and longer latency time because of the switchingswitching

• one core may lock the access to the other coreone core may lock the access to the other core

Page 11: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors Cache coherence of private memoryCache coherence of private memory

How to keep the data consistent across caches?How to keep the data consistent across caches?• solutions:solutions:

write through – every write is made also in the memory – not so write through – every write is made also in the memory – not so efficientefficient

snooping and invalidation – cores are snooping the bus and snooping and invalidation – cores are snooping the bus and invalidates their cache line if a write from another core affects its invalidates their cache line if a write from another core affects its caches content (e.g. Pentium Pro’s P6 bus – snooping phase)caches content (e.g. Pentium Pro’s P6 bus – snooping phase)

core 1 core 2 core 3 core 4

Memory

cache cachecache cache

inconsistencyRead

write

Page 12: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Symmetric v.s. asymmetric coresSymmetric v.s. asymmetric cores Symmetric architectureSymmetric architecture

• all cores are the sameall cores are the same• cores can perform any tasks; they are interchangeablecores can perform any tasks; they are interchangeable• Advantages:Advantages:

easy to build (simple replication), easy to build (simple replication), easy to program, to compile and to execute multithreaded easy to program, to compile and to execute multithreaded

programs programs

• examples: examples: Intel, AMD - Dual and Quad core, Core2, Intel, AMD - Dual and Quad core, Core2, SUN - UltraSparc T1 (Niagara) – 8 coresSUN - UltraSparc T1 (Niagara) – 8 cores

Page 13: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Symmetric v.s. asymmetric cores (cont.)Symmetric v.s. asymmetric cores (cont.) Asymmetric (heterogeneous) architectureAsymmetric (heterogeneous) architecture

• some cores have different functionalities:some cores have different functionalities: 1-2 master cores and many slave (simpler) cores1-2 master cores and many slave (simpler) cores 1 main core and multiple specialized cores (graphics, Fp, 1 main core and multiple specialized cores (graphics, Fp,

multimedia)multimedia)

• compilations should take into consideration what compilations should take into consideration what functionalities can be performed by each corefunctionalities can be performed by each core

• Advantages:Advantages: can integrate much more simple corescan integrate much more simple cores

• examples: examples: IBM – cell processor – used for Playstation 3IBM – cell processor – used for Playstation 3

Page 14: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Asymmetric (heterogeneous) Asymmetric (heterogeneous) architecturearchitecture

IBM cell architecture: 9 coresIBM cell architecture: 9 cores• 1 PPE - power processor element1 PPE - power processor element

coordination and data transfercoordination and data transfer

• 8 SPEs - Synergistic Processing 8 SPEs - Synergistic Processing ElementElement

specialized mathematical unitsspecialized mathematical units

• applications:applications: supercomputerssupercomputers playstationsplaystations home cinemahome cinema video cards video cards

Page 15: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Advantages of multi-core processors:Advantages of multi-core processors: Signals between different CPUs travel shorter distances, those Signals between different CPUs travel shorter distances, those

signals degrade less.signals degrade less.

These higher quality signals allow more data to be sent in a These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do given time period since individual signals can be shorter and do not need to be repeated as often not need to be repeated as often

Cache coherency circuitry can operate at a much higher clock Cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip.rate than is possible if the signals have to travel off-chip.

A dual-core processor uses slightly less power than two coupled A dual-core processor uses slightly less power than two coupled single-core processors.single-core processors.

Page 16: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Disadvantages of multi-core processors:Disadvantages of multi-core processors: Ability of multi-core processors to increase application Ability of multi-core processors to increase application

performance depends on the use of multiple threads within performance depends on the use of multiple threads within applications.applications.

Most current video games will run faster on a 3 GHz single-core Most current video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core processor (of the same processor than on a 2GHz dual-core processor (of the same core architecture.core architecture.

Two processing cores sharing the same system bus and Two processing cores sharing the same system bus and

memory bandwidth limits the real-world performance advantage. memory bandwidth limits the real-world performance advantage.

If a single core is close to being memory bandwidth limited, If a single core is close to being memory bandwidth limited, going to dual-core might only give 30% to 70% improvement.going to dual-core might only give 30% to 70% improvement.

If memory bandwidth is not a problem, a 90% improvement can If memory bandwidth is not a problem, a 90% improvement can be expectedbe expected..

Page 17: Structure of Computer Systems Course 6 Multi-core systems

Multi-core processorsMulti-core processors

Thread affinityThread affinity we can specify if a thread may be executed we can specify if a thread may be executed

on any core or just on a specific coreon any core or just on a specific core• soft affinity: - controlled by the operating systemsoft affinity: - controlled by the operating system

an interrupted thread should continue on the same corean interrupted thread should continue on the same core

• hard affinity – flags associated to a thread that hard affinity – flags associated to a thread that indicate on which core(s) may be executedindicate on which core(s) may be executed

useful for real-time and control applications – to reduce useful for real-time and control applications – to reduce the load on a core on which critical threads are executedthe load on a core on which critical threads are executed