multi core 15213 sp07

7/30/2019 Multi Core 15213 Sp07

1/67

1

Multi-core architectures

Jernej Barbic

15-213, Spring 2007May 3, 2007

7/30/2019 Multi Core 15213 Sp07

2/67

2

Single-core computer

7/30/2019 Multi Core 15213 Sp07

3/67

3

Single-core CPU chip

the single core

7/30/2019 Multi Core 15213 Sp07

4/67

4

Multi-core architectures

This lecture is about a new trend incomputer architecture:

Replicate multiple processor cores on a

single die.Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip

7/30/2019 Multi Core 15213 Sp07

5/67

5

Multi-core CPU chip

The cores fit on a single processor socket

Also called CMP (Chip Multi-Processor)

c

o

r

e

1

c

o

r

e

2

c

o

r

e

3

c

o

r

e

4

7/30/2019 Multi Core 15213 Sp07

6/67

6

The cores run in parallel

c

o

r

e

1

c

o

r

e

2

c

o

r

e

3

c

o

r

e

4

thread 1 thread 2 thread 3 thread 4

7/30/2019 Multi Core 15213 Sp07

7/67

7/30/2019 Multi Core 15213 Sp07

8/67

8

Interaction with the

Operating System

OS perceives each core as a separate processor

OS scheduler maps threads/processesto different cores

Most major OS support multi-core today:

Windows, Linux, Mac OS X,

7/30/2019 Multi Core 15213 Sp07

9/67

9

Why multi-core ?

Difficult to make single-coreclock frequencies even higher

Deeply pipelined circuits: heat problems

speed of light problems

difficult design and verification

large design teams necessary

server farms need expensive

air-conditioning

Many new applications are multithreaded

General trend in computer architecture (shifttowards more parallelism)

7/30/2019 Multi Core 15213 Sp07

10/67

10

Instruction-level parallelism

Parallelism at the machine-instruction level

The processor can re-order, pipeline

instructions, split them into

microinstructions, do aggressive branch

prediction, etc.

Instruction-level parallelism enabled rapid

increases in processor speeds over the

last 15 years

7/30/2019 Multi Core 15213 Sp07

11/67

11

Thread-level parallelism (TLP)

This is parallelism on a more coarser scale Server can serve each client in a separate

thread (Web server, database server)

A computer game can do AI, graphics, andphysics in three separate threads

Single-core superscalar processors cannot

fully exploit TLP

Multi-core architectures are the next step in

processor evolution: explicitly exploiting TLP

7/30/2019 Multi Core 15213 Sp07

12/67

7/30/2019 Multi Core 15213 Sp07

13/67

13

Multiprocessor memory types

Shared memory:

In this model, there is one (large) common

shared memory for all processors

Distributed memory:

In this model, each processor has its own

(small) local memory, and its content is not

replicated anywhere else

7/30/2019 Multi Core 15213 Sp07

14/67

14

Multi-core processor is a special

kind of a multiprocessor:All processors are on the same chip

Multi-core processors are MIMD:

Different cores execute different threads(Multiple Instructions), operating on differentparts of memory (Multiple Data).

Multi-core is a shared memory multiprocessor:All cores share the same memory

7/30/2019 Multi Core 15213 Sp07

15/67

15

What applications benefit

from multi-core?

Database servers

Web servers (Web commerce)

Compilers

Multimedia applications Scientific applications,

CAD/CAM

In general, applications with

Thread-level parallelism(as opposed to instruction-level parallelism)

Each can

run on its

own core

7/30/2019 Multi Core 15213 Sp07

16/67

16

More examples

Editing a photo while recording a TV show

through a digital video recorder

Downloading software while running an

anti-virus program

Anything that can be threaded today will

map efficiently to multi-core

BUT: some applications difficult to

parallelize

7/30/2019 Multi Core 15213 Sp07

17/67

17

A technique complementary to multi-core:

Simultaneous multithreading

Problem addressed:The processor pipelinecan get stalled:

Waiting for the resultof a long floating point(or integer) operation

Waiting for data to

arrive from memoryOther execution unitswait unused BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROMBTBL2CacheandC

ontrol

Bus

Source: Intel

7/30/2019 Multi Core 15213 Sp07

18/67

18

Simultaneous multithreading (SMT)

Permits multiple independent threads to execute

SIMULTANEOUSLY on the SAME core

Weaving together multiple threads

on the same core

Example: if one thread is waiting for a floating

point operation to complete, another thread canuse the integer units

7/30/2019 Multi Core 15213 Sp07

19/67

19

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2Cacheand

Control

Bus

Thread 1: floating point

Without SMT, only a single thread

can run at any given time

7/30/2019 Multi Core 15213 Sp07

20/67

20

Without SMT, only a single thread

can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB


Control

Bus

Thread 2:

integer operation

7/30/2019 Multi Core 15213 Sp07

21/67

21

SMT processor: both threads can

run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB


Control

Bus

Thread 1: floating pointThread 2:

integer operation

7/30/2019 Multi Core 15213 Sp07

22/67

22

But: Cant simultaneously use the

same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB


Control

Bus

Thread 1 Thread 2

This scenario is

impossible with SMT

on a single core

(assuming a single

integer unit)IMPOSSIBLE

7/30/2019 Multi Core 15213 Sp07

23/67

23

SMT not a true parallel processor

Enables better threading (e.g. up to 30%)

OS and applications perceive each

simultaneous thread as a separate

virtual processor

The chip has only a single copy

of each resource

Compare to multi-core:

each core has its own copy of resources

7/30/2019 Multi Core 15213 Sp07

24/67

24

Multi-core:

threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2CacheandC

ontrol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2CacheandControl

Bus

Thread 1 Thread 2

7/30/2019 Multi Core 15213 Sp07

25/67

25

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2CacheandC

ontrol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL

2CacheandControl

Bus

Thread 3 Thread 4

Multi-core:

threads can run on separate cores

7/30/2019 Multi Core 15213 Sp07

26/67

26

Combining Multi-core and SMT

Cores can be SMT-enabled (or not)

The different combinations:

Single-core, non-SMT: standard uniprocessor

Single-core, with SMT

Multi-core, non-SMT

Multi-core, with SMT: our fish machines

The number of SMT threads:

2, 4, or sometimes 8 simultaneous threads

Intel calls them hyper-threads

7/30/2019 Multi Core 15213 Sp07

27/67

27

SMT Dual-core: all four threads can

run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL

2CacheandC

ontrol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL

2CacheandControl

Bus

Thread 1 Thread 3 Thread 2 Thread 4

7/30/2019 Multi Core 15213 Sp07

28/67

28

Comparison: multi-core vs SMT

Advantages/disadvantages?

7/30/2019 Multi Core 15213 Sp07

29/67

29

Comparison: multi-core vs SMT

Multi-core:

Since there are several cores,each is smaller and not as powerful

(but also easier to design and manufacture) However, great with thread-level parallelism

SMT

Can have one large and fast superscalar core

Great performance on a single thread

Mostly still only exploits instruction-levelparallelism

7/30/2019 Multi Core 15213 Sp07

30/67

30

The memory hierarchy

If simultaneous multithreading only:

all caches shared

Multi-core chips:

L1 caches private

L2 caches private in some architectures

and shared in others

Memory is always shared

7/30/2019 Multi Core 15213 Sp07

31/67

31

Fish machines

Dual-core

Intel Xeon processors

Each core is

hyper-threaded

Private L1 caches

Shared L2 cachesmemory

L2 cache

L1 cache L1 cacheCO

R

E

1

CO

R

E

0

hyper-threads

7/30/2019 Multi Core 15213 Sp07

32/67

32

Designs with private L2 caches

memory

L2 cache

L1 cache L1 cacheCOR

E

1

C

OR

E

0

L2 cache

memory

L2 cache

L1 cache L1 cacheCOR

E

1

C

OR

E

0

L2 cache

Both L1 and L2 are private

Examples: AMD Opteron,

AMD Athlon, Intel Pentium D

L3 cache L3 cache

A design with L3 caches

Example: Intel Itanium 2

7/30/2019 Multi Core 15213 Sp07

33/67

33

Private vs shared caches?

Advantages/disadvantages?

7/30/2019 Multi Core 15213 Sp07

34/67

34

Private vs shared caches

Advantages of private:

They are closer to core, so faster access

Reduces contention

Advantages of shared:

Threads on different cores can share the

same cache data

More cache space available if a single (or afew) high-performance thread runs on the

system

7/30/2019 Multi Core 15213 Sp07

35/67

35

The cache coherence problem

Since we have private caches:How to keep the data consistent across caches?

Each core should perceive the memory as a

monolithic array, shared by all the cores

7/30/2019 Multi Core 15213 Sp07

36/67

36


Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

7/30/2019 Multi Core 15213 Sp07

37/67

37


Core 1 reads x


One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

7/30/2019 Multi Core 15213 Sp07

38/67

38


Core 2 reads x


One or more

levels of

cache

x=15213

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

7/30/2019 Multi Core 15213 Sp07

39/67

39


Core 1 writes to x, setting it to 21660


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

assuming

write-throughcaches

7/30/2019 Multi Core 15213 Sp07

40/67

40


Core 2 attempts to read x gets a stale copy


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

7/30/2019 Multi Core 15213 Sp07

41/67

41

Solutions for cache coherence

This is a general problem with

multiprocessors, not limited just to multi-core

There exist many solution algorithms,

coherence protocols, etc.

A simple solution:

invalidation-based protocol with snooping

7/30/2019 Multi Core 15213 Sp07

42/67

42

Inter-core bus


One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

multi-core chip

inter-core

bus

7/30/2019 Multi Core 15213 Sp07

43/67

43

Invalidation protocol with snooping

Invalidation:

If a core writes to a data item, all other

copies of this data item in other caches

are invalidated

Snooping:

All cores continuously snoop (monitor)

the bus connecting the cores.

7/30/2019 Multi Core 15213 Sp07

44/67

44


Revisited: Cores 1 and 2 have both read x


One or more

levels of

cache

x=15213

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

7/30/2019 Multi Core 15213 Sp07

45/67

45


Core 1 writes to x, setting it to 21660


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

assuming

write-throughcaches

INVALIDATEDsends

invalidation

request

inter-core

bus

7/30/2019 Multi Core 15213 Sp07

46/67

46


After invalidation:


One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

Th h h bl

7/30/2019 Multi Core 15213 Sp07

47/67

47


Core 2 reads x. Cache misses,and loads the new copy.


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

Alternative to invalidate protocol:

7/30/2019 Multi Core 15213 Sp07

48/67

48

Alternative to invalidate protocol:

update protocol

Core 1 writes x=21660:


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

assuming

write-throughcaches

UPDATED

broadcasts

updated

value inter-core

bus

Whi h d thi k i b tt ?

7/30/2019 Multi Core 15213 Sp07

49/67

49

Which do you think is better?

Invalidation or update?

7/30/2019 Multi Core 15213 Sp07

50/67

50

Invalidation vs update

Multiple writes to the same location

invalidation: only the first time

update: must broadcast each write

(which includes new variable value)

Invalidation generally performs better:

it generates less bus traffic

7/30/2019 Multi Core 15213 Sp07

51/67

51

Invalidation protocols

This was just the basic

invalidation protocol

More sophisticated protocols

use extra cache state bits

MSI, MESI

(Modified, Exclusive, Shared, Invalid)

7/30/2019 Multi Core 15213 Sp07

52/67

52

Programming for multi-core

Programmers must use threads or

processes

Spread the workload across multiple cores

Write parallel algorithms

OS will map threads/processes to cores

7/30/2019 Multi Core 15213 Sp07

53/67

53

Thread safety very important

Pre-emptive context switching:

context switch can happen AT ANY TIME

True concurrency, not just uniprocessortime-slicing

Concurrency bugs exposed much fasterwith multi-core

H N d t h i ti

7/30/2019 Multi Core 15213 Sp07

54/67

54

However: Need to use synchronization

even if only time-slicing on a uniprocessor

int counter=0;

void thread1() {

int temp1=counter;

counter = temp1 + 1;

}

void thread2() {

int temp2=counter;


}

N d t h i ti if l

7/30/2019 Multi Core 15213 Sp07

55/67

55

Need to use synchronization even if only

time-slicing on a uniprocessor

temp1=counter;


temp2=counter;

counter = temp2 + 1

temp1=counter;

temp2=counter;


counter = temp2 + 1

gives counter=2

gives counter=1

7/30/2019 Multi Core 15213 Sp07

56/67

56

Assigning threads to the cores

Each thread/process has an affinity mask

Affinity mask specifies what cores thethread is allowed to run on

Different threads can have different masks

Affinities are inherited across fork()

7/30/2019 Multi Core 15213 Sp07

57/67

57

Affinity masks are bit vectors

Example: 4-way multi-core, without SMT

1011

core 3 core 2 core 1 core 0

Process/thread is allowed to run on

cores 0,2,3, but not on core 1

Affinity masks when multi core and

7/30/2019 Multi Core 15213 Sp07

58/67

58

Affinity masks when multi-core and

SMT combined

Separate bits for each simultaneous thread

Example: 4-way multi-core, 2 threads per core

1

core 3 core 2 core 1 core 0

1 0 0 1 0 1 1

thread

1

Core 2 cant run the process Core 1 can only use one simultaneous thread

thread

0

thread

1

thread

0

thread

1

thread

0

thread

1

thread

0

7/30/2019 Multi Core 15213 Sp07

59/67

59

Default Affinities

Default affinity mask is all 1s:all threads can run on all processors

Then, the OS scheduler decides whatthreads run on what core

OS scheduler detects skewed workloads,migrating threads to less busy processors

7/30/2019 Multi Core 15213 Sp07

60/67

60

Process migration is costly

Need to restart the execution pipeline

Cached data is invalidated

OS scheduler tries to avoid migration as

much as possible:

it tends to keeps a thread on the same core

This is called soft affinity

7/30/2019 Multi Core 15213 Sp07

61/67

61

Hard affinities

The programmer can prescribe her own

affinities (hard affinities)

Rule of thumb: use the default scheduler

unless a good reason not to

7/30/2019 Multi Core 15213 Sp07

62/67

62

When to set your own affinities

Two (or more) threads share data-structures in

memory

map to same core so that can share cache

Real-time threads:Example: a thread running

a robot controller:

- must not be context switched,

or else robot can go unstable

- dedicate an entire core just to this threadSource: Sensable.com

7/30/2019 Multi Core 15213 Sp07

63/67

63

Kernel scheduler API

#include

int sched_getaffinity(pid_t pid,unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process pid and

stores it into space pointed to by mask.

len is the system word size: sizeof(unsigned int long)

7/30/2019 Multi Core 15213 Sp07

64/67

64

Kernel scheduler API

#include

int sched_setaffinity(pid_t pid,unsigned int len, unsigned long * mask);

Sets the current affinity mask of process pid to *mask

len is the system word size: sizeof(unsigned int long)

To query affinity of a running process:[barbic@bonito ~]$ taskset -p 3935

pid 3935's current affinity mask: f

7/30/2019 Multi Core 15213 Sp07

65/67

65

Windows Task Manager

core 2

core 1

7/30/2019 Multi Core 15213 Sp07

66/67

66

Legal licensing issues

Will software vendors charge a separate

license per each core or only a single

license per chip?

Microsoft, Red Hat Linux, Suse Linux will

license their OS per chip, not per core

7/30/2019 Multi Core 15213 Sp07

67/67

Conclusion

Multi-core chips an

important new trend in

computer architecture

Several new multi-core

chips in design phases

Parallel programming techniques

likely to gain importance

multi core 15213 sp07

Documents