multi core 15213 sp07
TRANSCRIPT
-
7/30/2019 Multi Core 15213 Sp07
1/67
1
Multi-core architectures
Jernej Barbic
15-213, Spring 2007May 3, 2007
-
7/30/2019 Multi Core 15213 Sp07
2/67
2
Single-core computer
-
7/30/2019 Multi Core 15213 Sp07
3/67
3
Single-core CPU chip
the single core
-
7/30/2019 Multi Core 15213 Sp07
4/67
4
Multi-core architectures
This lecture is about a new trend incomputer architecture:
Replicate multiple processor cores on a
single die.Core 1 Core 2 Core 3 Core 4
Multi-core CPU chip
-
7/30/2019 Multi Core 15213 Sp07
5/67
5
Multi-core CPU chip
The cores fit on a single processor socket
Also called CMP (Chip Multi-Processor)
c
o
r
e
1
c
o
r
e
2
c
o
r
e
3
c
o
r
e
4
-
7/30/2019 Multi Core 15213 Sp07
6/67
6
The cores run in parallel
c
o
r
e
1
c
o
r
e
2
c
o
r
e
3
c
o
r
e
4
thread 1 thread 2 thread 3 thread 4
-
7/30/2019 Multi Core 15213 Sp07
7/67
-
7/30/2019 Multi Core 15213 Sp07
8/67
8
Interaction with the
Operating System
OS perceives each core as a separate processor
OS scheduler maps threads/processesto different cores
Most major OS support multi-core today:
Windows, Linux, Mac OS X,
-
7/30/2019 Multi Core 15213 Sp07
9/67
9
Why multi-core ?
Difficult to make single-coreclock frequencies even higher
Deeply pipelined circuits: heat problems
speed of light problems
difficult design and verification
large design teams necessary
server farms need expensive
air-conditioning
Many new applications are multithreaded
General trend in computer architecture (shifttowards more parallelism)
-
7/30/2019 Multi Core 15213 Sp07
10/67
10
Instruction-level parallelism
Parallelism at the machine-instruction level
The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years
-
7/30/2019 Multi Core 15213 Sp07
11/67
11
Thread-level parallelism (TLP)
This is parallelism on a more coarser scale Server can serve each client in a separate
thread (Web server, database server)
A computer game can do AI, graphics, andphysics in three separate threads
Single-core superscalar processors cannot
fully exploit TLP
Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
-
7/30/2019 Multi Core 15213 Sp07
12/67
-
7/30/2019 Multi Core 15213 Sp07
13/67
13
Multiprocessor memory types
Shared memory:
In this model, there is one (large) common
shared memory for all processors
Distributed memory:
In this model, each processor has its own
(small) local memory, and its content is not
replicated anywhere else
-
7/30/2019 Multi Core 15213 Sp07
14/67
14
Multi-core processor is a special
kind of a multiprocessor:All processors are on the same chip
Multi-core processors are MIMD:
Different cores execute different threads(Multiple Instructions), operating on differentparts of memory (Multiple Data).
Multi-core is a shared memory multiprocessor:All cores share the same memory
-
7/30/2019 Multi Core 15213 Sp07
15/67
15
What applications benefit
from multi-core?
Database servers
Web servers (Web commerce)
Compilers
Multimedia applications Scientific applications,
CAD/CAM
In general, applications with
Thread-level parallelism(as opposed to instruction-level parallelism)
Each can
run on its
own core
-
7/30/2019 Multi Core 15213 Sp07
16/67
16
More examples
Editing a photo while recording a TV show
through a digital video recorder
Downloading software while running an
anti-virus program
Anything that can be threaded today will
map efficiently to multi-core
BUT: some applications difficult to
parallelize
-
7/30/2019 Multi Core 15213 Sp07
17/67
17
A technique complementary to multi-core:
Simultaneous multithreading
Problem addressed:The processor pipelinecan get stalled:
Waiting for the resultof a long floating point(or integer) operation
Waiting for data to
arrive from memoryOther execution unitswait unused BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROMBTBL2CacheandC
ontrol
Bus
Source: Intel
-
7/30/2019 Multi Core 15213 Sp07
18/67
18
Simultaneous multithreading (SMT)
Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
Weaving together multiple threads
on the same core
Example: if one thread is waiting for a floating
point operation to complete, another thread canuse the integer units
-
7/30/2019 Multi Core 15213 Sp07
19/67
19
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cacheand
Control
Bus
Thread 1: floating point
Without SMT, only a single thread
can run at any given time
-
7/30/2019 Multi Core 15213 Sp07
20/67
20
Without SMT, only a single thread
can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cacheand
Control
Bus
Thread 2:
integer operation
-
7/30/2019 Multi Core 15213 Sp07
21/67
21
SMT processor: both threads can
run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cacheand
Control
Bus
Thread 1: floating pointThread 2:
integer operation
-
7/30/2019 Multi Core 15213 Sp07
22/67
22
But: Cant simultaneously use the
same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cacheand
Control
Bus
Thread 1 Thread 2
This scenario is
impossible with SMT
on a single core
(assuming a single
integer unit)IMPOSSIBLE
-
7/30/2019 Multi Core 15213 Sp07
23/67
23
SMT not a true parallel processor
Enables better threading (e.g. up to 30%)
OS and applications perceive each
simultaneous thread as a separate
virtual processor
The chip has only a single copy
of each resource
Compare to multi-core:
each core has its own copy of resources
-
7/30/2019 Multi Core 15213 Sp07
24/67
24
Multi-core:
threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2CacheandC
ontrol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2CacheandControl
Bus
Thread 1 Thread 2
-
7/30/2019 Multi Core 15213 Sp07
25/67
25
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2CacheandC
ontrol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL
2CacheandControl
Bus
Thread 3 Thread 4
Multi-core:
threads can run on separate cores
-
7/30/2019 Multi Core 15213 Sp07
26/67
26
Combining Multi-core and SMT
Cores can be SMT-enabled (or not)
The different combinations:
Single-core, non-SMT: standard uniprocessor
Single-core, with SMT
Multi-core, non-SMT
Multi-core, with SMT: our fish machines
The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
Intel calls them hyper-threads
-
7/30/2019 Multi Core 15213 Sp07
27/67
27
SMT Dual-core: all four threads can
run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL
2CacheandC
ontrol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL
2CacheandControl
Bus
Thread 1 Thread 3 Thread 2 Thread 4
-
7/30/2019 Multi Core 15213 Sp07
28/67
28
Comparison: multi-core vs SMT
Advantages/disadvantages?
-
7/30/2019 Multi Core 15213 Sp07
29/67
29
Comparison: multi-core vs SMT
Multi-core:
Since there are several cores,each is smaller and not as powerful
(but also easier to design and manufacture) However, great with thread-level parallelism
SMT
Can have one large and fast superscalar core
Great performance on a single thread
Mostly still only exploits instruction-levelparallelism
-
7/30/2019 Multi Core 15213 Sp07
30/67
30
The memory hierarchy
If simultaneous multithreading only:
all caches shared
Multi-core chips:
L1 caches private
L2 caches private in some architectures
and shared in others
Memory is always shared
-
7/30/2019 Multi Core 15213 Sp07
31/67
31
Fish machines
Dual-core
Intel Xeon processors
Each core is
hyper-threaded
Private L1 caches
Shared L2 cachesmemory
L2 cache
L1 cache L1 cacheCO
R
E
1
CO
R
E
0
hyper-threads
-
7/30/2019 Multi Core 15213 Sp07
32/67
32
Designs with private L2 caches
memory
L2 cache
L1 cache L1 cacheCOR
E
1
C
OR
E
0
L2 cache
memory
L2 cache
L1 cache L1 cacheCOR
E
1
C
OR
E
0
L2 cache
Both L1 and L2 are private
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches
Example: Intel Itanium 2
-
7/30/2019 Multi Core 15213 Sp07
33/67
33
Private vs shared caches?
Advantages/disadvantages?
-
7/30/2019 Multi Core 15213 Sp07
34/67
34
Private vs shared caches
Advantages of private:
They are closer to core, so faster access
Reduces contention
Advantages of shared:
Threads on different cores can share the
same cache data
More cache space available if a single (or afew) high-performance thread runs on the
system
-
7/30/2019 Multi Core 15213 Sp07
35/67
35
The cache coherence problem
Since we have private caches:How to keep the data consistent across caches?
Each core should perceive the memory as a
monolithic array, shared by all the cores
-
7/30/2019 Multi Core 15213 Sp07
36/67
36
The cache coherence problem
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
-
7/30/2019 Multi Core 15213 Sp07
37/67
37
The cache coherence problem
Core 1 reads x
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
-
7/30/2019 Multi Core 15213 Sp07
38/67
38
The cache coherence problem
Core 2 reads x
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=15213
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
-
7/30/2019 Multi Core 15213 Sp07
39/67
39
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
assuming
write-throughcaches
-
7/30/2019 Multi Core 15213 Sp07
40/67
40
The cache coherence problem
Core 2 attempts to read x gets a stale copy
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
-
7/30/2019 Multi Core 15213 Sp07
41/67
41
Solutions for cache coherence
This is a general problem with
multiprocessors, not limited just to multi-core
There exist many solution algorithms,
coherence protocols, etc.
A simple solution:
invalidation-based protocol with snooping
-
7/30/2019 Multi Core 15213 Sp07
42/67
42
Inter-core bus
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
multi-core chip
inter-core
bus
-
7/30/2019 Multi Core 15213 Sp07
43/67
43
Invalidation protocol with snooping
Invalidation:
If a core writes to a data item, all other
copies of this data item in other caches
are invalidated
Snooping:
All cores continuously snoop (monitor)
the bus connecting the cores.
-
7/30/2019 Multi Core 15213 Sp07
44/67
44
The cache coherence problem
Revisited: Cores 1 and 2 have both read x
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=15213
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
-
7/30/2019 Multi Core 15213 Sp07
45/67
45
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
assuming
write-throughcaches
INVALIDATEDsends
invalidation
request
inter-core
bus
-
7/30/2019 Multi Core 15213 Sp07
46/67
46
The cache coherence problem
After invalidation:
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
Th h h bl
-
7/30/2019 Multi Core 15213 Sp07
47/67
47
The cache coherence problem
Core 2 reads x. Cache misses,and loads the new copy.
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=21660
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
Alternative to invalidate protocol:
-
7/30/2019 Multi Core 15213 Sp07
48/67
48
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=21660
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
assuming
write-throughcaches
UPDATED
broadcasts
updated
value inter-core
bus
Whi h d thi k i b tt ?
-
7/30/2019 Multi Core 15213 Sp07
49/67
49
Which do you think is better?
Invalidation or update?
-
7/30/2019 Multi Core 15213 Sp07
50/67
50
Invalidation vs update
Multiple writes to the same location
invalidation: only the first time
update: must broadcast each write
(which includes new variable value)
Invalidation generally performs better:
it generates less bus traffic
-
7/30/2019 Multi Core 15213 Sp07
51/67
51
Invalidation protocols
This was just the basic
invalidation protocol
More sophisticated protocols
use extra cache state bits
MSI, MESI
(Modified, Exclusive, Shared, Invalid)
-
7/30/2019 Multi Core 15213 Sp07
52/67
52
Programming for multi-core
Programmers must use threads or
processes
Spread the workload across multiple cores
Write parallel algorithms
OS will map threads/processes to cores
-
7/30/2019 Multi Core 15213 Sp07
53/67
53
Thread safety very important
Pre-emptive context switching:
context switch can happen AT ANY TIME
True concurrency, not just uniprocessortime-slicing
Concurrency bugs exposed much fasterwith multi-core
H N d t h i ti
-
7/30/2019 Multi Core 15213 Sp07
54/67
54
However: Need to use synchronization
even if only time-slicing on a uniprocessor
int counter=0;
void thread1() {
int temp1=counter;
counter = temp1 + 1;
}
void thread2() {
int temp2=counter;
counter = temp2 + 1;
}
N d t h i ti if l
-
7/30/2019 Multi Core 15213 Sp07
55/67
55
Need to use synchronization even if only
time-slicing on a uniprocessor
temp1=counter;
counter = temp1 + 1;
temp2=counter;
counter = temp2 + 1
temp1=counter;
temp2=counter;
counter = temp1 + 1;
counter = temp2 + 1
gives counter=2
gives counter=1
-
7/30/2019 Multi Core 15213 Sp07
56/67
56
Assigning threads to the cores
Each thread/process has an affinity mask
Affinity mask specifies what cores thethread is allowed to run on
Different threads can have different masks
Affinities are inherited across fork()
-
7/30/2019 Multi Core 15213 Sp07
57/67
57
Affinity masks are bit vectors
Example: 4-way multi-core, without SMT
1011
core 3 core 2 core 1 core 0
Process/thread is allowed to run on
cores 0,2,3, but not on core 1
Affinity masks when multi core and
-
7/30/2019 Multi Core 15213 Sp07
58/67
58
Affinity masks when multi-core and
SMT combined
Separate bits for each simultaneous thread
Example: 4-way multi-core, 2 threads per core
1
core 3 core 2 core 1 core 0
1 0 0 1 0 1 1
thread
1
Core 2 cant run the process Core 1 can only use one simultaneous thread
thread
0
thread
1
thread
0
thread
1
thread
0
thread
1
thread
0
-
7/30/2019 Multi Core 15213 Sp07
59/67
59
Default Affinities
Default affinity mask is all 1s:all threads can run on all processors
Then, the OS scheduler decides whatthreads run on what core
OS scheduler detects skewed workloads,migrating threads to less busy processors
-
7/30/2019 Multi Core 15213 Sp07
60/67
60
Process migration is costly
Need to restart the execution pipeline
Cached data is invalidated
OS scheduler tries to avoid migration as
much as possible:
it tends to keeps a thread on the same core
This is called soft affinity
-
7/30/2019 Multi Core 15213 Sp07
61/67
61
Hard affinities
The programmer can prescribe her own
affinities (hard affinities)
Rule of thumb: use the default scheduler
unless a good reason not to
-
7/30/2019 Multi Core 15213 Sp07
62/67
62
When to set your own affinities
Two (or more) threads share data-structures in
memory
map to same core so that can share cache
Real-time threads:Example: a thread running
a robot controller:
- must not be context switched,
or else robot can go unstable
- dedicate an entire core just to this threadSource: Sensable.com
-
7/30/2019 Multi Core 15213 Sp07
63/67
63
Kernel scheduler API
#include
int sched_getaffinity(pid_t pid,unsigned int len, unsigned long * mask);
Retrieves the current affinity mask of process pid and
stores it into space pointed to by mask.
len is the system word size: sizeof(unsigned int long)
-
7/30/2019 Multi Core 15213 Sp07
64/67
64
Kernel scheduler API
#include
int sched_setaffinity(pid_t pid,unsigned int len, unsigned long * mask);
Sets the current affinity mask of process pid to *mask
len is the system word size: sizeof(unsigned int long)
To query affinity of a running process:[barbic@bonito ~]$ taskset -p 3935
pid 3935's current affinity mask: f
-
7/30/2019 Multi Core 15213 Sp07
65/67
65
Windows Task Manager
core 2
core 1
-
7/30/2019 Multi Core 15213 Sp07
66/67
66
Legal licensing issues
Will software vendors charge a separate
license per each core or only a single
license per chip?
Microsoft, Red Hat Linux, Suse Linux will
license their OS per chip, not per core
-
7/30/2019 Multi Core 15213 Sp07
67/67
Conclusion
Multi-core chips an
important new trend in
computer architecture
Several new multi-core
chips in design phases
Parallel programming techniques
likely to gain importance