modern processor architectures...
Post on 26-Jul-2020
2 Views
Preview:
TRANSCRIPT
Modern Processor Architectures (II)
Hung-Wei Tseng
• How many of the following would happen given the modern processor microarchitecture?① The branch predictor will predict not taken for branch A② The cache may contain the content of array2[array1[16] * 512];③ temp can potentially become the value of array2[array1[16] * 512];④ The program will raise an exceptionA. 0B. 1C. 2D. 3E. 4
2
Putting it all together
unsigned int array1_size = 16;
uint8_t array1[160] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 260}; uint8_t array2[256 * 512];void bar(size_t x) { if (x < array1_size) { // Branch A: Taken if the statement is not going to be executed. temp &= array2[array1[x] * 512]; }}
void foo(size_t x) { int i = 0, j=0; for(j=0;j<10000;j++) bar(rand()%17);}
— very likely
— possibly
— impossible— not really, as x < array1_size
— where the security issues come from
Spectre and meltdown
3
• Exceptions and incorrect branch prediction can cause “rollback” of transient instructions
• Old register states are preserved, can be restored• Memory writes are buffered, can be discarded• Cache modifications are not restored!
4
What happen when mis-speculation detected
• Execution without speculation is safe• CPU will never read array1[x] for any x ≥ array1_size
• Execution with speculation can be exploited• Attacker sets up some conditions• train branch predictor to assume ‘if’ is likely true• make array1_size and array2[] uncached• Invokes code with out-of-bounds x such that array1[x] is a secret• Processor recognizes its error when array1_size arrives, restores its architectural state,
and proceeds with ‘if’ false• Attacker detects cache change (e.g. basic FLUSH+RELOAD or EVICT+RELOAD)• E.g. next read to array2[i*256] will be fast i=array[x] since this got cached
5
Speculative execution on the following codeif (x < array1_size) y = array2[array1[x] * 256];
• Compiler can only see/optimize static instructions, instructions in the compiled binary
• Compiler cannot optimize dynamic instructions, the real instruction sequence when executing the program• Compiler cannot re-order 3, 5 or 4,5• Compiler cannot predict cache misses
• Compiler optimization is constrained by false dependencies due to limited number of registers (even worse for x86)• Instructions lw $t1, 0($a0) and addi $a0, $a0, 4 do not
depend on each other• Compiler optimizations do not work for all architectures
• The code optimization in the previous example works for single pipeline, but not for superscalar
6
Why OoO instead of compilersStatic instructions
LOOP: lw $t1, 0($a0) addi $a0, $a0, 4 add $v0, $v0, $t1 bne $a0, $t0, LOOP lw $t0, 0($sp) lw $t1, 4($sp)
Dynamic instructions1: lw $t1, 0($a0) 2: addi $a0, $a0, 4 3: add $v0, $v0, $t1 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: addi $a0, $a0, 4 7: add $v0, $v0, $t1 8: bne $a0, $t0, LOOP
• The modern OOO processors have 3-6 issue widths• Keeping instruction window filled is hard
• Branches are every 4-5 instructions. • If the instruction window is 32 instructions the processor has to predict 6-8 consecutive
branches correctly to keep IW full.• The ILP within an application is low
• Usually 1-2 per thread• ILP is even lower is data depends on memory operations (if cache misses) or long latency
operations• Demo
7
Problems with OOO+Superscalar
• Consider the following dynamic instructions1: lw $t1, 0($a0) 2: lw $a0, 4($a0) 3: add $v0, $v0, $t1 4: bne $a0, $zero, LOOP 5: lw $t1, 0($a0) 6: lw $t2, 4($a0) 7: add $v0, $v0, $t1 8: bne $t2, $zero, LOOP Assume a superscalar processor with unlimited issue width & physical registers that can fetch up to 4 instructions per cycle, 2 cycles to execute a memory instruction how many cycles it takes to issue all instructions?
8
Dynamic execution with register naming + ROB
A. 1B. 2C. 3D. 4E. 5
1
3
2
4
8
5 6
cycle #1
cycle #2
cycle #3
cycle #4
cycle #5
You code looks like this when performing “linked list” traversal
7
Wasted
Wasted
Simultaneous multithreading
9
• Fetch instructions from different threads/processes to fill the not utilized part of pipeline• Exploit “thread level parallelism” (TLP) to solve the problem of insufficient ILP in a single
thread• You need to create an illusion of multiple processors for OSs
10
Simultaneous multithreading
• To create an illusion of a multi-core processor and allow the core to run instructions from multiple threads concurrently, how many of the following units in the processor must be duplicated?① Program counter② Register file③ ALUs④ Data cache⑤ Reorder bufferA. 1B. 2C. 3D. 4E. 5
11
Illusion of a multi-core processor
• Fetch instructions from different threads/processes to fill the not utilized part of pipeline• Exploit “thread level parallelism” (TLP) to solve the problem of insufficient ILP in a single
thread• You need to create an illusion of multiple processors for OSs
• Keep separate architectural states for each thread• PC• Register Files• Reorder Buffer• The rest of superscalar processor hardware is shared
• Invented by Dean Tullsen — my thesis advisor
12
Simultaneous multithreading
Simplified SMT-OOO pipeline
13
Register renaming
logicSchedule Execution
UnitsData
Cache
Instruction Fetch: T0
Instruction Decode
Instruction Fetch: T1
Instruction Fetch: T2
Instruction Fetch: T3
ROB: T0
ROB: T1
ROB: T2
ROB: T3
• Fetch 2 instructions from each thread/process at each cycle to fill the not utilized part of pipeline
• Issue width is still 2, commit width is still 4
14
SMT
T1 1: lw $t1, 0($a0)T1 2: lw $a0, 0($t1)T2 1: sll $t0, $a1, 2T2 2: add $t1, $a0, $t0T1 3: addi $a1, $a1, -1T1 4: bne $a1, $zero, LOOPT2 3: lw $v0, 0($t1)T2 4: addi $t1, $t1, 4T2 5: add $v0, $v0, $t2T2 6: jr $ra
IF
IF
IF
IF
IF
IF
ID
ID
ID
ID
IF
IF
EXE
Sch
MEM
Sch
MEM
EXE
Sch
Sch
Sch
Sch
ID
ID
ID
ID
Ren
Ren
Ren
Ren
IF
IF
ROB
ROB
ROB
MEM
ROB
Sch
EXE
Sch
Sch
Sch
Sch
Ren
Ren
Ren
Ren
ID
ID
ROB
ROB
EXE
ROB
ROB
ROB
EXE
ROB
ROB
EXE
ROB
ROB
Sch
Sch
Sch
Sch
ROB
ROB
MEM
EXE
EXE
Sch
Sch
EXE
Sch
Sch
Sch
EXE
Sch
Sch
Sch
Ren
Ren
• How many of the following about SMT are correct?① SMT makes processors with deep pipelines more tolerable to mis-predicted branches② SMT can improve the throughput of a single-threaded application③ SMT processors can better utilize hardware during cache misses comparing with
superscalar processors with the same issue width④ SMT processors can have higher cache miss rates comparing with superscalar
processors with the same cache sizes when executing the same set of applications.A. 0B. 1C. 2D. 3E. 4
15
SMT
hurt, b/c you are sharing resource with other threads.
• Improve the throughput of execution• May increase the latency of a single thread
• Less branch penalty per thread• Increase hardware utilization• Simple hardware design: Only need to duplicate PC/Register Files• Real Case:
• Intel HyperThreading (supports up to two threads per core)• Intel Pentium 4, Intel Atom, Intel Core i7
• AMD RyZen
16
SMT
Pentium 4
17
Intel SkyLake
18
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2-2
2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.
The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.
The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
32K L1 Instruction Cache
MSROM Decoded Icache (DSB)
Legacy DecodePipeline
Instruction Decode Queue (IDQ,, or micro-op queue)
Allocate/Rename/Retire/MoveElimination/ZeroIdiom
32K L1 Data Cache
256K L2 Cache (Unified)
Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,
Branch2
Port 2LD/STA
Scheduler
BPU
Port 0
Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA
Int ALU, Fast LEA,Vec SHUF,Vec ALU,
CVT
Int ALU, Int Shft,Branch1,
Port 3LD/STA
Port 4STD
Port 7STA
Port 1 Port 5 Port 6
5 uops/cycle4 uops/cycle6 uops/cycle
4-issue integer pipeline 4-issue memory pipeline
19
20
Parallel architectures (I)Hung-Wei Tseng
Von Neumann architecture
22
memory
2
8
3
CPU is a dominant factor of performance since we heavily rely on it to execute programs
By pointing “PC” to different part of your memory, we can perform different functions!
23
CPU performance scales well before 2002History of Processor Performance
1Tuesday, April 24, 12
52%/year
• Using PCIe P2P to eliminate CPU/DRAM from the data plane• Reduce the CPU load• Avoid redundant data copies in DRAM• Improve the latency
24
The slowdown of CPU scaling
SPEC
Rate
0
10
20
30
40
50
60
70
80
90
Sep-
06De
c-06
Mar
-07
Jun-
07Se
p-07
Dec-
07M
ar-0
8Ju
n-08
Sep-
08De
c-08
Mar
-09
Jun-
09Se
p-09
Dec-
09M
ar-1
0Ju
n-10
Sep-
10De
c-10
Mar
-11
Jun-
11Se
p-11
Dec-
11M
ar-1
2Ju
n-12
Sep-
12De
c-12
Mar
-13
Jun-
13Se
p-13
Dec-
13M
ar-1
4Ju
n-14
Sep-
14De
c-14
Mar
-15
Jun-
15Se
p-15
Dec-
15M
ar-1
6Ju
n-16
Sep-
16De
c-16
Mar
-17
Jun-
17Se
p-17
5x in 67 months
1.5x in 67 months
25
Intel P4(2000)1 core
Intel Nahalem(2010)4 cores
Nvidia Tegra 3(2011)5 cores
SPARC T3(2010)
16 cores
AMD Zambezi(2011)
16 cores
AMD Athlon 64 X2 (2005)2 cores
Die photo of a CMP processor
26
• Each processor has its own local cache
27
Memory hierarchy on CMP
Core 0
Local $
Core 1
Local $
Core 2
Local $
Core 3
Local $
Bus
Shar
ed $
• Assuming both application X and application Y have similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors:P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache P2: SMT with a 4-issue pipeline. 64KB L1 D-cache Which one do you think is better?A. P1B. P2
28
Comparing SMT and CMP
Parallelism
29
• Instruction-level parallelism• Data-level parallelism• Thread-level parallelism
30
Parallelism in modern computers
• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar
• SIMD — single instruction stream, multiple data• Vector instructions• GPUs
• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading
31
Processing models
• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar
• SIMD — single instruction stream, multiple data• Vector instructions• GPUs
• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading
32
Processing models
We will focus on this today
• How can we create programs to utilize these cores?A. Parallel programming is easy, programmers will just write parallel programs from now on.B. Parallel programming was hard, but architects have generally solved this problem in the
10 years since we saw the problemC. You don’t need to write parallel code, Intel’s new compilers know how to extract thread
level parallelismD. Intel (and everyone else) is just building the chips, it’s on you to figure out how to use
them.
33
How to utilize these cores?
top related