![Page 1: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/1.jpg)
COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface
RISC-VEdition
Chapter 4The Processor
![Page 2: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/2.jpg)
01 ldEn
00
RegWrite
11
rd
10
Write
A
ddre
ss D
eco
der
64
64
WriteData
64
64
64
ReadData2ReadData1
64
64rs2rs1
01 10
64
00011100 1110
64
64
register 01 (x1)
ckregister 01 (x1)
ckregister 01 (x1)
ck
![Page 3: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/3.jpg)
Chapter 4 — The Processor — 6
Control
![Page 4: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/4.jpg)
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html
Pipelined Datapath & Control Operationwithout data or control dependencies, yet
University of Crete
Dept. of Computer Science
CS−225 (HY−225)
Computer ORganization
Spring 2020 semester
Slides for §9.3 − 9.5
§9.4 Control for the Pipelined Datapath
§9.3 Pipelined Datapath Operation
§9.5 Graphical representation: time−work
![Page 5: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/5.jpg)
ld
76 add
−100x1
−1001
300
400 14
100
4880
80
add
700
ld
Cycle 5
rs2
we Din
rd
rs1 Addr
we rd
Din
Dout
DM?
f7
op f7
?rdIRPC
+
PC
+4
br/jmp addr.
A
IM
I
RF
64:
68:
72:
76:
ld x10, 40(x1)
ld x13, 48(x1)
add x14, x5, x6
60:
sub x11, x2, x3
f3
add x12, x3, x4
A
B
ImmImm
Control
LU
A
Reg.Rd; Op
Instr. Fetch
work
time
Cycle 5
Instr. Fetch Reg.Rd; Op ALU Data Mem. Write Back
Instr. Fetch Reg.Rd; Op ALU Data Mem.
Instr. Fetch Reg.Rd; Op ALU
Instr. Fetch
![Page 6: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/6.jpg)
1
1
x10
0
0
1
x11
sd
add add
1
1
0
0
x12
1
0
0
0
100
x1700
76 add
−100
−100
400
300
x13 130 14
48
Cycle 5
rs2
we Din
rd
rs1 Addr
we rd
Din
Dout
DMf7
op f7
rdIRPC
++4
br/jmp addr.
A
IM
I
RF
64:
68:
72:
76:
ld x10, 40(x1)
sd x13, 48(x1)
add x14, x5, x6
60:
sub x11, x2, x3
f3
add x12, x3, x4
A
B
Imm
PC
Imm
Control
LU
A
0
01
1
![Page 7: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/7.jpg)
Data Dependences (Hazards) in Pipelines
RAW (Read after Write) − true dependence, as above
RAR (Read after Read) − not a dependence, can freely reorder the reads
WAW (Write after Write) − if you want to reorder them, simply abort the write of I1
just keep a copy of the old data and have I1 read that copyWAR (Write after Read) − "antidependence": if you want to do the write (I2) early,
(if no one reads this word between I1 and I2)
an earlier Instruction:
a later Instruction
I2 needs the new data written by I1,hence must wait for I1 to write −or at least to generate− the new data
some Memory or
Register File:
a wordwrite
I1
read
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html
I2:
![Page 8: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/8.jpg)
Reg.Rd; OpInstr. Fetch
Program Order
time
Data Memory accesses are performed ‘in−order’ in our simple pipeline,i.e. are not reordered relative to what the program specifies,thus, no dependences of memory word accesses are ever violated
No Memory Data Hazards in our simple Pipeline
Data Mem. Write BackALUReg.Rd; Op
Data Mem. Write BackALU
Data Mem. Write Back
Data Mem.ALUReg.Rd; OpInstr. Fetch
Data Mem. Write BackALUReg.Rd; OpInstr. Fetch
Data Mem. Write BackALU
![Page 9: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/9.jpg)
Reg.Wr.
For each instruction that writes a destination register, if the next 2 or 3 instructionsread that same register, i.e. need its result, we have to do something about it...
Register Accesses reordered, with Pipeliningtime
in initial Program OrderRegister Accesses
Data Mem. Reg. WriteALUReg. ReadFetch
Data Mem.ALUReg. ReadInstr. Fetch Reg.Wr.
Register Accessesof next 2 or 3 instructions
potential Dependence Hazardsreordered in time, thus causing
time
Data Mem. Reg. WriteALUReg. ReadFetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem.ALUReg. ReadInstr. Fetch
![Page 10: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/10.jpg)
Actual Need−Produce Time vs. from/in−Register Time
ALU Instructions:actual
‘official’ Register Positionwritten intoOutput Result Produced
Inputs NeededInputs Read from ‘official’ Position
Output Result
We can ‘Bypass’ the ‘official’ loop through the Register File for immediate−use Results
computation
Load Instructions: Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Inputs Needed
Output Result Produced
Inputs Read
Output Resultwritten
actual computation
time
All we care about is actual Results ‘Forwarded’ from Producer to Consumer instruction
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
![Page 11: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/11.jpg)
the one immediately succeedinginstruction, without it having to wait one extra cycle
ALU op. result can be
to any subsequent instructionforwarded
Loaded data canNOT be used by
ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.
ALUInstr. Fetch Reg. WriteReg. Read
Data Mem.ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.ALUInstr. Fetch Reg. Read
Data Mem.
Reg. Write
ALU instructions never stall the pipeline, but Load instructions will do sowhen immediately followed by a dependent instruction
from Load Instruction:
from ALU Instructions:
time
ALU result to next I; Load result to next−after−next I
ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.
![Page 12: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/12.jpg)
Imm
LU
A
++4
br/jmp
PC A
IM
I
rs2
we Din
rd
rs1
f7
op f7
RF
f3
Imm
A
B
IR
Op. Dec.
Addr
we rd
Din
Dout
DM
rd
rs2
fwd.ctrl
Forwarding Controlrd rrd5rrd4
rwe4
rrd3
rwe3
01
01
mrd3
mwe3
mrd4
mwe4
rwe5rs1
![Page 13: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/13.jpg)
forward
Wait/Repeat!
Abort!
fetch 60
sub x10, x3
64:
68:
72:
76:
ld x10, 40(x1)
add x14, x5, x6
60:
sub x11, x10, x3
add x12, x3, x4
sd x13, 48(x1)
The instruction immediately after a Load wants to use the load’ed data
Simple, in−order Pipeline: the next instruction has to wait too!
Impossible without losing one cycle:force this instruction to wait (repeat itself on the next cycle)
fetch 64
Distance 1 dependence on LOAD: Wait!
time
x1 + 40
no−op.no−op no−op
Data Mem.x10 − x3fetch 68
x3 + x4fetch 68
ld: read x1 read M[140] write x10
write x11
Data Mem.add x3, x4 write x12
H’zrd D’tct!
![Page 14: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/14.jpg)
LU
A68
Imm++4
br/jmp
A
IM
I
rs2
we Din
rd
rs1
f7
op f7
RF
f3
Imm
A=100
B
IR
Op. Dec.
Addr
we rd
Din
Dout
DM
rd
40
x10
1
add
0
1
1
x10
x3 x11
sub
PC
1
Forwarding ControlHazard Detection
ldE
n
ldE
n
01
rrd3rd
rs1
rs2
fwd.ctrl
need.rs1
rrd5
rwe5
rrd4
rwe4rwe3
mrd3
mwe3
mrd4
mwe4ne
ed
.rs2
0
Wait
![Page 15: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/15.jpg)
(OK to reorder sd−ld) or i==j (fwd in reg.)?
If unknown to compiler, static sch. impossible=> dynamic scheduling at runtime (ooo pipe)
Does the compiler know for sure if i!=j
t1,
sub t1, t0, t1
sd 24(gp)t1,
2 extra clock cycles lost
e = b − f;
a = b + c;
a[i] = b + c;
e = b − a[j];What if the program is?:
RAW dependence?
sd e
sub
ld f
ld b
ld c
add
t1
ld b
ld c
sd e
sub
ld f
add
sd a
This is ‘Static’ Scheduling, at Compile Time
t0
two tem
pora
ryre
gis
ters
suffic
et2
t0
t1
thre
e tem
pora
ry r
egis
ters
needed
Instruction Scheduling
sd af
b
+16:+8:+0:
+24:+32:
e
c
a gp
the more things you have‘up in the air’ (in parallel),the more temporaryregisters you needin order to ‘name’those ‘pending’ values
ld 32(gp)t2,
ld t0, 8(gp)
ld t1, 16(gp)
add t1, t0, t1
sd t1, 0(gp)
sub t1, t0,
sd 24(gp)t1,
t2
No extra clock cycle lost
sd t1, 0(gp)
ld t0, 8(gp)
ld t1, 16(gp)
add t1, t0, t1
ld 32(gp)
![Page 16: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/16.jpg)
Control Dependences (branch/jump) in Pipelines
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html (slides 1−9); Elsevier (slides 10−11)
‘Data Dependence’ = next instruction uses data (register/memory) from previous
‘Control Dependence’ = which is the next instruction depends on the previous
Control Dependences arise from ‘Control Transfer Instructions (CTI)’
Control Transfer Instructions (CTI) are: Jump and Branch Instructions
‘Jumps’ are Unconditional CTI’s: they always transfer control
‘Branches’ are Conditional CTI’s: whether or not they transfer controldepends on the result of a data comparison that they have to perform
Statistics (rough numbers, in a majority of programs, but NOT always so):
− about 1/3 of executed branches are not taken (unsuccessful) = ~5% of all instr.− about 2/3 of executed branches are ‘taken’ (successful) = ~10% of all instr.
− most backwards branches appear in loops, and they are about 90% taken
Branches are about 15−16% of all (‘dynamically’) executed instructions in a program
Jumps are about 4−5% of all executed instructions in a program − procedure calls are about 1%, and returns another ~1%, of all executed instr.
![Page 17: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/17.jpg)
56
fetch
fetch
fetch
fetch
fetch
fetch
40:
44:
48:
52:
76:
36:
ALU
ALU DM WB
ALU DM
branch! (2 or more cycles)
sd
and
ld
xor+4
+4
72:
noopnoop
noop noop noop
noopnoopnoopnoop
SpeculativeExecution
Aborted Execution!
before it causes permanent damage:
speculative executionneed to abort
before DM and WB stages
In modern processors, branch latency is quite longIn our simple pipeline, branch latency is 2 cycles (read registers; compare)
Example here with 3−cycle branch latency
About 2/3 of all executed branches are taken, so this is a heavy loss
In this example, each taken branch causes the loss of 3 extra clock cycles
(with MIPS−style comparisons (beq/bne only) it could even be 1 cycle)
Branch Taken example40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
+4
+4
+4
+4
![Page 18: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/18.jpg)
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...
40 72260 200
88 120180 160
......
... ...
A ‘best approximation’ − not necessarily correct information
Like IM −the Instruction Cache− this will oftentimes ‘overflow’:
old pairs are removed to make room for more recent ones
May be complemented with a small hardware stack:
− on every call (jal ra,...), push the return address;
− on every return (jr ra), pop an address and predict jumpin to that one
Branches that are believed not−taken are NOT entered into the BTB
usually went, in the past.Target PC to which this instruction
PC of a jump or branch−likely instruction;
Branch Target Buffer (BTB)A small table − a cache, like a hash table − containing
their next−PC is something other than PC+4
pairs of (instruction) addresses for which
there is statistical evidence that
In parallel with each Fetch, search the fetched instruction’s PC value in the BTB
40: beq ..., goto72
44: sd ...
![Page 19: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/19.jpg)
fetch
SpeculativeExecution
Continue Executionand Commitfetch ALU DM WB
fetch xor ALU DM WB
fetch ALU DM WB
fetch op.dec
fetch
ALU
op.dec
DM
ALU
WB
DM
36:
BTB
+440:
BTB
+4
branch taken, as predicted
BTB
+4
BTB
+4BTB
+4
ld
op.dec
else, fetch from PC+4
When a matching BTB entry is found, use its Prediction;
When the BTB prediction is Correct40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
72:
80:BTB
+4
76:
88:
84:
When Prediction is Correct, NO extra clock cycles are lost!
40 72260 200
88 120180 160
......
... ...
BTB
![Page 20: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/20.jpg)
noop
noop noop noop
fetch
SpeculativeExecution
fetch ALU
fetch xor
fetch
36:
BTB
+440:
BTB
+4
BTB
+4
BTB
+4
ld
br. not taken: mispredicted
When Mispredicted, branches cost 3 extra clock cycles in this pipeline
44:
48:
76:
72:
DM
WBDM
ALUfetch
ALUsdfetch
+4
BTB
Flush the Pipeline!
and
Abort!
Prediction says: After fetching from 40, fetch from 72
But this time, the branch ends up going the other way: to 44
When the BTB prediction is Wrong40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
80:BTB
+4
40 72260 200
88 120180 160
......
... ...
BTBnoopnoopnoopnoop
noop
![Page 21: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/21.jpg)
Chapter 1 — Computer Abstractions and Technology — 28
Relative Performancen Define Performance = 1/Execution Timen “X is n time faster than Y”
n== XY
YX
time Executiontime ExecutionePerformancePerformanc
n Example: time taken to run a programn 10s on A, 15s on Bn Execution TimeB / Execution TimeA
= 15s / 10s = 1.5n So A is 1.5 times faster than B
![Page 22: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/22.jpg)
Chapter 1 — Computer Abstractions and Technology — 37
Performance Summary
n Performance depends onn Algorithm: affects IC, possibly CPIn Programming language: affects IC, CPIn Compiler: affects IC, CPIn Instruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle ClockSeconds
nInstructiocycles Clock
ProgramnsInstructioTime CPU ´´=
![Page 23: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/23.jpg)
Chapter 1 — Computer Abstractions and Technology — 35
CPI in More Detailn If different instruction classes take different
numbers of cycles
å=
´=n
1iii )Count nInstructio(CPICycles Clock
n Weighted average CPI
å=
÷øö
çèæ ´==
n
1i
ii Count nInstructio
Count nInstructioCPICount nInstructio
Cycles ClockCPI
Relative frequency
![Page 24: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/24.jpg)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38
Average Access Timen Hit time is also important for performance
n Average memory access time (AMAT)
n AMAT = Hit time + Miss rate × Miss penalty
n Example
n CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
n AMAT = 1 + 0.05 × 20 = 2ns
n 2 cycles per instruction
![Page 25: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52f454b9ba5a5f923a0a77/html5/thumbnails/25.jpg)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
Measuring Cache Performancen Components of CPU time
n Program execution cycles
n Includes cache hit time
n Memory stall cycles
n Mainly from cache misses
n With simplifying assumptions:
§5.4
Measurin
g a
nd Im
provin
g C
ache P
erfo
rm
ance
penalty MissnInstructio
MissesProgram
nsInstructio
penalty Missrate MissProgram
accessesMemory
cycles stallMemory
´´=
´´=