page 1 - university of california, davisamerican.cs.ucdavis.edu/academic/ecs201a/fred/l1.pdf ·...
TRANSCRIPT
Page 1
FTC.W99 1
Lecture 1:Cost/Performance, DLX, Pipelining,
Caches, Branch Prediction
Prof. Fred ChongECS 250A Computer Architecture
Winter 1999
(Slides based upon Patterson UCB CS252 Spring 1998)
FTC.W99 2
Computer Architecture Is …
the attributes of a [computing] system as seenby the programmer, i.e., the conceptualstructure and functional behavior, as distinctfrom the organization of the data flows andcontrols the logic design, and the physicalimplementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARESOFTWARE
FTC.W99 3
Computer Architecture’sChanging Definition
• 1950s to 1960s: Computer Architecture CourseComputer Arithmetic
• 1970s to mid 1980s: Computer Architecture CourseInstruction Set Design, especially ISA appropriatefor compilers
• 1990s: Computer Architecture CourseDesign of CPU, memory system, I/O system,Multiprocessors
FTC.W99 4
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, DSP
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
FTC.W99 5
Computer Architecture Topics
M
Interconnection NetworkS
PMPMPMP° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data Parallelism
Processor-Memory-Switch
MultiprocessorsNetworks and Interconnections
FTC.W99 6
ECS 250A Course Focus
Understanding the design techniques, machinestructures, technology factors, evaluationmethods that will determine the form ofcomputers in 21st Century
Technology ProgrammingLanguages
OperatingSystems History
ApplicationsInterface Design
(ISA)
Measurement &Evaluation
Parallelism
Computer Architecture:• Instruction Set Design• Organization• Hardware
Page 2
FTC.W99 7
Topic CoverageTextbook: Hennessy and Patterson, ComputerArchitecture: A Quantitative Approach, 2nd Ed., 1996.
• Performance/Cost, DLX, Pipelining, Caches, Branch Prediction• ILP, Loop Unrolling, Scoreboarding, Tomasulo, Dynamic Branch
Prediction
• Trace Scheduling, Speculation• Vector Processors, DSPs• Memory Hierarchy
• I/O• Interconnection Networks• Multiprocessors
FTC.W99 8
ECS250A: Staff
Instructor: Fred Chong
Office: EUII-3031 chong@cs
Office Hours: Mon 4-6pm or by appt.
T. A: Diana Keen
Office: EUII-2239 keend@cs
TA Office Hours: Fri 1-3pm
Class: Mon 6:10-9pm
Text: Computer Architecture: A Quantitative Approach,Second Edition (1996)
Web page: http://arch.cs.ucdavis.edu/~chong/250A/
Lectures available online before 1PM day of lecture
Newsgroup: ucd.class.cs250a{.d}
FTC.W99 9
Grading• Problem Sets 35%
• 1 In-class exam (prelim simulation) 20%• Project Proposals and Drafts 10%• Project Final Report 25%
• Project Poster Session (CS colloquium) 10%
FTC.W99 10
VLSI Transistors
G
A
B
B
A G
FTC.W99 11
CMOS Inverter
In OutOut In
FTC.W99 12
CMOS NAND Gate
A B
C
A
BC
Page 3
FTC.W99 13
IC cost = Die cost + Testing cost + Packaging cost
Final test yieldDie cost = Wafer cost Dies per Wafer * Die yield
Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area
Die Yield = Wafer yield * 1 +Defects_per_unit_area * Die_Area
αα
Integrated Circuits Costs
Die Cost goes roughly with die area4
{− α− α
}
FTC.W99 14
Real World Examples
Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer386DX 2 0.90 $900 1.0 43 360 71% $4486DX2 3 0.80 $1200 1.0 81 181 54% $12
PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272
Pentium 3 0.80 $1500 1.5 296 40 9% $417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,Microprocessor Report, August 2, 1993, p. 15
FTC.W99 15
Cost/PerformanceWhat is Relationship of Cost to Price?
• Component Costs• Direct Costs (add 25% to 40%) recurring costs: labor,
purchasing, scrap, warranty
• Gross Margin (add 82% to 186%) nonrecurring costs:R&D, marketing, sales, equipment maintenance, rental, financingcost, pretax profits, taxes
• Average Discount to get List Price (add 33% to 66%): volumediscounts and/or retailer markup
ComponentCost
Direct Cost
GrossMargin
AverageDiscount
Avg. Selling Price
List Price
15% to 33% 6% to 8%
34% to 39%
25% to 40%
FTC.W99 16
• Assume purchase 10,000 units
Chip Prices (August 1993)
Chip Area Mfg. Price Multi- Commentmm2 cost plier
386DX 43 $9 $31 3.4 Intense CompetitionIntense Competition
486DX2 81 $35 $245 7.0 No CompetitionNo CompetitionPowerPC 601 121 $77 $280 3.6
DEC Alpha 234 $202 $1231 6.1 Recoup R&D?
Pentium 296 $473 $965 2.0 Early in shipments
FTC.W99 17
Summary: Price vs. Cost
0%
20%
40%
60%
80%
100%
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
0
1
2
3
4
5
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
4.73.8
1.8
3.5
2.5
1.5
FTC.W99 18
Year
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004
i8080
Pentium
i80486
i80286
i8086
Technology Trends:Microprocessor Capacity
CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs
Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million
Moore’s Law
Page 4
FTC.W99 19
Memory Capacity(Single Chip DRAM)
size
Year
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
year size(Mb) cyc time1980 0.0625 250 ns1983 0.25 220 ns1986 1 190 ns1989 4 165 ns
1992 16 145 ns1996 64 120 ns2000 256 100 ns
FTC.W99 20
Technology Trends(Summary)
Capacity Speed (latency)
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
FTC.W99 21
Processor PerformanceTrends
Microprocessors
Minicomputers
Mainframes
Supercomputers
Year
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
FTC.W99 22
Processor Performance(1.35X before, 1.55X now)
0
200
400
600
800
1000
1200
87 88 89 90 91 92 93 94 95 96 97
DEC Alpha 21264/600
DEC Alpha 5/500
DEC Alpha 5/300
DEC Alpha 4/266IBM POWER 100
DEC AXP/500
HP 9000/750
Sun-4/
260
IBMRS/
6000
MIPS M/
120
MIPS M
2000
1.54X/yr
FTC.W99 23
Performance Trends(Summary)
• Workstation performance (measured in SpecMarks) improves roughly 50% per year(2X every 18 months)
• Improvement in cost performance estimatedat 70% per year
FTC.W99 24
Computer EngineeringMethodology
TechnologyTrends
Page 5
FTC.W99 25
Computer EngineeringMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
TechnologyTrends
Benchmarks
FTC.W99 26
Computer EngineeringMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
TechnologyTrends
Benchmarks
WorkloadsFTC.W99 27
Computer EngineeringMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
Implement NextImplement NextGeneration SystemGeneration System
TechnologyTrends
Benchmarks
Workloads
ImplementationComplexity
FTC.W99 28
Measurement Tools
• Benchmarks, Traces, Mixes• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb• Fundamental “Laws”/Principles
FTC.W99 29
The Bottom Line:Performance (and Cost)
• Time to run the task (ExTime)– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth
Plane
Boeing 747
BAD/SudConcodre
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput(pmph)
286,700
178,200
FTC.W99 30
The Bottom Line:Performance (and Cost)
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = ---------------
ExTime(X) Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
Page 6
FTC.W99 31
Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ ESpeedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction Fof the task by a factor S, and the remainder of thetask is unaffected
FTC.W99 32
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
FTC.W99 33
Amdahl’s Law
• Floating point instructions improved to run 2X;but only 10% of actual instructions are FP
Speedupoverall =
ExTimenew =
FTC.W99 34
Amdahl’s Law
• Floating point instructions improved to run 2X;but only 10% of actual instructions are FP
Speedupoverall = 1
0.95= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
FTC.W99 35
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
FTC.W99 36
Aspects of CPU PerformanceCPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock RateProgram X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X
Page 7
FTC.W99 37
Cycles Per Instruction
CPU time = CycleTime * Σ CPI * Ii = 1
n
i i
CPI = Σ CPI * F where F = I i = 1
n
i i i i
Instruction Count
“Instruction Frequency”
Invest Resources where time is Spent!
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count
“Average Cycles per Instruction”
FTC.W99 38
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)Op Freq Cycles CPI(i) (% Time)ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%) 1.5
FTC.W99 39
SPEC: System PerformanceEvaluation Cooperative
• First Round 1989– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)– “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
FTC.W99 40
How to Summarize Performance• Arithmetic mean (weighted arithmetic mean)
tracks execution time: ΣΣ(Ti)/n or ΣΣ(Wi*Ti)
• Harmonic mean (weighted harmonic mean) ofrates (e.g., MFLOPS) tracks execution time:
n/ ΣΣ(1/Ri) or n/ ΣΣ(Wi/Ri)
• Normalized execution time is handy for scalingperformance (e.g., X times faster thanSPARCstation 10)
• But do not take the arithmetic mean ofnormalized execution time,
use the geometric: (Π Π xi)^1/n)
FTC.W99 41
SPEC First Round• One program: 99% of time in single line of code• New front-end compiler could improve dramatically
Benchmark
0
100
200
300
400
500
600
700
800
gcc
epre
sso
spic
e
dodu
c
nasa
7 li
eqnt
ott
mat
rix3
00 fppp
p
tom
catv
FTC.W99 42
Impact of Means onSPECmark89 for IBM 550
Ratio to VAX: Time: Weighted Time:Program Before After Before After Before Aftergcc 30 29 49 51 8.91 9.22espresso 35 34 65 67 7.64 7.86spice 47 47 510 510 5.69 5.69doduc 46 49 41 38 5.81 5.45nasa7 78 144 258 140 3.43 1.86li 34 34 183 183 7.86 7.86eqntott 40 40 28 28 6.68 6.68matrix300 78 730 58 6 3.43 0.37fpppp 90 87 34 35 2.97 3.07tomcatv 33 138 20 19 2.01 1.94Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.Ratio 1.33 Ratio 1.16 Ratio 1.09
Page 8
FTC.W99 43
Performance Evaluation
• “For better or worse, benchmarks shape a field”• Good products created when have:
– Good benchmarks– Good ways to summarize performance
• Given sales is a function in part of performancerelative to competition, investment in improvingproduct as reported by performance summary
• If benchmarks/summary inadequate, then choosebetween improving product for real programs vs.improving product to get more sales;Sales almost always wins!
• Execution time is the measure of computerperformance!
FTC.W99 44
Instruction Set Architecture (ISA)
instruction set
software
hardware
FTC.W99 45
Interface Design
A good interface:
• Lasts through many implementations (portability,compatibility)
• Is used in many differeny ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
Interfaceimp 1
imp 2
imp 3
use
use
use
time
FTC.W99 46
Evolution of Instruction SetsSingle Accumulator (EDSAC 1950)
Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from Implementation
High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture
RISC
(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)FTC.W99 47
Evolution of Instruction Sets
• Major advances in computer architecture aretypically associated with landmark instructionset designs
– Ex: Stack vs GPR (System 360)
• Design decisions must take into account:– technology
– machine organization– programming languages– compiler technology– operating systems
• And they in turn influence these
FTC.W99 48
A "Typical" RISC
• 32-bit fixed format instruction (3 formats)• 32 32-bit GPR (R0 contains zero, DP take pair)• 3-address, reg-reg arithmetic instruction
• Single address mode for load/store:base + displacement
– no indirection
• Simple branch conditions
• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
Page 9
FTC.W99 49
Example: MIPS
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
FTC.W99 50
Summary, #1• Designing to Last through Trends
Capacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
• 6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …– Throughput, bandwidth
• “X is n times faster than Y” means ExTime(Y) Performance(X)
--------- = --------------
ExTime(X) Performance(Y)FTC.W99 51
Summary, #2
• Amdahl’s Law:
• CPI Law:
• Execution time is the REAL measure of computerperformance!
• Good products created when have:– Good benchmarks, good ways to summarize performance
• Die Cost goes roughly with die area4
• Can PC industry support engineering/researchinvestment?
Speedupoverall =ExTimeold
ExTimenew
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
FTC.W99 52
Pipelining: Its Natural!
• Laundry Example• Ann, Brian, Cathy, Dave
each have one load of clothesto wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
FTC.W99 53
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
FTC.W99 54
Pipelined LaundryStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Page 10
FTC.W99 55
Pipelining Lessons• Pipelining doesn’t help
latency of single task, ithelps throughput ofentire workload
• Pipeline rate limited byslowest pipeline stage
• Multiple tasks operatingsimultaneously
• Potential speedup =Number pipe stages
• Unbalanced lengths ofpipe stages reducesspeedup
• Time to “fill” pipeline andtime to “drain” it reducesspeedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
FTC.W99 56
Computer Pipelines
• Execute billions of instructions, sothroughput is what matters
• DLX desirable features: all instructions samelength, registers located in same place ininstruction format, memory operands only inloads or stores
FTC.W99 57
5 Steps of DLX DatapathFigure 3.1, Page 130
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
IRLMD
FTC.W99 58
Pipelined DLX DatapathFigure 3.4, page 137
MemoryAccess
WriteBack
InstructionFetch Instr. Decode
Reg. FetchExecute
Addr. Calc.
• Data stationary control– local decode for each instruction phase / pipeline stage FTC.W99 59
Visualizing PipeliningFigure 3.3, Page 133
Instr.
Order
Time (clock cycles)
FTC.W99 60
Its Not That Easy forComputers
• Limits to pipelining: Hazards prevent nextinstruction from executing during its designatedclock cycle
– Structural hazards: HW cannot support this combination ofinstructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of priorinstruction still in the pipeline (missing sock)
– Control hazards: Pipelining of branches & otherinstructionsstall the pipeline until the hazardbubbles” in thepipeline
Page 11
FTC.W99 61
One Memory Port/Structural HazardsFigure 3.6, Page 142
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4FTC.W99 62
One Memory Port/Structural HazardsFigure 3.7, Page 143
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
stall
Instr 3FTC.W99 63
Speed Up Equation forPipelining
CPIpipelined = Ideal CPI+ Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined
x
x
FTC.W99 64
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelinedimplementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both• Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times fasterFTC.W99 65
Data Hazard on R1Figure 3.9, page 147
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WB
FTC.W99 66
Three Generic Data HazardsInstrI followed by InstrJ
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
Page 12
FTC.W99 67
Three Generic Data HazardsInstrI followed by InstrJ
• Write After Read (WAR)InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and
– Reads are always in stage 2, and– Writes are always in stage 5
FTC.W99 68
Three Generic Data HazardsInstrI followed by InstrJ
• Write After Write (WAW)InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicatedpipes
FTC.W99 69
Forwarding to Avoid Data HazardFigure 3.10, Page 149
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
FTC.W99 70
HW Change for ForwardingFigure 3.20, Page 161
FTC.W99 71
Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with ForwardingFigure 3.12, Page 153
FTC.W99 72
Data Hazard Even with ForwardingFigure 3.13, Page 154
Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Page 13
FTC.W99 73
Try producing fast code for
a = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory.Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,RaLW Re,e
LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to AvoidLoad Hazards
Fast code:LW Rb,bLW Rc,cLW Re,eADD Ra,Rb,Rc
LW Rf,fSW a,RaSUB Rd,Re,RfSW d,Rd FTC.W99 74
Control Hazard on BranchesThree Stage Stall
FTC.W99 75
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• DLX branch tests if register = 0 or ° 0• DLX Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
FTC.W99 76
Pipelined DLX DatapathFigure 3.22, page 163
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc.
This is the correct 1 cyclelatency implementation!
FTC.W99 77
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX» DLX still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
FTC.W99 78
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch targetaddress in 5 stage pipeline
– DLX uses this
Branch delay of length n
Page 14
FTC.W99 79
Delayed Branch
• Where to get instructions to fill branch delay slot?– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful
in computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines,multiple instructions issued per clock (superscalar)
FTC.W99 80
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequency ×Branch penalty
FTC.W99 81
Pipelining Summary
• Just overlap tasks, and easy if tasks are independent• Speed Up Š Pipeline Depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:– Structural: need more HW resources– Data (RAW,WAR,WAW): need forwarding, compiler scheduling– Control: delayed branch, prediction
Speedup =Pipeline Depth
1 + Pipeline stall CPIX
Clock Cycle Unpipelined
Clock Cycle Pipelined