high performance computer architecturegiorgi/didattica/lezioni/lezioni220/c220... · • principles...
TRANSCRIPT
High Performance Computer Architecture
Dothan Core(0.09m)
145 mm2/55Mtr 84 mm2/140Mtr
217 mm2m/42Mtr
143 mm2/291MtrConroe Core (0.065m)
Intel-Pentium-4 (11/2000)
Intel-Pentium-4(01/2002)
Intel-Pentium-M(05/2004)
Intel-Core2-Duo(07/2006)Willamette Core (0.18m)
Northwood Core (0.13m)
Penryn Core(0.045m)
107 mm2/410Mtr
Bloomfield Core (0.045m)263 mm2/731Mtr
Intel-Core-i7 (11/2008)
Fermi 512G (0.040m)
Sandy Bridge (0.032m) 216 mm2/995Mtr
467 mm2/3000Mtr
Intel-Core2-Duo(01/2008)
IBM (08/2011)
Intel-Core-i7-2920XM (01/2011)
BlueGene/Q – 18 cores (0.045m)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr
Llano 4C/400G (0.032m) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
10 mm
2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18m)
20 mm
661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 7000$)
Haswell-EP – 18 Cores – 2SMT (0.022 m)
IBM Power4 (12/2001)
362 mm2m/2100Mtr12 Cores – 8SMT (0.022m)
IBM Power8 (12/2014)
128 mm2m/3000MtrApple-iPad Air2 (11/2014)
A8X 11 Cores (0.020m)Ivy Bridge (0.022m) tri-gate
160 mm2/1400MtrIntel-Core-i73770 (04/2012)
413 mm2m/7100MtrXeon Phi (06/2015)(estimated)
Knights Landing – 72 Cores – 4SMT (0.014 m)567 mm2m/8900Mtr
AMD Radeon R9 Fury-X (06/2015)
Fuji XT – 2048 GPU-Cores (0.028 m)
3Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
Where are High Performance Computers ?
4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
Among users
Where you need to make this happen:“I have a limited battery and need to… take a picture, share it with my friends, ...”
In the Internet Infrastructure
Where you need to connect anybody with anything
Every electronic devicehas a Computers inside
Electronic Devices
In the Datacenters Where you need to Storeand Retrieve YOUR data
Cars may have many as 50+ computers:(California approved a billfor autonomous vehicles)
AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)
AMD Opteron 6272HP Proliant DL 585 G7
Computer Architects• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)
5Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
AMD Opteron 6200 CHIP
AMD Opteron 6200 characteristics
Objectives of this course• This course constitutes a deeper study of current computers and aims to provide:
• Principles of high-performance microprocessors (superscalar, VLIW)
• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system
• Principles of Multi-Core / Multi-Processor Systems
• Tools for programming Parallel Machines
6Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
Course Administration• Teacher: Roberto Giorgi ( [email protected] )• Telephone: 0577-191-5182• Office-hours: Monday 16:30/19:00• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2
• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,
"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8
• Other Reference Textbooks• Hennessy and Patterson,
“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8
• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433
• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 567
Rules for exams, dates, slides, tools• Check out the course website:
8Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
http://www.dii.unisi.it/~giorgi/teaching/hpca2
Computer Architecture“The term ARCHITECTURE is used here to describe the set of attributes of a system, as these appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”
-- Gene Amdahl, IBM Journal of R&D, Apr. 1964
*programmer == system programmer (OS) engineer or the compiler
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 569
Architecture: an overloaded term• In the strict sense: Interface Hardware / Software
• Set of instructions• Memory management and protection• Interruptions and exceptions (traps)• Data formats (for example, IEEE 754 floating point)
• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture
(this is a part that Gene Amdahl had excluded)• Specifies the functional units and connections• Configuration of the pipeline• Position and configuration of cache memory
• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set
Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5610
Levels of Computer Architecture
I/O devicesand
Networking
Controllers
System Interconnect(bus)
Controllers
MemoryTranslation
Execution Hardware
Drivers MemoryManager Scheduler
Operating System
Libraries
ApplicationPrograms
MainMemory
1
2
33
4 5 6
7 78888
9
10 10
1111 12
13 14
ISA
Software
Hardware
1: User Interface2: API3,7: ABI4,5,6: internal interface of
the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture
API=Application Program InterfaceABI=Application Binary InterfaceISA=Instruction Set ArchitectureRTL=Register Transfer Level
Interfaces:
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5611
What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”
(Later revised to "24 months"), this is due to:- higher density (transistors / area)- availability of bigger chips
DATA FROM SLIDE 1
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5612Moore's Law and purely PSYCHOLOGICAL!
Mtr
What the market demands: Applications• Application Trend
• FROM numerical, scientific TO commercial, entertainment• FROM few "big" TO ubiquitous, "small“
- mainframes minis microprocessors handheld, embedded• FROM little TO big memory storage (primary and secondary)• FROM single-thread TO multiple-threads• FROM standalone TO networked (cloud computing)• FROM character-oriented TO multimedia (graphics and sound)• FROM personal data TO “BIG DATA”
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5613
Main Applications• Numerical/Scientific
• Computational Fluid Dynamics, Weather Prediction, ECAD• Long word length, floating point arithmetic
• Commercial• inventory control, billing, payroll, decision support• byte oriented, fixed point, high I/O, large secondary storage
• Real-Time/Embedded• control, some communications• predictable performance• interrupt architecture important, low power, cost critical
• Home Computing• multimedia, entertainment• high bandwidth data movement, graphics• cryptography, compression/decompression
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5614
App. Trends: Multimedia, Networked, Web-servers• A large choice of multimedia devices with
• Graphic displays (LCD, etc.).• High Definition Audio• Large capacity of secondary storage for images, sound, etc…
• Services via the Web and high-performance networks require• Many independent threads• Wide band communication
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5615
MICROPROCESSOR ARCHITECTURE• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU
• 1970s – Serial CPU, 1-bit for integers
• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration
of the entire processor chip
• 1990s – bigger CPUs, superscalar- Also for CISC
• 2000s – Multiprocessors on a chip...
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5616
Course Structure
1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines
(Cilk, OpenMP, MPI, CUDA, ...)
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5617
EVALUATING COMPUTERS
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5618
POWER
• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)
Pdynamic = αCV2f
Pstatic = VIsub ≈ Ve-KVt/T
• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3
• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!
• STATIC POWER • BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.
• POWER/ENERGY ARE CRITICAL PROBLEMS• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED
• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE CIRCUIT, SHORT TERM OR LONG TERM)
• EFFECT ON THE SUPPLY OF POWER TO THE CHIP
• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM• BATTERY OPERATED DEVICES
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5619
RELIABILITY
• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V
• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING
FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES
• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO
• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING
• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES
• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE
PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5620
PERFORMANCE METRICS (MEASURE)
• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N• THE MAJOR METRIC USED IN THIS COURSE
• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF
THROUGHPUT(X)/THROUGHPUT(Y) = N• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)
• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS
PER SECOND
EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5621
WHICH PROGRAM TO CHOOSE?
• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF
RESULTS
• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM
• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)
• SYNTHETIC BENCHMARKS (NOT REAL)
• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION
• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)
• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W
• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5622
REPORTING PERFORMANCE FOR A SET OF PROGRAMS
LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:
OR
THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT
2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE
MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)
• ARITHMETIC MEAN OF SPEEDUPS
• HARMONIC MEAN
T i Ni
T i W ii
SiTR iTi
-----------=
= 1= ∑ 1
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5623
REPORTING PERFORMANCE FOR A SET OF PROGRAMS
• GEOMETRIC MEANS OF SPEEDUPS
- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE
- EASILY COMPOSABLE- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5624
=
Example1 – Quantative comparison depends on the reference machine
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56
Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)Machine 1 10 sec 100 sec 55 sec 91.8 10Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5Reference 1 100 sec 10000 sec 5050 secReference 2 100 sec 1000 sec 550 sec
25
Example 2 – contrasting results with Arithmetic and harmonic mean
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5626
Program A Program BMachine 1 10 sec 100 secMachine 2 1 sec 200 secReference 1 100 sec 10000 secReference 2 100 sec 1000 sec
Program A Program B Arithmetic Harmonic GeometricWrt Reference 1
Machine 1 10 100 55 18.2 31.6Machine 2 100 50 75 66.7 70.7
Wrt Reference 2
Machine 1 10 10 10 10 10Machine 2 100 5 52.5 9.5 22.4
In terms of speedup:
GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!
FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)
Texe = IC X CPI X Tc
• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND
IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND
TECHNOLOGY
CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME
• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)
Texe = (IC X Tc)/IPC
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5627
AMDAHL’S LAW
• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S
1-F F
Apply enhancement
1-F F/S
without E
with E
Texe withE Texe withoutE X 1 F– FS--+=
Speedup E Texe withoutE Texe withE -------------------------- 1
1 F– FS--+
---------------= =
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5628
LESSONS FROM AMDAHL’S LAW
1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED
• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1
2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)
F=0.5
SPEEDUP E 11 F–-------<
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5629
Amdhal’s maximum
Amdhal’s Law
Remaining Speedup
Marginal Speedup
PARALLEL SPEEDUP
• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??
OVERALL NOT VERY HOPEFUL
F=0.95
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5630
= = 11 − + / = + 1 − < 11 −“mortar shot”
Amdhal’s Law
Amdhal’s maximum
Ideal speedup
GUSTAFSON’S LAW
• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON
CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P
PROCESSORS:
• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE
• EXECUTION TIME ON ONE PROCESSOR IS
• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = (s+p–p+pP)/(s+p)=1-F+FP = 1+F(P-1)
TP s p+=
T1 s pP+=
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5631
Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.
F
Sp
Course Structure
1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines
(Cilk, OpenMP, MPI, CUDA, ...)
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5632
PIPELINING
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5633
Pipelining
• Pipelining principles• Simple Pipeline• Structural Hazards• Data Hazards• Control Hazards
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5634
Pipelining principles
• Let T be the time to execute an instruction• Without pipelining
• Latency = T• Throughput seq = 1 / T
• With an ideal n-stage pipeline• Latency = T• Throughput pipe = n / T
• Speedup = Throughput pipe /Throughput seq = n
T
1 2 n. . .
1 2 n. . .
1 n. . .1 2 n. . .
The (ideal) speedup obtainable from an ideal pipeline is equal to n
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5635
• Consider instructions composed of n phases of equal duration
2
Implementation of a Simple Pipeline• Simple 5-stage pipeline
• F -- Instruction Fetch• D -- Instruction Decode + Operand Fetch• X -- Execution and Effective Address • M -- Memory Access• W – Write-back Results
latch
clock
latchlatchlatchlatchlatch
F D X M W
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5636
5-STAGE PIPELINE
INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE• NOTE: CONTROL IMPLEMENTATION
• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5637
Notation
5-stage pipeline1 2 3 4 5 6 7 8 9
i F D X M Wi+1 F D X M Wi+2 F D X M Wi+3 F D X M Wi+4 F D X M W
accessexecute backwrite
M WXF Dmemory
inst. fetchinst.decode
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5638
Pipeline Hazards• Conditions that lead to a malfunction if certain countermeasures are not taken
1) Structural Hazards• Two instructions want to use the same hardware resource in the same
cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)2) Data Hazards
• Two instructions use the same data: must happen in the order defined by the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)
3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which
instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5639
1) Structural Hazards• Two instructions want to use the same hardware resource in the same cycle Example:• A load / store uses the same memory location that is used by the
instruction fetch
i F D X M W <-- load instructioni+1 F D X M W i+2 F D X M W i+3 * F D X M W <-- i-fetch stallsi+4 F D X M . . .
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5640
Resolving structural hazards• Stall one of the involved instructions
+ Cost-effective and simple- Reduces the performance- Used for some rare events
• Pipelining the resource• Useful if possible (e.g.,
for resources that require more cycles)+ Good performance- In some cases too complex to do
(e.g. RAM)• Replicate the resource
+ Good performance- Costly- Probably introduces delays- Used for cheap resources (or indivisible) De-mux Mux
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5641
Guidelines to reduce the structural hazards• The structural hazards can be avoided if each instruction uses the resource:• At most once:
- (e.g., Separated Instruction memory and Data memory)• In the same pipeline cycle
- (e.g. I-Fetch in stage F, R/W in stage M)• For a single cycle
- (e.g. HIT in the data or instruction cache)
• Many RISC processor ISAs were designed with this in mind• Example of problematic situation:
• MISS in cache: pipeline stalls
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5642
2) Data Hazards• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :
R1 <- R2 + R3R2 <- R1 - R7R1 <- R5 OR R6
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5643
Reading r1
Writing r1
Data Hazards -- examples
i add r1, r2, r3 F D X M Wi+1 sub r2, r1, r7 F D X M W
r1 ?? Read-After-Write (RAW) Hazard
r1 ?? Write-After-Read (WAR) Hazard
i+1 sub r2, r1, r7 F * * * * * D X M Wi+2 or r1, r5, r6 F D X M W
Writing r1
Reading r1
Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction
i add r1, r2, r3 F * * * D X M Wi+1 sub r2, r1, r7 F D X M W i+2 or r1, r5, r6 F D X M W
Writing r1
Writing r1r1 ?? Write-After-Write (WAW) Hazard
Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5644
Dependency and HazardDependency: situation in the code that can potentially create hazards
• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another
• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers
• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers
• Read-After-Read (RAR)• Will not cause problems
HAZARDS• The dependencies can be translated into hazards depending on the hardware
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5645
True Dependence and MIPS Data Hazards• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it
• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication
• In the MIPS processor True Dependence normally generates a hazard
I: add r1,r2,r3J: sub r4,r1,r3
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5646
Anti-Dependence e MIPS Data Hazards• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it
• Also called "anti-dependence" (in the terminology of the theory of compilers)
• It results from having reused the name "r1", while I could easily use another register (*)
• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• the reads from registers occur in stage 2 (D) and• all writes are always in stage 5 (W)
I: sub r4,r1,r3 J: add r1,r2,r3
(*) This, however, is not always possible to sw, v. Lesson-2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5647
Output Dependence e MIPS Data Hazards• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it
• Also called "output dependence" (in the terminology of the theory of compilers)
• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)
• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• all writes are always in stage 5 (W)
I: mul r1,r4,r3 J: add r1,r2,r3
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5648 (*) This, however, is not always possible to sw, v. Lesson2
Simple resolution of the RAW hazard
• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish
F D X M W R1<-R2+R3 F D X M WR2<-R1-R7 F * * D X M W
+ Cost-effective and simple- Reduces the performance
NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5649
Implementation: stall control network• Add latches to remember the RS1/RS2/RD register identifiers at each stage
• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)• Similarly for RS2
unitExecution
fileRegister
D-CacheA
B
RS1
RD RD
StallControl
stall
Stage D Stage X Stage M
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5650
Inserting Stalls (detail)• Related to the instruction that creates the stall, it is necessary:
• On the previous stages: block all "inter-stage latch"• On the next stages:
- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the pipeline can continue to proceed without creating problems
Previous stageStage of the instructionwhich STALLS Next Stage Next Stage
V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal
STALL SIGNAL
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5651
Reduction of the RAW stalls• Bypass/Forward/Short-Circuit network
• Idea: use the data before it is written in the registers+ Reduces (potentially avoid) stalls- Additional Complexity
bypasses
ME WBEXIF ID
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5652
Bypass• Additional Hardware
• Multiplexer to select the input value to the ALU:or from Registers or from Bypass network
• Hazard detection logic (called interlock) that controls these multiplexers
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5653
bypass control
Unit(ALU)
Execution
operandlatches
bypass control
bypass
MUX
MUX
fileRegister
Resultlatch
Network to detect the possibility of Bypass• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)
• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)else if RS1(D)==RD(M) then D-CACHE-OUT(M)else select (A)
…similarlyon B input…
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5654
Unit(ALU)
ExecutionMUX
MUX
fileRegister
Resultlatch
D-Cache
RS1
RD RD
BypassControl
A
B
ALU-OUT(X)
D-CACHE-OUT(M)
Interaction between control networks Stall/Bypass• The stall logic is aware of the presence of the bypass logic• The bypass logic is activated independently at each stall condition
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5655
unitExecution
MUX
MUX
fileRegister
D-Cache
RS1
RD RD
BypassControl
A
B
StallControl
Pipeline Scheduling• Scheduling of instructions at compile-time (Reorder
instructions to reduce stalls caused by instruction load):
BEFORE:a= b + c; R1 <- mem(b)
R2 <- mem(c)stall
R3 <- R1 + R2mem(a) <- R3
d = e - f; R4 <- mem(e)R5 <- mem(f)
stallR6 <- R4 - R5mem(d) <- R6
AFTER:R1 <- mem(b)R2 <- mem(c)R4 <- mem(e)R3 <- R1 + R2R5 <- mem(f)mem(a) <- R3R6 <- R4 - R5mem(d) <- R6
Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5656