high performance computer architecturegiorgi/didattica/lezioni/lezioni220/c220... · • principles...

56
High Performance Computer Architecture Dothan Core (0.09m) 145 mm 2 /55Mtr 84 mm 2 /140Mtr 217 mm 2 m/42Mtr 143 mm 2 /291Mtr Conroe Core (0.065m) Intel-Pentium-4 (11/2000) Intel-Pentium-4 (01/2002) Intel-Pentium-M (05/2004) Intel-Core2-Duo (07/2006) Willamette Core (0.18m) Northwood Core (0.13m) Penryn Core (0.045m) 107 mm 2 /410Mtr Bloomfield Core (0.045m) 263 mm 2 /731Mtr Intel-Core-i7 (11/2008) Fermi 512G (0.040m) Sandy Bridge (0.032m) 216 mm 2 /995Mtr 467 mm 2 /3000Mtr Intel-Core2-Duo (01/2008) IBM (08/2011) Intel-Core-i7-2920XM (01/2011) BlueGene/Q – 18 cores (0.045m) NVIDIA GF100 (09/2009) 360 mm 2 /1470Mtr Llano 4C/400G (0.032m) 228 mm 2 /1450Mtr AMD A8-3850 (06/2011) 1 Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56 10 mm

Upload: others

Post on 30-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

High Performance Computer Architecture

Dothan Core(0.09m)

145 mm2/55Mtr 84 mm2/140Mtr

217 mm2m/42Mtr

143 mm2/291MtrConroe Core (0.065m)

Intel-Pentium-4 (11/2000)

Intel-Pentium-4(01/2002)

Intel-Pentium-M(05/2004)

Intel-Core2-Duo(07/2006)Willamette Core (0.18m)

Northwood Core (0.13m)

Penryn Core(0.045m)

107 mm2/410Mtr

Bloomfield Core (0.045m)263 mm2/731Mtr

Intel-Core-i7 (11/2008)

Fermi 512G (0.040m)

Sandy Bridge (0.032m) 216 mm2/995Mtr

467 mm2/3000Mtr

Intel-Core2-Duo(01/2008)

IBM (08/2011)

Intel-Core-i7-2920XM (01/2011)

BlueGene/Q – 18 cores (0.045m)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr

Llano 4C/400G (0.032m) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

10 mm

Page 2: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18m)

20 mm

661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 7000$)

Haswell-EP – 18 Cores – 2SMT (0.022 m)

IBM Power4 (12/2001)

362 mm2m/2100Mtr12 Cores – 8SMT (0.022m)

IBM Power8 (12/2014)

128 mm2m/3000MtrApple-iPad Air2 (11/2014)

A8X 11 Cores (0.020m)Ivy Bridge (0.022m) tri-gate

160 mm2/1400MtrIntel-Core-i73770 (04/2012)

413 mm2m/7100MtrXeon Phi (06/2015)(estimated)

Knights Landing – 72 Cores – 4SMT (0.014 m)567 mm2m/8900Mtr

AMD Radeon R9 Fury-X (06/2015)

Fuji XT – 2048 GPU-Cores (0.028 m)

Page 3: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

3Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Page 4: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Where are High Performance Computers ?

4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Among users

Where you need to make this happen:“I have a limited battery and need to… take a picture, share it with my friends, ...”

In the Internet Infrastructure

Where you need to connect anybody with anything

Every electronic devicehas a Computers inside

Electronic Devices

In the Datacenters Where you need to Storeand Retrieve YOUR data

Cars may have many as 50+ computers:(California approved a billfor autonomous vehicles)

Page 5: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)

AMD Opteron 6272HP Proliant DL 585 G7

Computer Architects• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)

5Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

AMD Opteron 6200 CHIP

AMD Opteron 6200 characteristics

Page 6: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Objectives of this course• This course constitutes a deeper study of current computers and aims to provide:

• Principles of high-performance microprocessors (superscalar, VLIW)

• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system

• Principles of Multi-Core / Multi-Processor Systems

• Tools for programming Parallel Machines

6Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Page 7: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Course Administration• Teacher: Roberto Giorgi ( [email protected] )• Telephone: 0577-191-5182• Office-hours: Monday 16:30/19:00• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,

"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks• Hennessy and Patterson,

“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8

• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433

• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 567

Page 8: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Rules for exams, dates, slides, tools• Check out the course website:

8Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Page 9: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Computer Architecture“The term ARCHITECTURE is used here to describe the set of attributes of a system, as these appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”

-- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS) engineer or the compiler

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 569

Page 10: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Architecture: an overloaded term• In the strict sense: Interface Hardware / Software

• Set of instructions• Memory management and protection• Interruptions and exceptions (traps)• Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture

(this is a part that Gene Amdahl had excluded)• Specifies the functional units and connections• Configuration of the pipeline• Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set

Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5610

Page 11: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Levels of Computer Architecture

I/O devicesand

Networking

Controllers

System Interconnect(bus)

Controllers

MemoryTranslation

Execution Hardware

Drivers MemoryManager Scheduler

Operating System

Libraries

ApplicationPrograms

MainMemory

1

2

33

4 5 6

7 78888

9

10 10

1111 12

13 14

ISA

Software

Hardware

1: User Interface2: API3,7: ABI4,5,6: internal interface of

the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture

API=Application Program InterfaceABI=Application Binary InterfaceISA=Instruction Set ArchitectureRTL=Register Transfer Level

Interfaces:

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5611

Page 12: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”

(Later revised to "24 months"), this is due to:- higher density (transistors / area)- availability of bigger chips

DATA FROM SLIDE 1

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5612Moore's Law and purely PSYCHOLOGICAL!

Mtr

Page 13: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

What the market demands: Applications• Application Trend

• FROM numerical, scientific TO commercial, entertainment• FROM few "big" TO ubiquitous, "small“

- mainframes minis microprocessors handheld, embedded• FROM little TO big memory storage (primary and secondary)• FROM single-thread TO multiple-threads• FROM standalone TO networked (cloud computing)• FROM character-oriented TO multimedia (graphics and sound)• FROM personal data TO “BIG DATA”

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5613

Page 14: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Main Applications• Numerical/Scientific

• Computational Fluid Dynamics, Weather Prediction, ECAD• Long word length, floating point arithmetic

• Commercial• inventory control, billing, payroll, decision support• byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded• control, some communications• predictable performance• interrupt architecture important, low power, cost critical

• Home Computing• multimedia, entertainment• high bandwidth data movement, graphics• cryptography, compression/decompression

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5614

Page 15: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

App. Trends: Multimedia, Networked, Web-servers• A large choice of multimedia devices with

• Graphic displays (LCD, etc.).• High Definition Audio• Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require• Many independent threads• Wide band communication

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5615

Page 16: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

MICROPROCESSOR ARCHITECTURE• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU

• 1970s – Serial CPU, 1-bit for integers

• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration

of the entire processor chip

• 1990s – bigger CPUs, superscalar- Also for CISC

• 2000s – Multiprocessors on a chip...

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5616

Page 17: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5617

Page 18: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

EVALUATING COMPUTERS

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5618

Page 19: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

POWER

• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)

Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3

• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!

• STATIC POWER • BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.

• POWER/ENERGY ARE CRITICAL PROBLEMS• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED

• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE CIRCUIT, SHORT TERM OR LONG TERM)

• EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM• BATTERY OPERATED DEVICES

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5619

Page 20: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

RELIABILITY

• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V

• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING

FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO

• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES

• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE

PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5620

Page 21: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

PERFORMANCE METRICS (MEASURE)

• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N• THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF

THROUGHPUT(X)/THROUGHPUT(Y) = N• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PER SECOND

EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5621

Page 22: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

WHICH PROGRAM TO CHOOSE?

• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF

RESULTS

• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)

• SYNTHETIC BENCHMARKS (NOT REAL)

• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION

• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5622

Page 23: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

OR

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE

MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

• ARITHMETIC MEAN OF SPEEDUPS

• HARMONIC MEAN

T i Ni

T i W ii

SiTR iTi

-----------=

= 1= ∑ 1

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5623

Page 24: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

• GEOMETRIC MEANS OF SPEEDUPS

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE

- EASILY COMPOSABLE- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5624

=

Page 25: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Example1 – Quantative comparison depends on the reference machine

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)Machine 1 10 sec 100 sec 55 sec 91.8 10Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5Reference 1 100 sec 10000 sec 5050 secReference 2 100 sec 1000 sec 550 sec

25

Page 26: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Example 2 – contrasting results with Arithmetic and harmonic mean

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5626

Program A Program BMachine 1 10 sec 100 secMachine 2 1 sec 200 secReference 1 100 sec 10000 secReference 2 100 sec 1000 sec

Program A Program B Arithmetic Harmonic GeometricWrt Reference 1

Machine 1 10 100 55 18.2 31.6Machine 2 100 50 75 66.7 70.7

Wrt Reference 2

Machine 1 10 10 10 10 10Machine 2 100 5 52.5 9.5 22.4

In terms of speedup:

GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!

Page 27: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)

Texe = IC X CPI X Tc

• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND

IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND

TECHNOLOGY

CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)

Texe = (IC X Tc)/IPC

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5627

Page 28: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

AMDAHL’S LAW

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S

1-F F

Apply enhancement

1-F F/S

without E

with E

Texe withE Texe withoutE X 1 F– FS--+=

Speedup E Texe withoutE Texe withE -------------------------- 1

1 F– FS--+

---------------= =

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5628

Page 29: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

LESSONS FROM AMDAHL’S LAW

1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

F=0.5

SPEEDUP E 11 F–-------<

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5629

Amdhal’s maximum

Amdhal’s Law

Remaining Speedup

Marginal Speedup

Page 30: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

PARALLEL SPEEDUP

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??

OVERALL NOT VERY HOPEFUL

F=0.95

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5630

= = 11 − + / = + 1 − < 11 −“mortar shot”

Amdhal’s Law

Amdhal’s maximum

Ideal speedup

Page 31: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

GUSTAFSON’S LAW

• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON

CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P

PROCESSORS:

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = (s+p–p+pP)/(s+p)=1-F+FP = 1+F(P-1)

TP s p+=

T1 s pP+=

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5631

Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.

F

Sp

Page 32: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5632

Page 33: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

PIPELINING

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5633

Page 34: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Pipelining

• Pipelining principles• Simple Pipeline• Structural Hazards• Data Hazards• Control Hazards

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5634

Page 35: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Pipelining principles

• Let T be the time to execute an instruction• Without pipelining

• Latency = T• Throughput seq = 1 / T

• With an ideal n-stage pipeline• Latency = T• Throughput pipe = n / T

• Speedup = Throughput pipe /Throughput seq = n

T

1 2 n. . .

1 2 n. . .

1 n. . .1 2 n. . .

The (ideal) speedup obtainable from an ideal pipeline is equal to n

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5635

• Consider instructions composed of n phases of equal duration

2

Page 36: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Implementation of a Simple Pipeline• Simple 5-stage pipeline

• F -- Instruction Fetch• D -- Instruction Decode + Operand Fetch• X -- Execution and Effective Address • M -- Memory Access• W – Write-back Results

latch

clock

latchlatchlatchlatchlatch

F D X M W

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5636

Page 37: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE• NOTE: CONTROL IMPLEMENTATION

• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5637

Page 38: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Notation

5-stage pipeline1 2 3 4 5 6 7 8 9

i F D X M Wi+1 F D X M Wi+2 F D X M Wi+3 F D X M Wi+4 F D X M W

accessexecute backwrite

M WXF Dmemory

inst. fetchinst.decode

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5638

Page 39: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Pipeline Hazards• Conditions that lead to a malfunction if certain countermeasures are not taken

1) Structural Hazards• Two instructions want to use the same hardware resource in the same

cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)2) Data Hazards

• Two instructions use the same data: must happen in the order defined by the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which

instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5639

Page 40: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

1) Structural Hazards• Two instructions want to use the same hardware resource in the same cycle Example:• A load / store uses the same memory location that is used by the

instruction fetch

i F D X M W <-- load instructioni+1 F D X M W i+2 F D X M W i+3 * F D X M W <-- i-fetch stallsi+4 F D X M . . .

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5640

Page 41: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Resolving structural hazards• Stall one of the involved instructions

+ Cost-effective and simple- Reduces the performance- Used for some rare events

• Pipelining the resource• Useful if possible (e.g.,

for resources that require more cycles)+ Good performance- In some cases too complex to do

(e.g. RAM)• Replicate the resource

+ Good performance- Costly- Probably introduces delays- Used for cheap resources (or indivisible) De-mux Mux

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5641

Page 42: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Guidelines to reduce the structural hazards• The structural hazards can be avoided if each instruction uses the resource:• At most once:

- (e.g., Separated Instruction memory and Data memory)• In the same pipeline cycle

- (e.g. I-Fetch in stage F, R/W in stage M)• For a single cycle

- (e.g. HIT in the data or instruction cache)

• Many RISC processor ISAs were designed with this in mind• Example of problematic situation:

• MISS in cache: pipeline stalls

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5642

Page 43: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

2) Data Hazards• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :

R1 <- R2 + R3R2 <- R1 - R7R1 <- R5 OR R6

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5643

Page 44: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Reading r1

Writing r1

Data Hazards -- examples

i add r1, r2, r3 F D X M Wi+1 sub r2, r1, r7 F D X M W

r1 ?? Read-After-Write (RAW) Hazard

r1 ?? Write-After-Read (WAR) Hazard

i+1 sub r2, r1, r7 F * * * * * D X M Wi+2 or r1, r5, r6 F D X M W

Writing r1

Reading r1

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction

i add r1, r2, r3 F * * * D X M Wi+1 sub r2, r1, r7 F D X M W i+2 or r1, r5, r6 F D X M W

Writing r1

Writing r1r1 ?? Write-After-Write (WAW) Hazard

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5644

Page 45: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Dependency and HazardDependency: situation in the code that can potentially create hazards

• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another

• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers

• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers

• Read-After-Read (RAR)• Will not cause problems

HAZARDS• The dependencies can be translated into hazards depending on the hardware

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5645

Page 46: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

True Dependence and MIPS Data Hazards• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it

• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication

• In the MIPS processor True Dependence normally generates a hazard

I: add r1,r2,r3J: sub r4,r1,r3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5646

Page 47: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Anti-Dependence e MIPS Data Hazards• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it

• Also called "anti-dependence" (in the terminology of the theory of compilers)

• It results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• the reads from registers occur in stage 2 (D) and• all writes are always in stage 5 (W)

I: sub r4,r1,r3 J: add r1,r2,r3

(*) This, however, is not always possible to sw, v. Lesson-2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5647

Page 48: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Output Dependence e MIPS Data Hazards• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it

• Also called "output dependence" (in the terminology of the theory of compilers)

• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• all writes are always in stage 5 (W)

I: mul r1,r4,r3 J: add r1,r2,r3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5648 (*) This, however, is not always possible to sw, v. Lesson2

Page 49: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Simple resolution of the RAW hazard

• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish

F D X M W R1<-R2+R3 F D X M WR2<-R1-R7 F * * D X M W

+ Cost-effective and simple- Reduces the performance

NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5649

Page 50: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Implementation: stall control network• Add latches to remember the RS1/RS2/RD register identifiers at each stage

• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)• Similarly for RS2

unitExecution

fileRegister

D-CacheA

B

RS1

RD RD

StallControl

stall

Stage D Stage X Stage M

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5650

Page 51: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Inserting Stalls (detail)• Related to the instruction that creates the stall, it is necessary:

• On the previous stages: block all "inter-stage latch"• On the next stages:

- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the pipeline can continue to proceed without creating problems

Previous stageStage of the instructionwhich STALLS Next Stage Next Stage

V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal

STALL SIGNAL

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5651

Page 52: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Reduction of the RAW stalls• Bypass/Forward/Short-Circuit network

• Idea: use the data before it is written in the registers+ Reduces (potentially avoid) stalls- Additional Complexity

bypasses

ME WBEXIF ID

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5652

Page 53: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Bypass• Additional Hardware

• Multiplexer to select the input value to the ALU:or from Registers or from Bypass network

• Hazard detection logic (called interlock) that controls these multiplexers

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5653

bypass control

Unit(ALU)

Execution

operandlatches

bypass control

bypass

MUX

MUX

fileRegister

Resultlatch

Page 54: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Network to detect the possibility of Bypass• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)

• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)else if RS1(D)==RD(M) then D-CACHE-OUT(M)else select (A)

…similarlyon B input…

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5654

Unit(ALU)

ExecutionMUX

MUX

fileRegister

Resultlatch

D-Cache

RS1

RD RD

BypassControl

A

B

ALU-OUT(X)

D-CACHE-OUT(M)

Page 55: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Interaction between control networks Stall/Bypass• The stall logic is aware of the presence of the bypass logic• The bypass logic is activated independently at each stall condition

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5655

unitExecution

MUX

MUX

fileRegister

D-Cache

RS1

RD RD

BypassControl

A

B

StallControl

Page 56: High Performance Computer Architecturegiorgi/didattica/lezioni/lezioni220/c220... · • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of

Pipeline Scheduling• Scheduling of instructions at compile-time (Reorder

instructions to reduce stalls caused by instruction load):

BEFORE:a= b + c; R1 <- mem(b)

R2 <- mem(c)stall

R3 <- R1 + R2mem(a) <- R3

d = e - f; R4 <- mem(e)R5 <- mem(f)

stallR6 <- R4 - R5mem(d) <- R6

AFTER:R1 <- mem(b)R2 <- mem(c)R4 <- mem(e)R3 <- R1 + R2R5 <- mem(f)mem(a) <- R3R6 <- R4 - R5mem(d) <- R6

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5656