high performance computer...

53
High Performance Computer Architecture Dothan Core (0.09 μ m) 145 mm 2 /55Mtr 84 mm 2 /140Mtr 217 mm 2 m/42Mtr 143 mm 2 /291Mtr Conroe Core (0.065 μ m) Intel-Pentium-4 (11/2000) Intel-Pentium-4 (01/2002) Intel-Pentium-M (05/2004) Intel-Core2-Duo (07/2006) Willamette Core (0.18 μ m) Northwood Core (0.13 μ m) Penryn Core (0.045 μ m) 107 mm 2 /410Mtr Bloomfield Core (0.045 μ m) 263 mm 2 /731Mtr Intel-Core-i7 (11/2008) Fermi 512G (0.040 μ m) Sandy Bridge (0.032 μ m) 216 mm 2 /995Mtr 467 mm 2 /3000Mtr Intel-Core2-Duo (01/2008) IBM (08/2011) Intel-Core-i7-2920XM (01/2011) BlueGene/Q – 18 cores (0.045 μ m) NVIDIA GF100 (09/2009) 360 mm 2 /1470Mtr Llano 4C/400G (0.032 μ m) 228 mm 2 /1450Mtr AMD A8-3850 (06/2011) 1 Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53 10 mm

Upload: others

Post on 04-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

High Performance Computer Architecture

Dothan Core(0.09µm)

145 mm2/55Mtr 84 mm2/140Mtr

217 mm2m/42Mtr

143 mm2/291MtrConroe Core (0.065 µm)

Intel-Pentium-4 (11/2000)

Intel-Pentium-4(01/2002)

Intel-Pentium-M(05/2004)

Intel-Core2-Duo(07/2006)Willamette Core (0.18 µm)

Northwood Core (0.13 µm)

Penryn Core(0.045µm)

107 mm2/410Mtr

Bloomfield Core (0.045 µm)263 mm2/731Mtr

Intel-Core-i7 (11/2008)

Fermi 512G (0.040 µm)

Sandy Bridge (0.032 µm) 216 mm2/995Mtr

467 mm2/3000Mtr

Intel-Core2-Duo(01/2008)

IBM (08/2011)

Intel-Core-i7-2920XM (01/2011)

BlueGene/Q – 18 cores (0.045 µm)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr

Llano 4C/400G (0.032 µm) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

10 mm

Page 2: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18 µm)

20 mm

661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 �7000$)

Haswell-EP – 18 Cores – 2SMT (0.022 µm)

IBM Power4 (12/2001)

362 mm2m/2100Mtr12 Cores – 8SMT (0.022 µm)

IBM Power4 (12/2014)

128 mm2m/3000MtrApple-iPad Air2 (11/2014)

A8X 11 Cores (0.020 µm)Ivy Bridge (0.022 µm) tri-gate

160 mm2/1400MtrIntel-Core-i73770 (04/2012)

413 mm2m/7100MtrXeon Phi (06/2015)(estimated)

Knights Landing – 72 Cores – 4SMT (0.014 µm)

567 mm2m/8900MtrAMD Radeon R9 Fury-X (06/2015)

Fuji XT – 2048 GPU-Cores (0.028 µm)

Page 3: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Where are High Performance Computers ?

3Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

Among users

Where you need to make this happen:

“I have a limited battery and need to… take a picture, share it with my friends, ...”

In the Internet Infrastructure

Where you need to connect anybody with anything

Every electronic device

has a Computers inside

Electronic Devices

In the Datacenters Where you need to Store

and Retrieve YOUR data

Cars may have many as 50+ computers:

(California approved a bill

for autonomous vehicles)

Page 4: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)

AMD Opteron 6272HP Proliant DL 585 G7

Computer Architects

• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)

4Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

AMD Opteron 6200 CHIP

AMD Opteron 6200 characteristics

Page 5: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Objectives of this course

• This course constitutes a deeper study of current computers and aims to provide:

• Principles of high-performance microprocessors (superscalar, VLIW)

• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system

• Principles of Multi-Core / Multi-Processor Systems

• Tools for programming Parallel Machines

5Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

Page 6: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Course Administration• Teacher: Roberto Giorgi ( [email protected] )

• Telephone: 0577-191-5182

• Office-hours: Monday 16:30/19:00

• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,

"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks• Hennessy and Patterson,

“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8

• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433

• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 536

Page 7: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Rules for exams, dates, slides, tools

• Check out the course website:

7Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Page 8: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Computer Architecture

“The term ARCHITECTURE is used here to describe the set of attributes of a system, as this appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”

-- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS)

engineer or the compiler

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 538

Page 9: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Architecture: an overloaded term• In the strict sense: Interface Hardware / Software

• Set of instructions

• Memory management and protection

• Interruptions and exceptions (traps)

• Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture

(this is a part that Gene Amdahl had excluded)

• Specifies the functional units and connections

• Configuration of the pipeline

• Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set

Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 539

Page 10: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Levels of Computer Architecture

I/O devicesand

Networking

Controllers

System Interconnect(bus)

Controllers

MemoryTranslation

Execution Hardware

DriversMemoryManager

Scheduler

Operating System

Libraries

ApplicationPrograms

MainMemory

1

2

33

4 5 6

7 78888

9

10 10

1111 12

13 14

ISA

Software

Hardware

1: User Interface2: API3,7: ABI4,5,6: internal interface of

the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture

API=Application Program Interface

ABI=Application Binary Interface

ISA=Instruction Set Architecture

RTL=Register Transfer Level

Interfaces:

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5310

Page 11: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”

(Later revised to "24 months"), this is due to:- higher density (transistors / area)

- availability of bigger chips

DATA FROM SLIDE 1

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5311Moore's Law and purely PSYCHOLOGICAL!

Mtr

Page 12: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

What the market demands: Applications

• Application Trend• FROM numerical, scientific TO commercial, entertainment

• FROM few "big" TO ubiquitous, "small“- mainframes � minis � microprocessors � handheld, embedded

• FROM little TO big memory storage (primary and secondary)

• FROM single-thread TO multiple-threads

• FROM standalone TO networked (cloud computing)

• FROM character-oriented TO multimedia (graphics and sound)

• FROM personal data TO “BIG DATA”

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5312

Page 13: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Main Applications

• Numerical/Scientific• Computational Fluid Dynamics, Weather Prediction, ECAD

• Long word length, floating point arithmetic

• Commercial• inventory control, billing, payroll, decision support

• byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded• control, some communications

• predictable performance

• interrupt architecture important, low power, cost critical

• Home Computing• multimedia, entertainment

• high bandwidth data movement, graphics

• cryptography, compression/decompression

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5313

Page 14: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

App. Trends: Multimedia, Networked, Web-servers

• A large choice of multimedia devices with• Graphic displays (LCD, etc.).

• High Definition Audio

• Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require• Many independent threads

• Wide band communication

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5314

Page 15: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

MICROPROCESSOR ARCHITECTURE

• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU

• 1970s – Serial CPU, 1-bit for integers

• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration

of the entire processor chip

• 1990s – bigger CPUs, superscalar- Also for CISC

• 2000s – Multiprocessors on a chip...

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5315

Page 16: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Course Structure

1. High Performance Pipelining

2. Branch Prediction

3. Superscalar processor

4. Media Processing: VLIW processors

5. Multiprocessors and related problems

6. TLP: Thread Level Parallelism

7. Evaluation of High Performance Architectures

8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5316

Page 17: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

EVALUATING COMPUTERS

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5317

Page 18: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

POWER

• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)

Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE

• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3

• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER

• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!

• STATIC POWER

• BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.

• POWER/ENERGY ARE CRITICAL PROBLEMS

• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE

CIRCUIT, SHORT TERM OR LONG TERM)

• EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM

• BATTERY OPERATED DEVICES

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5318

Page 19: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

RELIABILITY

• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V

• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING

FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO

• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES

• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE

PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5319

Page 20: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

PERFORMANCE METRICS (MEASURE)

• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N

• THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF

THROUGHPUT(X)/THROUGHPUT(Y) = N

• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND

• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS PER SECOND

EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5320

Page 21: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

WHICH PROGRAM TO CHOOSE?

• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF

RESULTS

• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)

• SYNTHETIC BENCHMARKS (NOT REAL)

• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION

• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5321

Page 22: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):

1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

OR

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE

MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

• ARITHMETIC MEAN OF SPEEDUPS

• HARMONIC MEAN

T i N⁄i∑ T i W i×

i∑

Si

TR i,Ti

-----------=

�� =1����

�� =�

∑ 1 � ���

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5322

Page 23: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

• GEOMETRIC MEANS OF SPEEDUPS

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE

- EASILY COMPOSABLE

- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT

Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)

Machine 1 10 sec 100 sec 55 sec 91.8 10

Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5

Reference 1 100 sec 10000 sec 5050 sec

Reference 2 100 sec 1000 sec 550 sec

Program A Program B Arithmetic Harmonic Geometric

Wrt Reference 1

Machine 1 10 100 55 18.2 31.6

Machine 2 100 50 75 66.7 70.7

Wrt Reference 2

Machine 1 10 10 10 10 10

Machine 2 100 5 52.5 9.5 22.4

In terms of speedup:

x2.2

x2.2

x3.66

x0.95

x1.36

x5.25

� GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5323

�� = ����

Page 24: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)

Texe = IC X CPI X Tc

• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND

IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND

TECHNOLOGY

CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)

Texe = (IC X Tc)/IPC

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5324

Page 25: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

AMDAHL’S LAW

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S

1-F F

Apply enhancement

1-F F/S

without E

with E

Texe withE( ) Texe withoutE( )X 1 F–( ) FS--+=

Speedup E( )Texe w ithoutE( )

Texe withE( )-------------------------- 1

1 F–( ) FS--+

---------------= =

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5325

Page 26: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

LESSONS FROM AMDAHL’S LAW

1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

F=0.5

SPEEDUP E( ) 11 F–-------<

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5326

Amdhal’s maximum

Amdhal’s Law

Remaining Speedup

Marginal Speedup

Page 27: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

PARALLEL SPEEDUP

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??

OVERALL NOT VERY HOPEFUL

F=0.95

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5327

�� =��� =

11 − � + �/� =

�� + �(1 − �) <

11 − �

“mortar shot”

Amdhal’s Law

Amdhal’s maximum

Ideal speedup

Page 28: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

GUSTAFSON’S LAW

• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON

CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING

• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P PROCESSORS:

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)

TP s p+=

T1 s pP+=

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5328

Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.

Page 29: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Course Structure

1. High Performance Pipelining

2. Branch Prediction

3. Superscalar processor

4. Media Processing: VLIW processors

5. Multiprocessors and related problems

6. TLP: Thread Level Parallelism

7. Evaluation of High Performance Architectures

8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5329

Page 30: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

PIPELINING

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5330

Page 31: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Pipelining

• Pipelining principles

• Simple Pipeline

• Structural Hazards

• Data Hazards

• Control Hazards

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5331

Page 32: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Pipelining principles

• Let T be the time to execute an instruction

• Without pipelining• Latency = T

• Throughput seq = 1 / T

• With an ideal n-stage pipeline• Latency = T

• Throughput pipe = n / T

• Speedup = Throughput pipe /Throughput seq = n

T

1 2 n. . .

1 2 n. . .

1 n. . .

1 2 n. . .

The (ideal) speedup

obtainable from an ideal

pipeline is equal to n

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5332

• Consider instructions composed of n phases of equal duration

2

Page 33: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Implementation of a Simple Pipeline

• Simple 5-stage pipeline• F -- Instruction Fetch

• D -- Instruction Decode + Operand Fetch

• X -- Execution and Effective Address

• M -- Memory Access

• W – Write-back Results

latch

clock

latchlatchlatchlatchlatch

F D X M W

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5333

Page 34: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE

• NOTE: CONTROL IMPLEMENTATION• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5334

Page 35: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Notation

5-stage pipeline1 2 3 4 5 6 7 8 9

i F D X M W

i+1 F D X M W

i+2 F D X M W

i+3 F D X M W

i+4 F D X M W

accessexecute backwrite

M WXF Dmemory

inst. fetchinst.decode

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5335

Page 36: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Pipeline Hazards

• Conditions that lead to a malfunction if certain countermeasures are not taken

1) Structural Hazards• Two instructions want to use the same hardware resource in the same

cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)

2) Data Hazards• Two instructions use the same data: must happen in the order defined by

the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which

instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5336

Page 37: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

1) Structural Hazards

• Two instructions want to use the same hardware resource in the same cycle

Example:• A load / store uses the same memory location that is used by the

instruction fetch

i F D X M W <-- load instruction

i+1 F D X M W

i+2 F D X M W

i+3 * F D X M W <-- i-fetch stalls

i+4 F D X M . . .

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5337

Page 38: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Resolving structural hazards

• Stall one of the involved instructions+ Cost-effective and simple

- Reduces the performance

- Used for some rare events

• Pipelining the resource• Useful if possible (e.g.,

for resources that require more cycles)

+ Good performance

- In some cases too complex to do(e.g. RAM)

• Replicate the resource+ Good performance

- Costly

- Probably introduces delays

- Used for cheap resources (or indivisible) De-mux Mux

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5338

Page 39: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Guidelines to reduce the structural hazards

• The structural hazards can be avoided if each instruction uses the resource:• At most once:

- (e.g., Separated Instruction memory and Data memory)

• In the same pipeline cycle - (e.g. I-Fetch in stage F, R/W in stage M)

• For a single cycle- (e.g. HIT in the data or instruction cache)

• Many RISC processor ISAs were designed with this in mind

• Example of problematic situation:• MISS in cache: pipeline stalls

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5339

Page 40: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

2) Data Hazards

• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :

R1 <- R2 + R3

R2 <- R1 - R7

R1 <- R5 OR R6

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5340

Page 41: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Reading r1

Writing r1

Data Hazards -- examples

i add r1, r2, r3 F D X M W

i+1 sub r2, r1, r7 F D X M W

r1 ?? � Read-After-Write (RAW) Hazard

r1 ?? � Write-After-Read (WAR) Hazard

i+1 sub r2, r1, r7 F * * * * * D X M W

i+2 or r1, r5, r6 F D X M W

Writing r1

Reading r1

Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction

i add r1, r2, r3 F * * * D X M W

i+1 sub r2, r1, r7 F D X M W

i+2 or r1, r5, r6 F D X M W

Writing r1

Writing r1r1 ?? � Write-After-Write (WAW) Hazard

Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5341

Page 42: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Dependency and Hazard

Dependency: situation in the code that can potentially create hazards

• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another

• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers

• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers

• Read-After-Read (RAR)• Will not cause problems

HAZARDS

• The dependencies can be translated into hazards depending on the hardware

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5342

Page 43: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

True Dependence and MIPS Data Hazards

• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it

• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication

• In the MIPS processor True Dependence normally generates a hazard

I: add r1,r2,r3J: sub r4,r1,r3

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5343

Page 44: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Anti-Dependence e MIPS Data Hazards

• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it

• Also called "anti-dependence" (in the terminology of the theory of compilers)

• It results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and

• the reads from registers occur in stage 2 (D) and

• all writes are always in stage 5 (W)

I: sub r4,r1,r3 J: add r1,r2,r3

(*) This, however, is not always possible to sw, v. Less.-2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5344

Page 45: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Output Dependence e MIPS Data Hazards

• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it

• Also called "output dependence" (in the terminology of the theory of compilers)

• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and

• all writes are always in stage 5 (W)

I: mul r1,r4,r3 J: add r1,r2,r3

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5345 (*) This, however, is not always possible to sw, v. Less.-2

Page 46: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Simple resolution of the RAW hazard

• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish

F D X M W

R1<-R2+R3 F D X M W

R2<-R1-R7 F * * D X M W

+ Cost-effective and simple

- Reduces the performance

NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5346

Page 47: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Implementation: stall control network

• Add latches to remember the RS1/RS2/RD register identifiers at each stage

• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)

• Similarly for RS2

unitExecution

fileRegister

D-Cache

A

B

RS1

RD RD

StallControl

stall

Stage D Stage X Stage M

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5347

Page 48: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Inserting Stalls (detail)

• Related to the instruction that creates the stall, it is necessary:• On the previous stages: block all "inter-stage latch"

• On the next stages:- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the

pipeline can continue to proceed without creating problems

Previous stage

Stage of the instruction

which STALLS Next Stage Next Stage

V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal

STALL SIGNAL

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5348

Page 49: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Reduction of the RAW stalls

• Bypass/Forward/Short-Circuit network

• Idea: use the data before it is written in the registers

+ Reduces (potentially avoid) stalls

- Additional Complexity

bypasses

ME WBEXIF ID

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5349

Page 50: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Bypass

• Additional Hardware• Multiplexer to select the input value to the ALU:

or from Registers or from Bypass network

• Hazard detection logic (called interlock) that controls these multiplexers

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5350

bypass control

Unit(ALU)

Execution

operandlatches

bypass control

bypass

MUX

MUX

file

Register

Resultlatch

Page 51: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Network to detect the possibility of Bypass

• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)

• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)

else if RS1(D)==RD(M) then D-CACHE-OUT(M)

else select (A)

…similarlyon B input…

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5351

Unit(ALU)

Execution

MUX

MUX

file

Register

Resultlatch

D-Cache

RS1

RD RD

BypassControl

A

B

ALU-OUT(X)

D-CACHE-OUT(M)

Page 52: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Interaction between control networks Stall/Bypass

• The stall logic is aware of the presence of the bypass logic

• The bypass logic is activated independently at each stall condition

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5352

unitExecution

MUX

MUX

fileRegister

D-Cache

RS1

RD RD

BypassControl

A

B

StallControl

Page 53: High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

Pipeline Scheduling

• Scheduling of instructions at compile-time (Reorder instructions to reduce stalls caused by instruction load):

BEFORE:

a= b + c; R1 <- mem(b)

R2 <- mem(c)

stall

R3 <- R1 + R2

mem(a) <- R3

d = e - f; R4 <- mem(e)

R5 <- mem(f)

stall

R6 <- R4 - R5

mem(d) <- R6

AFTER:R1 <- mem(b)

R2 <- mem(c)

R4 <- mem(e)

R3 <- R1 + R2

R5 <- mem(f)

mem(a) <- R3

R6 <- R4 - R5

mem(d) <- R6

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5353