ece8833 polymorphous and many-core computer architecture

37
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. School of Electrical and Computer Engineer Lecture 1 Early ILP Processors and Performance Bound Model

Upload: elizabeth-garner

Post on 31-Dec-2015

19 views

Category:

Documents


1 download

DESCRIPTION

ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 1 Early ILP Processors and Performance Bound Model. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering

Lecture 1 Early ILP Processors and Performance Bound Model

Page 2: ECE8833 Polymorphous and Many-Core Computer Architecture

2ECE8833 H.-H. S. Lee 2009

Decoupled Access/Execute Computer Architectures

James E. Smith, ACM TOCS, 1984

(a earlier version was published in ISCA 1982)

Page 3: ECE8833 Polymorphous and Many-Core Computer Architecture

3ECE8833 H.-H. S. Lee 2009

Background of DAE, circa. 1982• Written at a time when vector machine was dominating

LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3

MULV v3, v2, v1

LV v1, mem[a1]

ADDV v5, v4, v3

Time line

Vector chaining(Cray-1)

MULV v3, v2, v1

LV v1, mem[a1]

ADDV v5, v4, v3

64-bit register

0 63

4096-bit

Page 4: ECE8833 Polymorphous and Many-Core Computer Architecture

4ECE8833 H.-H. S. Lee 2009

Background of DAE, circa. 1982• Written at a time when vector machine was dominating

LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3

v1

v3

Memory

MUL

v2

v4

ADDv5

What about modern

SIMD ISA ?

Page 5: ECE8833 Polymorphous and Many-Core Computer Architecture

5ECE8833 H.-H. S. Lee 2009

Today State-of-the-art ?• Intel AVX

• Intel Larrabee NI

Page 6: ECE8833 Polymorphous and Many-Core Computer Architecture

6ECE8833 H.-H. S. Lee 2009

DAE, circa. 1982• Fine-grained parallelism: Vector vs. Superscalar

• What about scalar performance?– Remember what’s Flynn’s bottleneck?

Page 290

Page 7: ECE8833 Polymorphous and Many-Core Computer Architecture

7ECE8833 H.-H. S. Lee 2009

Flynn’s Bottleneck• ILP 1.86

– Programs on IBM 7090– Basically, he sort of said one cannot

execute more than one instruction per cycle– ILP exploited within basic blocks

• [Riseman & Foster’72][Riseman & Foster’72]– Breaking control dependency– A perfect machine model– Benchmark includes numerical programs,

assembler and compiler

passed jumps 0 jump

1 jump

2 jumps

8 jumps

32 jumps

128 jumps

jumps

Average ILP 1.72 2.72 3.62 7.21 14.8 24.2 51.2

BB0

BB1

BB3

BB2

BB4

Page 8: ECE8833 Polymorphous and Many-Core Computer Architecture

8ECE8833 H.-H. S. Lee 2009

DAE, circa. 1982, 1984• Issues in CDC6600 & IBM 360/91

– Overlap instructions by OoO complex control slower clock offset the benefit

– Complex issue methods were abandoned by their manufacturers

• Less determinism• Problems in HW debugging• Errors may not be reproducible

– Complexity can be shifted to system software

Page 9: ECE8833 Polymorphous and Many-Core Computer Architecture

9ECE8833 H.-H. S. Lee 2009

Decoupled Access/Execute Architecture• An architecture with two instruction streams to

break Flynn’s bottleneck– Access processor– eXecute processor

– Hey, this was 1980s

• Separate RFs (A0, A1, A2 .. , An-1 & X0, X1, X2 .. ,Xm-1), which can be totally incompatible – Synchronization issue?

Page 10: ECE8833 Polymorphous and Many-Core Computer Architecture

10ECE8833 H.-H. S. Lee 2009

DAE

Page 11: ECE8833 Polymorphous and Many-Core Computer Architecture

11ECE8833 H.-H. S. Lee 2009

Data Movement

Data In

Data Out

paired

XLQ, XSQ, are specified as registers

at the ISA level

Page 12: ECE8833 Polymorphous and Many-Core Computer Architecture

12ECE8833 H.-H. S. Lee 2009

Register-to-Register Synch

Xi Aj

Page 13: ECE8833 Polymorphous and Many-Core Computer Architecture

13ECE8833 H.-H. S. Lee 2009

Branch Synch-up

• One Runhead• One execute uncond.

Jump (BFQ instruction)

Branch outcomes in XBQ can be used to reduce I-fetch from X-Processor.

Page 14: ECE8833 Polymorphous and Many-Core Computer Architecture

14ECE8833 H.-H. S. Lee 2009

DAE Code Example

Page 15: ECE8833 Polymorphous and Many-Core Computer Architecture

15ECE8833 H.-H. S. Lee 2009

Modern Issue Consideration• Despite it is a ‘82/’84 paper, it considers

Page 16: ECE8833 Polymorphous and Many-Core Computer Architecture

16ECE8833 H.-H. S. Lee 2009

Precise Exception• Simple approach force the instructions to complete in order• In DAE, applied to each of the streams separately

• Example of Imprecise exception issues• Require cautiousness when coding A and E programs

Page 17: ECE8833 Polymorphous and Many-Core Computer Architecture

17ECE8833 H.-H. S. Lee 2009

Requirement for Precise Exception

Page 18: ECE8833 Polymorphous and Many-Core Computer Architecture

18ECE8833 H.-H. S. Lee 2009

Why (and How) It Works?• Avg. speedup = 1.58 for LFK• Executions between 2

processors are somewhat balanced

• Why?– Work nicely as shown in LFK– X-processor’s computation is not as

fast• 6-cycle FP add• 7-cycle FP multiply

– A-process takes care of • Memory (11-cycle load)• Branch resolution

Page 19: ECE8833 Polymorphous and Many-Core Computer Architecture

19ECE8833 H.-H. S. Lee 2009

Disadvantages of DAE Architecture

1. Writing 2 separate programs• What High-level language ?• Who should do it?

2. Certain duplication in Hardware• Instruction memory/cache• Instruction fetch unit• Decoder

Page 20: ECE8833 Polymorphous and Many-Core Computer Architecture

20ECE8833 H.-H. S. Lee 2009

Interleaving Instruction Streams

• Use a bit to tag streams• No split branch instruction

(1) X7 is XLQ or XSQ; (2) Once loaded, it is used once.(3) It must be stored after X-processor writes to it

(A)X

Page 21: ECE8833 Polymorphous and Many-Core Computer Architecture

21ECE8833 H.-H. S. Lee 2009

Summary of DAE Architecture• 2-wide issue per cycle

• Allow a constrained type of OoO – Data accesses could be done well in advance

(i.e., “slip” ahead)– Enable certain level of data prefetching

• Was novel in 1982!

Page 22: ECE8833 Polymorphous and Many-Core Computer Architecture

22ECE8833 H.-H. S. Lee 2009

The ZS-1 Central Processor

James E. Smith, et al. in ASPLOS-II, 1987

Page 23: ECE8833 Polymorphous and Many-Core Computer Architecture

23ECE8833 H.-H. S. Lee 2009

Astronautics ZS-1 ZS-1 Central Processor• A realization of DAE (by the same author)

• Decouple instruction stream into– Fixed point/memory – Floating-point operations

• Communicate via Architectural queues

• Is extensively pipelined

• 22.5 MFLOPS, 45 MIPS

Page 24: ECE8833 Polymorphous and Many-Core Computer Architecture

24ECE8833 H.-H. S. Lee 2009

ZS-1 Central Processor

Communicate with memory

31 A (and X) registers + 1 Queue entry= 5-bit encoded operands

Hold 24 insts

Hold 4 insts

Page 25: ECE8833 Polymorphous and Many-Core Computer Architecture

25ECE8833 H.-H. S. Lee 2009

ZS-1 Central Processor+ Instruction cannot be issued unless the dependency is resolved.

+ A load may bypass independent stores

+ Maintain load-load, store-store order

Page 26: ECE8833 Polymorphous and Many-Core Computer Architecture

26ECE8833 H.-H. S. Lee 2009

Can Load Bypass Load?• Why not?

Load R1, (A)Load R2, (A)

Core 1

Store (A), R3

Core 2

(A)=100 R3=25

(1)(2)

(3)

• What’s wrong with (2)(3)(1)?

Page 27: ECE8833 Polymorphous and Many-Core Computer Architecture

27ECE8833 H.-H. S. Lee 2009

ZS-1: Processing of Two Iterations

S: splitterB: inst buffer readD: decodedI: issued E: Execution

Page 28: ECE8833 Polymorphous and Many-Core Computer Architecture

28ECE8833 H.-H. S. Lee 2009

IBM RS/6000 and POWER• Evolved from IBM ACS and 801

• Foundation of POWER architecture (Performance Optimization With Enhanced RISC)– 10 discrete chips in the early POWER1 system– Single chip solution in RSC and some

subsequent POWER2 version called P2SC

Page 29: ECE8833 Polymorphous and Many-Core Computer Architecture

29ECE8833 H.-H. S. Lee 2009

POWER2 Processor Node• 8 Discrete chips on MCM• 66.7 MHz, 6-issue (2 reserved for

br/comp)• 2 FXUs

– Memory, INT, Logical– 2 per cycles

• 3 dual-pipe FPUs can perform– 2 DP Fma– 2 FP loads– 2 FP stores

---

I-Cache(32KB)

Dispatch

DualBranch

Processors

Instruction Cache Unit

Instruction Buffer

Execution Unit w/oMult/Div

Execution Unit w

Mult/Div

Instruction Buffer

ArithmeticExecution

Unit

Store Execution

Unit

Load Execution

Unit

Sync

Fixed-Point Unit (FXU) Floating-Point Unit (FPU)

Data Cache Unit (DCU)4 separate chips

(32KB each)

Memory Unit(64MB – 512MB)

OptionalSecondary Cache

(1 or 2MB)

Storage Control Unit

Page 30: ECE8833 Polymorphous and Many-Core Computer Architecture

30ECE8833 H.-H. S. Lee 2009

MACS Performance Bound Model

Actual Run Time

M Bound

MA Bound

MAC Bound

MACS Bound

PhysicallyMeasured

GAP A

GAP C

GAP S

GAP P

• To analyze achievable performance (mostly FP) in scientific applications

Page 31: ECE8833 Polymorphous and Many-Core Computer Architecture

31ECE8833 H.-H. S. Lee 2009

MACS Performance Bound Model• Gap A (keep you from attaining peak performance)

– Excessive loads/stores (more than essential ones, i.e., a[i] = b[i])

– Loop bookkeeping

• GAP C (reason we may want to have 432?)– Hardware restriction (architectural registers)– Redundant instructions – Load/store overhead in function calls

• GAP S– Weak scheduling algorithm– Resource conflicts preventing tighter schedule – Sol: Modulo scheduling to compact the code

• GAP P– Cache misses, inter-core communication, system effect

(i.e., context switches)– Sol: prefetch, loop blocking, loop fusion, loop exchange,

etc.

Page 32: ECE8833 Polymorphous and Many-Core Computer Architecture

32ECE8833 H.-H. S. Lee 2009

POWER2 M Bound (Ideal, Ideal)

M Bound Peak = 1 fma to 2 FPU pipelines = 0.25 CPF

---

Instruction Buffer

ArithmeticExecution

Unit

Store Execution

Unit

Load Execution

Unit

Floating-Point Unit (FPU)

Dispatch

Page 33: ECE8833 Polymorphous and Many-Core Computer Architecture

33ECE8833 H.-H. S. Lee 2009

POWER2 MA Bound (Ideal compiler and rest)MA Bound 1. Given the visible workload of the high level application

2. Calculate the essential operations must be performed

sqrtdivmama

dimfxflMA f*4f*4f*2ff

) t, t, t, t, MAX(tt

Time bound for all FP operations

Essential, minimum FP operations to complete the

computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations

Page 34: ECE8833 Polymorphous and Many-Core Computer Architecture

34ECE8833 H.-H. S. Lee 2009

POWER2 MA Bound (Ideal compiler and rest)

)I

L(MAX t

)sl,sMAX(lt

2

sl t

4

slffffft

)2

,2

,2

f*27f*17fffMAX(t

r

rr cycles recurrence d

fxfxflflm

flflfx

flflsqrtdivmamai

sqrtdivmamafl

flfl ls

r recurrencein iterations of # :r

I

dependency carried-loop theoflatency Total :r

L

2 pipelines

Max 4 dispatches to FPU and FXU

Other fixed-point considered irrelevant

Simplified memory model

Non-pipelined FP ops

Page 35: ECE8833 Polymorphous and Many-Core Computer Architecture

35ECE8833 H.-H. S. Lee 2009

POWER2 MAC Bound

4

n -length code compiled t'

compare andbranch of # n where2

n t'

div and mul FXU ofnumber

other s' l' n where)n ,

2

nMAX( t'

othersf'*27f'*17f'f'f' n where)2

',

2

',

2

nMAX(t'

BCi

BCBC

b

fxfx FXUFXMD

FXUfx

sqrtdivmaabaFPUFPU

fl

flfl ls

MAC BoundSimilar to computing MA Bound but using actual, generated instruction count

sqrtdivmama

dibmfxflMAC f*4f*4f*2ff

) t', t', t', t', t',MAX(t' t

Page 36: ECE8833 Polymorphous and Many-Core Computer Architecture

36ECE8833 H.-H. S. Lee 2009

POWER2 MACS Bound

MACS BoundSimilar to computing MAC Bound but the numerator is the actual compiler-scheduled code

Page 37: ECE8833 Polymorphous and Many-Core Computer Architecture

37ECE8833 H.-H. S. Lee 2009

IBM SP2 Performance Bound• Later expansion to include inter-processor

communication bound