on-chip parallelism alvin r. lebeck cps 221 week 13, lecture 2

20
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

Upload: marsha-blake

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

On-chip Parallelism

Alvin R. Lebeck

CPS 221

Week 13, Lecture 2

Page 2: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 2© Alvin R. Lebeck 1999

Administrivia

• Today simultaneous multithreading, MP on a chip

• project presentations (10-15 minutes)

• midterm II, Wed April 29, in class

• project write-up due Friday May 1 Noon– approximately 8 pages

Page 3: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 3© Alvin R. Lebeck 1999

Review: Software Coherence Protocols

Requires

• Access Control

• Messaging System– small control messages

– large bulk transfer

• Programmable Processor– Support for Protocol operations

Questions

• Kernel-based vs. User-Level?

• Integration of processor with other requirements?

Page 4: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 4© Alvin R. Lebeck 1999

Review: Typhoon

• Fully Integrated (processor, access control, NI)

Mem

P

$

P

$

RTLB

NI

P

$

P

$

P

$

Page 5: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 5© Alvin R. Lebeck 1999

Software Fine-Grain Access Control

• Low cost, can run on network of workstations

• Flexibility of Software protocol processing

• Like SW Dirty Bits, but more general

• Foreach load/store, check access bits– if access fault invoke fault handler

• Lookup Options– table lookup (Blizzard-S)

– magic cookie (Shasta, Blizzard-COW)

• Instrumentation Options– compiler

– executabe editing

Page 6: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 6© Alvin R. Lebeck 1999

Blizzard-S

• Supports Tempest Interface

• Executable Editing (EEL)

• Fast Table Lookup– mask, shift, add

Page 7: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 7© Alvin R. Lebeck 1999

Shasta

• Executable Editing (variant of ATOM)

• Magic Cookield r1, r2[300]

if r1 == magic_cookie

do_out_of_line_check(x);

add r3, r1, r4

• Incorporates several optimizations– code scheduling

– batching checks (refs to same cache lines)

– 3% overhead on uniprocessor code

• Multiple coherence granularity

• Supports Release Consistency

Page 8: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 8© Alvin R. Lebeck 1999

Future Directions

• Simultaneous Multithreading

• Single-Chip MP

• MultiScalar Processors (Wednesday)

Page 9: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 9© Alvin R. Lebeck 1999

Multithreaded Processors

• Exploit thread-level parallelism to improve performance

– Multiple Program Counters

• Thread– independent programs (multiprogramming)

– threads from same program

Page 10: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 10© Alvin R. Lebeck 1999

Deneclor HEP

• General purpose scientific computer

• Organized as MP– up to 16 processors

– each processor multithreaded

– up to 128 memory modules

– up to 4 I/O cache modules

– Three-input switches and chaotic routing

Page 11: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 11© Alvin R. Lebeck 1999

HEP Processor Organization

• Multiple contexts (threads)– each has own Program Status Word (PSW)

• PSWs circulate in control loop– control and data loops pipelined 8 deep

– PSW in control can circulate no faster than data in data loop

– PSW at queue head fetches and starts execution of next instruction

• Clock period: 100ns– 8 PSWs in control loop => 10MIPS

– Each thread gets 1/8 the processor

– Maximum performance per thread => 1.25 MIPS

(And they tried to sell as supercomputer)

Page 12: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 12© Alvin R. Lebeck 1999

Simultaneous Multithreading

• Goal: use hardware resources more efficiently– especially for superscalar processors

• Assume 4-issue superscalar

Thread Instruction

Horizontal Waste

Verticle Waste

Page 13: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 13© Alvin R. Lebeck 1999

Operation of Simultaneous Multithreading

• Standard multithreading can reduce verticle waste

• Issue from multiple threads in same cock cycle

• Eliminate both horizontal and verticle waste

Thread Instructions

Thread Instructions

Simultaneous MultithreadingStandard Multithreading

Page 14: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 14© Alvin R. Lebeck 1999

Limitations of SuperScalar Architectures

Instruction Fetch– branch prediction

– alignment of packet of instructions

Dynamic Instruction Issue

• Need to identify ready instructions

• Rename Table– No compares

– Large number of ports (Operands x Width)

• Reorder Buffer– n x Q x O x W 1 bit comparators (src and dest)

– Quadratic increase in queue size with issue width

– PA-8000 20% of die area to issue queue (56 instruction window)

Page 15: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 15© Alvin R. Lebeck 1999

SuperScalar Limitations (Continued)

Instruction Execute

• Register File– more rename registers

– more access ports

– complexity quadratic with issue width

• Bypass logic– complexity quadratic with issue width

– wire delays

• Functional Units– replicate

– add ports to data cache (complexity adds to access time)

Page 16: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 16© Alvin R. Lebeck 1999

Why Single Chip MP?

• Technology Push– Benefits of wide issue are limited

– Decentralized microarchitecture: easier to build several simple fast processors than one complex processor

• Application Pull– Applications exhibit parallelism at different grains

– < 10 instructions per cycle (Integer codes)

– > 40 instructions per cycle (FP loops)

Page 17: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 17© Alvin R. Lebeck 1999

A 6-Way SuperScalar Processor

Inte

ger

Uni

t

L2

Cac

he (

256

KB

)

I-Cache(32 KB)

TLB

D-Cache(32 KB)

ExternalInterface Instruction

Fetch

Clo

ckin

g &

Pad

sInstructionDecode &Rename

Reorder Buffer,Instruction Queues,and Out-of-Order Logic

Floating PointUnit

21 m

m

21 mm

Page 18: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 18© Alvin R. Lebeck 1999

A 4 x 2 Single Chip Multiprocessor

L2

Com

mun

icat

ion

Cro

ssba

r

L2

Cac

he (

256

KB

)

ExternalInterface

Clo

ckin

g &

Pad

s

21 m

m

Dcache 1Dcache 3

Dcache 2Dcache 4

Icache 1 Icache 2

Icache 3 Icache 4

Processor#1

Processor#2

Processor#3

Processor#4

21 mm

Page 19: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 19© Alvin R. Lebeck 1999

Performance Comparison

0

0.5

1

1.5

2

2.5

3

3.5

4

Co

mp

ress

Eq

nto

tt

m88

ksim

MP

sim

app

lu

apsi

swim

tom

catv

pm

ake

6-way SS

4x2 MP

Page 20: On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

CPS 221 20© Alvin R. Lebeck 1999

Summary of Performance

• 4 x 2 MP works well for coarse grain apps– How well would Message Passing Architecture do?

– Can SUIF handle pointer intensive codes?

• For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue