on-chip parallelism alvin r. lebeck cps 221 week 13, lecture 2

On-chip Parallelism

Alvin R. Lebeck

CPS 221

Week 13, Lecture 2

CPS 221 2© Alvin R. Lebeck 1999

Administrivia

• Today simultaneous multithreading, MP on a chip

• project presentations (10-15 minutes)

• midterm II, Wed April 29, in class

• project write-up due Friday May 1 Noon– approximately 8 pages


Review: Software Coherence Protocols

Requires

• Access Control

• Messaging System– small control messages

– large bulk transfer

• Programmable Processor– Support for Protocol operations

Questions

• Kernel-based vs. User-Level?

• Integration of processor with other requirements?


Review: Typhoon

• Fully Integrated (processor, access control, NI)

Mem

P

$

P

$

RTLB

NI

P

$

P

$

P

$


Software Fine-Grain Access Control

• Low cost, can run on network of workstations

• Flexibility of Software protocol processing

• Like SW Dirty Bits, but more general

• Foreach load/store, check access bits– if access fault invoke fault handler

• Lookup Options– table lookup (Blizzard-S)

– magic cookie (Shasta, Blizzard-COW)

• Instrumentation Options– compiler

– executabe editing


Blizzard-S

• Supports Tempest Interface

• Executable Editing (EEL)

• Fast Table Lookup– mask, shift, add


Shasta

• Executable Editing (variant of ATOM)

• Magic Cookield r1, r2[300]

if r1 == magic_cookie

do_out_of_line_check(x);

add r3, r1, r4

• Incorporates several optimizations– code scheduling

– batching checks (refs to same cache lines)

– 3% overhead on uniprocessor code

• Multiple coherence granularity

• Supports Release Consistency


Future Directions

• Simultaneous Multithreading

• Single-Chip MP

• MultiScalar Processors (Wednesday)


Multithreaded Processors

• Exploit thread-level parallelism to improve performance

– Multiple Program Counters

• Thread– independent programs (multiprogramming)

– threads from same program


Deneclor HEP

• General purpose scientific computer

• Organized as MP– up to 16 processors

– each processor multithreaded

– up to 128 memory modules

– up to 4 I/O cache modules

– Three-input switches and chaotic routing


HEP Processor Organization

• Multiple contexts (threads)– each has own Program Status Word (PSW)

• PSWs circulate in control loop– control and data loops pipelined 8 deep

– PSW in control can circulate no faster than data in data loop

– PSW at queue head fetches and starts execution of next instruction

• Clock period: 100ns– 8 PSWs in control loop => 10MIPS

– Each thread gets 1/8 the processor

– Maximum performance per thread => 1.25 MIPS

(And they tried to sell as supercomputer)


Simultaneous Multithreading

• Goal: use hardware resources more efficiently– especially for superscalar processors

• Assume 4-issue superscalar

Thread Instruction

Horizontal Waste

Verticle Waste


Operation of Simultaneous Multithreading

• Standard multithreading can reduce verticle waste

• Issue from multiple threads in same cock cycle

• Eliminate both horizontal and verticle waste

Thread Instructions

Thread Instructions

Simultaneous MultithreadingStandard Multithreading


Limitations of SuperScalar Architectures

Instruction Fetch– branch prediction

– alignment of packet of instructions

Dynamic Instruction Issue

• Need to identify ready instructions

• Rename Table– No compares

– Large number of ports (Operands x Width)

• Reorder Buffer– n x Q x O x W 1 bit comparators (src and dest)

– Quadratic increase in queue size with issue width

– PA-8000 20% of die area to issue queue (56 instruction window)


SuperScalar Limitations (Continued)

Instruction Execute

• Register File– more rename registers

– more access ports

– complexity quadratic with issue width

• Bypass logic– complexity quadratic with issue width

– wire delays

• Functional Units– replicate

– add ports to data cache (complexity adds to access time)


Why Single Chip MP?

• Technology Push– Benefits of wide issue are limited

– Decentralized microarchitecture: easier to build several simple fast processors than one complex processor

• Application Pull– Applications exhibit parallelism at different grains

– < 10 instructions per cycle (Integer codes)

– > 40 instructions per cycle (FP loops)


A 6-Way SuperScalar Processor

Inte

ger

Uni

t

L2

Cac

he (

256

KB

)

I-Cache(32 KB)

TLB

D-Cache(32 KB)

ExternalInterface Instruction

Fetch

Clo

ckin

g &

Pad

sInstructionDecode &Rename

Reorder Buffer,Instruction Queues,and Out-of-Order Logic

Floating PointUnit

21 m

m

21 mm


A 4 x 2 Single Chip Multiprocessor

L2

Com

mun

icat

ion

Cro

ssba

r

L2

Cac

he (

256

KB

)

ExternalInterface

Clo

ckin

g &

Pad

s

21 m

m

Dcache 1Dcache 3

Dcache 2Dcache 4

Icache 1 Icache 2

Icache 3 Icache 4

Processor#1

Processor#2

Processor#3

Processor#4

21 mm


Performance Comparison

0

0.5

1

1.5

2

2.5

3

3.5

4

Co

mp

ress

Eq

nto

tt

m88

ksim

MP

sim

app

lu

apsi

swim

tom

catv

pm

ake

6-way SS

4x2 MP


Summary of Performance

• 4 x 2 MP works well for coarse grain apps– How well would Message Passing Architecture do?

– Can SUIF handle pointer intensive codes?

• For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue

on-chip parallelism alvin r. lebeck cps 221 week 13, lecture 2

Documents