on-chip parallelism alvin r. lebeck cps 221 week 13, lecture 2
TRANSCRIPT
On-chip Parallelism
Alvin R. Lebeck
CPS 221
Week 13, Lecture 2
CPS 221 2© Alvin R. Lebeck 1999
Administrivia
• Today simultaneous multithreading, MP on a chip
• project presentations (10-15 minutes)
• midterm II, Wed April 29, in class
• project write-up due Friday May 1 Noon– approximately 8 pages
CPS 221 3© Alvin R. Lebeck 1999
Review: Software Coherence Protocols
Requires
• Access Control
• Messaging System– small control messages
– large bulk transfer
• Programmable Processor– Support for Protocol operations
Questions
• Kernel-based vs. User-Level?
• Integration of processor with other requirements?
CPS 221 4© Alvin R. Lebeck 1999
Review: Typhoon
• Fully Integrated (processor, access control, NI)
Mem
P
$
P
$
RTLB
NI
P
$
P
$
P
$
CPS 221 5© Alvin R. Lebeck 1999
Software Fine-Grain Access Control
• Low cost, can run on network of workstations
• Flexibility of Software protocol processing
• Like SW Dirty Bits, but more general
• Foreach load/store, check access bits– if access fault invoke fault handler
• Lookup Options– table lookup (Blizzard-S)
– magic cookie (Shasta, Blizzard-COW)
• Instrumentation Options– compiler
– executabe editing
CPS 221 6© Alvin R. Lebeck 1999
Blizzard-S
• Supports Tempest Interface
• Executable Editing (EEL)
• Fast Table Lookup– mask, shift, add
CPS 221 7© Alvin R. Lebeck 1999
Shasta
• Executable Editing (variant of ATOM)
• Magic Cookield r1, r2[300]
if r1 == magic_cookie
do_out_of_line_check(x);
add r3, r1, r4
• Incorporates several optimizations– code scheduling
– batching checks (refs to same cache lines)
– 3% overhead on uniprocessor code
• Multiple coherence granularity
• Supports Release Consistency
CPS 221 8© Alvin R. Lebeck 1999
Future Directions
• Simultaneous Multithreading
• Single-Chip MP
• MultiScalar Processors (Wednesday)
CPS 221 9© Alvin R. Lebeck 1999
Multithreaded Processors
• Exploit thread-level parallelism to improve performance
– Multiple Program Counters
• Thread– independent programs (multiprogramming)
– threads from same program
CPS 221 10© Alvin R. Lebeck 1999
Deneclor HEP
• General purpose scientific computer
• Organized as MP– up to 16 processors
– each processor multithreaded
– up to 128 memory modules
– up to 4 I/O cache modules
– Three-input switches and chaotic routing
CPS 221 11© Alvin R. Lebeck 1999
HEP Processor Organization
• Multiple contexts (threads)– each has own Program Status Word (PSW)
• PSWs circulate in control loop– control and data loops pipelined 8 deep
– PSW in control can circulate no faster than data in data loop
– PSW at queue head fetches and starts execution of next instruction
• Clock period: 100ns– 8 PSWs in control loop => 10MIPS
– Each thread gets 1/8 the processor
– Maximum performance per thread => 1.25 MIPS
(And they tried to sell as supercomputer)
CPS 221 12© Alvin R. Lebeck 1999
Simultaneous Multithreading
• Goal: use hardware resources more efficiently– especially for superscalar processors
• Assume 4-issue superscalar
Thread Instruction
Horizontal Waste
Verticle Waste
CPS 221 13© Alvin R. Lebeck 1999
Operation of Simultaneous Multithreading
• Standard multithreading can reduce verticle waste
• Issue from multiple threads in same cock cycle
• Eliminate both horizontal and verticle waste
Thread Instructions
Thread Instructions
Simultaneous MultithreadingStandard Multithreading
CPS 221 14© Alvin R. Lebeck 1999
Limitations of SuperScalar Architectures
Instruction Fetch– branch prediction
– alignment of packet of instructions
Dynamic Instruction Issue
• Need to identify ready instructions
• Rename Table– No compares
– Large number of ports (Operands x Width)
• Reorder Buffer– n x Q x O x W 1 bit comparators (src and dest)
– Quadratic increase in queue size with issue width
– PA-8000 20% of die area to issue queue (56 instruction window)
CPS 221 15© Alvin R. Lebeck 1999
SuperScalar Limitations (Continued)
Instruction Execute
• Register File– more rename registers
– more access ports
– complexity quadratic with issue width
• Bypass logic– complexity quadratic with issue width
– wire delays
• Functional Units– replicate
– add ports to data cache (complexity adds to access time)
CPS 221 16© Alvin R. Lebeck 1999
Why Single Chip MP?
• Technology Push– Benefits of wide issue are limited
– Decentralized microarchitecture: easier to build several simple fast processors than one complex processor
• Application Pull– Applications exhibit parallelism at different grains
– < 10 instructions per cycle (Integer codes)
– > 40 instructions per cycle (FP loops)
CPS 221 17© Alvin R. Lebeck 1999
A 6-Way SuperScalar Processor
Inte
ger
Uni
t
L2
Cac
he (
256
KB
)
I-Cache(32 KB)
TLB
D-Cache(32 KB)
ExternalInterface Instruction
Fetch
Clo
ckin
g &
Pad
sInstructionDecode &Rename
Reorder Buffer,Instruction Queues,and Out-of-Order Logic
Floating PointUnit
21 m
m
21 mm
CPS 221 18© Alvin R. Lebeck 1999
A 4 x 2 Single Chip Multiprocessor
L2
Com
mun
icat
ion
Cro
ssba
r
L2
Cac
he (
256
KB
)
ExternalInterface
Clo
ckin
g &
Pad
s
21 m
m
Dcache 1Dcache 3
Dcache 2Dcache 4
Icache 1 Icache 2
Icache 3 Icache 4
Processor#1
Processor#2
Processor#3
Processor#4
21 mm
CPS 221 19© Alvin R. Lebeck 1999
Performance Comparison
0
0.5
1
1.5
2
2.5
3
3.5
4
Co
mp
ress
Eq
nto
tt
m88
ksim
MP
sim
app
lu
apsi
swim
tom
catv
pm
ake
6-way SS
4x2 MP
CPS 221 20© Alvin R. Lebeck 1999
Summary of Performance
• 4 x 2 MP works well for coarse grain apps– How well would Message Passing Architecture do?
– Can SUIF handle pointer intensive codes?
• For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue