decoupled architectures for complexity-effective general purpose processors
DESCRIPTION
Decoupled Architectures for Complexity-Effective General Purpose Processors. Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science 12-7-2000. Motivation. out-of-order superscalar designs are inefficient and hard to scale - PowerPoint PPT PresentationTRANSCRIPT
Decoupled Architectures for Complexity-Effective General Purpose
Processors
Ronny Krashinsky and Mike Sung
6.893 Term Project Presentation
MIT Laboratory for Computer Science
12-7-2000
Motivation
out-of-order superscalar designs are inefficient and hard to scale
decoupled architectures can provide latency hiding, dynamic scheduling, and ILP in a much more complexity-effective and scalable manner
in previous work, decoupled architectures have been investigated for scientific apps
superscalar architectures are used universally for general purpose computing requirements
why? superscalars provide more flexibility, and decoupled architectures break down when there is a loss of decoupling
Proposal
use decoupled architectures for complexity-effective general purpose computing
multithreading can be used to hide loss of decoupling latency
potentially get the best out of both architectures by providing a superscalar processor with decoupled engines for complexity-effective streaming computations
we will present a survey of prior work and our proposed architectural innovations, unfortunately a lot of infrastructure (e.g. a compiler) is required for a more detailed investigation
Decoupled Access/Execute Architecture
Decoupled Access/Execute Computer Architectures, Smith, 1982
AP & EP process separate instruction streams EP used for computation (floating point) ILP
data values communicated via queues
slip – AP runs ahead of EP memory latency hiding dynamic scheduling
head of AEQ can be used as instruction operand in EP blocks if data isn’t available takes the place of register renaming
store addresses wait in WAQ until corresponding data arrives from EP loads can bypass stores (check address)
Decoupled Access/Execute Architecture
Decoupled Access/Execute Computer Architectures, Smith, 1982
program control flow implemented with corresponding conditional branch in each stream
branch condition queues allow AP to hide branch latency from EP
loss of decoupling if AP depends on branch condition from EP not discussed in early works
implemented in the Astronautics ZS-1 Processor single interleaved instruction stream is
split to feed instruction queues control flow instruction executed in the
splitter
Simultaneous Multithreading with DAE
The Synergy of Multithreading and Access/Execute Decoupling, Parcerisa and Gonzalez, 1998
observation that functional unit latencies and true data dependencies in EP hinder performance
use SMT and thread level parallelism to better utilize functional units (same as with SMT in superscalars)
few threads are required
decoupling provides memory latency tolerance, SMT hides functional unit latencies
Decoupled Control/Access/Execute Architecture
The Effectiveness of Decoupling, Bird et. al., 1993
further optimization: control decoupling three instruction streams, dynamic slip
CP processes control flow graph, sends directives to AP and EP to execute basic blocks limited control capabilities in AP and EP:
loop count and predication
fetch engines fill queues with valid instructions dynamic loop unrolling control latency hidden (without
speculation) “stream units”
CU can operate in stand-alone mode implemented as a 21064, ran the OS
Decoupled Control/Access/Execute Architecture
loss of decoupling events cause breakdown
The Performance of Decoupled Architectures, Parcerisa et. al., 1996
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
LOD!
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled Control/Access/Execute Architecture
Decoupled vs. Superscalar Architectures
Dynamic “out-of-order” execution with less complexity Allows non-speculative instruction and data prefetching. We can
shrink data structures like first level caches, potentially reducing critical paths as well as reducing power
Inherent long memory latency toleration – provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of caches
Simplified issue logic which can be implemented with small structures/queues (contrast with ROB/IW/bypass structures)
Better resource utilization by partioning between CP/AP/DP, processors can have specialized ISAs
Scalability – direct consequence of simplified logic For superscalar processors, need to increase IW which does not scale
(Palacharla/Agawal papers) Decoupled machines alleviate centralized resource bottlenecks Queue-based structure is amenable to tiled architectures with on-chip
networks
Decoupled Architectures for General Purpose Computing
So why haven’t decoupled machines taken over the world?
Because superscalar architectures took over the world first
Primary drawback of decoupled architectures from LOD events - “twisty” C code can cause severe performance degradation
Inability for compilers to program effectively for separate instruction streams – lack of research/development in the area of programming/compiling analysis
Wheel of Reincarnation: no such thing as a new idea… If we can augment existing decoupled architectures to remove the effects of
LOD events, we effectively have an architecture that can feasibly be used for general purpose computing
Leverage exiting ideas to augment decoupling – Multithreading and Auxiliary Processing
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading hides latency of LOD events. LOD events result in very long latencies (>100’s cycles) to
reestablish decoupling
Motivation is to hide LOD events to prevent need to resynchronize
SMT hides functional unit latencies.
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading in Access/execute units:
Multiple contexts (IP/RF) for fast context-switching during LOD event
Interleaved SMT to hide horizontal as well as vertical waste within execute processor
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
With multithreading, utilization of CP/AP/EP by different threads is pipelined analgous to instruction pipelining in a CPU datapath
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
LOD!
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Multithreading on a DCAE Architecture
Data Memory Interface
AP
CP
Instruction Memory Interface
DPIFEIFE
RFRF
IFB IFB
paramparam
SAQ SDQLDQRD
LAQ
cond cond
Auxiliary Decoupled Access/Execute Streaming Units
Implement control processor as fully functional high-performance microprocessor. Compiler can avoid decoupling control intensive code.
When decoupling is possible (e.g. streaming computations), the decoupled access/execute engines provide a high-performance complexity-effective alternative.
Analogous to vector coprocessors or SIMD array coprocessors. Basic idea is to utilize specialized hardware when possible and have a fallback plan when Achilles heel is exposed.
Extensions for Improved Performance
Wider issue access/execute processors
Speculative Multithreading Control processor can spawn speculative threads when only a single
thread of control is available Miss-speculation detection can be performed by checking accessed
memory addresses (in queues) for collisions Kill speculative thread by simply flushing queues/context
Can merge concepts, with multithreaded decoupled execution under the auxiliary access/execute units paradigm. Use decoupling/multithreading when possible, and fall back on high
performance control processor otherwise
Tiled architectures: Extend decoupled architectures to scaleable multiprocessor systems such as RAW. Queue-based structure is a good fit for encorporating communication
from other tiles
Summary
Decoupled architectures represent a complexity-effective and scalable way to provide dynamic scheduling, hide latency, and exploit ILP
To enable general purpose computation, we can augment decoupling with multithreading to hide the latency of LODs
By using decoupled access and execute units as auxillary processors, we can leverage the benefits of both decoupling for streaming computations, and out-of-order superscalars for control flow intensive computations