decoupled architectures for complexity-effective general purpose processors

49
Decoupled Architectures for Complexity-Effective General Purpose Processors Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science 12-7-2000

Upload: rhea-chandler

Post on 03-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

Decoupled Architectures for Complexity-Effective General Purpose Processors. Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science 12-7-2000. Motivation. out-of-order superscalar designs are inefficient and hard to scale - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Architectures for Complexity-Effective General Purpose

Processors

Ronny Krashinsky and Mike Sung

6.893 Term Project Presentation

MIT Laboratory for Computer Science

12-7-2000

Page 2: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Motivation

out-of-order superscalar designs are inefficient and hard to scale

decoupled architectures can provide latency hiding, dynamic scheduling, and ILP in a much more complexity-effective and scalable manner

in previous work, decoupled architectures have been investigated for scientific apps

superscalar architectures are used universally for general purpose computing requirements

why? superscalars provide more flexibility, and decoupled architectures break down when there is a loss of decoupling

Page 3: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Proposal

use decoupled architectures for complexity-effective general purpose computing

multithreading can be used to hide loss of decoupling latency

potentially get the best out of both architectures by providing a superscalar processor with decoupled engines for complexity-effective streaming computations

we will present a survey of prior work and our proposed architectural innovations, unfortunately a lot of infrastructure (e.g. a compiler) is required for a more detailed investigation

Page 4: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Access/Execute Architecture

Decoupled Access/Execute Computer Architectures, Smith, 1982

AP & EP process separate instruction streams EP used for computation (floating point) ILP

data values communicated via queues

slip – AP runs ahead of EP memory latency hiding dynamic scheduling

head of AEQ can be used as instruction operand in EP blocks if data isn’t available takes the place of register renaming

store addresses wait in WAQ until corresponding data arrives from EP loads can bypass stores (check address)

Page 5: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Access/Execute Architecture

Decoupled Access/Execute Computer Architectures, Smith, 1982

program control flow implemented with corresponding conditional branch in each stream

branch condition queues allow AP to hide branch latency from EP

loss of decoupling if AP depends on branch condition from EP not discussed in early works

implemented in the Astronautics ZS-1 Processor single interleaved instruction stream is

split to feed instruction queues control flow instruction executed in the

splitter

Page 6: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Simultaneous Multithreading with DAE

The Synergy of Multithreading and Access/Execute Decoupling, Parcerisa and Gonzalez, 1998

observation that functional unit latencies and true data dependencies in EP hinder performance

use SMT and thread level parallelism to better utilize functional units (same as with SMT in superscalars)

few threads are required

decoupling provides memory latency tolerance, SMT hides functional unit latencies

Page 7: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

The Effectiveness of Decoupling, Bird et. al., 1993

further optimization: control decoupling three instruction streams, dynamic slip

CP processes control flow graph, sends directives to AP and EP to execute basic blocks limited control capabilities in AP and EP:

loop count and predication

fetch engines fill queues with valid instructions dynamic loop unrolling control latency hidden (without

speculation) “stream units”

CU can operate in stand-alone mode implemented as a 21064, ran the OS

Page 8: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

loss of decoupling events cause breakdown

The Performance of Decoupled Architectures, Parcerisa et. al., 1996

Page 9: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 10: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 11: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 12: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 13: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 14: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

LOD!

Page 15: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 16: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 17: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 18: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 19: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 20: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 21: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 22: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 23: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 24: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Control/Access/Execute Architecture

Page 25: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled vs. Superscalar Architectures

Dynamic “out-of-order” execution with less complexity Allows non-speculative instruction and data prefetching. We can

shrink data structures like first level caches, potentially reducing critical paths as well as reducing power

Inherent long memory latency toleration – provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of caches

Simplified issue logic which can be implemented with small structures/queues (contrast with ROB/IW/bypass structures)

Better resource utilization by partioning between CP/AP/DP, processors can have specialized ISAs

Scalability – direct consequence of simplified logic For superscalar processors, need to increase IW which does not scale

(Palacharla/Agawal papers) Decoupled machines alleviate centralized resource bottlenecks Queue-based structure is amenable to tiled architectures with on-chip

networks

Page 26: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Decoupled Architectures for General Purpose Computing

So why haven’t decoupled machines taken over the world?

Because superscalar architectures took over the world first

Primary drawback of decoupled architectures from LOD events - “twisty” C code can cause severe performance degradation

Inability for compilers to program effectively for separate instruction streams – lack of research/development in the area of programming/compiling analysis

Wheel of Reincarnation: no such thing as a new idea… If we can augment existing decoupled architectures to remove the effects of

LOD events, we effectively have an architecture that can feasibly be used for general purpose computing

Leverage exiting ideas to augment decoupling – Multithreading and Auxiliary Processing

Page 27: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Multithreading hides latency of LOD events. LOD events result in very long latencies (>100’s cycles) to

reestablish decoupling

Motivation is to hide LOD events to prevent need to resynchronize

SMT hides functional unit latencies.

Page 28: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Multithreading in Access/execute units:

Multiple contexts (IP/RF) for fast context-switching during LOD event

Interleaved SMT to hide horizontal as well as vertical waste within execute processor

Page 29: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

With multithreading, utilization of CP/AP/EP by different threads is pipelined analgous to instruction pipelining in a CPU datapath

Page 30: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 31: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 32: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 33: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 34: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 35: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

LOD!

Page 36: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 37: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 38: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 39: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 40: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 41: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 42: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 43: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 44: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 45: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 46: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Multithreading on a DCAE Architecture

Data Memory Interface

AP

CP

Instruction Memory Interface

DPIFEIFE

RFRF

IFB IFB

paramparam

SAQ SDQLDQRD

LAQ

cond cond

Page 47: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Auxiliary Decoupled Access/Execute Streaming Units

Implement control processor as fully functional high-performance microprocessor. Compiler can avoid decoupling control intensive code.

When decoupling is possible (e.g. streaming computations), the decoupled access/execute engines provide a high-performance complexity-effective alternative.

Analogous to vector coprocessors or SIMD array coprocessors. Basic idea is to utilize specialized hardware when possible and have a fallback plan when Achilles heel is exposed.

Page 48: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Extensions for Improved Performance

Wider issue access/execute processors

Speculative Multithreading Control processor can spawn speculative threads when only a single

thread of control is available Miss-speculation detection can be performed by checking accessed

memory addresses (in queues) for collisions Kill speculative thread by simply flushing queues/context

Can merge concepts, with multithreaded decoupled execution under the auxiliary access/execute units paradigm. Use decoupling/multithreading when possible, and fall back on high

performance control processor otherwise

Tiled architectures: Extend decoupled architectures to scaleable multiprocessor systems such as RAW. Queue-based structure is a good fit for encorporating communication

from other tiles

Page 49: Decoupled Architectures for  Complexity-Effective General Purpose Processors

Summary

Decoupled architectures represent a complexity-effective and scalable way to provide dynamic scheduling, hide latency, and exploit ILP

To enable general purpose computation, we can augment decoupling with multithreading to hide the latency of LODs

By using decoupled access and execute units as auxillary processors, we can leverage the benefits of both decoupling for streaming computations, and out-of-order superscalars for control flow intensive computations