monarch: a morphable networked micro-architecture

John Granacki, USC/Information Sciences Institute

Mike Vahey, Raytheon

MONARCH:A Morphable Networked micro-ARCHitecture

RayHieoi^ WfT — ^

Uonmittixi Sdmcss tetlMi

OjmputerSystvn^inc

MONARC

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 21 MAY 2003

2. REPORT TYPE N/A

3. DATES COVERED -

4. TITLE AND SUBTITLE MONARCH: A Morphable Networked micro-ARCHitecture

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) USC/Information Sciences Institute and Raytheon

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited

13. SUPPLEMENTARY NOTES The original document contains color images.

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT

UU

18. NUMBEROF PAGES

33

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Outline

♦ MONARCH Team, Goals & Approach

♦ DIVA (Data Intensive Architecture) Leverage: The Chip

♦ Raytheon HPPS (High Performance Processor System)

♦ MONARCH Architecture & Applications

♦ Summary & Conclusions

MONARC

SCIENCESSCIENCES

USCUSCINFORMATIONINFORMATION

INSTITUTEINSTITUTE

Principal InvestigatorJohn Granacki

RESEARCH STAFFJeff DraperPedro DinizJeff LaCoss

Co-Principal InvestigatorMichael Vahey

RESEARCH STAFFJohn “Chip” Bodenschatz

Frank BrandonReagan BranstetterCharles Channell

Phil RosenMike Walker

Arlan pool

RESEARCH STAFFVlad Kaufman

The Team

Raytheon Computj^Sy^f^ms^fnc

MONARC

GOALS

♦ To support multiple classes of military missionswith a single morphable architecture

♦ To eliminate processing system redundanciesthrough rapid dynamic reconfiguration of front-end filtering and data-reduction processing

♦ To reduce application development costs by allowing the hardware to be mapped to the algorithms both statically and dynamically

♦ To develop an architecture that can quickly and efficiently adapt to changing situations - internal (fault tolerance, sensors configurations) and external (threats change, mission phasing, environment)

MONARC

Key Ideas

Combines fine, medium and coarse grain processingCombines fine, medium and coarse grain processingresources on a single chip

Matches hardware to the algorithmsMatches hardware to the algorithms and the control flow mechanisms

Configures memory structuresConfigures memory structures for efficient front-end and back-end processing

Provides flexible gigabyte I/O channelsProvides flexible gigabyte I/O channels for direct interface to sensors and inter-chip communication

Supports all systems processing requirementsSupports all systems processing requirements with a single MONARCH chip type

MONARC

Approach

♦ Leverage DARPA-sponsored DIVA Project results, Raytheon IRAD-sponsored HPPS and Mercury Stream Co-procesing Engine

♦ Use DoD missions to drive micro-architecture and morphing concepts and implementation

♦ Determine the “sweet spot” for mixing large, small- to-medium and fine-grained elements

♦ Through experiments and simulations demonstrate a “single chip” VLSI processing architecture based on DIVA and HPPS

MONARC

DIVA Leverage: The Chip

MONARC

Exploiting The Bandwidth in a System

ProcessorProcessor MemoryMemory

TheProblem

HostHostProcessorProcessor

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

ASolution

DIVA Solutions:•Move concurrent processing on-chip•More bandwidth and less latency on chip•Added bandwidth between memories •Lower latencies throughout system

MONARC

OS (Linux)OS (Linux)Page Placement,Page Placement,

Paging for Host & Paging for Host & PIMsPIMs,,Scheduling, PIM InitializationScheduling, PIM Initialization

DIVA Software/Hardware

Host Runtime LayerHost Runtime LayerSynchronization, FlushingSynchronization, Flushing

Thread Management, Host ParcelsThread Management, Host Parcels

CompilerCompilerData Placement, Parallelism,Data Placement, Parallelism,

HostHost--PIM Mapping,PIM Mapping,Parcels, CoherenceParcels, Coherence

ApplicationApplication

PIMApplications

Tools & Applications

PIM Runtime KernelPIM Runtime KernelParcel Management,Parcel Management,

Address Translation Faults,Address Translation Faults,PIM Context Switches,PIM Context Switches,

SynchronizationSynchronization

HostHostSystemSystem

PIM VLSI DevicesPIM VLSI DevicesProcessor, Memory ArrayProcessor, Memory Array

MemoryMemoryControllerController

SystemManagement

PhysicalPhysicalHardwareHardware

RuntimeCoordination

PhysicalPhysicalHardwareHardware

Host Runtime LayerHost Runtime LayerSynchronization, Flushing,Synchronization, Flushing,

Thread Management, Host ParcelsThread Management, Host Parcels

PIM Backend CompilerPIM Backend CompilerCode generation for scalarCode generation for scalar

and WideWord and WideWord

MONARC

DIVA Node ArchitectureDIVA Node Architecture

ICacheICache InstructionInstructionPipelinePipeline

WideWordWideWordDatapathDatapath

MEMORYMEMORYCONTROLCONTROL

&&ARBITERARBITER

DATA DATA REGISTERS

Node MainNode MainData BusData Bus

CTL

Node Memory Requests

HostMemory

Requests

ICache Mem Requests

HEADER REGISTERS

PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE

DRAMDRAMMEMORYMEMORY

32b

WideWord

256b

ScalarScalarDatapathDatapath

256b256bRegister Files

Scalar

Instrs.

Inter-data path Register Data

DatapathControl

SRAM

Node Processing LogicNode Processing Logic

WideWord ALU Data Flow

Source 1 Source 2

WideWordWideWordRegister FileRegister File

(256 bits)

Arithmetic(8, 16, 32)

Logical(8, 16, 32, 256)

Permutation(8, 16, 32)

Multipliers(8 x 8, 16 x 16)

(256 bits)

KISS: More compromises in architecture to enable early prototype

DIVA PIM Chip Floorplan

MemoryInterconnect

ParcelInterconnect

SSRRAAMM

WideWord

SRAM(4Mbit)

SRAM(4Mbit)

PiR

CP

buf

PL

L

Hos

t IfcIcache

Scal

arNode Processing Logic & I/O Node Processing Logic & I/O InterfaxcesInterfaxces

MONARC

DIVA PIM Chip

♦ Current lab measurements– 640 MOPs (peak, 32-bit ops)– 0.8 Watts at 80MHz on cornerturn core

loop

♦ Purpose– Demonstrate bandwidth advantages of

PIM technology

♦ Key architectural components– High memory bandwidth– 256-bit WideWord processing– PIM routing component

♦ Chip statistics– 9.8mm X 9.8mm in TSMC 0.18µµm– ~200K logic cells plus 8Mbit SRAM– 352 pins (241 signal pins)

♦ Projected performance for 2nd

prototype– 1.6 GOPs– 2.5 Watts at 200MHz

SRAMSRAM

Node Processing LogicNode Processing Logic

MONARC

HPPS & FPCA ARCHITECTURES

MONARC

CAFTIC2 GB DRAM

PPC1 v reg

CAFTIC2 GB DRAM

PPC1 v reg

CAFTIC2 GB DRAM

PPC1 v reg

CAFTICDRAM

PPC1 V reg.

FT

net

wor

k

High Performance Processor System

♦ Multinode Processor

♦ One custom ASIC

♦ Innovative voting

♦ Inputs for high bandwidth A/D receiver channels or FPCA

MONARC

HPPS Node Architecture

♦Fault-Tolerant Intf. Core

– I/O Capability– Intra-node

– Inter-node

– Distrib. Crossbar

– Fault Detection– Memory EDAC

– Processor voting

– Node configuration controls

Packetizer

Packetizer

Voter

Async I/F

I / O Crossbar

Asy

nc I

/F

Packet Buffer

Packet Buffer

Packet Buffer

PacketizerInternal Crossbar Input

DMA

Scrub Engine

GP/DSP I/F

GP/DSP

EDAC

Memory

FPCA

Mem

ory

Acc

ess

Con

trol

ler

Output DMA

Configuration Memory

Node Control • Self test • Initialization • Fault Mgmt • Reconfiguration

Vote Status

Configuration Control

Configuration & Control

Poin

t-T

o-Po

int

Inte

rcon

nect

Fa

bric

Interface for COTS µPto Leverage Commercial Advances

Field Programmable Computing Array

12 ea 4GByteI/O paths

MONARC

MONARCH: Node Architecture

♦Fault-Tolerant Intf. Core

– I/O Capability– Intra-node

– Inter-node

– Distrib. Crossbar

– Fault Detection– Memory EDAC

– Processor voting– Node configuration controls

Packetizer

Packetizer

Voter

Async I/F

I / O Crossbar

Asy

nc I

/F

Packet Buffer

Packet Buffer

Packet Buffer

PacketizerInternal Crossbar Input

DMA

Scrub Engine

GP/DSP I/F

GP/DSP

EDAC

Memory

FPCA

Mem

ory

Acc

ess

Con

trol

ler

Output DMA

Configuration Memory

Node Control • Self test • Initialization • Fault Mgmt • Reconfiguration

Vote Status

Configuration Control

Configuration & Control

Poin

t-T

o-Po

int

Inte

rcon

nect

Fa

bric

Interface for COTS µPFor Legacy Systems

Field Programmable Computing Array

12 ea 4GByteI/O paths

Virtual WideWord Unit/Data Flow Unit

DFIM

Note: Not to Scale

MONARC

Math Cluster Memory Cluster

Virtual WideWord Unit/DFIM

VWWUVWWU VWWUVWWU

VWWUVWWU VWWUVWWU

VWWUVWWU VWWUVWWU

VWWUVWWU VWWUVWWU

MONARC

MONARCH ARCHITECTURE

MONARC

CAFTIC2 GB DRAM

PPC1 v reg

CAFTIC2 GB DRAM

PPC1 v reg

CAFTIC2 GB DRAM

PPC1 v reg

MONARCHDRAM/DIVA

1 V reg.F

T n

etw

ork

MONARCH Processor System

♦ Multinode Processor

♦ One MONARCH chip

♦ Innovative voting

♦ Inputs for high bandwidth A/D receiver channels or direct chip-to-chip data transfer

MONARC

EDGEEDGEMEMORYMEMORY

GENERAL INTERFACE LOGICGENERAL INTERFACE LOGIC(Fine(Fine--grain programmable)grain programmable)

µµµµCONTROLLERCONTROLLERDatapathDatapath

I/OI/OControlControl

&&DataData

IntegrityIntegrity

ICacheICache

MONARCH Chip Overview

InstructionInstructionPipelinePipeline

MEMORY CONTROL&

ARBITER

DRAMDRAMMEMORYMEMORY

Register Load/Store Requests

ICache Reqs.

256b Instrs.

MultichannelSensor Input

COMPUTING RESOURCESCOMPUTING RESOURCESALUs/MACsALUs/MACs

STORAGESTORAGEMemory ClustersMemory Clusters

Register FilesRegister Files

INTERCONNECTINTERCONNECTCrossbars/Crossbars/DDEsDDEs

DATA/CONTROLREGISTERS

PARCELBUFFER

Inter-ChipCommunication

Fabric

256b

12

MONARC

M

M

M

M

A

A

FPU

ParcelLogic

I/O Adaptor(“FPGA”)

HighSpeed

I/OM

M

M

M

A

A

M

M

M

M

A

A

M

M

M

M

A

A

FPU FPU FPUµController

DRAM DRAM DRAM DRAM

M

M

M

M

ExternalMemoryInterface

µController µController µController

Native “Stream” Mode

MONARC

RegFile

“ALU”

M

M

M

M

A

A

FPU

ParcelLogic

I/O Adaptor(“FPGA”)

HighSpeed

I/OM

M

M

M

A

A

M

M

M

M

A

A

M

M

M

M

A

A

FPU FPU FPUµController

DRAM DRAM DRAM DRAM

M

M

M

M


µController µController µController

Native Threaded Mode

MONARC

MONARCH Single Chip Architecture

M

M

M

M

A

AParcelLogic

M

M

M

M

A

A

M

M

M

M

A

A

M

M

M

M

A

A

DRAM2 MB

DRAM2 MB

DRAM2 MB

DRAM2 MB

M

M

M

M


♦ 800 MHz Clock♦ 512 ops/clock

Mem Ctl Mem Ctl Mem Ctl Mem Ctl

µµControllerI Cache

FPU


FPU


FPU


FPU

XBar

I/OAdaptor

(“FPGA”)

Sel

ecta

ble

Bu

ffer

♦ 12 GFLOPS♦ 400 GOPS

♦ 32 MBytes DRAM♦ 320KBytes SRAM

♦ 256 MALU♦ 36 Watts

Hig

h S

pee

dI/O

4-16 slices

12ports

24 GBytes/secI/O rate

3.2 GBytes/secI/O rate

MONARC

MONARCH Application Processor

#1 #2 #3

#5

StreamProcessing

ThreadedThreadedProcessingProcessing

#4

MCMC--TMTM

MCMC--SMSMMCMC--SMSMMCMC--SMSM

Inter-chip

MemoryTransfer

ControlControl

ParcelInterface

Inter-chip I/O

(crossbar)

MCMC--TMTM

MONARC

MONARCH Architecture Features

♦ Dual native mode, high throughput computing– Multiple wide word threaded (instruction flow) processors/chip

– Highly parallel reconfigurable (data flow) processor♦ Large on chip, multiport memories

– High bandwidth access to memory

– Extensible with off chip memory♦ High speed, distributed cross bar I/O

– Integrated with chip processing

– Scalable I/O bandwidth - multiple topologies– Direct connect to high speed I/O devices, e.g., A/D’s

♦ Rich on chip interconnect

– Supports on chip topology morphing and fault tolerance– Supports multiple computation models (SISD, SIMD, DF, SPMD,…)

♦ On chip Morph - Program bus and microcontrollers

MONARC

Architecture Merger* Features- Mostly a complementary match and enhancement -

As aboveRetain - switch mode bitData flow control mode - streaming

Improved I/O performanceIncorporate dist. xbar and use for parcel com

High speed, multiple channel I/O

Little impactRetain and map onto other physical protocol

Parcel communications

Performance boostSimilar to Edge Memory

Now can have on chip

Large On-chip memory

Some hardware growth, but more control modes

Retain, and mux decoded signals with DF signals

5 State pipeline, instruction flow decoder

Performance boostRetainOn-chip micro controllers

Little impactBasic functions sameNeed to add some insts

Instruction Set Mapping

1 AC provides same width as WW unit

Each Arithmetic Cluster has 8 32 bit units

256 bit wide word processing unit

BENEFITAPPROACHISSUE

* Merger of features from DIVA and HPPS processors MONARC

Architecture Merger Issues

AreaSimulationMinimum I Cache size

Area, InterconnectTBDData exchange: WàS / SàW

NoneMemory MapI/O: Memory map or program?

Small area increaseExtend array arithmetic clusters

3-port WW registerfile implementation

Interconnect, CompilerTBD / SimulationWideWord pipelinelength / bypass

Small areaincrease

Enhance arrayx-bars for 8 bit data

Permute implementation

Design complexityTBD (modify array)WideWord shifter implementation

TBDSwitch RISC pipelinecontrol into array

Thread control for array / WideWord

Negligibledelay

Modify arraycarry-chain logic

WideWord 8-bit math

IMPACTAPPROACHISSUE

MONARC

FPCA Changes for WideWord

ALU8

ALU8

ALU8

ALU8

ALU8

ALU8

TA

TB

uOPu Data valid (default)u Consume (default)u Token (participation)

0 32

“Breakable” carry chains(Actually still 32 bit processing elements, but condition codes and carries controllable at 8 bit boundaries)

8 32-bit ALUs become 32 8-bit elements32 word register file added

Inst Decode

Mu

x

XBar

Multiplexor for dual mode instruction set control of elements

PLA

MONARC

MONARCH Pin Estimate

MONARCH I/O Summarynumber of ports

Wires per port

Total Wires Type

Clock Rate

High speed ports 12 50 600 LVDS 1 GHzInter FPCA Links 4 52 208 LVDS 1-2 GHzExternal memory 1 160 160 CMOS 500 MHzStandard I/O 2 60 120 variable 100+ MHz Total 1088

MONARC

Need to Select Preferred Parameters for 1st MONARCH Chip

Th

rou

gh

pu

t(G

OP

S)

Memory(Mbytes)

8 32 128100

400

1000

HighYield

Power

Devic

eSize

MONARCHProcessor/Memory

Trade Space

Could design chip for anywhere inside this

space

u Design choice will balance throughput, memory, and I/O needs for representative applications

MONARCH Processing Card- 6Ux160 double euro card form factor -

Bac

kpla

ne

Inte

rco

nn

ect

MONARCHchip

EDGEMEMORY

PowerConditioning

Mez

zani

ne C

onne

ctor

Mez

zani

ne C

onne

ctor

OpticalOptical

I/O

MONARCHchip

MONARCHchip

Dis

trib

ute

dC

ross

bar

EPROM

MONARCHchip

MONARCHchip

EDGEMEMORY

MONARCHchip

Multi-channelSensor &

PacketData

u 75 GFLOPSu 2.4 TOPS

u 192 MBytes on-chip DRAMu 2 MBytes on-chip SRAM

u 1 GBytes on-board memoryu 6 MONARCH chips + memory and power

conditioning MONARC

Summary & Conclusions

♦MONARCH features very attractive for multiple applications

♦Merger of two existing architectures shows good fit– “Complementary” but compatible features– Rich experience base allows quick design trades

♦“The devil is in the details” --- a lot more work– On-chip DRAM organization and access– Support for “morphing”– Simulation results at application-level– Trade offs for FPU capability

MONARC

monarch: a morphable networked micro-architecture

Documents