monarch: a morphable networked micro-architecture
TRANSCRIPT
John Granacki, USC/Information Sciences Institute
Mike Vahey, Raytheon
MONARCH:A Morphable Networked micro-ARCHitecture
RayHieoi^ WfT — ^
Uonmittixi Sdmcss tetlMi
OjmputerSystvn^inc
MONARC
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE 21 MAY 2003
2. REPORT TYPE N/A
3. DATES COVERED -
4. TITLE AND SUBTITLE MONARCH: A Morphable Networked micro-ARCHitecture
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) USC/Information Sciences Institute and Raytheon
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited
13. SUPPLEMENTARY NOTES The original document contains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT
UU
18. NUMBEROF PAGES
33
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
Outline
♦ MONARCH Team, Goals & Approach
♦ DIVA (Data Intensive Architecture) Leverage: The Chip
♦ Raytheon HPPS (High Performance Processor System)
♦ MONARCH Architecture & Applications
♦ Summary & Conclusions
MONARC
SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Principal InvestigatorJohn Granacki
RESEARCH STAFFJeff DraperPedro DinizJeff LaCoss
Co-Principal InvestigatorMichael Vahey
RESEARCH STAFFJohn “Chip” Bodenschatz
Frank BrandonReagan BranstetterCharles Channell
Phil RosenMike Walker
Arlan pool
RESEARCH STAFFVlad Kaufman
The Team
Raytheon Computj^Sy^f^ms^fnc
MONARC
GOALS
♦ To support multiple classes of military missionswith a single morphable architecture
♦ To eliminate processing system redundanciesthrough rapid dynamic reconfiguration of front-end filtering and data-reduction processing
♦ To reduce application development costs by allowing the hardware to be mapped to the algorithms both statically and dynamically
♦ To develop an architecture that can quickly and efficiently adapt to changing situations - internal (fault tolerance, sensors configurations) and external (threats change, mission phasing, environment)
MONARC
Key Ideas
Combines fine, medium and coarse grain processingCombines fine, medium and coarse grain processingresources on a single chip
Matches hardware to the algorithmsMatches hardware to the algorithms and the control flow mechanisms
Configures memory structuresConfigures memory structures for efficient front-end and back-end processing
Provides flexible gigabyte I/O channelsProvides flexible gigabyte I/O channels for direct interface to sensors and inter-chip communication
Supports all systems processing requirementsSupports all systems processing requirements with a single MONARCH chip type
MONARC
Approach
♦ Leverage DARPA-sponsored DIVA Project results, Raytheon IRAD-sponsored HPPS and Mercury Stream Co-procesing Engine
♦ Use DoD missions to drive micro-architecture and morphing concepts and implementation
♦ Determine the “sweet spot” for mixing large, small- to-medium and fine-grained elements
♦ Through experiments and simulations demonstrate a “single chip” VLSI processing architecture based on DIVA and HPPS
MONARC
DIVA Leverage: The Chip
MONARC
Exploiting The Bandwidth in a System
ProcessorProcessor MemoryMemory
TheProblem
HostHostProcessorProcessor
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
ASolution
DIVA Solutions:•Move concurrent processing on-chip•More bandwidth and less latency on chip•Added bandwidth between memories •Lower latencies throughout system
MONARC
OS (Linux)OS (Linux)Page Placement,Page Placement,
Paging for Host & Paging for Host & PIMsPIMs,,Scheduling, PIM InitializationScheduling, PIM Initialization
DIVA Software/Hardware
Host Runtime LayerHost Runtime LayerSynchronization, FlushingSynchronization, Flushing
Thread Management, Host ParcelsThread Management, Host Parcels
CompilerCompilerData Placement, Parallelism,Data Placement, Parallelism,
HostHost--PIM Mapping,PIM Mapping,Parcels, CoherenceParcels, Coherence
ApplicationApplication
PIMApplications
Tools & Applications
PIM Runtime KernelPIM Runtime KernelParcel Management,Parcel Management,
Address Translation Faults,Address Translation Faults,PIM Context Switches,PIM Context Switches,
SynchronizationSynchronization
HostHostSystemSystem
PIM VLSI DevicesPIM VLSI DevicesProcessor, Memory ArrayProcessor, Memory Array
MemoryMemoryControllerController
SystemManagement
PhysicalPhysicalHardwareHardware
RuntimeCoordination
PhysicalPhysicalHardwareHardware
Host Runtime LayerHost Runtime LayerSynchronization, Flushing,Synchronization, Flushing,
Thread Management, Host ParcelsThread Management, Host Parcels
PIM Backend CompilerPIM Backend CompilerCode generation for scalarCode generation for scalar
and WideWord and WideWord
MONARC
DIVA Node ArchitectureDIVA Node Architecture
ICacheICache InstructionInstructionPipelinePipeline
WideWordWideWordDatapathDatapath
MEMORYMEMORYCONTROLCONTROL
&&ARBITERARBITER
DATA DATA REGISTERS
Node MainNode MainData BusData Bus
CTL
Node Memory Requests
HostMemory
Requests
ICache Mem Requests
HEADER REGISTERS
PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE
DRAMDRAMMEMORYMEMORY
32b
WideWord
256b
ScalarScalarDatapathDatapath
256b256bRegister Files
Scalar
Instrs.
Inter-data path Register Data
DatapathControl
SRAM
Node Processing LogicNode Processing Logic
WideWord ALU Data Flow
Source 1 Source 2
WideWordWideWordRegister FileRegister File
(256 bits)
Arithmetic(8, 16, 32)
Logical(8, 16, 32, 256)
Permutation(8, 16, 32)
Multipliers(8 x 8, 16 x 16)
(256 bits)
KISS: More compromises in architecture to enable early prototype
DIVA PIM Chip Floorplan
MemoryInterconnect
ParcelInterconnect
SSRRAAMM
WideWord
SRAM(4Mbit)
SRAM(4Mbit)
PiR
CP
buf
PL
L
Hos
t IfcIcache
Scal
arNode Processing Logic & I/O Node Processing Logic & I/O InterfaxcesInterfaxces
MONARC
DIVA PIM Chip
♦ Current lab measurements– 640 MOPs (peak, 32-bit ops)– 0.8 Watts at 80MHz on cornerturn core
loop
♦ Purpose– Demonstrate bandwidth advantages of
PIM technology
♦ Key architectural components– High memory bandwidth– 256-bit WideWord processing– PIM routing component
♦ Chip statistics– 9.8mm X 9.8mm in TSMC 0.18µµm– ~200K logic cells plus 8Mbit SRAM– 352 pins (241 signal pins)
♦ Projected performance for 2nd
prototype– 1.6 GOPs– 2.5 Watts at 200MHz
SRAMSRAM
Node Processing LogicNode Processing Logic
MONARC
HPPS & FPCA ARCHITECTURES
MONARC
CAFTIC2 GB DRAM
PPC1 v reg
CAFTIC2 GB DRAM
PPC1 v reg
CAFTIC2 GB DRAM
PPC1 v reg
CAFTICDRAM
PPC1 V reg.
FT
net
wor
k
High Performance Processor System
♦ Multinode Processor
♦ One custom ASIC
♦ Innovative voting
♦ Inputs for high bandwidth A/D receiver channels or FPCA
MONARC
HPPS Node Architecture
♦Fault-Tolerant Intf. Core
– I/O Capability– Intra-node
– Inter-node
– Distrib. Crossbar
– Fault Detection– Memory EDAC
– Processor voting
– Node configuration controls
Packetizer
Packetizer
Voter
Async I/F
I / O Crossbar
Asy
nc I
/F
Packet Buffer
Packet Buffer
Packet Buffer
PacketizerInternal Crossbar Input
DMA
Scrub Engine
GP/DSP I/F
GP/DSP
EDAC
Memory
FPCA
Mem
ory
Acc
ess
Con
trol
ler
Output DMA
Configuration Memory
Node Control • Self test • Initialization • Fault Mgmt • Reconfiguration
Vote Status
Configuration Control
Configuration & Control
Poin
t-T
o-Po
int
Inte
rcon
nect
Fa
bric
Interface for COTS µPto Leverage Commercial Advances
Field Programmable Computing Array
12 ea 4GByteI/O paths
MONARC
MONARCH: Node Architecture
♦Fault-Tolerant Intf. Core
– I/O Capability– Intra-node
– Inter-node
– Distrib. Crossbar
– Fault Detection– Memory EDAC
– Processor voting– Node configuration controls
Packetizer
Packetizer
Voter
Async I/F
I / O Crossbar
Asy
nc I
/F
Packet Buffer
Packet Buffer
Packet Buffer
PacketizerInternal Crossbar Input
DMA
Scrub Engine
GP/DSP I/F
GP/DSP
EDAC
Memory
FPCA
Mem
ory
Acc
ess
Con
trol
ler
Output DMA
Configuration Memory
Node Control • Self test • Initialization • Fault Mgmt • Reconfiguration
Vote Status
Configuration Control
Configuration & Control
Poin
t-T
o-Po
int
Inte
rcon
nect
Fa
bric
Interface for COTS µPFor Legacy Systems
Field Programmable Computing Array
12 ea 4GByteI/O paths
Virtual WideWord Unit/Data Flow Unit
DFIM
Note: Not to Scale
MONARC
Math Cluster Memory Cluster
Virtual WideWord Unit/DFIM
VWWUVWWU VWWUVWWU
VWWUVWWU VWWUVWWU
VWWUVWWU VWWUVWWU
VWWUVWWU VWWUVWWU
MONARC
MONARCH ARCHITECTURE
MONARC
CAFTIC2 GB DRAM
PPC1 v reg
CAFTIC2 GB DRAM
PPC1 v reg
CAFTIC2 GB DRAM
PPC1 v reg
MONARCHDRAM/DIVA
1 V reg.F
T n
etw
ork
MONARCH Processor System
♦ Multinode Processor
♦ One MONARCH chip
♦ Innovative voting
♦ Inputs for high bandwidth A/D receiver channels or direct chip-to-chip data transfer
MONARC
EDGEEDGEMEMORYMEMORY
GENERAL INTERFACE LOGICGENERAL INTERFACE LOGIC(Fine(Fine--grain programmable)grain programmable)
µµµµCONTROLLERCONTROLLERDatapathDatapath
I/OI/OControlControl
&&DataData
IntegrityIntegrity
ICacheICache
MONARCH Chip Overview
InstructionInstructionPipelinePipeline
MEMORY CONTROL&
ARBITER
DRAMDRAMMEMORYMEMORY
Register Load/Store Requests
ICache Reqs.
256b Instrs.
MultichannelSensor Input
COMPUTING RESOURCESCOMPUTING RESOURCESALUs/MACsALUs/MACs
STORAGESTORAGEMemory ClustersMemory Clusters
Register FilesRegister Files
INTERCONNECTINTERCONNECTCrossbars/Crossbars/DDEsDDEs
DATA/CONTROLREGISTERS
PARCELBUFFER
Inter-ChipCommunication
Fabric
256b
12
MONARC
M
M
M
M
A
A
FPU
ParcelLogic
I/O Adaptor(“FPGA”)
HighSpeed
I/OM
M
M
M
A
A
M
M
M
M
A
A
M
M
M
M
A
A
FPU FPU FPUµController
DRAM DRAM DRAM DRAM
M
M
M
M
ExternalMemoryInterface
µController µController µController
Native “Stream” Mode
MONARC
RegFile
“ALU”
M
M
M
M
A
A
FPU
ParcelLogic
I/O Adaptor(“FPGA”)
HighSpeed
I/OM
M
M
M
A
A
M
M
M
M
A
A
M
M
M
M
A
A
FPU FPU FPUµController
DRAM DRAM DRAM DRAM
M
M
M
M
ExternalMemoryInterface
µController µController µController
Native Threaded Mode
MONARC
MONARCH Single Chip Architecture
M
M
M
M
A
AParcelLogic
M
M
M
M
A
A
M
M
M
M
A
A
M
M
M
M
A
A
DRAM2 MB
DRAM2 MB
DRAM2 MB
DRAM2 MB
M
M
M
M
ExternalMemoryInterface
♦ 800 MHz Clock♦ 512 ops/clock
Mem Ctl Mem Ctl Mem Ctl Mem Ctl
µµControllerI Cache
FPU
µµControllerI Cache
FPU
µµControllerI Cache
FPU
µµControllerI Cache
FPU
XBar
I/OAdaptor
(“FPGA”)
Sel
ecta
ble
Bu
ffer
♦ 12 GFLOPS♦ 400 GOPS
♦ 32 MBytes DRAM♦ 320KBytes SRAM
♦ 256 MALU♦ 36 Watts
Hig
h S
pee
dI/O
4-16 slices
12ports
24 GBytes/secI/O rate
3.2 GBytes/secI/O rate
MONARC
MONARCH Application Processor
#1 #2 #3
#5
StreamProcessing
ThreadedThreadedProcessingProcessing
#4
MCMC--TMTM
MCMC--SMSMMCMC--SMSMMCMC--SMSM
Inter-chip
MemoryTransfer
ControlControl
ParcelInterface
Inter-chip I/O
(crossbar)
MCMC--TMTM
MONARC
MONARCH Architecture Features
♦ Dual native mode, high throughput computing– Multiple wide word threaded (instruction flow) processors/chip
– Highly parallel reconfigurable (data flow) processor♦ Large on chip, multiport memories
– High bandwidth access to memory
– Extensible with off chip memory♦ High speed, distributed cross bar I/O
– Integrated with chip processing
– Scalable I/O bandwidth - multiple topologies– Direct connect to high speed I/O devices, e.g., A/D’s
♦ Rich on chip interconnect
– Supports on chip topology morphing and fault tolerance– Supports multiple computation models (SISD, SIMD, DF, SPMD,…)
♦ On chip Morph - Program bus and microcontrollers
MONARC
Architecture Merger* Features- Mostly a complementary match and enhancement -
As aboveRetain - switch mode bitData flow control mode - streaming
Improved I/O performanceIncorporate dist. xbar and use for parcel com
High speed, multiple channel I/O
Little impactRetain and map onto other physical protocol
Parcel communications
Performance boostSimilar to Edge Memory
Now can have on chip
Large On-chip memory
Some hardware growth, but more control modes
Retain, and mux decoded signals with DF signals
5 State pipeline, instruction flow decoder
Performance boostRetainOn-chip micro controllers
Little impactBasic functions sameNeed to add some insts
Instruction Set Mapping
1 AC provides same width as WW unit
Each Arithmetic Cluster has 8 32 bit units
256 bit wide word processing unit
BENEFITAPPROACHISSUE
* Merger of features from DIVA and HPPS processors MONARC
Architecture Merger Issues
AreaSimulationMinimum I Cache size
Area, InterconnectTBDData exchange: WàS / SàW
NoneMemory MapI/O: Memory map or program?
Small area increaseExtend array arithmetic clusters
3-port WW registerfile implementation
Interconnect, CompilerTBD / SimulationWideWord pipelinelength / bypass
Small areaincrease
Enhance arrayx-bars for 8 bit data
Permute implementation
Design complexityTBD (modify array)WideWord shifter implementation
TBDSwitch RISC pipelinecontrol into array
Thread control for array / WideWord
Negligibledelay
Modify arraycarry-chain logic
WideWord 8-bit math
IMPACTAPPROACHISSUE
MONARC
FPCA Changes for WideWord
ALU8
ALU8
ALU8
ALU8
ALU8
ALU8
TA
TB
uOPu Data valid (default)u Consume (default)u Token (participation)
0 32
“Breakable” carry chains(Actually still 32 bit processing elements, but condition codes and carries controllable at 8 bit boundaries)
8 32-bit ALUs become 32 8-bit elements32 word register file added
Inst Decode
Mu
x
XBar
Multiplexor for dual mode instruction set control of elements
PLA
MONARC
MONARCH Pin Estimate
MONARCH I/O Summarynumber of ports
Wires per port
Total Wires Type
Clock Rate
High speed ports 12 50 600 LVDS 1 GHzInter FPCA Links 4 52 208 LVDS 1-2 GHzExternal memory 1 160 160 CMOS 500 MHzStandard I/O 2 60 120 variable 100+ MHz Total 1088
MONARC
Need to Select Preferred Parameters for 1st MONARCH Chip
Th
rou
gh
pu
t(G
OP
S)
Memory(Mbytes)
8 32 128100
400
1000
HighYield
Power
Devic
eSize
MONARCHProcessor/Memory
Trade Space
Could design chip for anywhere inside this
space
u Design choice will balance throughput, memory, and I/O needs for representative applications
MONARCH Processing Card- 6Ux160 double euro card form factor -
Bac
kpla
ne
Inte
rco
nn
ect
MONARCHchip
EDGEMEMORY
PowerConditioning
Mez
zani
ne C
onne
ctor
Mez
zani
ne C
onne
ctor
OpticalOptical
I/O
MONARCHchip
MONARCHchip
Dis
trib
ute
dC
ross
bar
EPROM
MONARCHchip
MONARCHchip
EDGEMEMORY
MONARCHchip
Multi-channelSensor &
PacketData
u 75 GFLOPSu 2.4 TOPS
u 192 MBytes on-chip DRAMu 2 MBytes on-chip SRAM
u 1 GBytes on-board memoryu 6 MONARCH chips + memory and power
conditioning MONARC
Summary & Conclusions
♦MONARCH features very attractive for multiple applications
♦Merger of two existing architectures shows good fit– “Complementary” but compatible features– Rich experience base allows quick design trades
♦“The devil is in the details” --- a lot more work– On-chip DRAM organization and access– Support for “morphing”– Simulation results at application-level– Trade offs for FPU capability
MONARC