cgra quiz. quiz what is the fundamental drawback of fine-grained architecture that led to...
TRANSCRIPT
CGRA QUIZ
Quiz• What is the fundamental drawback of fine-grained
architecture that led to exploration of coarse grained reconfigurable architectures? (Max of 5 words!)
• Give two examples for each coarse grained architecture type: Mesh, Linear Array, and Crossbar
• Indicate whether the given architecture supports some form of partial reconfiguration or not.
PipeRanch, KressArray, Chess
COARSE GRAINED RECONFIGURABLE ARCHITECTURES
04/21/2014
Aditi Sharma
Dhiraj Chaudhary
Pruthvi Gowda
Rachana Raj Sunku
DAY - 2
3
Outline• Coarse Grained Reconfigurable Architectures
• RAW• CHESS
• Basics Of Network On Chip(NoC)
• Project Overview
4
Raw Architecture Workstation (RAW)
• Developed at MIT• It fully exposes Low Level
hardware architectural details to the compiler
• It lacks hardware for register renaming and dynamic instruction issue
• A Raw architecture seeks to execute pipelined application (like signal processing) efficiently. Motivation ???
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
5
Change Is Around the Corner
• Processor performance not scaling as before• Wire delay and power
old view: chip looks small to a wire
chip size
distance signal can travelin 1 cycle
new view: chip looks much bigger to a wire,communication is expensive even on chip!
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
6
Raw Architecture
How do we arrive at this design???
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
7
Problems with Monolithic Designs• Super-wide general purpose processors are no longer
practical
WideFetch(16 inst)
UnifiedLoad/StoreQueue
PC
RF
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LUBypass Net
con
tro
l
• Centralized control with global operand routing
• Area, power, and frequency concerns
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
8
WideFetch(16 inst)
UnifiedLoad/StoreQueue
PC
RF
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LUBypass Net
con
tro
l+
>>
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LUBypass Net
RF
Spatial Architectures
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
10
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RF
Bypass Net
Spatial Architectures
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
11
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RF
Spatial Architectures
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
12
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RF>>
+
Exploiting Locality
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
13
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RF
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
Distribute the Register File
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
14
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
Control
WideFetch(16 inst)
UnifiedLoad/StoreQueue
PC I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$
I$PC
D$
I$PC
D$
I$PC
D$
Distribute the Rest
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
15
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
AL
UA
LU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$
I$PC
D$
I$PC
D$
I$PC
D$
Tiled-Processor Architecture
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
16
Tiled-Processor Architecture
• Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
• Tile abstraction is quite powerful– e.g., power → resources
used as necessary
• Easily scalable• All signals registered at tile
boundaries, no global signals• Easier to Tune the
Frequency• Easier to do the Physical
Design• Easier to Verify
17
Raw On-Chip Networks• 2 Static Networks
• Provides low latency communication between tiles.
• Makes routing decision during compile time.
• 2 Dynamic Networks• Header encodes destination.• Transports unpredictable
operations like interrupt and cache misses.
ComputationResources
SwitchProcessor
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
18
Inside the Compute Processor
IF RFDA TL
M1 M2
F P
E
U
TV
F4 WB
r26
r27
r25
r24
InputFIFOsfromStaticRouter
r26
r27
r25
r24
OutputFIFOstoStaticRouter
Local BypassNetwork
19
20
Raw Compiler Example
tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
Assign instructions to the tiles, maximizing locality. Generate the static routerinstructions to transferOperands & streams tiles.
[Slide Source: Michael B. Taylor]
Raw tile
Architectural Comparison
RAW Superscalar Multiprocessor
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
21
Application Mapping on RAW
[
Four-way parallelized scalar code
Two-way threaded Java program
httpd Zzzz..
VideoData Stream
Frame BufferAnd Screen
Custom Data Path Pipeline(by Compiler)
Sleep Mode (power saving)Fast Inter-tile ALU
forwarding : 3 cycles
Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.
22
RAW - Performance
Taylor, Michael Bedford, et al. "Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society, 2004.
23
CHESS - A Reconfigurable Arithmetic Array For Multimedia Applications • Designed by Hewlett Packard laboratories in the year
1999• Aims at speeding up arithmetic operations for multimedia
applications and tries to improve memory density• Principle goals of CHESS
• Increased arithmetic computational density• Increased memory bandwidth• Increased capacity of internal memories• Enhanced Flexibility• Rapid Reconfiguration
24
Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
CHESS - Architecture• 4 bit ALUs• 4 bit bus wiring• Switchboxes• Chessboard Layout• Embedded block RAM’s• Speed and hierarchical line lengths• Small configuration memories• No run-time reconfiguration
25
Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
CHESS - ComponentsALU LOGIC DESIGN
26
Switchbox
Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
CHESS - Routing Structure
27
Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
CHESS - Performance
28
• High computational density• Efficient multiplies due to embedded
ALU
• Issues: • No reported software or application
results• No run-time reconfiguration
Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
Comparison: CHESS and MATRIX• Both use 2D array of ALUs• For both, instructions can be generated within the array• Both the architectures are flexible• CHESS is 4 bit whereas MATRIX is 8 bit• CHESS does not support run-time reconfiguration but has
very fast configuration as few bits are required• CHESS has high computational density• CHESS is aimed at arithmetic operations whereas
MATRIX is more general purpose
29
Network-On-Chip(NoC)
30
Project Overview• Implementing Coarse Grained and Hybrid Reconfigurable Architecture • NoC interconnection between processing elements• Supports Variable Block Size Motion Estimation• Motion Estimation Algorithms
• Full Search• Diamond Search
31
Verma, Ruchika, and Ali Akoglu. "A coarse grained and hybrid reconfigurable architecture with flexible NoC router for variable block size motion estimation." Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008.
CPE(1,1)CPE(1,1)
CPE(2,1)CPE(2,1)
CPE(3,1)CPE(3,1)
CPE(4,1)CPE(4,1)
CPE(1,2)CPE(1,2)
CPE(2,2)CPE(2,2)
CPE(3,2)CPE(3,2)
CPE(4,2)CPE(4,2)
CPE(1,3)CPE(1,3)
CPE(2,3)CPE(2,3)
CPE(3,3)CPE(3,3)
CPE(4,3)CPE(4,3)
CPE(1,4)CPE(1,4)
CPE(2,4)CPE(2,4)
CPE(3,4)CPE(3,4)
CPE(4,4)CPE(4,4)
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
PE 2(1)PE 2(1)
PE 2(3)PE 2(3)
PE 2(2)PE 2(2)
PE 2(4)PE 2(4)
PE 3PE 3
Main MemoryMain Memory Memory Interface (MI)Memory Interface (MI)
data_load_control
(16 bits)
reference_block_id (5 bits)
c_d_(x,y)
(32 bits)
r_d_(x,y)
(32 bits)
32 bits
14 bits
12 bits
32
QUESTIONS??
33
34