implementation characterstics of interconnect mechanisms …anup/homepage/files/presentatio… ·...
Post on 18-Apr-2018
231 Views
Preview:
TRANSCRIPT
Implementation Characterstics of InterconnectMechanisms in Clustered VLIW Architectures
Anup Gangwar
Embedded Systems Group,Department of Computer Science and Engineering,
Indian Institute of Technology Delhihttp://embedded.cse.iitd.ernet.in
August 12, 2004
(Joint work with M. Balakrishnan, Preeti R. Panda and Anshul Kumar)
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 1/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 2/32
VLIW Architecture and Features
Compiler extracts parallelism, these have evolved fromhorizontal microcoded architectures
Latest industry coined acronym, EPIC for Explicitly ParallelInstruction Computing
Commercial Architectures:
General Purpose Computing: Intel Itanium
Embedded Computing: TriMedia, TiC6x, Sun’s MAJC etc.
RISC SuperScalar VLIW (4 issue)ADD r1, r2, r3 ADD r1, r2, r3 ADD r1, r2, r3 | SUB r4, r2, r3 | NOP | NOP
SUB r4, r2, r3 SUB r4, r2, r3 MUL r5, r1, r4 | NOP | NOP | NOP
MUL r5, r1, r4 MUL r5, r1, r4
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 3/32
VLIW Architecture and Features (contd)
Advantages:Simplified hardware: Suitable for customization
Less power consumption as compared to SuperScalar processors
High performance
Disadvantages:Complicated compiler: limits retargetability
Code size blow up due to explicit NOPs.
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 4/32
Typical Organization of VLIW Processor
ALU LD/ST ALU ALU LD/ST ALU
R.F.
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 5/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 6/32
Clustered VLIW Processors
For N functional units connected to a RF the Area grows as
N3, Delay as N32 and Power as N3 (Rixner et. al, HPCA 2000)
Solutions is to break up the RF into a set of smaller RFs
....Register File 1 Register File 2 Register File 3
Memory System
Interconnection Network
Cluster 1 Cluster 2 Cluster 3
FU1 FU2 FU3 FU1 FU2 FU3 FU1 FU2 FU3
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 7/32
Clustered VLIW Architectures
RF-to-RF
LD/ST ALU LD/ST ALUALU ALU
R.F. R.F.
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Write Across (1)
ALU ALU
R.F. R.F.
ALULD/ST LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Read Across (1)
ALU ALU
R.F. R.F.
LD/ST ALU LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Write/Read Across (1)
ALU ALU
R.F. R.F.
LD/ST ALU LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Write Across (2)
ALU ALU
R.F. R.F.
ALULD/ST LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Read Across (2)
ALU ALU
R.F. R.F.
LD/ST ALU LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Clustered VLIW Architectures
Write/Read Across (2)
ALU ALU
R.F. R.F.
LD/ST ALU LD/ST ALU
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32
Set of Evaluated Benchmarks
Primary benchmark source is DSPStone and MediaBench
Pick up common set of benchmarks from proposedMediaBench II
Benchmarks:
DSPStone Kernels MediaBench
Matrix Initialization JPEG Decoder
IDCT JPEG Encoder
Biquad MPEG2 Decoder
Lattice MPEG2 Encoder
Matrix Multiplication G721 Decoder
Insert Sort G721 Encoder
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/32
Evaluation Results
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18
ILP
as
Frac
tion
of P
ure
VLI
W (%
)
No. of Clusters
RFWA.1RA.1WR.1WA.2RA.2WR.2
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 10/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 11/32
Objectives of Clock-period Evaluation
Excercise
For a comprehensive performance estimate both thenumber of cycles and cycle time itself are necessary
How does clock-period vary for the various types ofInterconnects
What is the relative increase in total chip interconnect areabetween different interconnects
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 12/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 13/32
Instruction Set Architecture of Processors
Mnemonic Explanation Cycles
NOP No Operation 1ADD Rdest ⇐ Rsrc1 +Rsrc2 1SUB Rdest ⇐ Rsrc1 −Rsrc2 1MUL Rdest ⇐ Rsrc1 ∗Rsrc2 3DIV Rdest ⇐ Rsrc1/Rsrc2 3MOV Rdest ⇐ Rsrc 1LD Rdest ⇐ MEM[Rsrc1] 3ST MEM[Rdest ] ⇐ Rsrc1 3CMP Rdest ⇐ Rsrc1 > Rsrc2 1BRL Branch if less to Address 3BRA Branch always to Address 3MVIH Load (Immediate) higher order bytes of Rdest 1MVIL Load (Immediate) lower order bytes of Rdest 1RFMV Inter-cluster move for RF-to-RF Architectures 1
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 14/32
Pipeline of Processors
InstructionFetch
InstructionDecode
InstructionExecute
Put Away
ALU Operations
InstructionFetch
InstructionDecode Put Away
Memory
Access
Memory Operations
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 15/32
Instruction Encoding of Processors
4 Bits log2(RF Size) log2(RF Size)
InstructionSource Source
Register 2Destination
RegisterRegister 1
ALU Operations for WA
log2(RF Size)+1
4 Bits
Instruction
4 Bits log2(RF Size)
InstructionSource Source
Register 2Destination
RegisterRegister 1
ALU Operations for RA
log2(RF Size)log2(RF Size)+1
RFMV Operation
log2(RF Size) 4 Bits log2(RF Size)
SourceRegister 1
DestinationRegisterCluster
Destination
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 16/32
Detailed Architecture for WA.1
ALULD/STALU
Register File
ALULD/STALU
Register File
Dec
odin
g Lo
gic
Full Bypass Within a Cluster
Dec
odin
g Lo
gic
Full Bypass Within a Cluster
Condition CodeCondition CodeInstruction Instruction
One ClusterOne Cluster
To/From Off−Chip Data Memory
From Off−Chip Instruction Memory
To/From Off−Chip Data Memory
From
Pre
viou
s C
lust
er
To N
ext C
lust
er
Instruction Fetch and Branch Handling Logic
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 17/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 18/32
Evaluation Flow for ASIC Technologies
ParameterizedRTL VHDL
Libr
arie
sA
SIC
Clu
ster
Mod
el G
ener
atio
n
Synopsys DC
CadenceSoC Encounter
Synopsys DC
CadenceSoC Encounter
Static TimingAnalysis
InterconnectClock−periodArea
Cluster Black−box Timingand P&R Models
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 19/32
Pin Organization for Clusters
RF-to-RF Cluster
Clk
iram
_dat
a
CC
Res
etn
ldst
_ram
_(al
l)
ldst_ld_stn
ldst_enldst_rf_adr
ldst_adr_rf_adr
alu0_(all others)alu1_(all)
DATAPATH DECODER
rf_da
ta_b
usrf_
adr_
bus
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 20/32
Pin Organization for Clusters
WA.1 Cluster
Clk
iram
_dat
a
CC
Res
etn
alu0_res_out
ldst
_ram
_(al
l)
rfl_w
r_en
_nex
tldst_ld_stn
ldst_enldst_rf_adr
ldst_adr_rf_adr
alu0_(all others)alu1_(all)
alu0_res_adr
DATAPATH DECODER
rf_w
r_(a
ll ot
hers
)
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 20/32
Obtained Floorplans for RF-to-RF
Two Cluster Configuration
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 21/32
Obtained Floorplans for RF-to-RF
Four Cluster Configuration
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 21/32
Obtained Floorplans for RF-to-RF
Six Cluster Configuration
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 21/32
Obtained Floorplans for RF-to-RF
Ten Cluster Configuration
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 21/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 22/32
Variation of Clock-period for UMC 0.13µ
NClusters WA.1 RA.1 WA.2 RA.2 WR.2 RF
2 0.851 0.859 1.036 0.894 0.891 1.1524 0.883 0.944 1.030 1.001 0.956 1.9876 0.929 0.977 1.104 1.015 1.051 2.7608 1.005 1.065 1.077 1.108 1.112 3.89110 1.026 1.064 1.126 1.091 0.995 5.110
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 23/32
Variation of Clock-period for XC2V3000 Device
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12
Clo
ck P
erio
d fo
r Xili
nx V
irtex
-II (n
s)
No. of Clusters
WA.1RA.1WA.2RA.2WR.2
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 24/32
Effective Variation in ILP for UMC 0.13µ
ILPe f f ective, x = ILPoriginal, x ∗Clk PeriodWA.1
Clk Periodx
0
20
40
60
80
100
0 2 4 6 8 10 12
ILP
as
Frac
tion
of P
ure
VLI
W (%
)
No. of Clusters
RFWA.1RA.1WA.2RA.2WR.2
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 25/32
Variation of Interconnect Area for UMC 0.13µ
NClusters WA.1 RA.1 WA.2 RA.2 WR.2 RF
2 39.45 39.34 39.42 39.16 39.33 43.434 36.71 36.65 36.63 36.53 36.65 36.676 35.69 35.66 35.73 35.61 35.55 35.718 34.90 34.62 35.02 34.82 34.84 34.9810 34.89 34.84 34.95 34.73 34.81 34.89
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 26/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 27/32
Conclusions
The increase in clock period for RF-to-RF architectures isvery high
For other interconnect architectures, the variation acrossdifferent interconnect mechanisms is negligible.
There is a small increase in clock period with an increase innumber of clusters for interconnect mechanisms other thanRF-to-RF.
Bidirectional connectivity architectures, WA.2 and RA.2,lose out somewhat in terms of clock period as compared totheir unidirectional connectivity architectures and this impactis more in smaller cluster configurations.
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 28/32
Conclusions (contd...)
While the interconnect lengths do vary in case ofunidirectional and bidirectional architectures this increase isin proportion to the increase in logic area.
The variation in interconnect area even with an increase innumber of cluster is not much. This is due to acorresponding increase in the logic area. Basically theinterconnect area per cluster is more of less constant.
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 29/32
Outline
VLIW Features and Typical Datapath Organization
Clustered VLIW Processors and Performance Evaluation of
Interconnects
Objectives of Clock-period Evaluation Excercise
Architecture of Modeled Processors
Clock-period Evaluation Flow
Experimental Results
Conclusions
Future work
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 30/32
Future Work
What could be done to bus based architectures to improveperformance: Multicycle and Pipelined buses
Development of fast performance estimation techniques
Search for an architecture which on the average gives goodperformance
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 31/32
Thank You
Thank You
Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 32/32
top related