nikolaos vassiliadis, george theodoridis and spiridon nikolaidis
DESCRIPTION
Aristotle University of Thessaloniki. Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support. Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis - PowerPoint PPT PresentationTRANSCRIPT
1
Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support
Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis
Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
E-mail: [email protected]
Aristotle University of Thessaloniki
2
Outline
Introduction
Target Architecture Overview
Partial Predicated Execution Enhancement
Virtual Opcode Enhancement
Development Framework
Experimental Results
Conclusions
3
Introduction
Characteristics of modern embedded applications Diversity of algorithms Rapid evolution of standards High performance demands
To amortize cost over high production volumes embedded systems must: Exhibit high levels of flexibility => fast Time-to-Market Exhibit high levels of adaptability => increased reusability
An appealing option => couple a reconfigurable hardware (RH) to a typical processor Processor => bulk of the flexibility RH => adaptation to the target application
Support by a development framework that hides RH related issues Maintain flexibility Continue to target software-oriented group of users
4
Target Architecture
Reconfigurable Instruction Set Processor (RISP)
Core processor 32-bit single issue RISC 5 pipeline stages
Reconfigurable Functional Unit (RFU) 1-D array of coarse-grain
processing elements (PEs)
An interface that tightly couples the RFU to the core Explicit communication
CONTROL LOGICR
EG
IS
TE
R F
IL
E
ALUM
UX
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
MULTIPLIER
SHIFTERDATA
MEMORY
CORE / RFU INTERFACE
PROCESSING & INTERCONNECT LAYERSCONFIGURATION LAYER
WRITE BACK DATA
CONTROL SIGNALS
I_DATA_INBUS
OPERANDS
1ST STAGE RESULT 2ND STAGE RESULT
Re OPCODE
STATUS SIGNALS
CONFIGURATION BITS
5
Target Architecture - ISA
Re=‘0’ => Standard Instruction Set Flexibility to execute any program
Re=‘1’ => Reconfigurable Instruction Set Extensions Offers the adaptation to the target application
Three types of Reconfigurable Instructions Complex computational operations Complex addressing modes Complex control flow operations
Re OpCode Source 1 Source 2 Destination Source 3 Source 4
32-Bit Instruction Word Format
6
Target Architecture - RFU
1-D Array of coarse-grain PEs Executes Reconfigurable Instructions
Multiple-Input-Single-Output (MISO) clusters of primitive operations Un-registered output
Chain of operations in the same clock cycle Registered output
Chain of pipelined operations Floating PEs => Can operate in both core pipeline stages on
demand Better utilization of the available hardware
PE REGISTER
MU
X
Operand1
Operand2
Function Sel Spatial-Temporal Sel
Result
INPUT NETWORK
OUTPUT NETWORK
PE BASIC STRUCTURE
OPERAND SELECT
OPERAND1
OPERAND2
PE RESULT
PE BASIC STRUCTURE
OPERAND SELECT
OPERAND1
OPERAND2
PE RESULT
1ST STAGE RESULT
2ND STAGE RESULT
FEEDBACK NETWORK
1ST STAGE OPERANDS
2ND STAGE OPERANDS
OPERANDS
CONSTANTS
7
Target Architecture – Configuration
Local configuration memory Multi-context No overhead to select a
context
Array of coarse-grain PEs => Small number of
configuration bit-stream per instruction
EXTERNAL CONFIGURATION
MEMORY
CO
NF
IGU
RA
TIO
N
CO
NT
RO
LL
ER
CONFIGURATION BITS LOCAL STORAGE
CONFIGURATION 0
CONFIGURATION 1
CONFIGURATION 2
CONFIGURATION 3
CONFIGURATION BITS
8
Target Architecture – Synthesis Results
A hardware model (VHDL) was designed
Synthesis results with STM 0.13um
Reasonable area overhead No overhead to core critical path
Configuration Value
Granularity32-bits
(16x16Multiplier)
Number of Processing Elements 8
Processing Elements FunctionalityALU, Shifter,
Multiplier
Configuration Contexts 16 words of 134 bits
Local Memory Size 8 constants of 32-bits
Number of Provided Local Operands 4
Component Area (mm2)
Processor Core 0.134
RFU Processing Layer 0.186
RFU Interconnection Layer
0.125
RFU Configuration Layer 0.137
RFU Total 0.448
9
Enhancement with Partial Predicated Execution
Predication Eliminate branches from an instruction stream Conditional execution of an instruction Utilized to expose Instruction Level Parallelism
Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement
Large clusters of operations => increased performance
-+CMP MUX
b c d
0a
f
g
SELECT Instructionif a<0 then g=b+c;
else g=d-f;
10
Support of Partial Predicated Execution
The available output network can be utilized
Extensions Two configuration bits Two multiplexers Hardwired connections to PEs
Selection of the RFU output Controlled by configuration bits => no
predication Controlled by comparison result =>
predicated execution Comparison => implemented in a PE
Output Network
1st Stage Result
2nd Stage Result
1st PE Result
n PE Result
1st CMP Result
1st Output Config.
1st Sel Bit MUX
2nd Output Config.
2nd Sel BitMUX
2nd CMP Result
11
Enhancement with Virtual Opcode
Explicitly communication between Core and RFU Opcode explosion problem
Proposed solution => “Virtual” opcode Virtual opcode = Natural opcode + code region
Overhead => Configuration memory size Coarse grain => Small configuration size => 136 bits/per instr.
In general Virtual opcode can performed by flushing and reload the whole local memory Large performance overhead Applicable for different applications
12
Support of Virtual Opcode
Local Configuration memory => extended with extra level of contexts First level = K contexts of locally available
reconfigurable instructions Second level = L copies of the first level for
different code regions
For each code region only one of L contexts is active The same natural opcode in different region
context forms a virtual opcode Partitioning of regions and issue of activation
performed by the compiler One cycle overhead to activate a context
Configuration memory size = K*L*Conf. Bits per Instr.
Instruction 1
Instruction K
Context 1
Instruction 1
Instruction K
Context L
ContextSelect
Set Active Context
ConfigBits
OpCode
13
Development Framework
Automated framework for the development of applications in the architecture
Transparent incorporation of the reconfigurable instructions set extensions
Based on the SUIF/MachSUIF compiler infrastructure
C/C++
Front-EndMachSUIF
Optimized IR in CDFG form
Basic Block Profiling Results
Instruction Selection
Back-EndStatistics
Executable Code
Profiling
Instrumentation
m2cInstr.Gen.
Pattern Gen.
Mapping
Instr. Extens.
User Defined Parameters
14
Dev. Framework – Front End / Profiling
Application source code translated in CDFG (SUIFvm operations)
Perform machine independent optimizations
If-conversion for partial predicated execution can be applied
CDFG instrumented with profiling annotations translated to equivalent C code compiled and executed in the host
Profiling information are collected Regions execution frequency
Application Source Code
DFG #1
DFG #2
+
-
+
a b
e
dc
...dfg1++; //profiling codevr1=a+b;vr2=c+d;e=vr1+vr2;...
15
Dev. Framework – Instruction Generation
First step = Pattern Generation In-house tool for the identification of MISO
cluster of operations based on the MaxMISO algorithm
Second step = Mapping of MISO in the RFU1. Place the SUIFvm nodes in PEs / Route
the 1-D array
2. Analyze paths and set the output of a PE (reg./unreg.) to minimize delay
3. Report candidate instruction semantics
ADD
SUB
NEG SHIFT
NEG
SHIFT
register register constant
register
register
register
Candidate2
Candidate1
PE1 PE2
PE3
Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4{ region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2}
16
Dev. Framework – Instruction Selection (1/2)
No Virtual opcode
Consider the whole application space
Perform pair-wise graph isomorphism to identify identical candidate instructions
Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency
Rank candidate instructions based on dynamic gain
Select best L instructions L defined by the number of supported instructions
17
Dev. Framework – Instruction Selection (2/2)
With Virtual opcode enabled
Partition application code into regions Currently supporting only procedures
Perform Graph isomorphism per region
Calculate dynamic gain offered by each candidate for each region
Calculate overhead to set active the region contexts
Rank regions and candidate instructions based on dynamic gain
Select best K regions and best L instructions from each region L, K defined by the supported contexts and instructions per context
18
Experimental Results
Prove the performance improvements offered by the proposed architecture
Evaluate the efficiency of the enhancements
A complete MPEG-2 encoding application is used Source code from MediaBench benchmark suite Input data => a video sequence consisting of 12 frames with resolution
of 144x176 pixels
19
Exp. Results – SpeedUp Analysis
Speedup analysis for the most timing consuming functions of MPEG2 enc.
Accelerate only critical regions => small overall speedup (Amdahl)
Our approach accelerates the whole application’s space => overall speedup is preserved
Instr.Count (106)(No RFU)
SpeedUpSpeedUp
(Incremental)
SAD 589.0 6.6 1.5
dist1 1206.0 3.4 2.3
fullsearch 73.5 2.0 2.5
bdist1 18.0 2.0 2.5
putbits 16.3 2.3 2.6
fdct 15.6 2.3 2.6
quant 13.1 2.6 2.7
idctcol 11.4 2.4 2.7
dct 10.4 2.3 2.7
pred_comp 10.1 1.9 2.7
iquant 9.9 1.8 2.8
add_pred 8.0 2.0 2.8
bdist2 7.3 1.8 2.8
idctrow 7.0 2.2 2.8
putnonintrablk 6.9 1.8 2.8
sub_pred 6.6 1.8 2.9
Overall 1448.7 2.9 2.9
20
Exp. Results – Evaluation of predication
Example of four instructions derived using if conversion and partial predicated execution These instructions implement the SAD function
Significant performance improvements are offered
+
+
>>
-
+ -CMP
MUX
Reg. Reg.
Reg.
Reg.
Const.
Const.
Const.
Reg.
-
-+CMP
MUX
Reg. Reg.
Reg.
Reg.
Const.
+ +
+
+>>
Reg. Reg. Reg. Reg.
Reg.
Const.
Const.
-
-+CMP
MUX
Reg. Reg.
Reg.
Reg.
Const.
Instruction 1 Instruction 2 Instructions 3a & 3b
SAD
Speedup
Overall
Speedup
No predic.
1.7 1.7
Predic. 6.6 2.9
21
Exp. Results – Evaluation of Virtual Opcode
Virtual opcode can be used to preserve speedups for architectures with limited opcode space
Reasonable overhead for the local configuration memory size
Finer partitioning of regions could result to more impressive results
Memory
Organization
(inst.Xcont.)
Speedup Memory
Size (KB)
4x8 1.7 0.5
8x8 2.0 1.1
16x12 2.8 3.2
32x12 3.0 8.7
Unconstr. 3.1 -
Contexts
1,5
1,7
1,9
2,1
2,3
2,5
2,7
2,9
3,1
4 8 16 32 64
Instructions per Context
Sp
ee
dU
p
16 12 8 4 2 Unified
22
Conclusions
Two enhancements to a previously proposed RISP architecture have been proposed Partial predicated execution => increase performance Virtual opcode => relaxes opcode space pressure
An automated development framework have been presented Hides the reconfigurable hardware from the user Supports the two enhancements
The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application
Future research Support full predication for further performance improvements Support finer partitioning of regions for better utilization of virtual opcode
23
Thank you !!!
Questions ??