nikolaos vassiliadis, george theodoridis and spiridon nikolaidis

1

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support

Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis

Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

E-mail: [email protected]

Aristotle University of Thessaloniki

2

Outline

Introduction

Target Architecture Overview

Partial Predicated Execution Enhancement

Virtual Opcode Enhancement

Development Framework

Experimental Results

Conclusions

3

Introduction

Characteristics of modern embedded applications Diversity of algorithms Rapid evolution of standards High performance demands

To amortize cost over high production volumes embedded systems must: Exhibit high levels of flexibility => fast Time-to-Market Exhibit high levels of adaptability => increased reusability

An appealing option => couple a reconfigurable hardware (RH) to a typical processor Processor => bulk of the flexibility RH => adaptation to the target application

Support by a development framework that hides RH related issues Maintain flexibility Continue to target software-oriented group of users

4

Target Architecture

Reconfigurable Instruction Set Processor (RISP)

Core processor 32-bit single issue RISC 5 pipeline stages

Reconfigurable Functional Unit (RFU) 1-D array of coarse-grain

processing elements (PEs)

An interface that tightly couples the RFU to the core Explicit communication

CONTROL LOGICR

EG

IS

TE

R F

IL

E

ALUM

UX

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

MULTIPLIER

SHIFTERDATA

MEMORY

CORE / RFU INTERFACE

PROCESSING & INTERCONNECT LAYERSCONFIGURATION LAYER

WRITE BACK DATA

CONTROL SIGNALS

I_DATA_INBUS

OPERANDS

1ST STAGE RESULT 2ND STAGE RESULT

Re OPCODE

STATUS SIGNALS

CONFIGURATION BITS

5

Target Architecture - ISA

Re=‘0’ => Standard Instruction Set Flexibility to execute any program

Re=‘1’ => Reconfigurable Instruction Set Extensions Offers the adaptation to the target application

Three types of Reconfigurable Instructions Complex computational operations Complex addressing modes Complex control flow operations

Re OpCode Source 1 Source 2 Destination Source 3 Source 4

32-Bit Instruction Word Format

6

Target Architecture - RFU

1-D Array of coarse-grain PEs Executes Reconfigurable Instructions

Multiple-Input-Single-Output (MISO) clusters of primitive operations Un-registered output

Chain of operations in the same clock cycle Registered output

Chain of pipelined operations Floating PEs => Can operate in both core pipeline stages on

demand Better utilization of the available hardware

PE REGISTER

MU

X

Operand1

Operand2

Function Sel Spatial-Temporal Sel

Result

INPUT NETWORK

OUTPUT NETWORK

PE BASIC STRUCTURE

OPERAND SELECT

OPERAND1

OPERAND2

PE RESULT

PE BASIC STRUCTURE

OPERAND SELECT

OPERAND1

OPERAND2

PE RESULT

1ST STAGE RESULT

2ND STAGE RESULT

FEEDBACK NETWORK

1ST STAGE OPERANDS

2ND STAGE OPERANDS

OPERANDS

CONSTANTS

7

Target Architecture – Configuration

Local configuration memory Multi-context No overhead to select a

context

Array of coarse-grain PEs => Small number of

configuration bit-stream per instruction

EXTERNAL CONFIGURATION

MEMORY

CO

NF

IGU

RA

TIO

N

CO

NT

RO

LL

ER

CONFIGURATION BITS LOCAL STORAGE

CONFIGURATION 0

CONFIGURATION 1

CONFIGURATION 2

CONFIGURATION 3

CONFIGURATION BITS

8

Target Architecture – Synthesis Results

A hardware model (VHDL) was designed

Synthesis results with STM 0.13um

Reasonable area overhead No overhead to core critical path

Configuration Value

Granularity32-bits

(16x16Multiplier)

Number of Processing Elements 8

Processing Elements FunctionalityALU, Shifter,

Multiplier

Configuration Contexts 16 words of 134 bits

Local Memory Size 8 constants of 32-bits

Number of Provided Local Operands 4

Component Area (mm2)

Processor Core 0.134

RFU Processing Layer 0.186

RFU Interconnection Layer

0.125

RFU Configuration Layer 0.137

RFU Total 0.448

9

Enhancement with Partial Predicated Execution

Predication Eliminate branches from an instruction stream Conditional execution of an instruction Utilized to expose Instruction Level Parallelism

Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement

Large clusters of operations => increased performance

-+CMP MUX

b c d

0a

f

g

SELECT Instructionif a<0 then g=b+c;

else g=d-f;

10

Support of Partial Predicated Execution

The available output network can be utilized

Extensions Two configuration bits Two multiplexers Hardwired connections to PEs

Selection of the RFU output Controlled by configuration bits => no

predication Controlled by comparison result =>

predicated execution Comparison => implemented in a PE

Output Network

1st Stage Result

2nd Stage Result

1st PE Result

n PE Result

1st CMP Result

1st Output Config.

1st Sel Bit MUX

2nd Output Config.

2nd Sel BitMUX

2nd CMP Result

11

Enhancement with Virtual Opcode

Explicitly communication between Core and RFU Opcode explosion problem

Proposed solution => “Virtual” opcode Virtual opcode = Natural opcode + code region

Overhead => Configuration memory size Coarse grain => Small configuration size => 136 bits/per instr.

In general Virtual opcode can performed by flushing and reload the whole local memory Large performance overhead Applicable for different applications

12

Support of Virtual Opcode

Local Configuration memory => extended with extra level of contexts First level = K contexts of locally available

reconfigurable instructions Second level = L copies of the first level for

different code regions

For each code region only one of L contexts is active The same natural opcode in different region

context forms a virtual opcode Partitioning of regions and issue of activation

performed by the compiler One cycle overhead to activate a context

Configuration memory size = K*L*Conf. Bits per Instr.

Instruction 1

Instruction K

Context 1

Instruction 1

Instruction K

Context L

ContextSelect

Set Active Context

ConfigBits

OpCode

13

Development Framework

Automated framework for the development of applications in the architecture

Transparent incorporation of the reconfigurable instructions set extensions

Based on the SUIF/MachSUIF compiler infrastructure

C/C++

Front-EndMachSUIF

Optimized IR in CDFG form

Basic Block Profiling Results

Instruction Selection

Back-EndStatistics

Executable Code

Profiling

Instrumentation

m2cInstr.Gen.

Pattern Gen.

Mapping

Instr. Extens.

User Defined Parameters

14

Dev. Framework – Front End / Profiling

Application source code translated in CDFG (SUIFvm operations)

Perform machine independent optimizations

If-conversion for partial predicated execution can be applied

CDFG instrumented with profiling annotations translated to equivalent C code compiled and executed in the host

Profiling information are collected Regions execution frequency

Application Source Code

DFG #1

DFG #2

+

-

+

a b

e

dc

...dfg1++; //profiling codevr1=a+b;vr2=c+d;e=vr1+vr2;...

15

Dev. Framework – Instruction Generation

First step = Pattern Generation In-house tool for the identification of MISO

cluster of operations based on the MaxMISO algorithm

Second step = Mapping of MISO in the RFU1. Place the SUIFvm nodes in PEs / Route

the 1-D array

2. Analyze paths and set the output of a PE (reg./unreg.) to minimize delay

3. Report candidate instruction semantics

ADD

SUB

NEG SHIFT

NEG

SHIFT

register register constant

register

register

register

Candidate2

Candidate1

PE1 PE2

PE3

Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4{ region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2}

16

Dev. Framework – Instruction Selection (1/2)

No Virtual opcode

Consider the whole application space

Perform pair-wise graph isomorphism to identify identical candidate instructions

Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency

Rank candidate instructions based on dynamic gain

Select best L instructions L defined by the number of supported instructions

17

Dev. Framework – Instruction Selection (2/2)

With Virtual opcode enabled

Partition application code into regions Currently supporting only procedures

Perform Graph isomorphism per region

Calculate dynamic gain offered by each candidate for each region

Calculate overhead to set active the region contexts

Rank regions and candidate instructions based on dynamic gain

Select best K regions and best L instructions from each region L, K defined by the supported contexts and instructions per context

18

Experimental Results

Prove the performance improvements offered by the proposed architecture

Evaluate the efficiency of the enhancements

A complete MPEG-2 encoding application is used Source code from MediaBench benchmark suite Input data => a video sequence consisting of 12 frames with resolution

of 144x176 pixels

19

Exp. Results – SpeedUp Analysis

Speedup analysis for the most timing consuming functions of MPEG2 enc.

Accelerate only critical regions => small overall speedup (Amdahl)

Our approach accelerates the whole application’s space => overall speedup is preserved

Instr.Count (106)(No RFU)

SpeedUpSpeedUp

(Incremental)

SAD 589.0 6.6 1.5

dist1 1206.0 3.4 2.3

fullsearch 73.5 2.0 2.5

bdist1 18.0 2.0 2.5

putbits 16.3 2.3 2.6

fdct 15.6 2.3 2.6

quant 13.1 2.6 2.7

idctcol 11.4 2.4 2.7

dct 10.4 2.3 2.7

pred_comp 10.1 1.9 2.7

iquant 9.9 1.8 2.8

add_pred 8.0 2.0 2.8

bdist2 7.3 1.8 2.8

idctrow 7.0 2.2 2.8

putnonintrablk 6.9 1.8 2.8

sub_pred 6.6 1.8 2.9

Overall 1448.7 2.9 2.9

20

Exp. Results – Evaluation of predication

Example of four instructions derived using if conversion and partial predicated execution These instructions implement the SAD function

Significant performance improvements are offered

+

+

>>

-

+ -CMP

MUX

Reg. Reg.

Reg.

Reg.

Const.

Const.

Const.

Reg.

-

-+CMP

MUX

Reg. Reg.

Reg.

Reg.

Const.

+ +

+

+>>

Reg. Reg. Reg. Reg.

Reg.

Const.

Const.

-

-+CMP

MUX

Reg. Reg.

Reg.

Reg.

Const.

Instruction 1 Instruction 2 Instructions 3a & 3b

SAD

Speedup

Overall

Speedup

No predic.

1.7 1.7

Predic. 6.6 2.9

21

Exp. Results – Evaluation of Virtual Opcode

Virtual opcode can be used to preserve speedups for architectures with limited opcode space

Reasonable overhead for the local configuration memory size

Finer partitioning of regions could result to more impressive results

Memory

Organization

(inst.Xcont.)

Speedup Memory

Size (KB)

4x8 1.7 0.5

8x8 2.0 1.1

16x12 2.8 3.2

32x12 3.0 8.7

Unconstr. 3.1 -

Contexts

1,5

1,7

1,9

2,1

2,3

2,5

2,7

2,9

3,1

4 8 16 32 64

Instructions per Context

Sp

ee

dU

p

16 12 8 4 2 Unified

22

Conclusions

Two enhancements to a previously proposed RISP architecture have been proposed Partial predicated execution => increase performance Virtual opcode => relaxes opcode space pressure

An automated development framework have been presented Hides the reconfigurable hardware from the user Supports the two enhancements

The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application

Future research Support full predication for further performance improvements Support finer partitioning of regions for better utilization of virtual opcode

23

Thank you !!!

Questions ??

nikolaos vassiliadis, george theodoridis and spiridon nikolaidis

Documents

rfu configuration layer0

rfu processing layer0

target architecture

small number of configuration

rfu interconnection

rfu total0

reconfigurable hardware

typical processor processor