trimmed vliw: moving application specific processors...

Trimmed VLIW: Moving Application Specific Processors Towards High Level Synthesis

Janarbek Matai, Jason Oberg, Ali Irturk, Taemin Kim, and Ryan Kastner

Dept. of Computer Science and Engineering

University of California, San Diego

Intel Labs, Portland

The 2012 Electronic System Level Synthesis Conference

June 02, 2012

Overview

Background Motivation Problem Definition High-Level Synthesis ASIP

Simulate and Eliminate (S&E) Approach Trimming Flexibility

Related Work HLS Application Specific Processor Design

Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers

Results Case 1 Results Case 2 Results Flexibility Results

Conclusion and Future Work

2

Motivation 3

Custom hardware have been widely used in various fields (E.g., Embedded systems)

Performance, power and flexibility evaluate custom hardware

Automating custom hardware design

ASIP (Application-Specific Instruction-Set Processor)

High-level synthesis

Problem Definition 4

Pros

Time-to-market

Reasonable Performance

2. ASIP 1. High-level Synthesis

Cons

Flexibility & Scalability

Code size>500 (700)

Decreased control on HDL

Pros

Flexibility

Performance>CPU

Cons

Manual Design ISA, Compiler

Area

Application

HLS Tool

HDL

Application

ASIP Designer Tool

HDL, ISA, Compiler

5

High-Level Synthesis

Most current HLS design tools works as below: Datapath is pieced together by scheduling and binding the operations Controller generated using FSM which often gets unwieldy for large designs

Initial Application Dataflow Graph

Scheduled Dataflow Graph

MUL

ADD

Register 1

Register 3

Register 2

FSM Controller

Final Datapath and Controller

ASIP (Application-Specific Instruction-Set Processor)

6

ISA Design

Architectural Design Space Exploration

Application & Application Analysis

Understand Application Characteristics

ISA, define instruction set size

Datapath/Control Micro architecture

design

Compiler for the ISA.

Compiler Design

Processor architectural parameters (functional units,

register file, number of registers, number of

read/write port)

ASIP A better trade-off between efficiency and flexibility.

Simulate and Eliminate (S&E): Start with General Purpose Processor and Customize

S&E can design flexibly ASIP, also further customize it by trimming.

Overview


S&E Approach Trimming Flexibility





7

What is a “Trimming” ?

S&E starts with an ASIP, and further customize it by trimming.

Trimming is the removing of unused resource; resources are functional units, registers, and interconnects.

8

In this work, we focus on trimming interconnect and multiplexer trimming.

“Wires” are trimmed in IxC and register file

Trimming unused functional units and registers is straightforward

IxC Trimming 9

1. IxC Trimming

2. IxC Trimming

Co

ntr

ol

R1 R2 … RF Read Data Ports

Write Ports

RN

IxC

FU1 FU2 FU3

Trimmed VLIW with Register File.


Write Ports

RN

FU1 FU3 FU2

M1 M2 M3 M4

M8

M5 M6

M7 M9

RD1 RD2

RD3 RD4

out1 out2 out3

IxC Trimming Example

Register read port to functional unit input port trimming

10

RF RD4

RD1 RD2

RD3

M3 M4

FU2

RF RD4

RD1 RD2

RD3

M3 M4

FU2

RF RD4

RD1 RD2

RD3

M3 M4

FU2

out3

Wp2 Wp1

out2 out1

M8 M7

out3

Wp2 Wp1

out2 out1

M8 M7

out3

Wp2 Wp1

out2 out1

M8

Functional unit output port to register file write port trimming

Register File Trimming

11

Register File Trimming

Co

ntr

ol


Write Ports

RN

IxC

FU1 FU2 FU3

Trimmed VLIW with Register File.

R 1

RF

RD4 RD3

R 2 R 3

RD2 RD1

M10 M11 M12 M13

S&E (Simulate and Eliminate) Approach

Start with general purpose processor and customize (trim) unused components.

In this work, we study application of S&E to the VLIW-like architecture.

12

Flexibility

Per

form

ance

Flexibility

Are

a

HLS

ASIP

S & E

HLS

ASIP

S & E

Flexibility 13

Flexibility of an application specific processor (customized hardware) is the ability to run a set of applications on an application specific processor.

E.g., an ASP (Application Specific Processor)

Edge detection

Corner detection

Flexible Customized Architecture (FCA): Customized also flexible enough to run a set of applications.

Flexibility (con’t)

Suppose, we have customized architecture for applications A and B.

14

R1 R2 RF

mul1

M1

RD1 RD2

Out 1

R1 R2 RF R3

mul2 add

M2 M3 M4

RD1 RD2

RD3

Out3 Out 2

A

1. RD1, RD2, RD3

2. R1, R2, R3

3. RD1M1, RD1M4

4. #mul=1 #add=1 B

C C can run both A and B. 1. C is flexible 2. C is application specific

Architectural parameters of C.

R1 R2 RF R3

mul add

M2 M3 M4

RD1 RD2

RD3

Out3/Out1 Out 2

M5

Overview



Related Work HLS Application Specific Processor Design (GUSTO, Tensilica, ISA Subsetting, NISC, PICO)




15

High Level Synthesis: HLS

AutoESL Xilinx HLS tool Creates an RTL implementation

from C level source code Implements the design based on

defaults and user applied directives

Many implementation are possible from the same source description Smaller designs, faster designs,

optimal designs Enables design exploration

E.g., Catapult C, Forte Cynthesizer, C2V,

Impulse, BlueSpec

Script with Constraints

……………… ………………

VHDL Verilog

System C

AutoESL

Constraints/ Directives

………………

………………

C, C++, SystemC

RTL Synthesis

Courtesy to Xilinx

GUSTO(General architecture design Utility and Synthesis Tool for Optimization)

17

GUSTO targets RISC In this work, we focus VLIW

Resource Trimming for Hardware

Optimization

Irturk et al. GUSTO [TECS’00] simulates the

architecture to define the usage of hardware

resources trims away the unused components with

their interconnects.

Multiplier

Adder

Memory

Full Connectivity

Multiplier

Adder

Memory

Required Connectivity

17

Tensilica

XTENSA: A configurable and extensible processor [IEEE MICRO’00]. Configurability

E.g., designer can choose to have execution units such as a 16x16 multiplier, a floating-point unit, a barrel shifter, etc.

Extensibility Enables designer to add application specific functionality

TIE (Tenesilica Instruction Extension) Allows configurability and extensibility

18

Tenesilica flexibility By adding a new instructions

S&E flexibility By trimming from base architecture.

Tenesilica is a good complement to S&E.

ISA Subsetting

VESPA [CASES’08], SPREE [IEEE TCAD’07]

Main idea: Extract usage of instructions from binary or by simulation (ASIP or soft processors has fixed ISA.) Remove unused instructions from ISA and architectural support

19

Source: SPREE [IEEE TCAD’07]

ISA Subsetting By trimming instructions

S&E flexibility By trimming mux(wires)

ISA subsetting is orthogonal to S&E.

NISC (No-Instruction-Set-Computer) [DAC’08] 20

Main idea: An application is directly compiled into a given datapath. Removes compiler design of ASIP.

NISC Trims unused functional units.

S&E Flexibility with efficiency. (NISC No flexibility)

S&E Trims mux in the IxC and in the register file.

Source: http://www.ics.uci.edu/~nisc/

ALAP

Trimming

redundant

hardware

PICO (Synopsys Synphony C Compiler) [ISSS’99,

JVLSI’02]

21

PICO (Program In/Chip Out): VLIW+NPA (Nonprogrammable accelerator)

PICO VLIW + NPA (Nonprogrammable accelerator). S&E can target any architecture, and trim it. S&E provides flexibility.

Overview







22

Trimmed VLIW Overview 23

Compiler

Application (C code)

Base Architecture Generation by DSE

(Scheduler, Allocation)

Architectural parameters (# of Mul=2, # of Add=2,..)

HDL (Verilog)

HDL generation Application

Trimmed Architecture (HDL)

DFG

Trimmed VLIW Simulator

User constraints

Trimmed VLIW is the implementation of S&E on VLIW-like processors.

Flow of Trimmed VLIW Framework: Converts C code to DFG

DFG is scheduled on a general VLIW processor with HLS scheduling algorithms such as Force

Directed Scheduler (FDS)

Base architecture is generated based on scheduler

We trim base architecture to trimmed architecture (Trimmed VLIW).

Trimmed VLIW Base Architecture Creation

Scheduler defines architectural parameters, e.g., number of registers, register read/write ports, functional units.

24

Parameters

Base HDL Generator

DFG

Scheduler

HDL generator creates VLIW architecture.

Co

ntr

ol


Write Ports

RN

IxC

FU1 FU2 FU3

Case 1: TrimmedVLIW with Register File.

Co

ntr

ol

R1 R2 … Registers

RN

IxC

FU1 FU2 FU3

Case 2 : Trimmed VLIW Without Register File.

a

Scheduler + Register Allocation

Algorithms in the Trimmed VLIW framework

Instruction Scheduling

List Scheduling

Force Directed Scheduling

Ant Scheduler based on Ant Colony Optimization (ACO) [IEEE TCAD’07]

Register Allocation

Left Edge Algorithm

25

Overview







26

Benchmarks

DFG and C code are provided in ExpressDFG

27

Benchmarks Names Number of Nodes

cosine2 Cosine 82

ewf Elliptic Wave Filter 34

fir1 Finite Input Response Filter 44

matmul Matrix (Vector) multiplication 28

mcm DSP program 98

Case 1 Results: Base Architecture vs. Trimmed VLIW

28

Results are obtained with design compiler.

IxC trimming 8.38% of area saving by trimming 20.54% of wires.

IxC + Register file trimming 19.2% area saving by trimming 49.8% of wires.

Benchmarks cosine2 ewf fir1 matmul mcm Average

IxC Trimming

Trimmed Wires 19% 27% 20% 16.9% 19.8% 20.54%

Area Savings 10.8% 6.7% 9.9% 6.1% 8.4% 8.38%

Register Trimming

Trimmed Wires 56% 40% 49% 64% 40% 49.8%

Area Savings 30% 10% 22% 18% 16% 19.2%


The area, delay and performance comparisons between base, IxC trimmed and register trimmed architectures.

29

0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

1 1.01

cosine2 ewf fir1 matmul mcm Average

No

rmal

ize

d D

ela

y

Delay

Base Trimmed (only IxC trimming) TrimmedVLIW

0

20

40

60

80

100


Late

ncy

Latency


0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d A

rea

Area


Case 1 Results: Trimmed VLIW vs. HLS Tools

Trimmed VLIW use average 25.6% less area than C2V, and have average 20.5% better performance than C2V.

C2V, AutoESL is forced to use same number of functional units with “set allocate” pragma.

30

0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d A

rea

Area

TrimmedVLIW C2V AutoESL

0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d D

ela

y

Delay


0

20

40

60

80

100


Late

ncy

Latency



31

IxC trimming 22.16% of area by trimming 43.4% of wires.

IxC + Register file trimming 35% area saving by trimming 63.8% of wires.

0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d A

rea

Area (Case2)

Base Trimmed ( only IxC) TrimmedVLIW

0.9

0.92

0.94

0.96

0.98

1

1.02


No

rmal

ize

d A

rea

Delay (Case2)

Base Trimmed ( only IxC) TrimmedVLIW

0

10

20

30

40

50

60

70

80


Late

ncy

Latency (Case2)


Case 1 Trimmed VLIW vs. Case 2 Trimmed VLIW vs. AutoESL

Area: Trimmed VLIW (Case2) is better than Trimmed VLIW (Case1).

Performance: AutoESL has better performance than Trimmed VLIW (Case1) and Trimmed VLIW (Case2).

32

0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d A

rea

Area

TrimmedVLIW (Case1) TrimmedVLIW (Case2)

AutoESL

0

0.2

0.4

0.6

0.8

1

1.2

cosine2 ewf fir1 matmul mcm Average No

rmal

ize

d D

ela

y

Delay


AutoESL

0

20

40

60

80


Late

ncy

Latency


AutoESL

Results: Flexibility

1. Create a flexible architecture.

2. Make it FCA by trimming for a given set of applications.

33

#Adder #Mul #Registers #Read Ports #Write Ports

matmul 2 3 20 6 3

fir1 3 2 22 6 3

ewf 2 2 8 4 2

General Architecture 3 3 22 6 3

Trimmed 87 out of 222 wires (39%)

Area saving : 13%

Results: Flexibility (cont’d)

A fixed architecture for each benchmark vs. A common FCA for all three benchmarks.

34

0

20000

40000

60000

80000

100000

120000

140000

160000

matmul fixed fir1 fixed ewf fixed Flexible

Are

a

Area of Fixed vs Flexible Architectures

17% 28%

92%


Applied S&E to two different types of VLIW architectures:

Presented preliminary results

Presented flexibility notion in S&E

Extend S&E to other architectures like (Superscalar).

Evaluation of our approach with more larger benchmarks and different architectures.

35

Thank you!

36

trimmed vliw: moving application specific processors...

Documents