trimmed vliw: moving application specific processors...
TRANSCRIPT
Trimmed VLIW: Moving Application Specific Processors Towards High Level Synthesis
Janarbek Matai, Jason Oberg, Ali Irturk, Taemin Kim, and Ryan Kastner
Dept. of Computer Science and Engineering
University of California, San Diego
Intel Labs, Portland
The 2012 Electronic System Level Synthesis Conference
June 02, 2012
Overview
Background Motivation Problem Definition High-Level Synthesis ASIP
Simulate and Eliminate (S&E) Approach Trimming Flexibility
Related Work HLS Application Specific Processor Design
Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers
Results Case 1 Results Case 2 Results Flexibility Results
Conclusion and Future Work
2
Motivation 3
Custom hardware have been widely used in various fields (E.g., Embedded systems)
Performance, power and flexibility evaluate custom hardware
Automating custom hardware design
ASIP (Application-Specific Instruction-Set Processor)
High-level synthesis
Problem Definition 4
Pros
Time-to-market
Reasonable Performance
2. ASIP 1. High-level Synthesis
Cons
Flexibility & Scalability
Code size>500 (700)
Decreased control on HDL
Pros
Flexibility
Performance>CPU
Cons
Manual Design ISA, Compiler
Area
Application
HLS Tool
HDL
Application
ASIP Designer Tool
HDL, ISA, Compiler
5
High-Level Synthesis
Most current HLS design tools works as below: Datapath is pieced together by scheduling and binding the operations Controller generated using FSM which often gets unwieldy for large designs
Initial Application Dataflow Graph
Scheduled Dataflow Graph
MUL
ADD
Register 1
Register 3
Register 2
FSM Controller
Final Datapath and Controller
ASIP (Application-Specific Instruction-Set Processor)
6
ISA Design
Architectural Design Space Exploration
Application & Application Analysis
Understand Application Characteristics
ISA, define instruction set size
Datapath/Control Micro architecture
design
Compiler for the ISA.
Compiler Design
Processor architectural parameters (functional units,
register file, number of registers, number of
read/write port)
ASIP A better trade-off between efficiency and flexibility.
Simulate and Eliminate (S&E): Start with General Purpose Processor and Customize
S&E can design flexibly ASIP, also further customize it by trimming.
Overview
Background Motivation Problem Definition High-Level Synthesis ASIP
S&E Approach Trimming Flexibility
Related Work HLS Application Specific Processor Design
Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers
Results Case 1 Results Case 2 Results Flexibility Results
Conclusion and Future Work
7
What is a “Trimming” ?
S&E starts with an ASIP, and further customize it by trimming.
Trimming is the removing of unused resource; resources are functional units, registers, and interconnects.
8
In this work, we focus on trimming interconnect and multiplexer trimming.
“Wires” are trimmed in IxC and register file
Trimming unused functional units and registers is straightforward
IxC Trimming 9
1. IxC Trimming
2. IxC Trimming
Co
ntr
ol
R1 R2 … RF Read Data Ports
Write Ports
RN
IxC
FU1 FU2 FU3
Trimmed VLIW with Register File.
R1 R2 … RF Read Data Ports
Write Ports
RN
FU1 FU3 FU2
M1 M2 M3 M4
M8
M5 M6
M7 M9
RD1 RD2
RD3 RD4
out1 out2 out3
IxC Trimming Example
Register read port to functional unit input port trimming
10
RF RD4
RD1 RD2
RD3
M3 M4
FU2
RF RD4
RD1 RD2
RD3
M3 M4
FU2
RF RD4
RD1 RD2
RD3
M3 M4
FU2
out3
Wp2 Wp1
out2 out1
M8 M7
out3
Wp2 Wp1
out2 out1
M8 M7
out3
Wp2 Wp1
out2 out1
M8
Functional unit output port to register file write port trimming
Register File Trimming
11
Register File Trimming
Co
ntr
ol
R1 R2 … RF Read Data Ports
Write Ports
RN
IxC
FU1 FU2 FU3
Trimmed VLIW with Register File.
R 1
RF
RD4 RD3
R 2 R 3
RD2 RD1
M10 M11 M12 M13
S&E (Simulate and Eliminate) Approach
Start with general purpose processor and customize (trim) unused components.
In this work, we study application of S&E to the VLIW-like architecture.
12
Flexibility
Per
form
ance
Flexibility
Are
a
HLS
ASIP
S & E
HLS
ASIP
S & E
Flexibility 13
Flexibility of an application specific processor (customized hardware) is the ability to run a set of applications on an application specific processor.
E.g., an ASP (Application Specific Processor)
Edge detection
Corner detection
Flexible Customized Architecture (FCA): Customized also flexible enough to run a set of applications.
Flexibility (con’t)
Suppose, we have customized architecture for applications A and B.
14
R1 R2 RF
mul1
M1
RD1 RD2
Out 1
R1 R2 RF R3
mul2 add
M2 M3 M4
RD1 RD2
RD3
Out3 Out 2
A
1. RD1, RD2, RD3
2. R1, R2, R3
3. RD1M1, RD1M4
4. #mul=1 #add=1 B
C C can run both A and B. 1. C is flexible 2. C is application specific
Architectural parameters of C.
R1 R2 RF R3
mul add
M2 M3 M4
RD1 RD2
RD3
Out3/Out1 Out 2
M5
Overview
Background Motivation Problem Definition High-Level Synthesis ASIP
S&E Approach Trimming Flexibility
Related Work HLS Application Specific Processor Design (GUSTO, Tensilica, ISA Subsetting, NISC, PICO)
Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers
Results Case 1 Results Case 2 Results Flexibility Results
Conclusion and Future Work
15
High Level Synthesis: HLS
AutoESL Xilinx HLS tool Creates an RTL implementation
from C level source code Implements the design based on
defaults and user applied directives
Many implementation are possible from the same source description Smaller designs, faster designs,
optimal designs Enables design exploration
E.g., Catapult C, Forte Cynthesizer, C2V,
Impulse, BlueSpec
Script with Constraints
……………… ………………
VHDL Verilog
System C
AutoESL
Constraints/ Directives
………………
………………
C, C++, SystemC
RTL Synthesis
Courtesy to Xilinx
GUSTO(General architecture design Utility and Synthesis Tool for Optimization)
17
GUSTO targets RISC In this work, we focus VLIW
Resource Trimming for Hardware
Optimization
Irturk et al. GUSTO [TECS’00] simulates the
architecture to define the usage of hardware
resources trims away the unused components with
their interconnects.
Multiplier
Adder
Memory
Full Connectivity
Multiplier
Adder
Memory
Required Connectivity
17
Tensilica
XTENSA: A configurable and extensible processor [IEEE MICRO’00]. Configurability
E.g., designer can choose to have execution units such as a 16x16 multiplier, a floating-point unit, a barrel shifter, etc.
Extensibility Enables designer to add application specific functionality
TIE (Tenesilica Instruction Extension) Allows configurability and extensibility
18
Tenesilica flexibility By adding a new instructions
S&E flexibility By trimming from base architecture.
Tenesilica is a good complement to S&E.
ISA Subsetting
VESPA [CASES’08], SPREE [IEEE TCAD’07]
Main idea: Extract usage of instructions from binary or by simulation (ASIP or soft processors has fixed ISA.) Remove unused instructions from ISA and architectural support
19
Source: SPREE [IEEE TCAD’07]
ISA Subsetting By trimming instructions
S&E flexibility By trimming mux(wires)
ISA subsetting is orthogonal to S&E.
NISC (No-Instruction-Set-Computer) [DAC’08] 20
Main idea: An application is directly compiled into a given datapath. Removes compiler design of ASIP.
NISC Trims unused functional units.
S&E Flexibility with efficiency. (NISC No flexibility)
S&E Trims mux in the IxC and in the register file.
Source: http://www.ics.uci.edu/~nisc/
ALAP
Trimming
redundant
hardware
PICO (Synopsys Synphony C Compiler) [ISSS’99,
JVLSI’02]
21
PICO (Program In/Chip Out): VLIW+NPA (Nonprogrammable accelerator)
PICO VLIW + NPA (Nonprogrammable accelerator). S&E can target any architecture, and trim it. S&E provides flexibility.
Overview
Background Motivation Problem Definition High-Level Synthesis ASIP
S&E Approach Trimming Flexibility
Related Work HLS Application Specific Processor Design
Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers
Results Case 1 Results Case 2 Results Flexibility Results
Conclusion and Future Work
22
Trimmed VLIW Overview 23
Compiler
Application (C code)
Base Architecture Generation by DSE
(Scheduler, Allocation)
Architectural parameters (# of Mul=2, # of Add=2,..)
HDL (Verilog)
HDL generation Application
Trimmed Architecture (HDL)
DFG
Trimmed VLIW Simulator
User constraints
Trimmed VLIW is the implementation of S&E on VLIW-like processors.
Flow of Trimmed VLIW Framework: Converts C code to DFG
DFG is scheduled on a general VLIW processor with HLS scheduling algorithms such as Force
Directed Scheduler (FDS)
Base architecture is generated based on scheduler
We trim base architecture to trimmed architecture (Trimmed VLIW).
Trimmed VLIW Base Architecture Creation
Scheduler defines architectural parameters, e.g., number of registers, register read/write ports, functional units.
24
Parameters
Base HDL Generator
DFG
Scheduler
HDL generator creates VLIW architecture.
Co
ntr
ol
R1 R2 … RF Read Data Ports
Write Ports
RN
IxC
FU1 FU2 FU3
Case 1: TrimmedVLIW with Register File.
Co
ntr
ol
R1 R2 … Registers
RN
IxC
FU1 FU2 FU3
Case 2 : Trimmed VLIW Without Register File.
a
Scheduler + Register Allocation
Algorithms in the Trimmed VLIW framework
Instruction Scheduling
List Scheduling
Force Directed Scheduling
Ant Scheduler based on Ant Colony Optimization (ACO) [IEEE TCAD’07]
Register Allocation
Left Edge Algorithm
25
Overview
Background Motivation Problem Definition High-Level Synthesis ASIP
S&E Approach Trimming Flexibility
Related Work HLS Application Specific Processor Design
Trimmed VLIW Trimmed VLIW Overview Base Architecture Generation Case 1: Trimmed VLIW with register file Case 2: Trimmed VLIW with discrete registers
Results Case 1 Results Case 2 Results Flexibility Results
Conclusion and Future Work
26
Benchmarks
DFG and C code are provided in ExpressDFG
27
Benchmarks Names Number of Nodes
cosine2 Cosine 82
ewf Elliptic Wave Filter 34
fir1 Finite Input Response Filter 44
matmul Matrix (Vector) multiplication 28
mcm DSP program 98
Case 1 Results: Base Architecture vs. Trimmed VLIW
28
Results are obtained with design compiler.
IxC trimming 8.38% of area saving by trimming 20.54% of wires.
IxC + Register file trimming 19.2% area saving by trimming 49.8% of wires.
Benchmarks cosine2 ewf fir1 matmul mcm Average
IxC Trimming
Trimmed Wires 19% 27% 20% 16.9% 19.8% 20.54%
Area Savings 10.8% 6.7% 9.9% 6.1% 8.4% 8.38%
Register Trimming
Trimmed Wires 56% 40% 49% 64% 40% 49.8%
Area Savings 30% 10% 22% 18% 16% 19.2%
Case 1 Results: Base Architecture vs. Trimmed VLIW
The area, delay and performance comparisons between base, IxC trimmed and register trimmed architectures.
29
0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
1 1.01
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d D
ela
y
Delay
Base Trimmed (only IxC trimming) TrimmedVLIW
0
20
40
60
80
100
cosine2 ewf fir1 matmul mcm Average
Late
ncy
Latency
Base Trimmed (only IxC trimming) TrimmedVLIW
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d A
rea
Area
Base Trimmed (only IxC trimming) TrimmedVLIW
Case 1 Results: Trimmed VLIW vs. HLS Tools
Trimmed VLIW use average 25.6% less area than C2V, and have average 20.5% better performance than C2V.
C2V, AutoESL is forced to use same number of functional units with “set allocate” pragma.
30
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d A
rea
Area
TrimmedVLIW C2V AutoESL
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d D
ela
y
Delay
TrimmedVLIW C2V AutoESL
0
20
40
60
80
100
cosine2 ewf fir1 matmul mcm Average
Late
ncy
Latency
TrimmedVLIW C2V AutoESL
Case 2 Results: Base Architecture vs. Trimmed VLIW
31
IxC trimming 22.16% of area by trimming 43.4% of wires.
IxC + Register file trimming 35% area saving by trimming 63.8% of wires.
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d A
rea
Area (Case2)
Base Trimmed ( only IxC) TrimmedVLIW
0.9
0.92
0.94
0.96
0.98
1
1.02
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d A
rea
Delay (Case2)
Base Trimmed ( only IxC) TrimmedVLIW
0
10
20
30
40
50
60
70
80
cosine2 ewf fir1 matmul mcm Average
Late
ncy
Latency (Case2)
Base Trimmed (only IxC trimming) TrimmedVLIW
Case 1 Trimmed VLIW vs. Case 2 Trimmed VLIW vs. AutoESL
Area: Trimmed VLIW (Case2) is better than Trimmed VLIW (Case1).
Performance: AutoESL has better performance than Trimmed VLIW (Case1) and Trimmed VLIW (Case2).
32
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average
No
rmal
ize
d A
rea
Area
TrimmedVLIW (Case1) TrimmedVLIW (Case2)
AutoESL
0
0.2
0.4
0.6
0.8
1
1.2
cosine2 ewf fir1 matmul mcm Average No
rmal
ize
d D
ela
y
Delay
TrimmedVLIW (Case1) TrimmedVLIW (Case2)
AutoESL
0
20
40
60
80
cosine2 ewf fir1 matmul mcm Average
Late
ncy
Latency
TrimmedVLIW (Case1) TrimmedVLIW (Case2)
AutoESL
Results: Flexibility
1. Create a flexible architecture.
2. Make it FCA by trimming for a given set of applications.
33
#Adder #Mul #Registers #Read Ports #Write Ports
matmul 2 3 20 6 3
fir1 3 2 22 6 3
ewf 2 2 8 4 2
General Architecture 3 3 22 6 3
Trimmed 87 out of 222 wires (39%)
Area saving : 13%
Results: Flexibility (cont’d)
A fixed architecture for each benchmark vs. A common FCA for all three benchmarks.
34
0
20000
40000
60000
80000
100000
120000
140000
160000
matmul fixed fir1 fixed ewf fixed Flexible
Are
a
Area of Fixed vs Flexible Architectures
17% 28%
92%
Conclusion and Future Work
Applied S&E to two different types of VLIW architectures:
Presented preliminary results
Presented flexibility notion in S&E
Extend S&E to other architectures like (Superscalar).
Evaluation of our approach with more larger benchmarks and different architectures.
35