veal: virtualized execution accelerator for loops
DESCRIPTION
VEAL: Virtualized Execution Accelerator for Loops. Nate Clark 1 , Amir Hormati 2 , Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan. How to get Efficiency?. Microarchitecture changes Multi- / many-core Heterogeneity. Core2 Duo. STI Cell. Engineer/ Compiler. How is Heterogeneity Used?. - PowerPoint PPT PresentationTRANSCRIPT
VEAL: Virtualized Execution Accelerator for Loops
Nate Clark1, Amir Hormati2, Scott Mahlke2
1 Georgia Tech., 2U. Michigan
STI Cell
How to get Efficiency?
• Microarchitecture changes
• Multi- / many-core
• Heterogeneity
2
Core2 Duo
How is Heterogeneity Used?
3
Engineer/Compiler
GPP
Hetero.
Program
Control Statically Placed in Binary
Problem With Static Control
Not forward/backward compatible
4
CPU
Hetero.
CPU
CPU
Hetero.
Program
Engineer/Compiler
Solution: Virtualization
• Abstract accelerator features– Reexamine compiler algorithms
• Key: do the hard stuff offline
5
CPU
Hetero.
Program
CPU
CPU
Hetero.
DynComp.
DynComp.
DynComp.
Engineer/Compiler
Offline Online
This Paper:
• Examines loops as heterogeneity target– ASICs often implement loops
• Design a generalized loop accelerator– Not covered in this talk
• Explore how to virtualize loop accelerators– I.e. abstract the accelerator interface
6
Loop Accelerator Template
7
Why More Efficient Than GPP?
• Simple control flow
• Decoupled memory accesses
• I-Cache unnecessary
• Customize execution resources for loops
8
Proposed Loop Accelerator
• 1 CCA
• 2 Int units
• 16 regs
• Memory (4x)– 16 Input streams– 8 Output streams
• 0.8 mm2, 90nm
9
Modulo Scheduling
+ High quality software pipelining technique
+ Simple control structure (low HW cost)
- Can be slow, i.e., hard to do dynamically
- Loops: no side exits, no while, if convertible
10
Benchmark Execution Time
0%10%20%30%40%50%60%70%80%90%
100%
Dy
na
mic
Ex
ec
uti
on
Tim
e
Modulo Schedulable Speculation Support Subroutine Acyclic
11
Modulo Scheduling Basics
12
Kernel
FU C
CCA Int IntCCA Int Int
Modulo Scheduling Example
13
Priority: 2, 4, 63, 5
7
0
1
2
2
4
6
3
57Time
1. CCA Mapping
2. II Calculation
3. Priority
4. Scheduling
5. Reg. assignment/
communication
Measured Scheduling Overhead
0
50
100
150
200
250
300
350
400
450
500
Ov
erh
ea
d (1
00
0s
of
Ins
tru
cti
on
s)
CCA Subgraph ID ResMII RecMII
Priority Scheduling Register Assignment
14
70% Priority, 19% CCA
Supporting Hybrid Compilation
15
Loop:1 ld2 add3 sub and sub xor5 or6 or7 add8 str
Loop:1 ld2 add3 sub4 brl CCA5 or6 or7 add8 str
CCA: and sub xor ret
Data:01463…Loop:1 ld2 add3 sub4 brl CCA5 or…
Speedups
1
1.5
2
2.5
3
3.5
4
4.5
5
Sp
ee
dup
No Overhead
Full Dynamic
CCA/Priority Offline
16
Summary
• Virtualization key to heterogeneity
• VEAL speedup: 2.54– 2.63 w/o translation (i.e., not binary compatible)– 2.17 fully dynamic
• CCA and priority: 89% overhead– mpeg2dec 2.1 vs. 1.15
17
Thank you!
Questions?
http://www.cc.gatech.edu/~ntclark
http://cccp.eecs.umich.edu/
18