veal: virtualized execution accelerator for loops

18
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1 , Amir Hormati 2 , Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan

Upload: livvy

Post on 05-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

VEAL: Virtualized Execution Accelerator for Loops. Nate Clark 1 , Amir Hormati 2 , Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan. How to get Efficiency?. Microarchitecture changes Multi- / many-core Heterogeneity. Core2 Duo. STI Cell. Engineer/ Compiler. How is Heterogeneity Used?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: VEAL: Virtualized Execution Accelerator for Loops

VEAL: Virtualized Execution Accelerator for Loops

Nate Clark1, Amir Hormati2, Scott Mahlke2

1 Georgia Tech., 2U. Michigan

Page 2: VEAL: Virtualized Execution Accelerator for Loops

STI Cell

How to get Efficiency?

• Microarchitecture changes

• Multi- / many-core

• Heterogeneity

2

Core2 Duo

Page 3: VEAL: Virtualized Execution Accelerator for Loops

How is Heterogeneity Used?

3

Engineer/Compiler

GPP

Hetero.

Program

Control Statically Placed in Binary

Page 4: VEAL: Virtualized Execution Accelerator for Loops

Problem With Static Control

Not forward/backward compatible

4

CPU

Hetero.

CPU

CPU

Hetero.

Program

Engineer/Compiler

Page 5: VEAL: Virtualized Execution Accelerator for Loops

Solution: Virtualization

• Abstract accelerator features– Reexamine compiler algorithms

• Key: do the hard stuff offline

5

CPU

Hetero.

Program

CPU

CPU

Hetero.

DynComp.

DynComp.

DynComp.

Engineer/Compiler

Offline Online

Page 6: VEAL: Virtualized Execution Accelerator for Loops

This Paper:

• Examines loops as heterogeneity target– ASICs often implement loops

• Design a generalized loop accelerator– Not covered in this talk

• Explore how to virtualize loop accelerators– I.e. abstract the accelerator interface

6

Page 7: VEAL: Virtualized Execution Accelerator for Loops

Loop Accelerator Template

7

Page 8: VEAL: Virtualized Execution Accelerator for Loops

Why More Efficient Than GPP?

• Simple control flow

• Decoupled memory accesses

• I-Cache unnecessary

• Customize execution resources for loops

8

Page 9: VEAL: Virtualized Execution Accelerator for Loops

Proposed Loop Accelerator

• 1 CCA

• 2 Int units

• 16 regs

• Memory (4x)– 16 Input streams– 8 Output streams

• 0.8 mm2, 90nm

9

Page 10: VEAL: Virtualized Execution Accelerator for Loops

Modulo Scheduling

+ High quality software pipelining technique

+ Simple control structure (low HW cost)

- Can be slow, i.e., hard to do dynamically

- Loops: no side exits, no while, if convertible

10

Page 11: VEAL: Virtualized Execution Accelerator for Loops

Benchmark Execution Time

0%10%20%30%40%50%60%70%80%90%

100%

Dy

na

mic

Ex

ec

uti

on

Tim

e

Modulo Schedulable Speculation Support Subroutine Acyclic

11

Page 12: VEAL: Virtualized Execution Accelerator for Loops

Modulo Scheduling Basics

12

Kernel

FU C

Page 13: VEAL: Virtualized Execution Accelerator for Loops

CCA Int IntCCA Int Int

Modulo Scheduling Example

13

Priority: 2, 4, 63, 5

7

0

1

2

2

4

6

3

57Time

1. CCA Mapping

2. II Calculation

3. Priority

4. Scheduling

5. Reg. assignment/

communication

Page 14: VEAL: Virtualized Execution Accelerator for Loops

Measured Scheduling Overhead

0

50

100

150

200

250

300

350

400

450

500

Ov

erh

ea

d (1

00

0s

of

Ins

tru

cti

on

s)

CCA Subgraph ID ResMII RecMII

Priority Scheduling Register Assignment

14

70% Priority, 19% CCA

Page 15: VEAL: Virtualized Execution Accelerator for Loops

Supporting Hybrid Compilation

15

Loop:1 ld2 add3 sub and sub xor5 or6 or7 add8 str

Loop:1 ld2 add3 sub4 brl CCA5 or6 or7 add8 str

CCA: and sub xor ret

Data:01463…Loop:1 ld2 add3 sub4 brl CCA5 or…

Page 16: VEAL: Virtualized Execution Accelerator for Loops

Speedups

1

1.5

2

2.5

3

3.5

4

4.5

5

Sp

ee

dup

No Overhead

Full Dynamic

CCA/Priority Offline

16

Page 17: VEAL: Virtualized Execution Accelerator for Loops

Summary

• Virtualization key to heterogeneity

• VEAL speedup: 2.54– 2.63 w/o translation (i.e., not binary compatible)– 2.17 fully dynamic

• CCA and priority: 89% overhead– mpeg2dec 2.1 vs. 1.15

17

Page 18: VEAL: Virtualized Execution Accelerator for Loops

Thank you!

Questions?

http://www.cc.gatech.edu/~ntclark

http://cccp.eecs.umich.edu/

18