vectorization and mapping of software defined radio ...dspcad.umd.edu/papers/bhat2014x3.pdf ·...

38
Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014 Shuvra S. Bhattacharyya Professor, ECE and UMIACS University of Maryland at College Park [email protected] , http://www.ece.umd.edu/~ssb With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall

Upload: others

Post on 18-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms

GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014

Shuvra S. Bhattacharyya Professor, ECE and UMIACS

University of Maryland at College Park [email protected] , http://www.ece.umd.edu/~ssb

With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall

Page 2: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

2

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

Page 3: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

3

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

Page 4: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

4

DSPCAD Methodologies: Computer-Aided Design (CAD) for Digital Signal Processing (DSP) Systems

FPGA

Programmable DSP

GPU

Platforms Applications and Tasks [Bhattacharyya 2013] Image: medical, computer vision, feature detection, etc.

Video: coding, compression, etc.

Audio: sample rate conversion, speech, etc.

Color processing Prediction Transformation &

Quantization Entropy Coding

Imaging device

Data preprocessing

Image reconstruction

Post reconstruction

Advanced image analysis

Image visualization

Audio device

Data preprocessing

Feature extraction

Data postprocessing

Microcontroller Wireless communication systems

Source encoding

Channel encoding

Digital modulation

D/A conversion RF

Back-end

Page 5: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

5

Motivation •  Diversity of platforms:

–  ASIC, FPGA, DSPs, GPUs, GPP

•  Complex application environments –  GNU Radio

•  Exposing parallelism

–  Task, Data and Pipeline

•  Difficult Mapping Problem

•  Multi-objective (throughput, Latency)

Page 6: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

6

Background: GNU Radio

•  A software development framework that provides software defined radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications [Blossom 2004]

Page 7: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

7

DSP-oriented Dataflow Models of Computation

•  Applica(on  is  modeled  as  a  directed  graph  –  Nodes  (actors)  represent  func(ons  of  arbitrary  complexity  –  Edges  represent  communica(on  channels  between  func(ons  –  Nodes  produce  and  consume  data  from  edges  –  Edges  buffer  data  (logically)  in  a  FIFO  (first-­‐in,  first-­‐out)  fashion  

•  Data-­‐driven  execu(on  model    –  An  actor  can  execute  whenever  it  has  sufficient  data  on  its  input  edges.  

–  The  order  in  which  actors  execute  is  not  part  of  the  specifica9on.  

–  The  order  is  typically  determined  by  the  compiler,  the  hardware,  or  both.  

•  Itera(ve  execu(on  –  Body  of  loop  to  be  iterated  a  large  or  infinite  number  of  (mes    

Page 8: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

8

DSP-oriented Dataflow Graphs

•  Ver(ces  (actors)  represent  computa(onal  modules  •  Edges  represent  FIFO  buffers  •  Edges  may  have  delays,  implemented  as  ini(al  tokens  •  Tokens  are  produced  and  consumed  on  edges  •  Different  models  have  different  rules  for  produc(on  (SDF  à  fixed,  CSDF  à  periodic,  BDF  à  dynamic)  

X Y 5 Z p1,i c1,i

p2,i c2,i

e1 e2

Page 9: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

9

Dataflow Production and Consumption Rates

X Y 5 Z p1,i c1,i

p2,i c2,i

e1 e2

Page 10: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

10

Dataflow Graph Scheduling

•  Assigning actors to processors, and ordering actor subsets that share common processors

•  Here, a “processor” means a hardware resource for actor execution on which assigned actors are time-multiplexed

•  Scheduling objectives include –  Exploiting parallelism –  Buffer management – Minimizing power/energy consumption

Page 11: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

11

Background: Contemporary Architectures

Graphics Processing Units(GPUs) Vector Operations in General Purpose Processors (GPPs)

Page 12: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

12

Primary Contribution

A novel workflow for scheduling SDF graphs while taking into account

–  Actor execution times. –  Efficient vectorization. –  Heterogeneous multiprocessors.

Demonstration system –  Applications described in a domain specific language. –  Systematic integration of precompiled libraries. –  Targeted to architectures consisting of GPPs and GPUs.

Page 13: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

13

Previous Work

•  Automatic SIMDzation [Hormati, 2010] -Based on StreamIT compiler.

•  Hierarchical models for SDR [Lin, 2007] -Targeted towards special architectures.

•  Multi-processor scheduling [Stuijk, 2007]: - Formulation towards special objectives.

•  Vectorization [Ritz,1992]: - Single processor block processing optimization.

Page 14: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

14

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

Page 15: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

15

DIF-GR-GPU Workflow

•  Start from a model-based application description.

•  Use tools to optimize scheduling, assignment.

•  Generate an accelerated system.

Dataflow Scheduler

Data Flow Graph

Throughput, Latency

Constraints Application

Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description

Actor Profiles

Library of Actor Implementations

GNU Radio

Page 16: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

16

Workflow Goals

•  Adequately make use of all sources of parallelism in order to utilize the underlying architecture.

Sources of parallelism:

–  Data Parallelism (prod and cons in SDF)

–  Task Parallelism (implicit in DFG)

–  Pipeline Parallelism (Looped schedules)

A B 100 1

A B

C

Page 17: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

17

SDF Scheduling Preliminaries

•  An SDF graph G = (V,E) has a valid (periodic) schedule if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge).

•  For each actor v in a consistent SDF graph, there is a unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule.

A B 2 3

q(A) = 3 q(B) = 2

Some Possible Schedules: (1) AABAB (2) AAA BB

Page 18: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

18

DIF-GR-GPU: Dataflow Scheduler Objective:

–  Optimize exploitation of data and pipeline parallelism à Higher throughput.

Flat Schedule: Executes an SDF graph as a cascade of distinct loops with no inter-actor nesting of loops.

Vectorization of a schedule S: A unique positive integer B, called the blocking factor of S, such that S invokes each actor v exactly (B x q(v)) times.

Original SDF Graph Corresponding BPDAG, B = 10

Data Parallelism

Pipeline Parallelism

10 10 20 20

10 10 20 20 60

60

60 60

Page 19: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

19

DIF-GR-GPU Workflow

•  Start from a model-based application description

•  Use tools to optimize scheduling, assignment.

•  Generate an accelerated system.

Dataflow Scheduler

Data Flow Graph

Throughput, Latency

Constraints Application

Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description

Actor Profiles

Library of Actor Implementations

GNU Radio

Page 20: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

20

Heterogeneous Multiprocessor Scheduler

•  Objective: Utilize available multiprocessors in the platform.

•  Architecture Descriptions: The platform is described by a set P of processors and a set B of all to all communication buses.

•  Execution times depend on the blocking factor.

•  Every processor is assumed to have a shared memory.

Task Parallelism

GPU0

Com

mun

icat

ion

Bus

GPU1

GPU N

0

1

N

Page 21: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

21

Scheduler Inputs

•  Architecture description: set P of processors and a set B of communication buses.

•  Application description: The application model (input BPDAG) consists of a set T of tasks, and a set E of edges.

•  Task and edge profiles: These profiles are described by two functions:

- RTP(t ∈ T, p ∈ P) → R defines the execution time of task t on processor p,

- REB(e ∈ E, b ∈ B) → R defines the execution time of edge e on bus b.

•  Dependency analysis: Task t1 is said to be dependent on task t2 if there is a path that starts at t1 and ends at t2 . If no such path exists between t1 and t2, then they are called parallel tasks. A similar concept can be applied to edges.

Page 22: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

22

Multiprocessor Scheduler

•  The basic scheduler functionality is to – Map every task to a given processor. – Order the execution of parallel actors assigned to

the same processor. –  “Zero” out the communication cost of collocated

dependent actors .

•  The scheduler objective is: Minimize the latency LB of B graph iterations.

Page 23: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

23

MLP formulation •  Why?

–  Offline analysis of SDF graphs. –  Coarse grain nature of SDF graphs. –  Solver gives a bound from optimal solution.

•  Basic Variables: –  Mapping: XT[t, p] = 1 if task t is assigned to processor p;

XT[t, p] = 0 otherwise. –  Ordering: For all parallel tasks t1, t2 that are assigned to

the same processor YT[t1, t2] = 1 if t1 is ordered before t2; YT[t1, t2] = 1 otherwise.

–  Running time: RT[t] = actual (platform dependent) execution time of the task t depending on its mapping.

–  Start time: ST[t] is the start time for execution of task t.

Page 24: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

24

MLP formulation (continued)

•  Constraints: –  Assignment: –  Dataflow dependency: –  Zero cost communication:

•  Objective: Minimize M

Page 25: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

25

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

Page 26: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

26

DIF-GR-GPU Workflow

•  Start from a model-based application description using the dataflow interchange format (DIF)

•  Use tools to optimize scheduling, assignment.

•  Generate an accelerated system.

Dataflow Scheduler

Dataflow Graph

Throughput, Latency

Constraints Application

Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description

Actor Profiles

Library of Actor Implementations

GNU Radio

Page 27: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

27

GPU interface GRGPU [Plishker 2011]

GRGPU GN

U R

adio

GPU Kernel in CUDA

(.cu)

CUDA Synthesizer

(nvcc)

CU

DA

Libr

arie

s

C++ Block with call to

GPU Kernel (.cc)

GPU Kernel in

C++ (.cc)

libtool

Standalone Python package

libcudart libcutil

device_work()

source H2D sink D2H FIR (GPU accelerated)

Page 28: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

28

MP-Sched Benchmark (GNU Radio)

SRC

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR FIR FIR

SNK

# of Stages #

of P

ipel

ines

Page 29: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

29

Realization of a 2x5 MP-Sched Graph

Application: 2x5 mp-sched graph Platform: 1 GPP (Intel XeonCPU 3GHz), 1 GPU (a NVidia GTX 260), and a PCI Blocking factor B : 2048. Amount of

improvement All GPP All GPU

Analytical 55% 19% Empirical 39% 21%

Page 30: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

30

Latency vs. Throughput Trade-offs

0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

0  

0.05  

0.1  

0.15  

0.2  

0.25  

0.3  

0.35  

0.4  

0.45  

0.5  

1K   2K   4K   8K   16K   32K   64K  

(microsecond

s)  

Blocking  Factor    (B)  

Latency  per  itera(on  (primary  axis)  

Overall  Latency  of  B  itera(ons.    (secondary  axis)  

(millisecon

ds)  

Each point an optimized assignment for each blocking factor.

Page 31: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

31

MLP Solver Running Time

•  Problem written in MathProg. •  Solved using the IBM ILOG CPLEX optimizer •  On Intel Core 2 Duo processor at 3 GHz.

Page 32: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

32

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

Page 33: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

33

Summary •  Diversity of platforms:

–  ASIC, FPGA, DSPs, GPUs, GPP

•  Complex application environments –  GNU Radio

•  Exposing parallelism

–  Task, Data and Pipeline

•  Difficult Mapping Problem

•  Multi-objective (throughput, Latency)

Page 34: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

34

To Probe Further … •  G. Zaki, W. Plishker, S. Bhattacharyya, C. Clancy, and J. Kuykendall.

Vectorization and mapping of software defined radio applications on heterogeneous multi-processor platforms. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 31-36, Beirut, Lebanon, October 2011.

•  G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. Journal of Signal Processing Systems, 70(2):177-191, February 2013. DOI:10.1007/s11265-012-0696-0.

Page 35: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

35

To Probe Even Further

•  Foreword by S. Y. Kung •  Part 1: Applications •  Part 2: Architectures •  Part 3: Programming

and Simulation Tools •  Part 4: Design Methods

First edition, 2010, Second edition, 2013

Page 36: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

36

•  This research was supported in part by the Laboratory for Telecommunications Sciences.

•  For more details on this project, other projects in the Maryland DSPCAD Research Group, and associated publications: http://www.ece.umd.edu/DSPCAD/home/dspcad.htm.

Acknowledgements

Page 37: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

37

References 1 •  [Bhattacharyya 2013] S. S. Bhattacharyya, E. Deprettere, R. Leupers, and

J. Takala, editors. Handbook of Signal Processing Systems. Springer, second edition, 2013. ISBN: 978-1-4614-6858-5 (Print); 978-1-4614-6859-2 (Online).

•  [Plishker 2011] W. Plishker, G. Zaki, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Applying graphics processor acceleration in a software defined radio prototyping environment. In Proceedings of the International Symposium on Rapid System Prototyping, pages 67-73, Karlsruhe, Germany, May 2011.

•  [Hormati 2010] A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. MacroSS: macro-SIMDization of streaming applications. In Symposium on Architectural Support for Programming Languages and Operating Systems, pages 285-296, 2010.

•  [Lin 2007] Y. Lin, M. Kudlur, S. Mahlke, and T. Mudge. Hierarchical coarse-grained stream compilation for software defined radio. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems, pages 115-124, 2007.

Page 38: Vectorization and Mapping of Software Defined Radio ...dspcad.umd.edu/papers/bhat2014x3.pdf · hardware,+or+both.+ • Iterave+execu(on+ ... • Generate an accelerated system. Dataflow

38

References 2

•  [Stuijk 2007] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Proceedings of the Design Automation Conference, 2007.

•  [Blossom 2004] E. Blossom. GNU radio: tools for exploring the radio frequency spectrum. Linux Journal, June 2004.

•  [Ritz 1992] S. Ritz, M. Pankert, and H. Meyr. High level software synthesis for signal processing systems. In Proceedings of the International Conference on Application Specific Array Processors, August 1992.