a computational stack for cross-domain accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · a...

17
A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush Ghodrati Brahmendra Yatham Alric Althoff * Divya Mahajan Sorin Lerner Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab University of California San Diego * Tortuga Logic Microsoft {skinzer, jkkim, soghodra, byatham}@eng.ucsd.edu alric@tortugalogic.com divya.mahajan@microsoft.com lerner@eng.ucsd.edu hadi@eng.ucsd.edu Abstract—Domain-specific accelerators obtain performance benefits by restricting their algorithmic domain. These accel- erators utilize specialized languages constrained to particular hardware, thus trading off expressiveness for high performance. The pendulum has swung from one hardware for all domains (general-purpose processors) to one hardware per individual do- main. The middle-ground on this spectrum–which provides a unified computational stack across multiple, but not all, domains– is an emerging and open research challenge. This paper sets out to explore this region and its associated tradeoff between ex- pressiveness and performance by defining a cross-domain stack, dubbed PolyMath. This stack defines a high-level cross-domain language (CDL), called PMLang, that in a modular and reusable manner encapsulates mathematical properties to be expressive across multiple domains–Robotics, Graph Analytics, Digital Sig- nal Processing, Deep Learning, and Data Analytics. PMLang is backed by a recursively-defined intermediate representation allowing simultaneous access to all levels of operation granu- larity, called srDFG. Accelerator-specific or domain-specific IRs commonly capture operations in the granularity that best fits a set of Domain-Specific Architectures (DSAs). In contrast, the recursive nature of the srDFG enables simultaneous access to all the granularities of computation for every operation, thus form- ing an ideal bridge for converting to various DSA-specific IRs across multiple domains. Our stack unlocks multi-acceleration for end-to-end applications that cross the boundary of multiple domains each comprising different data and compute patterns. Evaluations show that by using PolyMath it is possible to harness accelerators across the five domains to realize an average speedup of 3.3× over a Xeon CPU along with 18.1× reduc- tion in energy. In comparison to Jetson Xavier and Titan XP, cross-domain acceleration offers 1.7× and 7.2× improvement in performance-per-watt, respectively. We measure the cross- domain expressiveness and performance tradeoff by comparing each benchmark against its hand-optimized implementation to achieve 83.9% and 76.8% of the optimal performance for single- domain algorithms and end-to-end applications. For the two case studies of end-to-end applications (comprising algorithms from multiple domains), results show that accelerating all kernels of- fers an additional 2.0× speedup over CPU, 6.1× improvement in performance-per-watt over Titan Xp, and 2.8× speedup over Jetson Xavier compared to only the one most effective single- domain kernel being accelerated. Finally, we examine the utility and expressiveness of PolyMath through a user study, which shows, on average, PolyMath requires 1.9× less time to imple- ment algorithms from two different domains with 2.5× fewer lines of code relative to Python. Index Terms—Accelerator design, Machine learning systems, Algorithms/System Co-design, Domain Specific Language Expressiveness Performance Domain-Specific Accelerators PolyMath General-Purpose Processors DeCo [7] TABLA [14] RoboX [12] TVM [13] Darwin [15] Graphicionado [11] Expressiveness Performance Domain Robotics DSP Data Analytics PolyMath General-Purpose Processors Open Research Genomics Graph Analytics Deep Learning BCP Acc [17] SAT Solving Fig. 1: Emerging tradeoff: Expressiveness vs. Performance. I. I NTRODUCTION End-to-end applications ranging from delivery drones [1] to smart speakers [2] cross multiple domains. One such appli- cation senses the environment, (1) pre-processes the sensory data, feeds it to a (2) perception module that in turn invokes a (3) decision making process to determine actions. Percep- tion is currently reigned by Machine Learning (ML), which has attracted significant attention, but applications are not just ML. Sensory data processing relies on algorithms from Digital Signal Processing (DSP) while Control Theory and Robotics bring forth the final action that may also feed the perception module. Even though these domains work in tandem to real- ize an entire application, they are becoming isolated by the current push towards Domain-Specific Accelerators (DSAs). One the one hand, these accelerators tradeoff generality for performance and energy efficiency by restricting programma- bility to a single domain [3]. On the other hand, the traditional general-purpose computational stack cannot meet the compu- tational demands of emerging applications [4–6]. Although, Domain-Specific Accelerators (DSA) bridge this performance gap, but make implementation an arduous task of dealing with isolated programming interfaces. Thus, expressiveness is be- coming limited, making the composition of an end-to-end ap- plication a major challenge for execution on accelerators. As a

Upload: others

Post on 16-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

A Computational Stack for Cross-DomainAcceleration

Sean Kinzer Joon Kyung Kim Soroush Ghodrati Brahmendra YathamAlric Althoff∗ Divya Mahajan† Sorin Lerner Hadi Esmaeilzadeh

Alternative Computing Technologies (ACT) LabUniversity of California San Diego ∗ Tortuga Logic † Microsoft

skinzer,jkkim,soghodra,[email protected] [email protected]

[email protected] [email protected] [email protected]

Abstract—Domain-specific accelerators obtain performancebenefits by restricting their algorithmic domain. These accel-erators utilize specialized languages constrained to particularhardware, thus trading off expressiveness for high performance.The pendulum has swung from one hardware for all domains(general-purpose processors) to one hardware per individual do-main. The middle-ground on this spectrum–which provides aunified computational stack across multiple, but not all, domains–is an emerging and open research challenge. This paper sets outto explore this region and its associated tradeoff between ex-pressiveness and performance by defining a cross-domain stack,dubbed PolyMath. This stack defines a high-level cross-domainlanguage (CDL), called PMLang, that in a modular and reusablemanner encapsulates mathematical properties to be expressiveacross multiple domains–Robotics, Graph Analytics, Digital Sig-nal Processing, Deep Learning, and Data Analytics. PMLangis backed by a recursively-defined intermediate representationallowing simultaneous access to all levels of operation granu-larity, called srDFG. Accelerator-specific or domain-specific IRscommonly capture operations in the granularity that best fitsa set of Domain-Specific Architectures (DSAs). In contrast, therecursive nature of the srDFG enables simultaneous access to allthe granularities of computation for every operation, thus form-ing an ideal bridge for converting to various DSA-specific IRsacross multiple domains. Our stack unlocks multi-accelerationfor end-to-end applications that cross the boundary of multipledomains each comprising different data and compute patterns.

Evaluations show that by using PolyMath it is possible toharness accelerators across the five domains to realize an averagespeedup of 3.3× over a Xeon CPU along with 18.1× reduc-tion in energy. In comparison to Jetson Xavier and Titan XP,cross-domain acceleration offers 1.7× and 7.2× improvementin performance-per-watt, respectively. We measure the cross-domain expressiveness and performance tradeoff by comparingeach benchmark against its hand-optimized implementation toachieve 83.9% and 76.8% of the optimal performance for single-domain algorithms and end-to-end applications. For the two casestudies of end-to-end applications (comprising algorithms frommultiple domains), results show that accelerating all kernels of-fers an additional 2.0× speedup over CPU, 6.1× improvementin performance-per-watt over Titan Xp, and 2.8× speedup overJetson Xavier compared to only the one most effective single-domain kernel being accelerated. Finally, we examine the utilityand expressiveness of PolyMath through a user study, whichshows, on average, PolyMath requires 1.9× less time to imple-ment algorithms from two different domains with 2.5× fewerlines of code relative to Python.

Index Terms—Accelerator design, Machine learning systems,Algorithms/System Co-design, Domain Specific Language

Expr

essi

vene

ss

Performance

Domain-Specific Accelerators

PolyMath

General-Purpose Processors

DeCo [7]TABLA [14]

RoboX [12]

TVM [13]Darwin [15]

Graphicionado [11]

…Expr

essi

vene

ss

Performance

DomainRobotics

DSPData Analytics

PolyMath

Gen

eral

-Pur

pose

Pro

cess

ors

Open Research

Genomics

Graph Analytics

Deep Learning

BCP Acc [17] SAT Solving

Fig. 1: Emerging tradeoff: Expressiveness vs. Performance.

I. INTRODUCTION

End-to-end applications ranging from delivery drones [1]to smart speakers [2] cross multiple domains. One such appli-cation senses the environment, (1) pre-processes the sensorydata, feeds it to a (2) perception module that in turn invokesa (3) decision making process to determine actions. Percep-tion is currently reigned by Machine Learning (ML), whichhas attracted significant attention, but applications are not justML. Sensory data processing relies on algorithms from DigitalSignal Processing (DSP) while Control Theory and Roboticsbring forth the final action that may also feed the perceptionmodule. Even though these domains work in tandem to real-ize an entire application, they are becoming isolated by thecurrent push towards Domain-Specific Accelerators (DSAs).One the one hand, these accelerators tradeoff generality forperformance and energy efficiency by restricting programma-bility to a single domain [3]. On the other hand, the traditionalgeneral-purpose computational stack cannot meet the compu-tational demands of emerging applications [4–6]. Although,Domain-Specific Accelerators (DSA) bridge this performancegap, but make implementation an arduous task of dealing withisolated programming interfaces. Thus, expressiveness is be-coming limited, making the composition of an end-to-end ap-plication a major challenge for execution on accelerators. As a

Page 2: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

consequence, users seeking to create compute-intensive appli-cations composed of algorithms from different domains mustchoose between either using a lower-performance, general-purpose processor or bear the burden of manually stitchingtogether various domain-specific accelerators.

This emerging challenge creates a new tradeoff betweenperformance and expressiveness, illustrated in Figure 1. Onone extreme, we have General-Purpose Processors that al-low expressing all domains at the cost of performance and/orefficiency [3]. On other extreme, are domain-specific accel-erators [7–13] that can only support a single domain to beexecuted on one particular specialized architecture, thus arevery performant. Even though certain DSAs offer computa-tional stacks, composing an end-to-end application that crossesthe boundary of many domains requires intimate knowledge ofmultiple different interfaces and various hardware acceleratorsto obtain high performance. Recent efforts such as, Graphi-cionado [14], RoboX [15], TVM [16], Tabla [17], aim to unifyhigh-level coding within a single domain, cross-domain stacksfor accelerators still remains an open challenge (Figure 1).

As Figure 1 illustrates, by addressing this challenge, Poly-Math defines a new point in the Expressiveness vs. Perfor-mance design space. The ingredients of our approach are:1) Exploiting the mathematical similarities across domains to

design a modular and reusable language, PMLang, thatis expressive across Robotics, Graph Analytics, DSP, DataAnalytics, and Deep Learning. It offers one-to-one mappingbetween code and mathematical formulation while retain-ing modularity, thus making it familiar to both domain-experts and software engineers. PMLang offers light-weight type modifiers based on domain semantics to enableaccelerators to handle on-chip and off-chip data allocation,storage, and transfer. PMLang is not an abstraction overexisting domain specific languages, rather it explores anovel dimension of designing a unified, stand-alone lan-guage across multiple domains. This unique dimension de-fines a class of languages that we refer to as Cross DomainLanguages (CDLs). To enable cross-domain acceleration,PMLang and its associated compilation stack takes an ini-tial step to bridge CDLs with domain-specific architectures,that are constrained to a single domain.

2) To preserve expressiveness and provide flexibility for com-pilation to different accelerators, we devise an interme-diate representation that is a recursively-defined DataflowGraph, providing simultaneous access to all levels of op-eration granularity (srDFG). The compute granularity ofthe kernels is not uniform across different accelerators, re-quired for cross-domain settings. Thus, we define srDFGrecursively in that its nodes are also srDFGs. As such,srDFG uniquely offers simultaneous access to various lev-els of computation granularity within a single program,thus enabling leveraging different accelerators. This capa-bility enables cross-domain multi-acceleration–accelerationof a cross-domain application on different accelerators. Theflexible, recursive nature of our IR shown in Figure 2,whose edges preserve the type modifier metadata.

3) PolyMath uses a modular compilation framework that con-veniently enables creation and application of pipelinedcompilation passes on the srDFG. To convert srDFGsto executable accelerator code, PolyMath offers a graphlowering algorithm with a conversion strategy which usesmetadata embedded in the srDFG edges to flexibly trans-late the graph nodes. The lowering algorithm applies trans-formations which produce a new srDFG made up of com-pute kernels at the same granularity of the target accelera-tor. Once lowered, the metadata associated with the srDFGedges is translated to the accelerator’s own IR for finalbinary generation through its own scheduling and mappingframework.

4) Last but not least, PolyMath will be the very first exten-sible, modular, and open-source computation stack to en-able the community to innovate and explore the impendingchallenges of cross-domain acceleration at a time whendomain-specific compilation stacks, except for DNNs, areelusive. We have made the entire codebase available in apublic repository (https://github.com/he-actlab/polymath).Although there are many accelerators in the literature,many of their compilation stacks are not available. Bymaking PolyMath open-source and extensible, the com-munity can add other domains which align with the coremathematical constructs in PMLang.

We show PolyMath’s balance of expressiveness and per-formance by compiling twelve different algorithms acrossrobotics, graph analytics, digital signal processing, data ana-lytics, and deep learning. These workloads achieve an overallspeedup of 3.3× over a Xeon CPU along with 18.1× reductionin energy. In comparison to Titan Xp and Jetson Xavier GPUs,cross-domain acceleration offers 7.2× and 1.7× in energy re-duction. We next measure the tradeoff of cross-domain algo-rithm expression and find that PolyMath can achieve 83.9%of the performance of the same algorithms implemented intheir native stack’s language. We also study two end-to-endapplications that cross multiple domains. Accelerating all thekernels offers an additional 2.0× speedup over CPU, 6.1×additional improvement in energy requirements over Titan Xp,and 2.8× speedup over Jetson Xavier vs when only one mosteffective kernel was accelerated. The results show that whenonly a part is accelerated, the slower non-accelerated kernelsdictates the overall improvement, whereas, when all the al-gorithms in the application are accelerated Amdahl’s burdenreduces, and the improvement of all the domains is magni-fied. To evaluate the usability and expressiveness of PMLangrelative to Python, we conduct a user study and found thaton average PMLang required 1.9× less time to implementalgorithms with 2.5× fewer lines of code. Finally, end-to-endperformance of Polymath is 76.8% of the performance of twomanually implemented end-to-end applications. Given the factthat PolyMath offers greater ease of programming comparedto Python, the automation overhead of 23.1% (=100%-76.8%)is a fair bargain.

Page 3: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

+

+

-

++

++

sin

-

++

- -

+

//+-+

++

-/

+ +- /

++

/ln

++

+/ - +

+

/ +

+

+

-

/

abs

Fig. 2: Visual representation of PolyMath’s srDFG.

II. PMLANG: MATHEMATICAL PROGRAMMINGINTERFACE

PMLang is designed to encapsulate the mathematical prop-erties of these domains, as they are tied together by similar op-erations on multi-dimensional data, include minimal control-flow, and share use-cases such as cyber-physical systems. Con-sider the following application in its entirety: An end-to-endneuroscience application requires multiple domains to studythe impact of deep brain stimulation on movement disordersand goes through the following steps: (1) convert raw elec-trocorticographic (ECoG) brain signals to frequency domainusing fast Fourier transform (FFT); (2) apply logistic regres-sion to classify these frequency domain signals into variousbiomarkers; (3) based on the classification, use model predic-tive control to send an optical stimulation back to the brain.This application crosses three domains, DSP, Data Analytics,and Control Theory in each iteration to generate deep brainstimulation signals.

There are numerous domain specific architectures for eachof these algorithms/domains individually; however, using themfor this application would require writing each part of theapplication in a different DSL, compiling them separately, andmanually joining their executables. Instead, PMLang allowsusers to write their application as a single program, thus, elim-inating the overhead of stitching together stacks to execute theprogram across multiple domain specific architectures.

Keeping the properties of target domains in mind, PM-Lang is designed to reduce the time to code a mathemati-cal expression into a formula-based textual format, enabledby language constructs for modularity and light-weight typemodifiers. Moreover for code organization and reduction inimplementation time, PMLang includes reusable executioncode blocks called components that perform operations onflows of data. These components encapsulate a task comprisedof either other components and/or mathematical expressionswhich use traditional, imperative syntax to facilitate familiarityfor experienced programmers. For modularity and reusability,components have distinct boundaries and arguments which aredistinguished by type modifiers consisting of input, output,state, and param; each of which is associated with how thecomponent will use the argument, shown in Table I. By usingtype modifiers in component arguments to explicitly identifydata semantics PMLang binds operations to data being oper-ated on for accelerators to take advantage of.

yx

Fig. 3: Visual of trajectory tracking for wheeled robot.

The remainder of the section will delve into details of PM-Lang constructs through an example. For brevity, we show thePMLang program for Model Predictive Control (MPC) fromControl Theory used for Robotics. MPC attempts to solve aconstrained optimization problem over a finite sequence ofinputs. MPC can also be used for the aforementioned brainstimulation application. We provide MPC in the context ofa mobile two-wheeled robot performing trajectory tracking,shown in Figure 3. Here, the sensors send the current state ofthe robot as inputs to a model that predicts the next location,then optimizes a sequence of control signals for moving therobot to the predicted location. The program finds the optimalsequence of control signals, ctrl_mdl, over a finite period oftime to match a reference trajectory, pos_ref, that specifiesthe position and orientation of the robot. At each point in time,the actual position and orientation of the robot is input to thealgorithm, which optimizes the ctrl_mdl and sends the nextcontrol signal, ctrl_sgnl, back to the robot. This processis summed up in the following steps:

Input: pos → the (x,y,θ) orientation of the robot.Output: ctrl_sgnl→ (ν, ω) control consisting the velocityand angular velocity to be sent to the robot.State: u → (v, ω) linear and angular velocities across a pre-determined time horizon h.Step 1: Make a prediction Using input pos and cost matricesP and H, predict the position and angle of the robot acrosshorizon h.Step 2: Compute the gradient of the objective functionCalculate the error on the predicted position and orientation,pos_pred, using pre-computed gradient coefficient matricesHQ_g and R_g.Step 3: Update the control model and send the outputsignal Using gradient, g, update the control model, and sendthe output control signal, ctrl_sgnl.

A. Components

Components form the building blocks of PMLang, and areused to delineate different parts of the program into mul-tiple levels of execution. To delineate the access semanticsfor each argument of the components, PMLang uses typemodifiers ((input), output, state, param). Using typemodifiers relieves programmers from concern about the un-derlying accelerator-specific mechanisms for data exchangebetween different components. An input argument is usedto feed data into the component, and is read-only; an outputargument is used to return data from a component, and canonly be written to; a state argument can be read or written

Page 4: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

1 predict_trajectory(input float pos[a],2 input float ctrl_mdl[b],3 param float P[c][a],4 param float H[c][b],5 output float pred[c])6 index i[0:a-1], j[0:b-1], k[0:c-1];7 pred[k] = sum[i](P[k][i]*pos[i]);8 pred[k] = pred[k] + sum[j](H[k][j]*ctrl_mdl[j]);9

10 update_ctrl_model(input float ctrl_prev[b],11 input float g[b],12 output float ctrl_mdl[b],13 output float ctrl_sgnl[s],14 param int h)15 index i[0:b-2], j[0:s-1];16 ctrl_sgnl[j] = ctrl_mdl[h*j];17 ctrl_mdl[(h-1)*j] = 0;18 ctrl_mdl[i] = ctrl_prev[(i+1)*h] - g[(i+1)*h];19 20 mvmul(input float A[m][n],21 input float B[n],22 output float C[m])23 index i[0:n-1], j[0:m-1];24 C[j] = sum[i](A[j][i]*B[i]);25 26 compute_ctrl_grad(input float pos_pred[c],27 input float ctrl_mdl[b],28 input float pos_ref[c],29 param float HQ_g[b][c], // Input Cost Gradient30 param float R_g[b][b], // Cost Inverse Hessian31 output float g[b])32 index i[0:b-1], j[0:c-1];33 float P_g[b], H_g[b];34 err[j] = pos_ref[j] - pos_pred[j];35 mvmul(HQ_g, err, P_g);36 mvmul(R_g, ctrl_mdl, H_g);37 g[i] = P_g[i] + H_g[i];38 39 main(input float pos[3],40 state float ctrl_mdl[20],41 param float pos_ref[30],42 param float P[30][3],43 param float HQ_g[20][30],44 param float H[30][20],45 param float R_g[20][20],46 output float ctrl_sgnl[2])47 float pos_pred[30], g[20];48 index i[0:9], j[0:1];49 RBT: predict_trajectory(pos, ctrl_mdl, P, H, pos_pred);50 RBT: compute_ctrl_grad(pos_pred,ctrl_mdl,pos_ref,HQ_g,R_g,g);51 RBT: update_ctrl_model(ctrl_mdl, g, ctrl_mdl, ctrl_sgnl,10);52

Fig. 4: MPC for MobileRobot trajectory tracking in PMLang.

to, and represents data that is part of the state of the com-ponent, thus is preserved across invocations/iterations; and aparam argument is a constant that is used to parameterizethe component. These type modifiers describe whether or notthe data will be re-used (state), kept unchanged (param),or used once and discarded (input/output). As an exam-ple, line 1 shows that argument pos is an input to thepredict_trajectory component. As another example,line 41 shows a state argument named ctrl_mdl whichindicates that ctrl_mdl is used, updated, shared across in-vocations of main, which matches the MPC semantics ofoptimizing the control model over a series of time steps. Typemodifiers also enable custom accelerators to place inputdata such as pos in Read-only FIFO buffers to reduce datacommunication overhead and hardware memory logic, or storestate data such as ctrl_mdl on-chip on the acceleratorfor fast repeated data accesses. Using a single set of typemodifiers to describe data semantics across multiple domainsunifies program implementation for end-to-end applications.This is exemplified in robotics and deep learning, where onedomain uses “model” and the other “weight” to describe thesame data semantics, both of which are described as statedata in PMLang.

In addition to being reusable, these components allow usersto conceive their program as a collection of sub-steps atvarying levels of granularity making it adaptable for compi-

lation to different accelerators. To instantiate a component,the programmer specifies its name and arguments. An ex-ample of component instantiation is shown in line 49-51where predict_trajectory, compute_ctrl_grad,and update_ctrl_model is instantiated. Each instantia-tion creates a copy of the component, as if it were inlined.This is in contrast to conventional languages that rely on afunction call stack which is sequential in nature. Instead, in-lining enables the program to be mapped to our srDFG IR,which preserves opportunities for parallelism based on dataflow dependencies.

B. Index Variables

PMLang is based on mathematical notations that do not usefor loops and instead use indices (e.g.,

∑n−1i=0 Ai). To simplify

programming based on formulae, PMLang uses index vari-ables to concisely specify operations performed over rangesof multi-dimensional data without using explicit for loops. Inits most basic form, an index variable represents a rangeof integers, specified by its lower and upper bounds. Line48 shows two such index variable declarations: i and j.This approach reveals the inherent parallelism in mathematicalformulae since operations expressed using this approach arenaturally vectorizable without performing any loop transfor-mations. For example, below is a PMLang statement iteratingover all js, each of which can be performed in parallel.index j[0:s-1];err[j] = pos_ref[j] - pos_pred[j];

Strided indexing. To support strided/non-sequential indexing(e.g., convolution), PolyMath also supports arithmetic opera-tions on index variables as shown below.ctrl_mdl[i] = ctrl_prev[(i+1)*h] - g[(i+1)*h];

Boolean conditional over indexing. Unlike a domain-specificlanguage such as TABLA [17] that focuses solely on data an-alytics, PMLang allows Boolean conditionals to be appliedto indices, which provides support for other domains suchas graph analytics and robotics. For instance, the followingcomputes the sum of the non-diagonal parts of the matrix A:index i[0:N-1], j[0:M-1];res = sum[i][j: j != i](A[i][j]);

Support for Boolean conditionals and non-sequential indexvariables flexibly incorporates common as well as specific-to-domain characteristics of algorithms across robotics, graphanalytics, DSP, data analytics, and deep learning. These fea-tures distinguish PMLang from DSLs which either use (1)concise operations expressed through a fixed API (e.g., namedfunctions such as “dense” in TVM [16]), or (2) simply donot support these construction since the specific target domaindoes not require them (e.g., TABLA [17].)

C. Mathematical Operations

Index variables allow for a nearly one-to-one mapping be-tween mathematical notation and PMLang code. PMLangoffers standard mathematical operators to be used with multi-dimensional data, expressed in a single statement by using

Page 5: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

index variables. PMLang’s syntax for math expression of

Cj =n∑

i=0

Aj,i ×Bi is:

C[j] = sum[i](A[j][i]*B[i]);

Non-Linear operations. PMLang includes a set of built-infunctions to be used in math expressions commonly usedacross the multiple target domains, including non-linear oper-ations such as cosine/sine (DSP, robotics), gaussian (robotics,DSP, data analytics), sigmoid/ReLU (deep learning, data ana-lytics), etc. Including non-linear operations as part of PMLangsimplifies algorithm expression and allows PolyMath to lever-age the performance benefits of non-linear compute units incustom accelerators.Reduction operations. PMLang is also equipped with built-ingroup reduction operations such as sum, prod, max etc., tocalculate the summation (

∑), product (Π), or maximum value

of a sequence of numbers. These group reductions operationsare converted to srDFG with two levels of granularity: (1)the outer group DFG node that (2) encapsulates the scalarinner operations (nodes). This multi-granular representationenables the compiler to map the the outer encompassing nodeto a dedicated unit if the accelerator harbors it. Otherwise, theinner basic nodes are mapped to individual ALUs. This crucialflexibility is a unique feature of PolyMath and enables it becross domain and target different accelerators.Custom reduction operations. PMLang also supports customgroup reduction operations, as they are commonly used ingraph analytics and DSP algorithms. Custom reduction opera-tions can be defined in PolyMath by specifying the arithmeticfor a given set of input arguments. Below is an example ofthe definition of the min reduction function and using it tofind the minimum value for the matrix A:reduction min(a,b) = a < b ? a : b;res = min[i][j](A[i][j]);

D. Domain Annotations

PMLang uniquely targets multiple domains, each of whichis eventually accelerated with a Domain-Specific Architecture.As such, PMLang offers a light-weight mechanism to specifythe target domain for only top-level component instantiationswithout tying it to a specific accelerator. All of the code withina component also inherits the same domain, which alleviatesthe programmer from having to annotate all component in-stances in their program. This is done by simply adding oneof the five keywords: RBT (Robotics), GA (Graph Analytics),DSP (Digital Signal Processing), DA (Data Analytics), and DL(Deep Learning), as demonstrated below:

RBT: predict_trajectory(pos, ctrl_mdl, P, H, pos_pred);

III. SIMULTANEOUS-RECURSIVE DATAFLOW GRAPH

Accelerator-specific or domain-specific IRs commonly cap-ture operations in the granularity that best fits the target archi-tecture. One of the major challenges that PolyMath faces istargeting multiple domains each of whose accelerators operateon different granularities of computation. Even within a singledomain, various architectures accept computation in different

TABLE I: A subset of PMLang’s keywords and definitions.

LanguageConstruct

Keyword Description

Component string name Takes input, produces output, andreads/writes to state arguments

Domain RBT, GA, DSP, DA,DL

Specifies a component’s target do-main

input Flow of data, can be exclusivelyread from within a componentscope

output Flow of data, can be exclusivelywritten to within a componentscope

TypeModifiers

param Constant parameter used to pa-rameterize a component

state Flow of data, can be written toor read from within a componentscope

IndexTypes

index Specifies ranges of operationsTypes bin, int, float,

str, complexData types used to for variable dec-larations.

granularities. To address this challenge, we designed srDFG,an intermediate representation which is a recursively-defineddataflow graph, and provides simultaneous access to each levelof recursion, srDFG. Our srDFG enables the compiler tosimultaneously access all the granularities of computation forevery component, thus forming the ideal bridge to convert tovarious accelerator-specific IR. Furthermore, srDFG enablemulti-acceleration for end-to-end applications that cross theboundary of multiple domains with different data and computepatterns across Robotics, Graph Analytics, DSP, Data Analyt-ics, and Deep Learning. Next, we describe the srDFG struc-ture using Figure 5, a visual representation of the MobileRobotalgorithm described in Section II.A. srDFG Definitions

An srDFG is defined as a pair, (N,E), of nodes N repre-senting PMLang operations, and edges E representing input oroutput operands. An srDFG node n ∈ N is a pair (name, srdfg)of a string representing the name of an operation, and its lower-granularity operations srDFG composition. Each numberedbox in Figure 5 represents the srdfg for different nodes at vary-ing granularities within the MobileRobot algorithm. As shownin 2 : both the the subtraction operation and the mvmul com-ponent are nodes. An edge e ∈ E is a tuple of source src anddestination dst nodes, and the edge metadata md: (src,dst,md).Edge metadata consists of the type, type modifier, and shapeof the operand associated with the edge. For math operations,input and output edges represent operands and results, whereasadjacent edges in component instantiations represent state, in-put and output arguments. This is illustrated in 1 , where posis the input argument, ctrl_sgnl is the output argument,and ctrl_mdl is the state argument that creates a cycle formultiple iterations. Given a node n, we denote the name andsrdfg as n.name and n.srdfg, and similarly denote src,dst,md inan edge e as e.src, e.dst, and e.md. Lastly, the domain annota-tions previously described are translated to the srdfg.domainattribute for each srdfg.B. srDFG Semantics

As an example, the srDFG shown in 3 will begin op-erations when the data in edges R_g and ctrl_mdl are

Page 6: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

1

ctrl_sgnl

update_ctrl

predict_trajectory

compute_grad

g

pos_pred

pos

mvmul

-

pos_pred

pos_ref

ctrl_mdl

R_g

+g

mvmul

err

HQ_g x

sum

R_g ctrl_mdl

++ + + +

++

sum operation

5

4

element-wise multiplication

R_g1,1 ctrl_mdl1

x

R_g1,2 ctrl_mdl1

x

R_gi,k ctrl_mdlk

x

2

3

main component compute_grad component

mvmul component

ctrl_mdl

+

Fig. 5: Overview of the srDFG MobileRobot algorithm including zoomed-in views of its multiple levels of recursion.

ready. Each srDFG is a statically defined graph representinga single instantiation on its input values, with each componentinstantiation or operation getting its own srDFG, which allowsfor computing context sensitive information. As an example,Figure 5 2 shows two unique nodes and pairs of input edgesfor the mvmul component, and as a result each instantiation ofmvmul gets its own node and srDFG. The srDFG in 2 alsoshows how edges propagate their metadata to the lower granu-larity nodes, as the shapes of R_g and ctrl_mdl determinethe number of element-wise multiplication nodes in 4 . Thetype modifier included in edge metadata can change dependingon its srDFG, as shown in 1 where the ctrl_mdl edge isa reusable state, but is an input edge for 2 .C. Enabling Different Accelerators

Custom accelerators support unique set of operations per-formed on a variety of different typed and shaped inputs andoutputs. To ensure flexibility, each srDFG includes operationsas nodes, n, as well as the more fine-grained operations todefine the node, n.dfg. To illustrate this point, each srDFGin Figure 5 represents a possible operation supported by anaccelerator. If 2 is supported by the target accelerator, thesrDFG can be transformed to consist only of the operationsrepresented by each node in 1 . If 2 is not supported but 3is supported, then the operations in both 1 and 2 will beselected for compilation. PolyMath’s base unit of lowering isa node, and if the nodes in the srDFG cannot be lowered toa specific hardware because of unsupported nodes, the com-pilation fails for that accelerator.

Each of these accelerator operations is closely tied to thetypes and shapes of its operands. The srDFG uses the edgemetadata to specify the operand information when perform-ing compilation of a node to an acceleration operation. Forexample, an accelerator might support the element-wise mul-tiplication in 4 , but requires the number of elements beingmultiplied to perform the operation. Each input edge to 3includes the shape as part of its metadata, which allows forcompilation of 4 . Domain-specific accelerators differentiatehow data is stored by receiving this information from theprogrammer on how the variables are used. For example, thectrl_mdl edge in 1 has the state type modifier whichcauses the accelerator to store the data local to the component.

v_temp[V] = reduce[u: u == V][v](process(e_w[u][v], v_p[u]));v_p[V] = apply(v_temp[V], v_p[V]);

process

e_w[u][v]

reduce

apply

v_temp[V]

v_p[u]

v_p[u] Read v_p Read edges with u ==V

Read v_temp[V]

reduce_opapply

Read e_w[u][v]

process edges with source u

Write v_temp[V]

Write v_p[u]

(a)

(b) (c)

Fig. 6: Graph Analytics algorithm compilation starting from(a) a PMLang program compiled to an (b) srDFG which islowered and converted to (b) Graphicionado[14] pipelineblock IR.

IV. COMPILATION FRAMEWORK

PolyMath performs compilation in three steps: (1) compila-tion from PMLang to an srDFG; (2) lowering the srDFG tothe granularity of the different domains and converting it to theaccelerator’s IR; (3) invoking the accelerator’s provided com-piler to generate the final binaries. Figure 6 illustrates thesephases for a PMLang graph analytics implementation whichis compiled to an srDFG, and then lowered and converted toGRAPHICIONADO’s pipeline IR.

A. srDFG Generation

For each PMLang program, compilation forms an AbstractSyntax Tree (AST) using syntax analysis. The program AST isthen traversed and a symbol table S is created, storing informa-tion contained in each component. The component informationconsists of variable names and variable metadata md (e.g.,edge type, type modifier, and shape). For each component,a DFG is formed by stitching statements together using staticsingle assignment. Lower-level, srDFG operations are formedfor element-wise operations or group operations, which meansedges may represent both scalar and multi-dimensional values.

After completing AST traversal generating srDFGs fromPMLang statements, a single srDFG is generated startingwith the highest level component main. Previously, compo-nent statements were skipped because all component srDFGswere not created. When the main srdfg is traversed, a compo-

Page 7: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

nent node nc is created using previously skipped componentstatements and their argument type modifiers, preserving thedomain annotation as nc.dfg.domain attribute. Edges adjacentto nc are added to srdfg by using the type modifiers for argu-ments in the component signature of nc, where type modifiersare (1) in/out edge sfor input/output, and (2) edges such thate = (src,dst,md) where src = dst for state. The srDFG isrepeated for each component statement recursively, creatingnodes and edges to generate the lower levels of operationgranularity for each component, which inherit the top-leveldomains.

B. Example srDFG Passes

PolyMath implements a modular framework and set ofAPIs that enable custom, target-independent passes over theIR. These passes take an srDFG as an input and producea transformed srDFG. This feature conveniently enables ap-plying pipelines of passes on the same IR. Also, traditionalpasses such as constant propagation, constant folding, etc. aresupported via this PolyMath pass infrastructure. We cover onesuch compiler pass below.Algebraic combination. Transformation passes in PolyMathbenefit from simultaneous access to all levels of operationgranularity for a program. This bolsters traditional compilerpasses such as algebraic simplification that are typically lim-ited by single granularity IRs which hide opportunities forsimplification. In contrast, a PolyMath pass can identify hid-den algebraic simplifications which span multiple levels ofgranularity which would remain obscured in other flat IRs.As an example, if an srDFG with a top-level matrix-vectormultiplication is added to the output of another matrix-vectoroperation contained in another node’s subgraph, the matrixvector operations can be fused together by concatenating theirinputs. This transformation opportunity remains unidentifiedin flat IRs, but PolyMath uniquely reveals these transforma-tion prospects by preserving a program’s multi-granularity inthe srDFG and supporting transformations crossing granularboundaries.

C. Compilation from srDFG to Accelerator IR

Algorithm 1: srDFG Lowering Algorithmfunction Lower(srdfg, Om)

let (N,E) = srdfg.subDfglet Ot = Om[srdfg.domain] for each n ∈ N do

if n.name /∈ Ot thenlet subDfg =Lower(n,Om)srdfg← srdfg[n 7−→ subDfg]

endreturn srdfg

Compiling a srDFG to a domain-specific architecture con-sists of (1) lowering srDFG operations supported by the targetaccelerator (Algorithm 1) and (2) forming valid acceleratorIR by translating and combining each srDFG node (Algo-rithm 2). Algorithm 1 is a function Lower which takes asinput a dataflow graph srdfg which is a pair (N,E) of nodes

Algorithm 2: Compilation Algorithmfunction CompileProgram(srdfg, AccSpec)

let πd ← ∅ for d ∈ Domainslet (N,E) = srdfgfor each n ∈ N do

let (+d, md) = AccSpec[n.domain]let t = md[n.name]πd = πd + t(srdfg, n)for each in edge ∈ n do

if (n.domain 6= in edge.src.domain) thenπd = πd + tload(in edge, n)

endfor each out edge ∈ n do

if (n.domain 6= out edge.dst.domain) thenπd = πd + tstore(n, out edge)

endendreturn πd1, . . . , πdn

N and edges E defined in Section III-A. PolyMath lowerssrDFG operations to different domains with accelerator tar-gets that support different granularity operations. Lower usesa map, Om, with domain names as keys, and lists of domain-specific accelerator operation names, Ot, as values to lowersrDFG nodes to the correct granularity.

The algorithm consists of first using the srdfg.domain at-tribute as a key to determine the correct granularity of opera-tions for lowering, storing the set of supported operations inOt. For each node n, if n.name is not included in Ot, n.srdfginherits the srdfg domain, and lowers n in srdfg by replac-ing it with a srDFG comprised of only supported operations.Replacing n in the srdfg consists of substituting src or dstin adjacent edges (src,dst,md) with a node in the subDfg ifsrc = n or dst = n. Once each n ∈ N has been replacedwith supported operations based on Ot. By preserving differentlevels of granularity in the srDFG, the same srDFG is capableof generating dfg’s with operations supported by a variety ofcustom accelerators. For instance, the hierarchy of ROBOXbegins at the System level, followed by finer grained Taskcomputations all the way down to varying operation granular-ities in it’s macro dataflow graph, such as Vector, Scalar,and Group operations.

Once a srdfg has been lowered, it is compiled to an IRsuitable for the accelerator using Algorithm 2. Accelerator IRfor a domain d, denoted with πd, is comprised of accelerator IRfragments, each of which is a basic operator and its arguments.To generate each πd for the different targets, Algorithm 2 takesas input a lowered srDFG produced by Algorithm 1, andaccelerator specifications for the targets corresponding to eachdomain. Acceleration specifications for each domain are storedin AccSpec, and define how srDFG nodes are translated andmerged to form accelerator IR. A specification for domain dis a pair (md, +d) where:• md is a map from operator names to a translation func-

tion for that operator. The translation functions, t, worksas follows: given a srdfg and a node n, t(srdfg,n) returnsthe accelerator IR fragment πd representing the acceleratoroperation for n.

• +d is an operator that combines an accelerator IR πd andan accelerator IR fragment produced by td.

Page 8: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

Having defined the necessary variables, we can describeAlgorithm 2. The algorithm extracts the nodes and edges inthe srDFG, then applies a translation function to each node,creating an accelerator IR fragment. Each IR fragment foreach of the program’s domains is separately accumulated intocomplete representations of an accelerator program IR, andare returned by the algorithm.

The most complicated part of the compilation are the trans-lation functions, t. The translation function does two things:(1) identify the correct accelerator IR operation, and (2) assignthe correct arguments for that operator. Assigning the correctarguments uses these steps:1) Convert types to the equivalent accelerator type2) Use edges with input type modifier as input arguments3) Use edges with output type modifier as output arguments4) Initialize IR variables for edges with the state type modifier5) Add constants for arguments with param type modifierIf the accelerator IR fragment requires the shape of operationarguments, it is also included as part of the arguments to theoperator or declaration operation. To ensure data is transferredbetween domain boundaries, load and store IR fragments arecreated when there are sources and destinations with differentdomains than a node. As a final step, accelerator providedcompilers are used to create binaries from the generated IR.Compilation flexibility. The combination of Algorithm 2 and1 enables compilation to different types of domain-specificaccelerator because of two key properties. First, simultaneousaccess to each level of srDFG recursion allows supportedaccelerator operations to be translated. Unsupported srDFGnodes on the particular accelerator are refined and transformedto the appropriate level of granularity through recursion whichenables identification of the accelerator-supported operations.Second, the metadata stored in the srDFG allows the IR gen-eration to be parameterized based on the target accelerator. Asa result, users can create different accelerator specifications fordifferent accelerators and these same algorithms will do theappropriate mapping. Each algorithm can be instantiated fora number of different mappings without changes to the high-level algorithm.

V. EVALUATION

Table II illustrates the difference between conventionalgeneral-purpose stacks, domain-specific stacks in the litera-ture, and PolyMath. As shown, PolyMath represent a middleground between domain-specificity and generality, enablingcross-domain multi-acceleration.A. Experimental Setup

1) Algorithms and Datasets.: Table III shows workloadsfrom Robotics, Graph Analytics, DSP, Data Analytics, andDeep Learning domains and the lines of code (LOC) forthe PMLang implementation. Table IV shows a break downof end-to-end application domains, algorithms, configurations,and PMLang LOC.Single domain workloads. In Robotics domain, we have twobenchmarks, MobileRobot [21] and Hexacopter [22]. Sec-tion II discusses the two-wheeled MobileRobot in detail. Hex-

TABLE II: A comparison of computational stacks.

Domain Gen

eral

-Pur

pose

Pro

cess

ors

Gra

phic

iona

do[1

4]

Dar

win

[18]

DN

NW

eave

r[1

9]

TVM

[16]

TAB

LA[1

7]

Rob

oX[1

5]

DeC

O[7

]

BC

PA

cc[2

0]

Poly

Mat

h

Robotics 3 7 7 7 7 7 3 7 7 3

GraphAnalytics

3 3 7 7 7 7 7 7 7 3

DSP 3 7 7 7 7 7 7 3 7 3

DataAnalytics

3 7 7 7 3 3 7 7 7 3

DeepLearning

3 7 7 3 3 7 7 7 7 3

Genomics 3 7 3 7 7 7 7 7 7 7

SAT Solvers 3 7 7 7 7 7 7 7 3 7

acopter is a six-rotor micro UAV that uses motion planningand orientation control to determine trajectory. For both theseworkloads the physical robot and task specification is ex-pressed in PMLang. For Data Analytics we have Low RankMatrix Factorization (LRMF) and Kmeans clustering. LRMFconverts a large matrix into two smaller matrices, which iftaken product of, represent the original matrix. For LRMF weuse two Movielens [23] datasets. Kmeans clustering partitionsdata into k-clusters. For one Kmeans workload cluster handwritten digits with mnist [24] dataset. The second benchmarkuses data from the UCI repository [25] to cluster householdswith similar electricity consumption. In the Digital SignalProcessing domain we have four benchmarks, two for eachFast Fourier Transform (FFT), and Discrete Cosine Transform(DCT). The FFT implementation is a fine-grained butterflyand bit-reversal to transform a signal to frequency domain.The DCT algorithm applies a filter kernel to an input imageand is used for compression. For Deep Learning, we use twopopular convolutional neural networks, ResNet-18 [26] andMobileNet [27] for object classification. For Graph Analyt-ics, we implement and apply Breadth-First Search on twographs, one of Twitter users and followers [28] and another ofWikipedia links [29].End-to-end cross-domain applications. Table IV shows theend-to-end applications for our case study, the different do-mains they comprise of, and specification of each algo-rithm. The brain stimulation application,BrainStimul, isdescribed in Section II. The stock market application, calledOptionPricing, predicts call option price in stock mar-ket and uses two data analytics algorithms. This applicationfirst performs sentiment analysis through logistic regressionon news articles to understand market signals and then Black-Scholes to predict the price.

2) Optimized CPU and GPU implementations.: Table Vand VI shows the optimized CPU and GPU frameworktheir specifications. The robotics’s CPU implementation usesACADO Toolkit [30] to implement optimized, self-containedC code and uses cuBLAS [31] libraries for GPUs. For graphanalytics, we used Intel GraphMat [32] for CPU implementa-tions and Enterprise [33] for GPU. The DSP workloads use

Page 9: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

TABLE III: Benchmarks and workloads used to evaluatePolyMath.

Domain Benchmark Algorithm Config/Dataset PMLangLOC

MobileRobot

ModelPredictive

Control

Trajectory Tracking,Horizon = 1024

52

Rob

otic

s

Hexacopter ModelPredictive

Control

Altitude Control,Horizon = 1024

197

TwitterFollowers

Breadth-FirstSearch

#Vertices=61.57M,#Edges=1468.36M

14

WikipediaLinks

Breadth-FirstSearch

#Vertices=3.56M,#Edges=84.75M

14

Gra

phA

naly

tics

LiveJournal SingleSource

Shortest Path

#Vertices=4.84M,#Edges=68.99M

14

MovieL(100k)

Low RankMatrix

Factorization

1682 movies, 943users; 100000

ratings

43

MovieL(20M)

Low RankMatrix

Factorization

40110 movies,259137 users;244096 ratings

43

Dat

aA

naly

tics

DigitCluster K-MeansClustering

784features;120000

images;K=10

41

ElecUse K-MeansClustering

4 features; 2075259data points; K=12

41

FFT-8192 Fast-FourierTransform

1D FFT-real;8192x1 input

12

FFT-16384 Fast-FourierTransform

1D FFT-real;16834x1 input

12

DCT-1024 DiscreteCosine

Transform

1024x1024 image;8x8 kernel, stride=8

31

DS

P

DCT-2048 DiscreteCosine

Transform

2048x2048 image;8x8 kernel, stride=8

31

ResNet-18 Deep NeuralNetwork

Batch Size = 1,ImageNet

117

Dee

pLe

arni

ng

MobileNet Deep NeuralNetwork

Batch Size = 1,ImageNet

102

TABLE IV: Algorithmic composition of end-to-end applica-tions.Benchmark Algorithm Domain Config/Dataset LOC

BrainStimul

Fast-Fourier Transform(FFT)

DSP 1D FFT, 4096Input

12

Logistic Regression(LR)

DataAnalytics

4096 features 8

Model PredictiveControl (MPC)

Robotics Horizon = 1024 64

OptionPricing

Black-Scholes (BLKS) DataAnalytics

8192 options 10

Logistic Regression(LR)

DataAnalytics

129549 words 8

TABLE V: Domains and accelerators used for evaluations.

Domain PolyMath Accelerator Baseline FrameworkRobotics ROBOX (ASIC) ACADO/cuBLASGraphAnalytics

GRAPHICIONADO (ASIC) Intel GraphMat/Enterprise

DataAnalytics

HyperStreams(FPGA) /TABLA (FPGA)

MLPack/OpenBlas/CUDA

DSP DECO (FPGA) FFTW3/cuFFT/NVIDIA-DCT

DeepLearning

TVM-VTA (FPGA) TVM/Tensorflow

TABLE VI: CPU, FPGA, and ASIC specifications.

CPU FPGA ASIC GPUChip Xeon E-

2176GUltraScaleKCU1500

ROBOX/GRAPHICIONADO

Titan Xp/Jetson AGX

XavierCores 6 - - 3,840/ 512Memory/BRAM

128GB 75MB 512KB/ 64MB 12 GB/32 GB

Power 80W 35W 3.4W/7W 250W/30WFrequency 3.7

GHz150MHz 1GHz/ 1GHz 1.5GHz/

1.3GHzLogicTables

- 1,451 - -

ComputeUnits

- 5,520 256/8 -

C subroutine libraries [34, 35] for CPU and Nvidia imple-mentations [36, 37] for GPU. For data analytics, we use ml-pack [38], a fast and flexible C++ ML library built on topof OpenBLAS [39], NVBLAS [40], and Armadillo [41]. TheDeep Learning workloads are compiled using optimized Ten-sorflow [42] for CPU and GPU. All of our experiments wereperformed on an Intel Xeon E7, Titan Xp GPU, and low-powerJetson Xavier AGX.

3) Domain-Specific accelerators.: Table V shows the ac-celerator used for each domain. To evaluate each benchmarkon domain-specific accelerators, PolyMath was used to com-pile programs to the target accelerator IR, and the targetaccelerator’s compiler was used to generate executable bi-naries. RoBoX’s [15] Macro DFG from srDFG as it of-fers programmable ASIC for system, tasks, and penalties forcontrol algorithms optimized using MPC. We use GRAPHI-CIONADO [14], an ASIC accelerator for graph analytics al-gorithms expressed as vertex programs, as the target forthe Graph Analytics workloads. We compile FFT and DCTsrDFG representation to DECO [7], a DSP block based FPGAaccelerator, by translating to it’s DFG, which is then compiledas executable binaries. For LRMF and kmeans we convertsrDFG to TABLA [43], an open source template-based FPGAaccelerator for machine learning, by compiling a srDFG toa DFG.We use TVM-VTA [44], a programmable deep learn-ing FPGA accelerator, as the target for Deep Learning work-loads, as it is state-of-the-art and open-source. Each DSArequires specific levels of operation granularity: single opera-tion [7, 17], coarse DNN layers [44], and coarse time snap-shots [15], enabled by each srDFG’s multi granularity, formapping of kernels to the accelerator.Multi-acceleration. For BrainStimul, we compile parts toDECO [7] (FFT-4096), TABLA [17] (Logistic Regression), andRoBoX [15] accelerators. For OptionPricing, we executelogistic regression based sentiment analysis on TABLA [17]and Black Scholes on HyperStreams [45]. All accelerators arecascaded as a single System On Chip (SOC), comprised ofmemory and a host. A light-weight manager executes on thehost, ensuring data dependencies between different accelera-tors and initiating DMA transfers between DRAM and localaccelerator memory. This setting is similar to prior work [46]that also uses an array of micro-accelerators.

Page 10: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

Mob

ileR

obot

Hex

acop

ter

Twitt

er-B

FS

Wik

i-BFS

Live

Jour

n-S

SP

Mov

ieL-

20M

Mov

ieL-

100K

Dig

itClu

ster

Ele

cUse

FFT-

8192

FFT-

1638

4

DC

T-10

24

DC

T-20

48

Res

Net

-18

Mob

ileN

et

Geo

mea

n

0.0×5.0×

10.0×15.0×20.0×25.0×30.0×

Impr

ovem

ent/C

PU

1.4×

9.9×

3.1× 4.

14.7×

14.7× 17

.9×

2.1× 3.

20.7×

18.5×

2.0×

2.1×

0.2×

0.2×

3.8×

34.5×

239.

65.5×

98.9×

34.4×

34.4×

41.9×

5.0×

8.7×

48.6×

43.3×

4.6×

4.9×

9.9×

8.2×

23.8×

DeepLearningDSPData Analytics

GraphAnalyticsRobotics

RuntimeEnergy

Fig. 7: Runtime and Energy improvement of PolyMath overCPU.

Mob

ileR

obot

Hex

acop

ter

Twitt

er-B

FS

Wik

i-BFS

Live

Jour

n-S

SP

Mov

ieL-

20M

Mov

ieL-

100K

Dig

itClu

ster

Ele

cUse

FFT-

8192

FFT-

1638

4

DC

T-10

24

DC

T-20

48

Res

Net

-18

Mob

ileN

et

Geo

mea

n

0.0×5.0×

10.0×15.0×20.0×25.0×

Impr

ovem

ent/G

PU

2.4× 3.2×

0.5×

0.5× 1.

22.7×

6.7×

2.4×

1.0×

13.7×

8.8×

0.0×

0.0×

0.1×

0.1× 1.

5.3× 7.

2.2×

2.1×

6.2×

19.4×

5.8×

2.0×

0.9×

11.7×

7.6×

0.0×

0.0× 0.6× 1.0× 1.

0.1×

0.1×

0.2×

0.3× 0.7×

14.6× 17

.8×

1.7× 3.

1× 5.5×

3.5×

0.0×

0.0×

0.0×

0.0× 0.4×

6.0×

4.4×

8.7× 10

.6×

25.6×

104.

127.

12.3×

22.5×

39.0×

25.2×

0.2×

0.1× 1.

9× 2.4×

7.7×

DeepLearningDSPData Analytics

GraphAnalyticsRobotics

Runtime (Jetson)Performance-per-Watt (Jetson)Runtime (Titan Xp)Performance-per-Watt (Titan Xp)

Fig. 8: Runtime and Performance-per-Watt improvement ofPolyMath over GPU.

B. Experimental Results

The goal of PolyMath is to facilitate the use of the widevariety of custom accelerators across end-to-end cross domainapplications. It does so by abstracting away hardware leveldetails through a versatile, extensible, and a modular stack thatcan maintains the required levels of kernel granularities bestsuited for each design. In this section we compare domain-specific accelerators executing PMLang code with optimizedCPU and GPU implementations to better understand the porta-bility of PolyMath. We then perform case studies throughtwo end-to-end applications and observe that PolyMath allowscross domain multi-acceleration.

1) Performance and Energy Comparisons: Single kernelcomparisons. Figure 7 and Figure 8 show the speedup ofPolyMath compiled programs listed in Table III to domainspecific accelerators over Xeon E-2176G CPU, Titan Xp GPU,and Jetson Xavier AGX as the baseline, respectively. On av-erage, PolyMath translated implementations outperform XeonE-2176G and Jetson Xavier by 3.3× and 1.2× in terms of run-time and offer 7.2× and 1.7×more Performance-per-Watt overTitan Xp and Jetson Xavier GPUs. Smaller benchmarks suchas MovieLens-100K and ElecUse are unable to fully utilizeTitan XP, thus cannot obtain higher benefits in comparison toJetson but incur higher PPW. The average still demonstrates anincrease in both performance and energy in the cross-domainsetting. PolyMath implementations also offer 18.1× more en-ergy efficiency over CPUs, but due to many lower power ac-celerator backends only offers 40% of the GPU performance.This is especially true for discrete cosine transform and deeplearning benchmarks; DCT due to its high coarse granular ma-

trix multiplications for which DECO a programmable FPGAaccelerator is not as effective as Titan Xp and deep learningmodels because our backend for CNNs is VTA that is designedas a low-power accelerator but is also being compared to ahigh-end GPU for uniformity. Note that PolyMath does notcontribute any overhead specifically for deep learning accel-eration because it offers direct conversion of srDFG to theTVM nodes.Optimal performance comparison. The accelerators used forexecution of PolyMath implementations also offer customstacks that are built for their target architecture. We com-pare PolyMath to implementations in their native stack, whichrepresent optimal executions to demonstrate the cross-domainoverhead of our stack to native implementations. Figure 9 com-pares the performance of PolyMath implementations with theoptimal performance that can be reached by programs writtenby experts for each of the accelerators. The figure shows thatPolyMath achieves 83.9% the optimal runtime.

Mob

ileR

obot

Hex

acop

ter

Twitt

er-B

FS

Wik

i-BFS

Live

Jour

n-S

SP

Mov

ieL-

20M

Mov

ieL-

100K

Dig

itClu

ster

Ele

cUse

FFT-

8192

FFT-

1638

4

DC

T-10

24

DC

T-20

48

Res

Net

-18

Mob

ileN

et

Ave

rage

0%

20%

40%

60%

80%

100%

%O

ptim

alPe

rform

ance

DeepLearningDSPData Analytics

GraphAnalyticsRobotics

Fig. 9: Percent of optimal runtime for PolyMath trans-lated implementations compared to hand-tuned implemen-tations.

The performance of PolyMath relative to optimal imple-mentations is dependent on the domain and the algorithm be-cause each stack differs in the level of data semantics that canbe complex to compile to from a more expressive language.For instance, accelerators specializing in Deep Learning oftenutilize three primary type modifiers for variables in neuralnetwork graphs: input, output, and weights–each of which canbe directly mapped to type modifiers in PMLang, thus incurzero overhead. In contrast, Robotics algorithms contain uniquedata semantics, such as task penalties, constraints variables,time varying references, etc., which do not differentiated withPolyMath type modifiers, thus implementations do not reachoptimal performance. Accelerators such as DECO require spe-cific topologies for their graph-based IR, i.e. balanced DFGs,because they rely on stage-based computation, which resultsin reduced execution time relative to PolyMath translations.In the case of Data Analytics, we see a low percentage ofoptimal performance for ElecUse, because the benchmarkis small, which makes any extra operations included in thesrDFG have a more significant impact on performance relativeto the optimal implementation. In contrast, DigitCluster usesthe same algorithm but on a larger dataset, thus can amor-tize the overhead more effectively. It is important to note thatthe optimal performance is reached by a specialized program

Page 11: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

on each accelerator written by an expert, whereas PolyMathoffers support for multiple domains constituting end-to-endapplications that can be expressed as single comprehensiveprogram, coupled with srDFG to pave way for compilationto multiple accelerators.

FFT LR MPC FFT+LR FFT+MPC LR+MPC FFT+LR+MPC0.0£

1.0£

2.0£

3.0£

4.0£

End

-to-e

ndIm

prov

emen

t/CP

U

RuntimeEnergy

2 Domains 3 DomainsSingle-Domain Acceleration

Cross-Domain Acceleration

(a) BrainStimul brain stimulation application

BLKS LR BLKS+LR0.0£

1.0£

2.0£

3.0£

4.0£

End

-to-e

ndIm

prov

emen

t/CP

U

RuntimeEnergy

Cross-Domain Acceleration(2 Domains)Single-Domain Acceleration

(b) OptionPricing finance application

Fig. 10: Runtime and energy improvement over CPU of end-to-end applications for different combinations of acceler-ated domains.

2) End-to-End Application Case Study: PolyMath offersmeans to express cross domain applications as a single pro-gram which can be compiled to multiple accelerators pertain-ing to each of these domains. Figure 10 shows the runtimeand energy improvement of the end to end applications incomparison to CPU. Figure 11 illustrates the runtime andperformance-per-Watt improvement in comparison to TitanXp and Jetson. Figure 10a and Figure 11a shows these re-sults for BrainStimul and Figure 10b and Figure 11b forOptionPricing application. In these graphs we provideentire application improvement for all possible accelerationcombinations, from one domain algorithm accelerated to cross-domain where all algorithms are accelerated. Each end-to-end result incorporates data communication overheads fromdata transfer between hardware. Stand-alone kernel accelera-tion, as shown in , can offer very high speedups. However,when these kernels are incorporated within a more compre-hensive application, those speedups do not manifest in theentire application because the non-accelerated kernel becomesa bottleneck. For instance, the gap between the highest benefitobtained from the best single-domain acceleration and cross-

domain end-to-end acceleration, is 1.85× for BrainStimuland 2.06× for OptionPricing (Figure 10). Every kernelthat is added for acceleration not only benefits itself from spe-cialized execution but also reduces the Amdahl’s burden andmagnifies other accelerated component’s impact. The benefitsof individual kernel acceleration are present despite end-to-endcommunication runtime overheads of 23.4% and 17.0% andenergy overheads of 21.8% and 12.4% for BrainStimuland OptionPricing, respectively.

FFT LR MPC FFT+LR FFT+MPC LR+MPC FFT+LR+MPC0.0×

2.0×

4.0×

6.0×

Impr

ovem

ent/G

PU

1.0×

8.4×

Single-Domain Acceleration 2 Domains 3 DomainsCross-Domain Acceleration

Runtime (Titan Xp)Performance-per-Watt (Titan Xp)Runtime (Jetson)Performance-per-Watt (Jetson)

(a) BrainStimul brain stimulation application

BLKS LR BLKS+LR0.0×

2.0×

4.0×

6.0×Im

prov

emen

t/GP

U

1.0×

9.2×

Single-Domain AccelerationCross-Domain Acceleration(2 Domains)

Runtime (Titan Xp)Performance-per-Watt (Titan Xp)Runtime (Jetson)Performance-per-Watt (Jetson)

(b) OptionPricing finance application

Fig. 11: Runtime and Performance-per-Watt improvementover GPU for combinations of accelerated domains for end-to-end applications.

Figure 11a and 11b show the BrainStimul andOptionPricing results for both Titan Xp and Jetson.Figure 11a shows that PolyMath offers 1.2× runtimeimprovement over Titan Xp compared to 1.8× over Jetson.In contrast, PolyMath improves performance-per-watt by8.3× over Titan Xp compared to 2.8× for Jetson due toits lower power consumption. As Figure 11b shows, theOptionPricing benchmark underutilizes the Titan Xp,only offering 1.5× and 9.2× improvement in performanceand performance-per-watt compared to 1.4× and 1.9×over Jetson. This is caused by the difference in levels ofcoarse parallelism in the algorithms. Overall, the PolyMathimplementation of OptionPricing still outperforms bothGPUs for both runtime and performance-per-watt.

Lastly, Figure 12 shows end-to-end Polymath implemen-tations achieve 76.7% optimal performance for BrainStimand 76.9% for OptionPricing compared to entirely man-ual implementations of each application. Given the fact thatPolyMath offers greater ease of programming compared toPython (Figure 13), the automation overhead of 23.1% is afair tradeoff.

Page 12: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

FFT

LR(S

TIM

)

MP

C

FFT+

LR(S

TIM

)

FFT+

MP

C

LR(S

TIM

)+M

PC

FFT+

LR(S

TIM

)+M

PC

BLK

S

LR(P

RC

)

BLK

S+L

R(P

RC

)

Ave

rage

0%

20%

40%

60%

80%

100%%

Opt

imal

Perfo

rman

ce

Brain StimulationOptionPricing

Fig. 12: Percent of optimal performance for BrainStimand OptionPricing compared to hand-tuned implementa-tions.

Kmeans DCT Average0.0×0.5×1.0×1.5×2.0×2.5×3.0×3.5×4.0×

LOC

Red

uctio

n 3.3×

1.8×

2.5×

(a) Lines of Code ReductionKmeans DCT Average

0.0×0.4×0.8×1.2×1.6×2.0×2.4×2.8×

Cod

ing

Tim

eR

educ

tion

2.6×

1.2×

1.9×

(b) Coding Time Reduction

Fig. 13: Reduction in Lines of Code (LOC) and coding timewith PMLang over Python for Kmeans and DCT.

3) User Study: To determine the usability of PMLang, weconducted a user study with 20 programmers who are eitherprofessional software engineers or PhD students in computerscience. The goal of the user study is to measure the expres-siveness of PMLang by comparing it to Python, an intuitiveprogramming language which is commonly used in three ofthe focus domains of PolyMath: DSP, Data Analytics, andDeep Learning. The study tasked each user in the study withimplementing either a DSP or Analytics algorithm in Python orPMLang. The users were divided into four groups; Kmeansin Python, Kmeans in PMLang, DCT in Python, and DCTin PMLang. For fairness and ease in expression of tensoralgebraic operations, we allow the users to import Python mod-ules such numpy [47]. Each participant in the user study hasvarying levels of expertise (from beginner to proficient) in thetarget domains, and is proficient in Python. Every participantwent through the following process:1) Participants were introduced to PMLang with a short, six-

minute video which walked through the language and smallexamples.

2) To avoid any algorithm knowledge bias, participants wererandomly assigned either the DSP or ML algorithm to im-plement in either Python and PMLang.

3) To minimize variation in algorithm understanding, userswere not allowed to begin their first implementation be-fore having read and confirmed their understanding of thealgorithm.

4) We timed participants during their implementations andmeasured their Lines of Code (LOC) after completion.

Results. Figure 13 compares the LOC between Python andPMLang as a ratio of Python LOC to PMLang LOC in (a),and the implementation time of Python and PMLang as a ratio

of Python implementation time to PMLang implementationtime in (b). The results show that PMLang required 2.5×fewer lines of code on average and 1.9× less implementationtime on average. The Kmeans implementation averaged 3.3×fewer LOC, whereas for DCT the average reduction of LOC is1.8×. In general, the Kmeans algorithm is more verbose thanDCT and on average required 47.6% more lines of Pythoncode than DCT. Because there were more lines of code in-cluded in Kmeans, there was more opportunity to reduce multi-line operations to a single PMLang statement, which explainsthe difference in LOC reduction.

The greater complexity of Kmeans appears to have an ef-fect in the speedup of implementation time as well, wherethe average speedup for Kmeans was 2.6× and the averagespeedup for DCT was 1.2×. Part of this speedup can be at-tributed to typing more LOC in PMLang, but it is also aresult of being able to directly translate mathematical notationto the equivalent PMLang statement. These results indicatethat the more complicated the mathematical program is, themore the programmer will benefit from implementing the pro-gram in PMLang. Further, PMLang is expressive enough forprogrammers unfamiliar with the language to write algebraicexpressions more efficiently than they would write the sameexpressions in a language they are familiar with, Python.

VI. RELATED WORK

DSLs for custom architectures. There are various domain-specific languages designed to facilitate the use of hardwareaccelerators. These languages are mostly designed for a singledomain [48–51] or like Spatial [52], they focus on conve-niently expressing lower level hardware-centric information.Another language, Halide [50], allows expression of imageprocessing pipelines and contains constructs for filter-basedalgorithms. Lime [53] focuses on high-level synthesis fromJava, thereby enabling execution of Java programs on bothFPGAs and CPUs. Instead, PolyMath offers a Cross-DomainLanguage (CDL) and compute stack to explore the emergingtradeoff between expressiveness and performance while lever-aging currently isolated, domain-specific accelerators.Mathematical and scientific computing environments. Thereare numerous scientific and numerical programming environ-ments [54–57], and frameworks [47, 58]. PolyMath uniquelyprovides a 1-to-1 mapping of mathematical expressions to itsstatements and leverages the natural parallelism in formulaswithout any explicit annotations for vectorization In contrast,MATLAB [54], Julia [55], or R, require manual effort from theuser to identify the parallelism across different computations,vectorize its code, and determine column/row arrangements formatrix operations. Moreover, these languages do not delineatebetween the semantics of data in their programs and do notoffer a multi-granular representation, as offered by PolyMath,to enable usage of various domain-specific accelerators.Intermediate Representations. A number of intermediate rep-resentations [59, 60] provide abstractions to enable programanalysis using virtual resources. Both LLVM and JVM op-erate at the granularity of a single CPU instruction, which

Page 13: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

is highly inefficient for domain-specific architectures. Someworks [61] have adapted LLVM to guarantee independencebetween parallel operation threads by using a dataflow graphstructure intended for heterogeneous platforms. A number ofother works [16, 62–66] focus on the domain of machinelearning and have implemented an end-to-end approach foroptimization on heterogeneous platforms after performing op-timizations from a high-level language. These works supportedlimited algorithm domains [16, 62–65], and rely on C/C++or other general-purpose programming languages [61, 66],requiring the programmer to express complex mathematicalexpressions in unintuitive ways. MLIR [66] is a hierarchical,high level IR, but is general-purpose and as such is on the endof the expressiveness curve (Figure 1). Whereas PolyMath isrestricted by its mathematical language (PMLang) to a limitedset of domains, falling in the middle of the spectrum of expres-siveness. Furthermore, MLIR does not have any compilationstack to support variety of accelerators from different domainsas PolyMath does and is practically demonstrated in the eval-uations. Tiramisu [67] introduces a scheduling language withnovel commands to explicitly manage the complexities thatarise when targeting multicores, GPUs, and distributed ma-chines. Tiramisu offers an IR based on the polyhedral model toallow fine-grained optimization. As such, Tiramisu can serveas a potential backend for PolyMath that deals with the higher-level complexity of expressing cross-domain application andnot low-level fine-grained optimization.Acceleration frameworks and toolchains. TensorFlow [42] isan end-to-end open source platform for expressing ML algo-rithms in Python. Deep learning accelerators (e.g., TPU [13])leverage Tensorflow. Similarly, a variety of deep learningframeworks [19, 68, 69] allow users to run their DNNs onFPGA based hardware designs. Full stack solutions such asTABLA [17] and ROBOX [15] support classical supervisedmachine learning and model predictive control in robotics,respectively. Other toolchains [70–73] aim to simplify runningdeep neural networks on hardware accelerators by performingdesign space exploration to find the best configuration for theirparticular design. These solutions, however, are bound to theirown custom architectures for particular platforms (FPGAs orASICs). In contrast, the srDFG offers a flexible hook that canbe translated to these toolchains and frameworks as well as tofuture accelerator designs and platforms. The cross-domainnature of PolyMath that supports Robotics, Graph Analytics,DSP, Data Analytics and Deep learning sets it apart from thesedomain-specific stacks.

VII. CONCLUSION

As domain-specific accelerators are becoming prevalent,there is an emerging tradeoff between expressiveness and per-formance. This paradigm–a pendulum swing from general-purpose processing to the opposite direction–creates implicitprogramming silos between different domain. This paper setout to explore the region between these extremes and ex-plore the new expressiveness-performance tradeoff. To thatend, we defined a cross-domain computational stack, Poly-

Math, that bridges the expressiveness gap between multipledomains, Robotics, Graph Analytics, DSP, Deep Learning, andData Analytics. The results from user study and performanceevaluations showed that PolyMath strikes an effective balancebetween expressiveness and performance while enabling cross-domain multi-acceleration. It is time to look beyond the timely,yet temporary, success of domain-specific accelerators and de-vise a future that enables end-to-end applications. The currentapproach towards acceleration excludes significant opportu-nities by restricting the domain. To harness these untappedopportunities, a new paradigm needs to emerge that breaksthe boundaries of domains, but also preserves the benefits ofdomain-specificity. PolyMath takes the initial step in breakingthis new ground.

VIII. ACKNOWLEDGMENTS

This work was in part supported by generous gifts fromQualcomm, Google, Microsoft, Xilinx, Leidos as well as theNational Science Foundation (NSF) awards CNS#1703812,ECCS#1609823, CCF#1553192, Air Force Office of ScientificResearch (AFOSR) Young Investigator Program (YIP) award#FA9550-17-1-0274, National Institute of Health (NIH) award#R01EB028350, and AirForce Research Laboratory (AFRL)and Defense Advanced Research Project Agency (DARPA)under agreement number #FA8650-20-2-7009 and #HR0011-18-C-0020. The U.S. Government is authorized to reproduceand distribute reprints for Governmental purposes not with-standing any copyright notation thereon. The views and con-clusions contained herein are those of the authors and shouldnot be interpreted as necessarily representing the official poli-cies or endorsements, either expressed or implied of Qual-comm, Google, Microsoft, Xilinx, NSF, AFSOR, NIH, AFRL,DARPA or the U.S. Government.

REFERENCES

[1] S. Murray, W. Floyd-Jones, Y. Qi, G. Konidaris, andD. J. Sorin. The microarchitecture of a real-time robotmotion planning accelerator. In 2016 49th AnnualIEEE/ACM International Symposium on Microarchitec-ture (MICRO), pages 1–12, Oct 2016. doi: 10.1109/MICRO.2016.7783748.

[2] Texas instruments c6000tm dsp, 2007. URLhttp://www.ti.com/processors/digital-signal-processors/c6000-floating-point-dsp/overview.html.

[3] Rehan Hameed, Wajahat Qadeer, Megan Wachs, OmidAzizi, Alex Solomatnikov, Benjamin C. Lee, StephenRichardson, Christos Kozyrakis, and Mark Horowitz.Understanding sources of inefficiency in general-purposechips. In ISCA, 2010.

[4] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.Toward dark silicon in servers. IEEE Micro, 31(4):6–15,July–Aug. 2011.

[5] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant,Karthikeyan Sankaralingam, and Doug Burger. Dark sil-icon and the end of multicore scaling. In ISCA, 2011.

Page 14: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

[6] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Sat-urnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez,Steven Swanson, and Michael Bedford Taylor. Conserva-tion cores: Reducing the energy of mature computations.In ASPLOS, 2010.

[7] A. K. Jain, X. Li, P. Singhai, D. L. Maskell, and S. A.Fahmy. Deco: A dsp block based fpga accelerator over-lay with low overhead interconnect. In 2016 IEEE 24thAnnual International Symposium on Field-ProgrammableCustom Computing Machines (FCCM), pages 1–8, May2016. doi: 10.1109/FCCM.2016.10.

[8] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss:A Spatial Architecture for Energy-Efficient Dataflow forConvolutional Neural Networks. In ISCA, 2016.

[9] Mohammad Samragh, Mojan Javaheripi, and FarinazKoushanfar. Encodeep: Realizing bit-flexible encodingfor deep neural networks. ACM Transactions on Embed-ded Computing Systems (TECS), 19(6):1–29, 2020.

[10] Soroush Ghodrati, Hardik Sharma, Sean Kinzer, AmirYazdanbakhsh, Jongse Park, Nam Sung Kim, DougBurger, and Hadi Esmaeilzadeh. Mixed-signal charge-domain acceleration of deep neural networks throughinterleaved bit-partitioned arithmetic. In PACT, 2020.

[11] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, LiqiangHe, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu,Ninghui Sun, et al. Dadiannao: A machine-learning su-percomputer. In MICRO, 2014.

[12] Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim,Sean Kinzer, Brahmendra Yatham, Navateja Alla, HardikSharma, Mohammad Alian, Eiman Ebrahimi, Nam SungKim, Cliff Young, and Hadi Esmaeilzadeh. Planaria:Dynamic architecture fission for spatial multi-tenant ac-celeration of deep neural networks. In MICRO, October2020.

[13] Norman P. Jouppi, Cliff Young, Nishant Patil, David Pat-terson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates,Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle,Pierre-luc Cantin, Clifford Chao, Chris Clark, JeremyCoriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,Tara Vazir Ghaemmaghami, Rajendra Gottipati, WilliamGulland, Robert Hagmann, Richard C. Ho, Doug Hog-berg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz,Aaron Jaffey, Alek Jaworski, Alexander Kaplan, HarshitKhaitan, Andy Koch, Naveen Kumar, Steve Lacy, JamesLaudon, James Law, Diemthu Le, Chris Leary, ZhuyuanLiu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-ana Maggiore, Maire Mahony, Kieran Miller, RahulNagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix,Thomas Norrie, Mark Omernick, Narayana Penukonda,Andy Phelps, Jonathan Ross, Amir Salek, Emad Sama-diani, Chris Severn, Gregory Sizikov, Matthew Snel-ham, Jed Souter, Dan Steinberg, Andy Swing, Mer-cedes Tan, Gregory Thorson, Bo Tian, Horia Toma,Erick Tuttle, Vijay Vasudevan, Richard Walter, Wal-ter Wang, Eric Wilcox, and Doe Hyun Yoon. In-

datacenter performance analysis of a tensor processingunit. CoRR, abs/1704.04760, 2017. URL http://arxiv.org/abs/1704.04760.

[14] T. J. Ham, L. Wu, N. Sundaram, N. Satish, andM. Martonosi. Graphicionado: A high-performance andenergy-efficient accelerator for graph analytics. In 201649th Annual IEEE/ACM International Symposium on Mi-croarchitecture (MICRO), pages 1–13, Oct 2016. doi:10.1109/MICRO.2016.7783759.

[15] Jacob Sacks, Divya Mahajan, Richard C. Lawson, andHadi Esmaeilzadeh. Robox: An end-to-end solutionto accelerate autonomous control in robotics. In Pro-ceedings of the 45th Annual International Symposiumon Computer Architecture, ISCA ’18, pages 479–490,Piscataway, NJ, USA, 2018. IEEE Press. ISBN 978-1-5386-5984-7. doi: 10.1109/ISCA.2018.00047. URLhttps://doi.org/10.1109/ISCA.2018.00047.

[16] Tianqi Chen, Thierry Moreau, Ziheng Jiang, LianminZheng, Eddie Yan, Haichen Shen, Meghan Cowan,Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin,and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13thUSENIX Symposium on Operating Systems Design andImplementation (OSDI 18), pages 578–594, Carlsbad,CA, 2018. USENIX Association. ISBN 978-1-931971-47-8. URL https://www.usenix.org/conference/osdi18/presentation/chen.

[17] Divya Mahajan, Jongse Park, Emmanuel Amaro, HardikSharma, Amir Yazdanbakhsh, Joon Kim, and Hadi Es-maeilzadeh. TABLA: A unified template-based frame-work for accelerating statistical machine learning. March2016.

[18] Yatish Turakhia, Gill Bejerano, and William J. Dally.Darwin: A genomics co-processor provides up to 15,000xacceleration on long read assembly. In Proceedingsof the Twenty-Third International Conference on Archi-tectural Support for Programming Languages and Op-erating Systems, ASPLOS ’18, pages 199–213, NewYork, NY, USA, 2018. ACM. ISBN 978-1-4503-4911-6.doi: 10.1145/3173162.3173193. URL http://doi.acm.org/10.1145/3173162.3173193.

[19] Hardik Sharma, Jongse Park, Divya Mahajan, EmmanuelAmaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra,and Hadi Esmaeilzadeh. From high-level deep neuralmodels to FPGAs. October 2016.

[20] John D. Davis, Zhangxi Tan, Fang Yu, and LintaoZhang. A practical reconfigurable hardware acceleratorfor boolean satisfiability solvers. In Proceedings of the45th Annual Design Automation Conference, New York,NY, USA, 2008. Association for Computing Machinery.ISBN 9781605581156. doi: 10.1145/1391469.1391669.URL https://doi.org/10.1145/1391469.1391669.

[21] Felipe Kuhne, Joao Manoel Gomes da Silva Jr, and Wal-ter Fetter Lages. Mobile robot trajectory tracking usingmodel predictive control. In Latin American RoboticsSymposium, 2005.

Page 15: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

[22] M. Achtelik M. Kamel, K. Alexis and R. Siegwart. Fastnonlinear model predictive control for multicopter atti-tude tracking on so(3). In IEEE Multi-Conference onSystems and Control, 2015.

[23] Grouplens. Movielens dataset. URL http://grouplens.org/datasets/movielens/.

[24] Yann Lecun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. Gradient-based learning applied to documentrecognition. In Proceedings of the IEEE, pages 2278–2324, 1998.

[25] M. Lichman. UCI machine learning repository, 2013.URL http://archive.ics.uci.edu/ml.

[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. InCVPR, 2016.

[27] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco An-dreetto, and Hartwig Adam. Mobilenets: Efficient convo-lutional neural networks for mobile vision applications.ArXiv, abs/1704.04861, 2017.

[28] Haewoon Kwak, Changhyun Lee, Hosung Park, and SueMoon. What is twitter, a social network or a news media?In Proceedings of the 19th International Conference onWorld Wide Web, WWW ’10, page 591–600, New York,NY, USA, 2010. Association for Computing Machinery.ISBN 9781605587998. doi: 10.1145/1772690.1772751.URL https://doi.org/10.1145/1772690.1772751.

[29] Timothy A. Davis and Yifan Hu. The university offlorida sparse matrix collection. ACM Trans. Math.Softw., 38(1), December 2011. ISSN 0098-3500.doi: 10.1145/2049662.2049663. URL https://doi.org/10.1145/2049662.2049663.

[30] Boris Houska, Hans Joachim Ferreau, and Moritz Diehl.Acado toolkit—an open-source framework for automaticcontrol and dynamic optimization. Optimal ControlApplications and Methods, 32(3):298–312, 2011. doi:10.1002/oca.939. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/oca.939.

[31] Nvidia. Dense linear algebra on gpus, . URL https://developer.nvidia.com/cublas.

[32] Narayanan Sundaram, Nadathur Satish, Md Mostofa AliPatwary, Subramanya R. Dulloor, Michael J. An-derson, Satya Gautam Vadlamudi, Dipankar Das,and Pradeep Dubey. Graphmat: High performancegraph analytics made productive. Proc. VLDB En-dow., 8(11):1214–1225, July 2015. ISSN 2150-8097.doi: 10.14778/2809974.2809983. URL https://doi.org/10.14778/2809974.2809983.

[33] H. Liu and H. H. Huang. Enterprise: breadth-first graphtraversal on gpus. In SC ’15: Proceedings of the Inter-national Conference for High Performance Computing,Networking, Storage and Analysis, pages 1–12, 2015.

[34] Matteo Frigo and Steven G. Johnson. The design andimplementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Genera-tion, Optimization, and Platform Adaptation”.

[35] Nvidia. Nvidia toolkit, . URL https://developer.nvidia.com/cuda-toolkit.

[36] Nvidia. Nvidia cuda fast fourier transform library, . URLhttps://developer.nvidia.com/cufft.

[37] Nvidia. Nvidia cuda sdk - image/video pro-cessing and data compression, 2008. URLhttps://www.nvidia.com/content/cudazone/cuda sdk/Image Video Processing and Data Compression.html.

[38] Ryan R. Curtin, James R. Cline, Neil P. Slagle,William B. March, P. Ram, Nishant A. Mehta, andAlexander G. Gray. MLPACK: A scalable C++ machinelearning library. Journal of Machine Learning Research,14:801–805, 2013.

[39] Zhang Xianyi, Wang Qian, and Zhang Yunquan. Model-driven level 3 blas performance optimization on loongson3a processor. In Proceedings of the 2012 IEEE 18th In-ternational Conference on Parallel and Distributed Sys-tems, ICPADS ’12, pages 684–691, Washington, DC,USA, 2012. IEEE Computer Society. ISBN 978-0-7695-4903-3. doi: 10.1109/ICPADS.2012.97. URLhttp://dx.doi.org/10.1109/ICPADS.2012.97.

[40] Nvidia. Nvidia nvblas library., . URL http://docs.nvidia.com/cuda/nvblas.

[41] Conrad Sanderson and Ryan Curtin. Armadillo: atemplate-based c++ library for linear algebra. In Journalof Open Source Software, 2016.

[42] Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv-ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz,Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, DanMane, Rajat Monga, Sherry Moore, Derek Murray, ChrisOlah, Mike Schuster, Jonathon Shlens, Benoit Steiner,Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vijay Vasudevan, Fernanda Viegas, OriolVinyals, Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed sys-tems, 2015. URL http://download.tensorflow.org/paper/whitepaper2015.pdf.

[43] ACTLab. TABLA source code, 2017. http://www.act-lab.org/artifacts/tabla/.

[44] Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch,Lianmin Zheng, Eddie Yan, Josh Fromm, Ziheng Jiang,Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.A hardware-software blueprint for flexible deep learningspecialization. IEEE Micro, July 2019.

[45] G. W. Morris and M. Aubury. Design space explorationof the european option benchmark using hyperstreams. In2007 International Conference on Field ProgrammableLogic and Applications, pages 5–10, 2007.

[46] N. Chandramoorthy, G. Tagliavini, K. Irick, A. Pullini,S. Advani, S. A. Habsi, M. Cotter, J. Sampson,V. Narayanan, and L. Benini. Exploring architecturalheterogeneity in intelligent vision systems. In 2015 IEEE

Page 16: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

21st International Symposium on High PerformanceComputer Architecture (HPCA), pages 1–12, 2015.

[47] Stefan Van der Walt, S. Chris Colbert, and Gael Varo-quaux. The numpy array: a structure for efficient nu-merical computation. CoRR, abs/1102.1523, 2011. URLhttp://arxiv.org/abs/1102.1523.

[48] Marco Frigerio, Jonas Buchli, and Darwin G. Caldwell.A domain specific language for kinematic models andfast implementations of robot dynamics algorithms. InInternational Workshop on Domain-Specific Languagesand Models for Robotic Systems, 2015.

[49] Mirko Bordignon, Kasper Stoy, and Ulrik Pagh Schultz.Generalized programming of modular robots throughkinematic configurations. In International Conference onIntelligent Robots and Systems, 2011.

[50] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet,Connelly Barnes, Sylvain Paris, Marc Levoy, SamanAmarasinghe, and Fredo Durand. Halide: Decouplingalgorithms from schedules for high-performance imageprocessing. Commun. ACM, 61(1):106–115, December2017. ISSN 0001-0782. doi: 10.1145/3150211. URLhttp://doi.acm.org/10.1145/3150211.

[51] Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown,Hassan Chafi, Michael Wu, Anand R. Atreya, KunleOlukotun, Tiark Rompf, and Martin Odersky. Optiml:An implicitly parallel domain-specific language for ma-chine learning. In Proceedings of the 28th Interna-tional Conference on International Conference on Ma-chine Learning, ICML’11, pages 609–616, USA, 2011.Omnipress. ISBN 978-1-4503-0619-5. URL http://dl.acm.org/citation.cfm?id=3104482.3104559.

[52] David Koeplinger, Matthew Feldman, Raghu Prabhakar,Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao,Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, andKunle Olukotun. Spatial: A language and compiler forapplication accelerators. In Proceedings of the 39thACM SIGPLAN Conference on Programming LanguageDesign and Implementation, PLDI 2018, pages 296–311, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi: 10.1145/3192366.3192379. URLhttp://doi.acm.org/10.1145/3192366.3192379.

[53] Joshua Auerbach, David F. Bacon, Perry Cheng, andRodric Rabbah. Lime: A java-compatible and synthe-sizable language for heterogeneous architectures. InProceedings of the ACM International Conference onObject Oriented Programming Systems Languages andApplications, OOPSLA ’10, pages 89–108, New York,NY, USA, 2010. ACM. ISBN 978-1-4503-0203-6.doi: 10.1145/1869459.1869469. URL http://doi.acm.org/10.1145/1869459.1869469.

[54] MATLAB version 9.3.0.713579 (R2017b). The Math-works, Inc., Natick, Massachusetts, 2017.

[55] Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and AlanEdelman. Julia: A fast dynamic language for technicalcomputing. CoRR, abs/1209.5145, 2012. URL http://arxiv.org/abs/1209.5145.

[56] Guy L. Steele, Jr. Parallel programming and code se-lection in fortress. In Proceedings of the EleventhACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming, PPoPP ’06, pages 1–1, NewYork, NY, USA, 2006. ACM. ISBN 1-59593-189-9.doi: 10.1145/1122971.1122972. URL http://doi.acm.org/10.1145/1122971.1122972.

[57] R Core Team. R: A Language and Environment for Statis-tical Computing. R Foundation for Statistical Computing,Vienna, Austria, 2017. URL https://www.R-project.org/.

[58] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Auto-matic differentiation in pytorch. 2017.

[59] Chris Lattner and Vikram Adve. LLVM: A compilationframework for lifelong program analysis and transforma-tion. In CGO, 2004.

[60] Tim Lindholm, Frank Yellin, Gilad Bracha, and AlexBuckley. The Java Virtual Machine Specification, JavaSE 8 Edition. Addison-Wesley Professional, 1st edition,2014. ISBN 013390590X, 9780133905908.

[61] Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sin-clair, Rakesh Komuravelli, Vikram Adve, and SaritaAdve. Hpvm: Heterogeneous parallel virtual machine.In Proceedings of the 23rd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Program-ming, PPoPP ’18, pages 68–80, New York, NY, USA,2018. ACM. ISBN 978-1-4503-4982-6. doi: 10.1145/3178487.3178493. URL http://doi.acm.org/10.1145/3178487.3178493.

[62] Thierry Moreau, Tianqi Chen, Ziheng Jiang, Luis Ceze,Carlos Guestrin, and Arvind Krishnamurthy. VTA:an open hardware-software stack for deep learning.CoRR, abs/1807.04188, 2018. URL http://arxiv.org/abs/1807.04188.

[63] Jared Roesch, Steven Lyubomirsky, Logan Weber, JoshPollock, Marisa Kirisame, Tianqi Chen, and ZacharyTatlock. Relay: A new ir for machine learning frame-works. In Proceedings of the 2Nd ACM SIGPLAN In-ternational Workshop on Machine Learning and Pro-gramming Languages, MAPL 2018, pages 58–68, NewYork, NY, USA, 2018. ACM. ISBN 978-1-4503-5834-7.doi: 10.1145/3211346.3211348. URL http://doi.acm.org/10.1145/3211346.3211348.

[64] Nicolas Vasilache, Oleksandr Zinenko, TheodorosTheodoridis, Priya Goyal, Zachary DeVito, William S.Moses, Sven Verdoolaege, Andrew Adams, and Al-bert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstrac-tions. CoRR, abs/1802.04730, 2018. URL http://arxiv.org/abs/1802.04730.

[65] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Sum-mer Deng, Roman Dzhabarov, James Hegeman, RomanLevenstein, Bert Maher, Nadathur Satish, Jakob Ole-sen, Jongsoo Park, Artem Rakhov, and Misha Smelyan-skiy. Glow: Graph lowering compiler techniques for

Page 17: A Computational Stack for Cross-Domain Accelerationhadi/doc/paper/2021-hpca... · 2021. 4. 4. · A Computational Stack for Cross-Domain Acceleration Sean Kinzer Joon Kyung Kim Soroush

neural networks. CoRR, abs/1805.00907, 2018. URLhttp://arxiv.org/abs/1805.00907.

[66] Chris Lattner and Jacques Pienaar. Mlir primer: A com-piler infrastructure for the end of moore’s law, 2019.

[67] E. Del Sozzo, R. Baghdadi, S. Amarasinghe, and M. D.Santambrogio. A unified backend for targeting fpgasfrom dsls. In 2018 IEEE 29th International Conferenceon Application-specific Systems, Architectures and Pro-cessors (ASAP), pages 1–8, July 2018. doi: 10.1109/ASAP.2018.8445108.

[68] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, , AdrianCaulfield, Todd Massengill, Ming Liu, Mahdi Ghandi,Daniel Lo, Steve Reinhardt, Shlomi Alkalay, HariAngepat, Derek Chiou, Alessandro Forin, Doug Burger,Lisa Woods, Gabriel Weisz, Michael Haselman, andDan Zhang. Serving dnns in real time at datacenterscale with project brainwave. IEEE Micro, 38:8–20,March 2018. URL https://www.microsoft.com/en-us/research/publication/serving-dnns-real-time-datacenter-scale-project-brainwave/.

[69] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael,Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay,Michael Haselman, Logan Adams, Mahdi Ghandi,Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz,Lisa Woods, Sitaram Lanka, Steve Reinhardt, AdrianCaulfield, Eric Chung, and Doug Burger. A configurable

cloud-scale dnn processor for real-time ai. ACM, June2018. URL https://www.microsoft.com/en-us/research/publication/a-configurable-cloud-scale-dnn-processor-for-real-time-ai/.

[70] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee,S. K. Lee, J. M. Hernandez-Lobato, G. Wei, andD. Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In 2016ACM/IEEE 43rd Annual International Symposium onComputer Architecture (ISCA), pages 267–278, 2016.doi: 10.1109/ISCA.2016.32.

[71] Byung Hoon Ahn, Prannoy Pilligundla, and Hadi Es-maeilzadeh. Chameleon: Adaptive code optimization forexpedited deep neural network compilation. In ICLR,2020.

[72] Byung Hoon Ahn, Jinwon Lee, Jamie Menjay Lin, Hsin-Pai Cheng, Jilei Hou, and Hadi Esmaeilzadeh. Orderingchaos: Memory-aware scheduling of irregularly wiredneural networks for edge devices. In MLSys, 2020.

[73] Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, andDavid Brooks. Aladdin: A pre-rtl, power-performanceaccelerator simulator enabling large design space ex-ploration of customized architectures. pages 97–108,06 2014. ISBN 978-1-4799-4394-4. doi: 10.1109/ISCA.2014.6853196.