[acm press the 34th annual conference - anaheim, california, united states (1997.06.09-1997.06.13)]...

Hardware/Software Partitioning and Pipelining

Smita Bakshiy

Daniel D. Gajski

Department of Electrical & Computer Engineering Department of Information & Computer Science

University of California University of California

Davis, CA 95616 Irvine, CA 92697

Abstract

For a given throughput constrained system-level speci�ca-

tion, we present a design ow and an algorithm to select soft-ware (general purpose processors) and hardware components,

and then partition and pipeline the speci�cation amongst

the selected components. This is done so as to best satisfythe throughput constraint at minimal hardware cost. Our

ability to pipeline the design at several levels, enables us to

attain high throughput designs, and also distinguishes ourwork from previously proposed hardware/software partition-

ing algorithms.

1 Introduction

Digital systems, especially in the domain of digital pro-cessing and telecommunications, are immensely complex.

In order to deal with the high complexity, increased time-

to-market pressures, and a set of possibly con icting con-straints, it is now imperative to involve design automation

at the highest possible level. This \highest possible level"

may vary on the design, but given the fact that an increas-ing number of designs now contain a combination of di�er-

ent component types, such as general-purpose processors,

DSP (digital signal processing) cores, and custom designedASICs (application speci�c integrated circuits), we consider

the highest level of design to be the one involving the selec-

tion and interconnection of such components. We refer tothis as system-level design.

The reason why a system is best composed of di�er-

ent component types is due to the di�erent characteris-tics of the components, which may be targeted at satisfy-

y This work was performed while the author was at UC Irvine.

The authors gratefully acknowledge SRC (Grant #93-DJ-146) fortheir support.

Permission to make digital/hard copy of all or part of this work

for personal or classroom use is granted without fee provided thatcopies are not made or distributed for pro�t or commercial ad-

vantage, the copyright notice, the title of the publication and itsdate appear and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers or

to redistribute to lists, requires prior speci�c permission and/or afee.DAC 97, Anaheim, Californiac 1997 ACM 0-89791-920-3/97/06..$3.50

ing di�erent constraints. O�-the-shelf processors o�er high-

programmability, lower design time, and a comparativelylower cost and lower performance than an equivalent ASIC

implementation. On the other hand ASICs are more expen-

sive to design and fabricate, but o�er comparatively higherperformance. Thus, for a given system, these components

are selected such that the performance critical sections are

performed rapidly on ASICs, and the less critical sections,or the sections that require higher programmability, are per-

formed on the processors.

Our work addresses throughput constrained systems.Given a speci�cation of such a system, we select proces-

sors and hardware resources to implement the system and

then partition and pipeline it amongst the selected compo-nents so as to best satisfy the given throughput constraint at

minimal hardware cost. The throughput of a system is the

rate at which it processes input data, and this is often theprime constraint on digital signal processing systems includ-

ing most image processing applications. In order to meet the

throughput constraints of these systems, it is not only suf-�cient to perform the critical sections in hardware, but it is

also necessary to pipeline the design. Pipelining divides the

design into concurrently executing stages, thus increasing itsdata rate. Our work supports pipelining at four di�erent

levels of the design, namely the system, behavior, loop and

operation level.

Over the past �ve years, several co-design systems [1] [2][3] [4] for hardware/software partitioning have been devel-

oped. However, these tools assume that the tasks in the

system execute sequentially, in a non-pipelined fashion. Fur-thermore, they also assume that the operations within each

task execute sequentially. In comparing our work with the

synthesis systems mentioned above, we have, in general, ex-tended the design space explored by these systems by allow-

ing designs to be pipelined at the system and the task level.

Hence, our work extends current algorithms by performingpipelining at several levels, and by so doing, it is capable

of achieving partitions with high throughput values that are

unattainable without pipelining.

2 Problem de�nition

Our problem of hardware/software partitioning andpipelining may be de�ned as follows:

Given:

1. A speci�cation of the system as a control ow graph

(CFG) of behaviors or tasks.

2. A hardware library containing functional units charac-

terized by a three-tuple <type, cost, delay>.

3. A software (processor) library containing a list of

processors characterized by a four-tuple <type, clock

speed, dollar cost, metrics �le>.

4. A clock constraint and a throughput constraint for the

complete speci�cation.

Determine:

1. An implementation type (either software or hardware)

for every behavior.

2. The estimated area for each hardware behavior, as well

as the total hardware area for the complete speci�ca-tion.

3. The processor to be used for each software behavior, aswell as the total number of processors for the complete

speci�cation

4. A division of the control ow graph into pipe stages of

delay no more than the given throughput constraint.

Such that:

1. Constraints on throughput are satis�ed, and

2. Total hardware area (for the given clock) is minimized.

The throughput constraint speci�es the di�erence in the

arrival time (in nanoseconds) of two consecutive input sam-ples. We also refer to this time as the PS (pipe stage) delay,

since this would be the required delay of a pipe stage in the

design, if it were to be pipelined.

InputControl Flow Graph

A

B C

D

E

Output

Control Flow GraphPipelined & Partitioned

Software Library

Type Clock $ Cost Metrics File

Pentium

PowerPC68000

10

5010

90

7560

pentium.metrics

powerpc.metricsmot68000.metrics

(ns)

Constraints:

Aim:

Minimize hardware area

pentium

HW

68000

pentium

HW

time

System Pipeline

A A A

C C C

B B

D D

E

0 3100 4000

4000 4000 4000

Hardware Library

Name Delay(ns) (gates)

Type

* Mpy1Mpy2

3050

10070

Mpy3 70 60

+ 4530Add1+ 3042Add2

> Cmp1 18 12= Cmp2 14 8

**

Area

3 Mpy1, 1 Mpy22 Add2

A

B C

D

E

1 Mpy1,2 Add28x16 mem

stage 1

stage 2

stage 3

68000

ASIC

pentium

Node: behavior

Arc: control flow

Clock = 10 nsPS delay = 4000 ns

Satisfy PS delay

stage 1

stage 2

stage 3

Figure 1: Inputs and outputs of the algorithm.

The example in Figure 1 illustrates the problem de�ned

above. As input we have a control ow graph of behaviors,

a hardware and a software library, and a PS delay (through-put) and clock constraint. The nodes in the control ow

graph represent behaviors and the arcs represent control de-

pendencies. Each behavior contains a sequence of VHDLstatements that represents computation done on variables.

The hardware library consists of a set of functional units

with their corresponding delay (in ns) and area (in gates)

and the software library contains a list of processors withtheir corresponding clock speeds, dollar cost and metrics �le.

The metrics �le gives the number of instruction cycles andthe number of bytes required to execute each of a list of

3-address generic instructions on that processor. This infor-

mation characterizes a processor and is required to estimatethe execution time of a behavior on a speci�c processor [5].

The output consists of a pipelined and partitioned CFGwhere every behavior has been mapped to either hardware

or software and the graph has been divided into pipe stagesof delay no more than 4000 ns. Every hardware behavior

is associated with an estimate of its execution time and the

number and type of components (selected from the hardwarelibrary) needed to obtain that execution time. For instance,

behavior E has a throughput of 4000 ns and requires 3 in-

stances ofMpy1, 1 instance ofMpy2 and 2 instances of Add2,bringing the total area to 430 gates. Similarly, every software

behavior is associated with a processor from the software li-

brary, and its execution time on that processor. For instance,behavior A implemented on the Pentium processor, has an

execution time of 3100 ns.

Finally, the CFG has also been partitioned into three pipe

stages such that the throughput of the system is 4000 ns, thatis each pipe stage has a delay of no more than 4000 ns. The

hardware and software partitioning and pipelining has been

done with the aim of satisfying the throughput constraintat minimal hardware cost. This scheduling and pipelining

information is represented in the System Pipeline diagram,

which depicts all the pipe stages and the execution scheduleof behaviors within each pipe stage.

Before we describe our algorithm, note some assumptions

of our model:

1. The system architecture contains processors, ASICs

(application speci�c integrated circuits), and memory

chips that all communicate over buses. The memorystores data that needs to be transferred between pipe

stages as well as any globally de�ned data that may

be accessed by multiple processors and/or ASICs. Inthis paper, we assume that all hardware behaviors are

mapped onto 1 ASIC. After the pipelining and the par-

titioning, this ASIC may be further partitioned into

smaller ASICs [6] [7] .

2. Two software behaviors may share the same processor,

irrespective of the pipe stages they execute in.

3. Two hardware behaviors may only share resources if

they execute sequentially in the same pipe stage.

3 Algorithm

An overview of our algorithm for hardware/software par-

titioning and pipelining is presented in Figure 2. Given aSpecChart [1] speci�cation, hardware and software libraries,

a throughput and clock constraint, the �rst step consists of

deriving the control ow graph from the given speci�cation.We then estimate [5] the execution time of all behaviors on

all the available processors in the software library. This gives

us the PET (processor execution time) table, an example of

which is shown in Figure 3 for the given CFG and softwarelibrary in Figure 1. Based on the assumption that a soft-

ware implementation is always less costly than an equivalenthardware implementation for a given behavior, our algorithm

attempts to execute as many as possible behaviors on pro-

cessors. Thus, any behavior that has an execution time lessthan the given throughput constraint on at least one pro-

cessor can be executed in software and only those behaviors

that have an execution time greater than the throughputconstraint on all processors need be executed in hardware.

For instance, for the example in Figure 3 behaviors A, B

and D can be executed on a processor and behaviors C andE need to be executed in hardware.

Determine hardware/softwarepartition

Modify processor allocation

Build control flow graph fromSpecChart specification

Pipeline and schedulecontrol flow graph

Throughput satisfied?

Throughput satisfied?

Determine area of resourcesfor all hardware behaviors

Step 1

Step2

Step 3Step 4

Step 5

Step 6

Yes

Yes

No

Resources not fast enough. Stop.

No

Minimal hw area pipeline &schedule achieved. Stop.

Determine initial cheapestprocessor allocation

Figure 2: Partitioning and pipelining algorithm ow.

Once we have built the PET table and have determined

the hardware/software partition, Step 3 of our algorithm se-

lects an initial processor allocation for the system. Our al-

gorithm's primary goal is to perform the partitioning and

pipelining at minimal hardware cost; however, its secondary

goal is to minimize the total cost of processors. Thus, we

start with the cheapest processor allocation and then in-

crease the cost of the processor allocation till throughputconstraints are satis�ed. This cheapest processor allocation

consists of one instance of the cheapest processor on which

all the software behaviors have an execution time that is lessthan the throughput constraint. For our example CFG and

PET in Figure 3, the initial processor set consists of one

instance of the PowerPC. Though 68000 is the cheapest pro-cessor, it cannot execute behavior A in less than 4000 ns;

hence, it does not satisfy our criteria.

Next, in step 4, we estimate the area of all the hard-

ware behaviors when constrained with a throughput of thegiven PS delay constraint. The estimation algorithm per-

Processor Execution−Time Table

(a)

A

B C

D

Esoftwarehardware

(b)

Behavior Processor Execution Time

AA

pentiumpowerPC

A

BBBCC

pentium

pentiumpowerPC

powerPC

68000

68000

(ns)

pentium

pentiumpowerPC

powerPC

68000

68000

CDDDEE

68000

6000

1400220028008400

12000

E

18900

10001200

102301487021080

Throughput constraint = 4000 nsInitial processor allocation = powerPC

3800

Hardware/software Partition

3100

900

Figure 3: Step 2: hardware/software partition.

forms component selection, scheduling and pipelining, to de-

termine the total number and types of components beingused, the number and position of the pipe stages, as well as

the schedule (clock state divisions) within each pipe stage.

This area and performance information is then entered intoa Hardware Execution Time (HET) Table, to be used in

the next step. Note that if throughput constraints cannot

be satis�ed at this step, the only alternatives are to includefaster resources in the hardware library, or to rewrite the

speci�cation by manually partitioning critical behaviors.

In Step 5, we schedule and pipeline the control ow graph,

that is, we determine a pipe stage and a time slot within the

pipe stage for each behavior. For each software behavior, wealso determine the processor with which it will be executed.

We use a version of the well known list-scheduling algorithm.

We �rst make a prioritized list of behaviors whose predeces-sors have already been assigned to a pipe stage and time slot.

This is known as the ready list. If the behavior at the top

of the ready list is a hardware behavior, we assign it to theearliest feasible pipe stage and time slot. If the behavior is

a software behavior, we determine the processor using which

we can complete the execution of the behavior at the earliestpossible time, and then schedule the behavior in that pipe

stage and time slot. We repeat this step, until all behaviorshave been assigned to a pipe stage and a time slot.

If we can not determine a valid schedule and pipeline,we increase the speed and/or the number of processors and

repeat the scheduling and pipelining step. We use a sim-

ple, almost exhaustive method of modifying the processor

allocation. We �rst replace the allocated processor with a

faster processor from the library, if available. When we have

tried scheduling with the fastest processor, we start with

2 instances of the slowest processor and then increase the

processor's speed, one at a time, in every iteration of Step 5.

This is repeated till constraints are satis�ed, and in the worstcase, this may be repeated till there are as many processors

as software behaviors in the CFG.

4 Experimental results

We have integrated the hardware/software partitioning

and pipelining algorithm within SpecSyn [1], a system syn-

thesis tool. Our experiments are designed to evaluate thequality of the hardware estimation and the quality of the

hardware/software partitioning and pipelining algorithm.

We present a synopsis of our results.

The quality of the hardware estimation was evaluatedby comparing manually obtained designs of the IDCT, the

FFT, and 6 blocks from the MPEG II decoder, against thoseobtained by the estimation algorithm. In general, for a

given throughput constraint, our estimates were within 10%

of the area of the manual designs. The estimation errorswere mainly because our algorithm does not handle multi-

functional units, and is currently incapable of using multiple

bitwidth implementations of the same component type.The quality of the hardware/software partitioning and

pipelining algorithm was evaluated by comparing the man-

ual design exploration process against the results obtained byour algorithm, for the MPEG II decoder example. The man-

ual design process started with an all software non-pipelined

design and then moved the critical behaviors to hardware, aswell as pipelined the system, till its throughput was within

about 3000 ns (this represents a decoding rate of 30 frames

per second). Similarly, we ran our algorithm for a range ofPS delay constraints (700,000 ns to 3,000 ns), starting from

an all software solution and moving towards an all hardware

solution. Results of the comparison are shown graphically(in part) in Figure 4.

The results indicate that the design exploration con-

ducted by our algorithm closely matches the manual explo-ration. The consistent di�erence in area between designs on

the two curves is because our designs do not include the con-

troller area, whereas the manual designs do. Despite the in-accuracy of our estimates, its �delity is high, indicating that,

with further improvements, it will be feasible to replace the

manual exploration process by our algorithm.The results also indicate that the designs obtained by our

algorithm, were in some cases able to share the same proces-

sor amongst 2 behaviors, hence requiring a fewer number ofprocessors. (Note that the Pentium processor was selected to

be the best from a library of about 6 processors, including the

Sparc, PowerPC, and 68000 processors). More importantly,the results show that the fastest design attainable by our al-

gorithm (2980 ns) is about 25% faster than that obtained by

the manual design. This is because our algorithm performspipelining at the system, behavior, loop and operation levels,

which is di�cult for designers to perform manually.

It is important to note that while the manual design andexploration took approximately 6 man months, our algo-

rithm took approximately 3 minutes per design, with a total

of about 30 minutes on a SUN SPARC 5 workstation. (De-tails of these experiments are provided in [8] and [9]). Given

the similarity of both results, this is a signi�cant saving in

design time.

5 Conclusion

Our design ow and algorithms may be improved in sev-

eral ways. At the behavior level, our resource estimationalgorithm can be improved by allowing the use of multi-

functional components, by allowing multiple bitwidths of

the same component type by extending our algorithms topipeline in the presence of loop-carried dependencies, and by

100000 200000 300000 400000 500000Area (gates)

0

5000

10000

15000

20000

PS

del

ay (

ns)

MPEG: Area (gates) vs. PS delay (ns)

Our algorithmManual exploration

Pentium x 3 Pentium x 4

Pentium x 4 Pentium x 2

Pentium x 2Pentium x 1

Pentium x 1

Figure 4: PS delay vs. hardware area for MPEG

designs obtained by manual and automatic means.

providing estimates for the cost of multiplexers and buses.

Our design model and algorithms can support these exten-sions and incorporating these extensions is part of ongoing

work. At the system level, we can improve the pipelining

and partitioning algorithms by incorporating some measureof processor cost into our cost metric, by taking interface

cost and delay into account, as well as by partitioning the

system amongst multiple ASICs.Our current results indicate that, with these modi�ca-

tions, our algorithm will serve as a practical solution to the

hardware/software partitioning problem. We believe thiswill greatly increase design quality and reduce design time.

References

[1] D. Gajski, F. Vahid, S. Narayan, and J. Gong, Speci�cationand Design of Embedded Systems. Englewood Cli�s, New Jer-sey 07632: Prentice Hall, Inc, 1994.

[2] R. Ernst, J. Henkel, and T. Benner, \Hardware-softwarecosynthesis for microcontrollers," in IEEE Design and Testof Computers, pp. 64{75, 1994.

[3] R. Gupta and G. D. Micheli, \Hardware-software cosynthesisfor digital systems," IEEE Design and Test of Computers,vol. 10, no. 3, pp. 29{41, 1993.

[4] A. Kalavade, System-Level Codesign Of Mixed Hardware-Software Systems. PhD thesis, University of California, Berke-ley, 1995.

[5] J. Gong, D. Gajski, and S. Narayan, \Software estimationfrom executable speci�cations," in The Journal of Computerand Software Engineering, 1994.

[6] R. Gupta and G. D. Micheli, \Partitioning of functional mod-els of synchronous digital systems," in Proceedings of theIEEE International Conference on Computer Aided Design,pp. 216{219, 1990.

[7] F. Vahid and D. Gajski, \Speci�cation partitioning for sys-tem design," in Proceedings of the 29th Design AutomationConference, 1992.

[8] D. Gajski, P. Grun, W. Pan, and S. Bakshi, \Design explo-ration for pipelined IDCT," Tech. Rep. 96-41, Dept. of Infor-mation and Computer Science, University of California, Irvine,1996.

[9] A. B. Thordarson, \Comparison of manual and automatic be-havioral synthesis on MPEG-algorithm,"Master's thesis, Uni-versity of California, Irvine, 1995.

[acm press the 34th annual conference - anaheim, california, united states (1997.06.09-1997.06.13)]...

Documents