Hardware/Software Partitioning and Pipelining
Smita Bakshiy
Daniel D. Gajski
Department of Electrical & Computer Engineering Department of Information & Computer Science
University of California University of California
Davis, CA 95616 Irvine, CA 92697
Abstract
For a given throughput constrained system-level speci�ca-
tion, we present a design ow and an algorithm to select soft-ware (general purpose processors) and hardware components,
and then partition and pipeline the speci�cation amongst
the selected components. This is done so as to best satisfythe throughput constraint at minimal hardware cost. Our
ability to pipeline the design at several levels, enables us to
attain high throughput designs, and also distinguishes ourwork from previously proposed hardware/software partition-
ing algorithms.
1 Introduction
Digital systems, especially in the domain of digital pro-cessing and telecommunications, are immensely complex.
In order to deal with the high complexity, increased time-
to-market pressures, and a set of possibly con icting con-straints, it is now imperative to involve design automation
at the highest possible level. This \highest possible level"
may vary on the design, but given the fact that an increas-ing number of designs now contain a combination of di�er-
ent component types, such as general-purpose processors,
DSP (digital signal processing) cores, and custom designedASICs (application speci�c integrated circuits), we consider
the highest level of design to be the one involving the selec-
tion and interconnection of such components. We refer tothis as system-level design.
The reason why a system is best composed of di�er-
ent component types is due to the di�erent characteris-tics of the components, which may be targeted at satisfy-
y This work was performed while the author was at UC Irvine.
The authors gratefully acknowledge SRC (Grant #93-DJ-146) fortheir support.
Permission to make digital/hard copy of all or part of this work
for personal or classroom use is granted without fee provided thatcopies are not made or distributed for pro�t or commercial ad-
vantage, the copyright notice, the title of the publication and itsdate appear and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers or
to redistribute to lists, requires prior speci�c permission and/or afee.DAC 97, Anaheim, Californiac 1997 ACM 0-89791-920-3/97/06..$3.50
ing di�erent constraints. O�-the-shelf processors o�er high-
programmability, lower design time, and a comparativelylower cost and lower performance than an equivalent ASIC
implementation. On the other hand ASICs are more expen-
sive to design and fabricate, but o�er comparatively higherperformance. Thus, for a given system, these components
are selected such that the performance critical sections are
performed rapidly on ASICs, and the less critical sections,or the sections that require higher programmability, are per-
formed on the processors.
Our work addresses throughput constrained systems.Given a speci�cation of such a system, we select proces-
sors and hardware resources to implement the system and
then partition and pipeline it amongst the selected compo-nents so as to best satisfy the given throughput constraint at
minimal hardware cost. The throughput of a system is the
rate at which it processes input data, and this is often theprime constraint on digital signal processing systems includ-
ing most image processing applications. In order to meet the
throughput constraints of these systems, it is not only suf-�cient to perform the critical sections in hardware, but it is
also necessary to pipeline the design. Pipelining divides the
design into concurrently executing stages, thus increasing itsdata rate. Our work supports pipelining at four di�erent
levels of the design, namely the system, behavior, loop and
operation level.
Over the past �ve years, several co-design systems [1] [2][3] [4] for hardware/software partitioning have been devel-
oped. However, these tools assume that the tasks in the
system execute sequentially, in a non-pipelined fashion. Fur-thermore, they also assume that the operations within each
task execute sequentially. In comparing our work with the
synthesis systems mentioned above, we have, in general, ex-tended the design space explored by these systems by allow-
ing designs to be pipelined at the system and the task level.
Hence, our work extends current algorithms by performingpipelining at several levels, and by so doing, it is capable
of achieving partitions with high throughput values that are
unattainable without pipelining.
2 Problem de�nition
Our problem of hardware/software partitioning andpipelining may be de�ned as follows:
Given:
1. A speci�cation of the system as a control ow graph
(CFG) of behaviors or tasks.
2. A hardware library containing functional units charac-
terized by a three-tuple <type, cost, delay>.
3. A software (processor) library containing a list of
processors characterized by a four-tuple <type, clock
speed, dollar cost, metrics �le>.
4. A clock constraint and a throughput constraint for the
complete speci�cation.
Determine:
1. An implementation type (either software or hardware)
for every behavior.
2. The estimated area for each hardware behavior, as well
as the total hardware area for the complete speci�ca-tion.
3. The processor to be used for each software behavior, aswell as the total number of processors for the complete
speci�cation
4. A division of the control ow graph into pipe stages of
delay no more than the given throughput constraint.
Such that:
1. Constraints on throughput are satis�ed, and
2. Total hardware area (for the given clock) is minimized.
The throughput constraint speci�es the di�erence in the
arrival time (in nanoseconds) of two consecutive input sam-ples. We also refer to this time as the PS (pipe stage) delay,
since this would be the required delay of a pipe stage in the
design, if it were to be pipelined.
InputControl Flow Graph
A
B C
D
E
Output
Control Flow GraphPipelined & Partitioned
Software Library
Type Clock $ Cost Metrics File
Pentium
PowerPC68000
10
5010
90
7560
pentium.metrics
powerpc.metricsmot68000.metrics
(ns)
Constraints:
Aim:
Minimize hardware area
pentium
HW
68000
pentium
HW
time
System Pipeline
A A A
C C C
B B
D D
E
0 3100 4000
4000 4000 4000
Hardware Library
Name Delay(ns) (gates)
Type
* Mpy1Mpy2
3050
10070
Mpy3 70 60
+ 4530Add1+ 3042Add2
> Cmp1 18 12= Cmp2 14 8
**
Area
3 Mpy1, 1 Mpy22 Add2
A
B C
D
E
1 Mpy1,2 Add28x16 mem
stage 1
stage 2
stage 3
68000
ASIC
pentium
Node: behavior
Arc: control flow
Clock = 10 nsPS delay = 4000 ns
Satisfy PS delay
stage 1
stage 2
stage 3
Figure 1: Inputs and outputs of the algorithm.
The example in Figure 1 illustrates the problem de�ned
above. As input we have a control ow graph of behaviors,
a hardware and a software library, and a PS delay (through-put) and clock constraint. The nodes in the control ow
graph represent behaviors and the arcs represent control de-
pendencies. Each behavior contains a sequence of VHDLstatements that represents computation done on variables.
The hardware library consists of a set of functional units
with their corresponding delay (in ns) and area (in gates)
and the software library contains a list of processors withtheir corresponding clock speeds, dollar cost and metrics �le.
The metrics �le gives the number of instruction cycles andthe number of bytes required to execute each of a list of
3-address generic instructions on that processor. This infor-
mation characterizes a processor and is required to estimatethe execution time of a behavior on a speci�c processor [5].
The output consists of a pipelined and partitioned CFGwhere every behavior has been mapped to either hardware
or software and the graph has been divided into pipe stagesof delay no more than 4000 ns. Every hardware behavior
is associated with an estimate of its execution time and the
number and type of components (selected from the hardwarelibrary) needed to obtain that execution time. For instance,
behavior E has a throughput of 4000 ns and requires 3 in-
stances ofMpy1, 1 instance ofMpy2 and 2 instances of Add2,bringing the total area to 430 gates. Similarly, every software
behavior is associated with a processor from the software li-
brary, and its execution time on that processor. For instance,behavior A implemented on the Pentium processor, has an
execution time of 3100 ns.
Finally, the CFG has also been partitioned into three pipe
stages such that the throughput of the system is 4000 ns, thatis each pipe stage has a delay of no more than 4000 ns. The
hardware and software partitioning and pipelining has been
done with the aim of satisfying the throughput constraintat minimal hardware cost. This scheduling and pipelining
information is represented in the System Pipeline diagram,
which depicts all the pipe stages and the execution scheduleof behaviors within each pipe stage.
Before we describe our algorithm, note some assumptions
of our model:
1. The system architecture contains processors, ASICs
(application speci�c integrated circuits), and memory
chips that all communicate over buses. The memorystores data that needs to be transferred between pipe
stages as well as any globally de�ned data that may
be accessed by multiple processors and/or ASICs. Inthis paper, we assume that all hardware behaviors are
mapped onto 1 ASIC. After the pipelining and the par-
titioning, this ASIC may be further partitioned into
smaller ASICs [6] [7] .
2. Two software behaviors may share the same processor,
irrespective of the pipe stages they execute in.
3. Two hardware behaviors may only share resources if
they execute sequentially in the same pipe stage.
3 Algorithm
An overview of our algorithm for hardware/software par-
titioning and pipelining is presented in Figure 2. Given aSpecChart [1] speci�cation, hardware and software libraries,
a throughput and clock constraint, the �rst step consists of
deriving the control ow graph from the given speci�cation.We then estimate [5] the execution time of all behaviors on
all the available processors in the software library. This gives
us the PET (processor execution time) table, an example of
which is shown in Figure 3 for the given CFG and softwarelibrary in Figure 1. Based on the assumption that a soft-
ware implementation is always less costly than an equivalenthardware implementation for a given behavior, our algorithm
attempts to execute as many as possible behaviors on pro-
cessors. Thus, any behavior that has an execution time lessthan the given throughput constraint on at least one pro-
cessor can be executed in software and only those behaviors
that have an execution time greater than the throughputconstraint on all processors need be executed in hardware.
For instance, for the example in Figure 3 behaviors A, B
and D can be executed on a processor and behaviors C andE need to be executed in hardware.
Determine hardware/softwarepartition
Modify processor allocation
Build control flow graph fromSpecChart specification
Pipeline and schedulecontrol flow graph
Throughput satisfied?
Throughput satisfied?
Determine area of resourcesfor all hardware behaviors
Step 1
Step2
Step 3Step 4
Step 5
Step 6
Yes
Yes
No
Resources not fast enough. Stop.
No
Minimal hw area pipeline &schedule achieved. Stop.
Determine initial cheapestprocessor allocation
Figure 2: Partitioning and pipelining algorithm ow.
Once we have built the PET table and have determined
the hardware/software partition, Step 3 of our algorithm se-
lects an initial processor allocation for the system. Our al-
gorithm's primary goal is to perform the partitioning and
pipelining at minimal hardware cost; however, its secondary
goal is to minimize the total cost of processors. Thus, we
start with the cheapest processor allocation and then in-
crease the cost of the processor allocation till throughputconstraints are satis�ed. This cheapest processor allocation
consists of one instance of the cheapest processor on which
all the software behaviors have an execution time that is lessthan the throughput constraint. For our example CFG and
PET in Figure 3, the initial processor set consists of one
instance of the PowerPC. Though 68000 is the cheapest pro-cessor, it cannot execute behavior A in less than 4000 ns;
hence, it does not satisfy our criteria.
Next, in step 4, we estimate the area of all the hard-
ware behaviors when constrained with a throughput of thegiven PS delay constraint. The estimation algorithm per-
Processor Execution−Time Table
(a)
A
B C
D
Esoftwarehardware
(b)
Behavior Processor Execution Time
AA
pentiumpowerPC
A
BBBCC
pentium
pentiumpowerPC
powerPC
68000
68000
(ns)
pentium
pentiumpowerPC
powerPC
68000
68000
CDDDEE
68000
6000
1400220028008400
12000
E
18900
10001200
102301487021080
Throughput constraint = 4000 nsInitial processor allocation = powerPC
3800
Hardware/software Partition
3100
900
Figure 3: Step 2: hardware/software partition.
forms component selection, scheduling and pipelining, to de-
termine the total number and types of components beingused, the number and position of the pipe stages, as well as
the schedule (clock state divisions) within each pipe stage.
This area and performance information is then entered intoa Hardware Execution Time (HET) Table, to be used in
the next step. Note that if throughput constraints cannot
be satis�ed at this step, the only alternatives are to includefaster resources in the hardware library, or to rewrite the
speci�cation by manually partitioning critical behaviors.
In Step 5, we schedule and pipeline the control ow graph,
that is, we determine a pipe stage and a time slot within the
pipe stage for each behavior. For each software behavior, wealso determine the processor with which it will be executed.
We use a version of the well known list-scheduling algorithm.
We �rst make a prioritized list of behaviors whose predeces-sors have already been assigned to a pipe stage and time slot.
This is known as the ready list. If the behavior at the top
of the ready list is a hardware behavior, we assign it to theearliest feasible pipe stage and time slot. If the behavior is
a software behavior, we determine the processor using which
we can complete the execution of the behavior at the earliestpossible time, and then schedule the behavior in that pipe
stage and time slot. We repeat this step, until all behaviorshave been assigned to a pipe stage and a time slot.
If we can not determine a valid schedule and pipeline,we increase the speed and/or the number of processors and
repeat the scheduling and pipelining step. We use a sim-
ple, almost exhaustive method of modifying the processor
allocation. We �rst replace the allocated processor with a
faster processor from the library, if available. When we have
tried scheduling with the fastest processor, we start with
2 instances of the slowest processor and then increase the
processor's speed, one at a time, in every iteration of Step 5.
This is repeated till constraints are satis�ed, and in the worstcase, this may be repeated till there are as many processors
as software behaviors in the CFG.
4 Experimental results
We have integrated the hardware/software partitioning
and pipelining algorithm within SpecSyn [1], a system syn-
thesis tool. Our experiments are designed to evaluate thequality of the hardware estimation and the quality of the
hardware/software partitioning and pipelining algorithm.
We present a synopsis of our results.
The quality of the hardware estimation was evaluatedby comparing manually obtained designs of the IDCT, the
FFT, and 6 blocks from the MPEG II decoder, against thoseobtained by the estimation algorithm. In general, for a
given throughput constraint, our estimates were within 10%
of the area of the manual designs. The estimation errorswere mainly because our algorithm does not handle multi-
functional units, and is currently incapable of using multiple
bitwidth implementations of the same component type.The quality of the hardware/software partitioning and
pipelining algorithm was evaluated by comparing the man-
ual design exploration process against the results obtained byour algorithm, for the MPEG II decoder example. The man-
ual design process started with an all software non-pipelined
design and then moved the critical behaviors to hardware, aswell as pipelined the system, till its throughput was within
about 3000 ns (this represents a decoding rate of 30 frames
per second). Similarly, we ran our algorithm for a range ofPS delay constraints (700,000 ns to 3,000 ns), starting from
an all software solution and moving towards an all hardware
solution. Results of the comparison are shown graphically(in part) in Figure 4.
The results indicate that the design exploration con-
ducted by our algorithm closely matches the manual explo-ration. The consistent di�erence in area between designs on
the two curves is because our designs do not include the con-
troller area, whereas the manual designs do. Despite the in-accuracy of our estimates, its �delity is high, indicating that,
with further improvements, it will be feasible to replace the
manual exploration process by our algorithm.The results also indicate that the designs obtained by our
algorithm, were in some cases able to share the same proces-
sor amongst 2 behaviors, hence requiring a fewer number ofprocessors. (Note that the Pentium processor was selected to
be the best from a library of about 6 processors, including the
Sparc, PowerPC, and 68000 processors). More importantly,the results show that the fastest design attainable by our al-
gorithm (2980 ns) is about 25% faster than that obtained by
the manual design. This is because our algorithm performspipelining at the system, behavior, loop and operation levels,
which is di�cult for designers to perform manually.
It is important to note that while the manual design andexploration took approximately 6 man months, our algo-
rithm took approximately 3 minutes per design, with a total
of about 30 minutes on a SUN SPARC 5 workstation. (De-tails of these experiments are provided in [8] and [9]). Given
the similarity of both results, this is a signi�cant saving in
design time.
5 Conclusion
Our design ow and algorithms may be improved in sev-
eral ways. At the behavior level, our resource estimationalgorithm can be improved by allowing the use of multi-
functional components, by allowing multiple bitwidths of
the same component type by extending our algorithms topipeline in the presence of loop-carried dependencies, and by
100000 200000 300000 400000 500000Area (gates)
0
5000
10000
15000
20000
PS
del
ay (
ns)
MPEG: Area (gates) vs. PS delay (ns)
Our algorithmManual exploration
Pentium x 3 Pentium x 4
Pentium x 4 Pentium x 2
Pentium x 2Pentium x 1
Pentium x 1
Figure 4: PS delay vs. hardware area for MPEG
designs obtained by manual and automatic means.
providing estimates for the cost of multiplexers and buses.
Our design model and algorithms can support these exten-sions and incorporating these extensions is part of ongoing
work. At the system level, we can improve the pipelining
and partitioning algorithms by incorporating some measureof processor cost into our cost metric, by taking interface
cost and delay into account, as well as by partitioning the
system amongst multiple ASICs.Our current results indicate that, with these modi�ca-
tions, our algorithm will serve as a practical solution to the
hardware/software partitioning problem. We believe thiswill greatly increase design quality and reduce design time.
References
[1] D. Gajski, F. Vahid, S. Narayan, and J. Gong, Speci�cationand Design of Embedded Systems. Englewood Cli�s, New Jer-sey 07632: Prentice Hall, Inc, 1994.
[2] R. Ernst, J. Henkel, and T. Benner, \Hardware-softwarecosynthesis for microcontrollers," in IEEE Design and Testof Computers, pp. 64{75, 1994.
[3] R. Gupta and G. D. Micheli, \Hardware-software cosynthesisfor digital systems," IEEE Design and Test of Computers,vol. 10, no. 3, pp. 29{41, 1993.
[4] A. Kalavade, System-Level Codesign Of Mixed Hardware-Software Systems. PhD thesis, University of California, Berke-ley, 1995.
[5] J. Gong, D. Gajski, and S. Narayan, \Software estimationfrom executable speci�cations," in The Journal of Computerand Software Engineering, 1994.
[6] R. Gupta and G. D. Micheli, \Partitioning of functional mod-els of synchronous digital systems," in Proceedings of theIEEE International Conference on Computer Aided Design,pp. 216{219, 1990.
[7] F. Vahid and D. Gajski, \Speci�cation partitioning for sys-tem design," in Proceedings of the 29th Design AutomationConference, 1992.
[8] D. Gajski, P. Grun, W. Pan, and S. Bakshi, \Design explo-ration for pipelined IDCT," Tech. Rep. 96-41, Dept. of Infor-mation and Computer Science, University of California, Irvine,1996.
[9] A. B. Thordarson, \Comparison of manual and automatic be-havioral synthesis on MPEG-algorithm,"Master's thesis, Uni-versity of California, Irvine, 1995.