[ieee 34th design automation conference - anaheim ca (june 9-13, 1997)] proceedings of the 34th...

43.5

Hardware/Software Partitioning and Pipelining

Smita Bakshit Department of Electrical & Computer Engineering

University of California Davis, CA 95616

Abstract For a given throughput constrained system-level specifica-

t ion , we present a design f low and a n algorithm to select software (general purpose processors) and hardware components, and then parti t ion and pipeline the specification amongs t the selected components. This i s done so as t o best satisfy the throughput constraint a t min imal hardware cost. Our ability t o pipeline the design a t several levels, enables us t o a t ta in high throughput designs, and also distinguishes our work f r o m previously proposed hardware/software parti t ion- ing algorithms.

1 Introduction Digital systems, especially in the domain of digital pro-

cessing and telecommunications, are immensely complex. In order to deal with the high complexity, increased time- to-market pressures, and a set of possibly conflicting constraints, it is now imperative to involve design automation at the highest possible level. This “highest possible level” may vary on the design, but given the fact that an increasing number of designs now contain a combination of different component types, such as general-purpose processors, DSP (digital signal processing) cores, and custom designed ASICs (application specific integratcd circuits), we consider the highest level of design to be the one involving the selection and interconnection of such components. We refer to this as system-level design.

The reason why a system is best composed of different component types is due to the different characteris- tics of the components, which may be targeted at satisfy-

t This work was performcd while the author was at UC Irvine.

The authors gratefully acknowledge SRC (Grant #93-DJ-146) for their support.

Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial ad- vantage, the copyright notice, the title of the publication and its date appear and notice is given that copying is by permission of AGM, Inc. To copy otherwise, to republish, to post on seryers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 97, Anaheim, California 01997 ACM 0-89791-920-3/97/06..$3.50

Daniel D. Gajski Department of Information & Computer Science

University of California Irvine, CA 92697

ing different constraints. Off-the-shelf processors offer high- programmability, lower design time, and a comparatively lower cost and lower performance than an equivalent ASIC implementation. On the other hand ASICs are more expen- sive to design and fabricate, but offer comparatively higher performance. Thus, for a given system, these components are selected such that the performance critical sections are performed rapidly on ASICs, and the less critical sections, or the sections that require higher programmability, are performed on the processors.

Our work addresses throughput constrained systems. Given a specification of such a system, we select processors and hardware resources to implement the system and then partition and pipeline it amongst the selected components so as to best satisfy the given throughput constraint at minimal hardware cost. The throughput of a system is the rate at which it processes input data, and this is often the prime constraint on digital signal processing systems including most image processing applications. In order to meet the throughput constraints of these systems, it is not only suf- ficient to perform the critical sections in hardware, but it is also necessary to pipeline the design. Pipelining divides the design into concurrently executing stages, thus increasing its data rate. Our work supports pipelining at four different levels of the design, namely the system, behavior, loop and operation level.

Over the past five years, several co-design systems [l] [2] [3] [4] for hardware/software partitioning have been devel- oped. However, these tools assume that the tasks in the system execute sequentially, in a non-pipelined fashion. Fur- thermore, they also assume that the operations within each task execute sequentially. In comparing our work with the synthesis systems mentioned above, we have, in general, ex- tended the design space explored by these systems by allowing designs to be pipelined a t the system and the task level. Hence, our work extends current algorithms by performing pipelining at several levels, and by so doing, it is capable of achieving partitions with high throughput values that are unattainable without pipelining.

2 Problem definition Our problem of hardware/software partitioning and

pipelining may be defined as follows: Given:

1. A specification of the system as a control flow graph

713

( C F G ) of behaviors or tasks.

A hardware library containing functional units characterized by a three-tuple <type, cost, delay>.

A software (processor) library containing a list of processors characterized by a four-tuple <type, clock speed, dollar cost, metrics file>.

A clock constraint and a throughput constraint for the complete specification.

Determine:

An implementation type (either software or hardware) for every behavior.

The estimated area for each hardware behavior, as well as the total hardware area for the complete specification.

The processor to be used for each software behavior, as well as the total number of processors for the complete specification

A division of the control flow graph into pipe stages of delay no more than the given throughput constraint.

Such that:

1. Constraints on throughput are satisfied, and

2. Total hardware area (for the given clock) is minimized.

The throughput constraint specifies the difference in the arrival time (in nanoseconds) of two consecutive input sam- ples, We also refer to this time as the PS (pipe stage) delay, since this would be the required delay of a pipe stage in the design, if it were to be pipelined.

Control Flow Graph

Node: behavlor Arc: smlrol llow

Plpellned L Psnllloned System Plpellne Control Flow Graph

Figure 1: Inputs and outputs of the a lgor i thm. The example in Figure 1 illustrates the problem defined

above. As input we have a control flow graph of behaviors, a hardware and a software library, and a PS delay (throughput) and clock constraint. The nodes in the control flow graph represent behaviors and the arcs represent control dependencies. Each behavior contains a sequence of VHDL statements that represents computation done on variables.

The hardware library consists of a set of functional units with their corresponding delay (in ns) and area (in gates) and the software library contains a list of processors with their corresponding clock speeds, dollar cost and metrics file. The metrics file gives the number of instruction cycles and the number of bytes required to execute each of a list of 3-address generic instructions on that processor. This information characterizes a processor and is required to estimate the execution time of a behavior on a specific processor [5].

The output consists of a pipelined and partitioned CFG where every behavior has been mapped to either hardware or software and the graph has been divided into pipe stages of delay no more than 4000 ns. Every hardware behavior is associated with an estimate of its execution time and the number and type of components (selected from the hardware library) needed to obtain that execution time. For instance, behavior E has a throughput of 4000 ns and requires 3 instances of Mpyl, 1 instance of Mpy2 and 2 instances of Add2, bringing the total area to 430 gates. Similarly, every software behavior is associated with a processor from the software library, and its execution time on that processor. For instance, behavior A implemented on the Pentium processor, has an execution time of 3100 ns.

Finally, the CFG has also been partitioned into three pipe stages such that the throughput of the system is 4000 ns, that is each pipe stage has a delay of no more than 4000 ns. The hardware and software partitioning and pipelining has been done with the aim of satisfying the throughput constraint at minimal hardware cost. This scheduling and pipelining information is represented in the System Pipeline diagram, which depicts all the pipe stages and the execution schedule of behaviors within each pipe stage.

Before we describe our algorithm, note some assumptions of our model:

1. The system architecture contains processors, ASICs (application specific integrated circuits), and memory chips that all communicate over buses. The memory stores data that needs to be transferred between pipe stages as well as any globally defined data that may be accessed by multiple processors and/or ASICs. In this paper, we assume that all hardware behaviors are mapped onto 1 ASIC. After the pipelining and the partitioning, this ASIC may be further partitioned into smaller ASICs [6] [7] .

2. Two software behaviors may share the same processor, irrespective of the pipe stages they execute in.

3 . T w o hardware behaviors may only share resources if they execute sequentially in the same pipe stage.

3 Algorithm An overview of our algorithm for hardware/software par-

titioning and pipelining is presented in Figure 2 . Given a SpecChart [I] specification, hardware and software libraries, a throughput and clock constraint, the first step consists of deriving the control flow graph from the given specification. We then estimate [5] the execution time of all behavisrs on

714

all the available processors in the software library. This gives us the P E T (processor execution time) table, an example of which is shown in Figure 3 for the given CFG and software library in Figure 1. Based on the assumption that a software implementation is always less costly than an equivalent hardware implementation for a given behavior, our algorithm attempts to execute as many as possible behaviors on processors. Thus, any behavior that has an execution time less than the given throughput constraint on at least one processor can be executed in software and only those behaviors that have an execution time greater than the throughput constraint on all processors need be executed in hardware. For instance, for the example in Figure 3 behaviors A, B and D can be executed on a processor and behaviors C and E need to be executed in hardware.

Modify processor allocatlon Step 6

Build control flow graph from SDecChen sneclncation

I ’ I

t sfepz 71

Resources not fast enough. Pipeline and schedule control flow graph

4 I

Processor Execution-Time Table Hardwarehoftware Partition

powerPC

x 2800

12000

900 software 0 hardware

1200 t Throughput constraint = 4000 ns Initial processor allocation = powerPC powerPC

fiinnn

(a) (b)

F i g u r e 3: Step 2: hardware/software par t i t ion .

forms component selection, scheduling and pipelining, to determine the total number and types of components being used, the number and position of the pipe stages, as well as the schedule (clock state divisions) within each pipe stage. This area and performance information is then entered into a Hardware Execution Time (HET) Table, to be used in the next step. Note that if throughput constraints cannot be satisfied at this step, the only alternatives are to include faster resources in the hardware library, or to rewrite the specification by manually partitioning critical behaviors.

In Step 5, we schedule and pipeline the control flow graph, that is, we determine a pipe stage and a time slot within the pipe stage for each behavior. For each software behavior, we also determine the processor with which it will be executed. We use a version of the well known list-scheduling algorithm. We first make a prioritized list of behaviors whose predeces- sors have already been assigned to a pipe stage and time slot. This is known as the ready list. If the behavior at the top of the ready list is a hardware behavior, we assign it to the earliest feasible pipe stage and time slot. If the behavior is a software behavior, we determine the processor using which we can complete the execution of the behavior at the earliest possible time, and then schedule the behavior in that pipe stage and time slot. We repeat this step, until all behaviors have been assigned to a pipe stage and a time slot.

If we can not determine a valid schedule and pipeline, we increase the speed and/or the number of processors and repeat the scheduling and pipelining step. We use a sim- ple, almost exhaustive method of modifying the processor allocation. We first replace the allocated processor with a faster processor from the library, if available. When we have tried scheduling with the fastest processor, we start with 2 instances of the slowest processor and then increase the processor’s speed, one a t a time, in every iteration of Step 5. This is repeated till constraints are satisfied, and in the worst case, this may be repeated till there are as many processors as software behaviors in the CFG.

4 Experimental results We have integrated the hardware/software partitioning

and pipelining algorithm within SpecSyn [l], a system synthesis tool. Our experiments are designed to evaluate the quality of the hardware estimation and the quality of the

715

hardware/software partitioning and pipelining algorithm. We present a synopsis of our results.

The quality of the hardware estimation was evaluated by comparing manually obtained designs of the IDCT, the FFT, and 6 blocks from the MPEG I1 decoder, against those obtained by the estimation algorithm. In general, for a given throughput constraint, our estimates were within 10% of the area of the manual designs. The estimation errors were mainly because our algorithm does not handle multi- functional units, and is currently incapable of using multiple bitwidth implementations of the same component type.

The quality of the hardware/software partitioning and pipelining algorithm was evaluated by comparing the manual design exploration process against the results obtained by our algorithm, for the MPEG I1 decoder example. The manual design process started with an all software non-pipelined design and then moved the critical behaviors to hardware, as well as pipelined the system, till its throughput was within about 3000 n s (this represents a decoding rate of 30 frames per second). Similarly, we ran our algorithm for a range of PS delay constraints (700,000 ns to 3,000 ns) , starting from an all software solution and moving towards an all hardware solution. Results of the comparison are shown graphically (in part) in Figure 4.

The results indicate that the design exploration con- ducted by our algorithm closely matches the manual exploration. The consistent difference in area between designs on the two curves is because our designs do not include the con- troller area, whereas the manual designs do. Despite the in- accuracy of our estimates, its fidelity is high, indicating that, with further improvements, it will be feasible to replace the manual exploration process by our algorithm.

The results also indicate that the designs obtained by our algorithm, were in some cases able to share the same processor amongst 2 behaviors, hence requiring a fewer number of processors. (Note that the Pentium processor was selected to be the best from a library of about 6 processors, including the Sparc, PowerPC, and 68000 processors). More importantly, the results show that the fastest design attainable by our algorithm (2980 n3) is about 25% faster than that obtaincd by the manual design. This is because our algorithm performs pipelining at the system, behavior, loop and operation levels, which is difficult for designers to perform manually.

I t is important to note that while the manual design and exploration took approximately 6 man months, our algorithm took approximately 3 minutes per design, with a total of about 30 minutes on a SUN SPARC 5 workstation. (De- tails of these experiments are provided in [8] and [9]). Given the similarity of both results,.this is a significant saving in design time.

5 Conclusion Our design flow and algorithms may be improved in sev-

eral ways. At the behavior level, our resource estimation algorithm can be improved by allowing the use of multi- functional components, by allowing multiple bitwidths of the same component type by extending our algorithms to pipeline in the presence of loop-carried dependencies, and by

MPEG: Area (galss) ‘IS. PS delay (ns) 2ww 2ww

1 MW

- I I g IMW ’0 m

M W

MPEG: Area (galss) ‘IS. PS delay (ns)

\

Lp

lWw0 “0 3wwo 4owoo 5 Area (gates)

ilumx I

pentivm x 1

100

Figure 4: PS delay vs. hardware area for MPEG designs obtained by manual and automatic means.

providing estimates for the cost of multiplexers and buscs. Our design model and algorithms can support these extensions and incorporating these extensions is part of ongoing work. At the system level, we can improve the pipelining and partitioning algorithms by incorporating some measure of processor cost into our cost metric, by taking interface cost and delay into account, as well as by partitioning the system amongst multiple ASICs.

Our current results indicate that, with these modifica- tions, our algorithm will serve as a practical solution to the hardware/software partitioning problem. We believe this will greatly increase design quality and reduce design time.

References

[l] D. Gajski, F. Vahid, S. Narayan, and J. Gon Specification and Design of Embedded Systems. Englewood cliffs, New Jer- sey 07632: Prentice Hall, Inc, 1994.

[2] R. Ernst, J. Henkel, and T. Benner, “Hardware-software cosynthesis for microcontrollers,” in IEEE Design and Test

[3] R. Gupta and G. D. Michcli, “Hardware-software cosynthesis for digital systems,” IEEE Design and Test o f Computers, vol. 10, no. 3, pp. 29-41, 1993.

[4] A. Kalavade, System-Level Codesign Of Mixed Hardware- Software Systems. PhD thesis, University of California, Berke- ley, 1995.

[5] J. Gong, D. Gajski, and S. Narayan, “Software estimation from executable specifications,” in The Journal of Computer and Software Engineering, 1994.

[6] R. Gupta and G. D. Micheli, “Partitioning of functional mod- els of synchronous digital systems,” in Proceedings of the IEEE International Conference o n Computer Aided Design, pp. 216-219,1990.

[7] F. Vahid and D. Gajski, ”Specification partitioning for system design,” in Proceedings of the 29th Design Automation Conference, 1992.

[8] D. Gajski, P. Grun, W. Pan, and S. Bakshi, “Design exploration for pipelined IDCT,” Tech. Rep. 96-41, Dept. of Infor- mation and Computer Science, University of California, Irvine, 1996.

[9] A. B. Thordarson, “Comparison of manual and automatic be- havioral synthesis on MPEG-algorithm,” Master’s thesis, Uni- versity of California, Irvine, 1995.

I

Of COmpUtCTS, pp. 64-75, 1994.

716

[ieee 34th design automation conference - anaheim ca (june 9-13, 1997)] proceedings of the 34th...

Documents