stream computations organized for reconfigurable...

31
Stream Computations Organized for Reconfigurable Execution (SCORE): Introduction and Tutorial Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andr´ e DeHon, John Wawrzynek U.C. Berkeley BRASS group <[email protected]> August 25, 2000 / Version 1.0 Abstract A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying com- putational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computa- tion Organized for Reconfigurable Execution), a stream- based compute model which virtualizes reconfigurable computing resources (compute, storage, and communica- tion) by dividing a computation up into fixed-size “pages” and time-multiplexing the virtual pages on available phys- ical hardware. Consequently, SCORE applications can scale up or down automatically to exploit a wide range of hardware sizes. We hypothesize that the SCORE model will ease development and deployment of reconfigurable applications and expand the range of applications which can benefit from reconfigurable execution. Further, we be- lieve that a well engineered SCORE implementation can be efficient, wasting little of the capabilities of the raw hard- ware. In this paper, we introduce the key components of the SCORE system. 1 Introduction A large body of evidence exists documenting the raw advantages of reconfigurable hardware such as FPGAs over conventional microprocessor-based systems on se- lected applications. Yet reconfigurable computing remains in limited use, popular primarily in application-specific do- mains (e.g. [23] [32] [38]) or as a replacement for ASICs A significantly reduced extended abstract of this article appears in FPL’2000 and is copyright Springer- Verlag. This expanded version can be found at <http://www.cs.berkeley.edu/projects/brass/ documents/score_tutorial.html> for rapid prototyping and fast time-to-market. This lim- ited popularity is not due to any lack of raw hardware ca- pability, as million-gate devices are readily available [37] [2], and we have seen recent advances in high clock rates [30] [28], rapid reconfiguration [29] [16] [14], and high- bandwidth memory access [24] [19] [25]. Rather, we be- lieve that the limited applicability of reconfigurable tech- nology derives largely from the lack of any unifying com- pute model to abstract away the fixed resource limits of devices which, otherwise, restrict software expressibility as well as longevity across device generations. Existing targets are non-portable. Software for recon- figurable hardware is typically tied to a particular device (or set of devices), with limited source compatibility and no binary compatibility even across a vendor-specific fam- ily of devices. Redeploying a program to bigger, next- generation devices, or alternatively to a smaller, cheaper or lower-power device typically requires substantial hu- man effort. At best, it requires a potentially expensive pass through mapping tools. At worst, it requires a significant rewrite to fully exploit new device features and sizes. In contrast, a program written for microprocessor systems can automatically run and benefit from additional resources on any ISA-compatible device, without recompilation. Existing targets expose fixed resource limitations. The exposure of fixed resource limitations in existing pro- gramming models tends to impair their expressiveness and broad applicability. In such programming models, an ap- plication’s choice of algorithm and spatial structure is re- stricted by the size of available hardware. Furthermore, a computation’s structure and size must be fixed at compile time, with no allowance for dynamic resource allocation. Hence algorithms with data-dependent structures or poten- tially unbounded resource usage cannot be easily mapped

Upload: others

Post on 23-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Stream Computations Organized for Reconfigurable Execution(SCORE): Introduction and Tutorial

    Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury MarkovskiyAndré DeHon, John Wawrzynek

    U.C. Berkeley BRASS group

    August 25, 2000 / Version 1.0

    Abstract

    A primary impediment to wide-spread exploitation ofreconfigurable computing is the lack of a unifying com-putational model which allows application portability andlongevity without sacrificing a substantial fraction of theraw capabilities. We introduce SCORE (Stream Computa-tion Organized for Reconfigurable Execution), a stream-based compute model which virtualizes reconfigurablecomputing resources (compute, storage, and communica-tion) by dividing a computation up into fixed-size “pages”and time-multiplexing the virtual pages on available phys-ical hardware. Consequently, SCORE applications canscale up or down automatically to exploit a wide rangeof hardware sizes. We hypothesize that the SCORE modelwill ease development and deployment of reconfigurableapplications and expand the range of applications whichcan benefit from reconfigurable execution. Further, we be-lieve that a well engineered SCORE implementation can beefficient, wasting little of the capabilities of the raw hard-ware. In this paper, we introduce the key components of theSCORE system.

    1 Introduction

    A large body of evidence exists documenting the rawadvantages of reconfigurable hardware such as FPGAsover conventional microprocessor-based systems on se-lected applications. Yet reconfigurable computing remainsin limited use, popular primarily in application-specific do-mains (e.g. [23] [32] [38]) or as a replacement for ASICs

    A significantly reduced extended abstract of thisarticle appears in FPL’2000 and is copyright Springer-Verlag. This expanded version can be found at

    for rapid prototyping and fast time-to-market. This lim-ited popularity is not due to any lack of raw hardware ca-pability, as million-gate devices are readily available [37][2], and we have seen recent advances in high clock rates[30] [28], rapid reconfiguration [29] [16] [14], and high-bandwidth memory access [24] [19] [25]. Rather, we be-lieve that the limited applicability of reconfigurable tech-nology derives largely from the lack of any unifying com-pute model to abstract away the fixed resource limits ofdevices which, otherwise, restrict software expressibilityas well as longevity across device generations.

    Existing targets are non-portable. Software for recon-figurable hardware is typically tied to a particular device(or set of devices), with limited source compatibility andno binary compatibility even across a vendor-specific fam-ily of devices. Redeploying a program to bigger, next-generation devices, or alternatively to a smaller, cheaperor lower-power device typically requires substantial hu-man effort. At best, it requires a potentially expensive passthrough mapping tools. At worst, it requires a significantrewrite to fully exploit new device features and sizes. Incontrast, a program written for microprocessor systems canautomatically run and benefit from additional resources onany ISA-compatible device, without recompilation.

    Existing targets expose fixed resource limitations. Theexposure of fixed resource limitations in existing pro-gramming models tends to impair their expressiveness andbroad applicability. In such programming models, an ap-plication’s choice of algorithm and spatial structure is re-stricted by the size of available hardware. Furthermore, acomputation’s structure and size must be fixed at compiletime, with no allowance for dynamic resource allocation.Hence algorithms with data-dependent structures or poten-tially unbounded resource usage cannot be easily mapped

  • to reconfigurable hardware.1

    Virtualize resources The SCORE compute model intro-duced in this paper addresses the issue of fixed resourcelimits by virtualizing the computational, communication,and memory resources of reconfigurable hardware. FPGAconfigurations are partitioned into fixed-size, communicat-ing pages which, in analogy to virtual memory, are “paged-in” or loaded into hardware on demand. Streaming com-munication between pages which are not simultaneouslyin hardware may be transparently buffered through mem-ory. This scheme allows a partitioned program to run onarbitrarily-many physical pages and to automatically ex-ploit more available physical pages without recompilation.With proper hardware design, this scheme permits binarycompatibility and scalability across an architectural familyof page-compatible devices.

    Convenient and Efficient Model For software to ben-efit from additional physical resources (pages), the pro-gramming model should expose (page-level) parallelismand permit spatial scaling. SCORE’s programming modelis a natural abstraction of the communication which oc-curs between spatial, hardware blocks. That is, the dataflow communication graph captures the blocks of com-putation (operators) and the communication (streams) be-tween them. Once captured, we can exploit a wealth ofwell-known techniques for efficiently mapping these com-putational graphs to arbitrary-sized hardware. Further-more, run-time composition of graphs is supported, en-abling data-driven program structure, dynamic resource al-location, and the integration of separately compiled or de-veloped library components.

    Section 2 of this paper discusses other systems andcompute models which have influenced the formulationof SCORE. Section 3 presents the key components of theSCORE model. Section 4 discusses the hardware require-ments for a SCORE implementation and why they are rea-sonable in today’s technology. Section 5 gives a brief in-troduction to programming constructs for SCORE. Sec-tion 6 show an execution sample, and Section 7 describesthe basic architecture for our current implementation of theSCORE run-time system. Section 8 shows results from aJPEG encoder in SCORE as an example of our early expe-rience implementing a SCORE system.

    1Data-dependent computational structures can be constructed via spe-cialization and recompilation, as in [38], but this requires a complete passthrough mapping tools.

    2 Related Work

    The technique of time-multiplexing a large spatial de-sign onto a small reconfigurable system was demonstratedby Villasenoret al. [31]. By hand-partitioning a partic-ular design (motion-wavelet video coder) into a graph ofFPGA-sized “pages” and manually reconfiguring each de-vice with those pages, they were able to run the design onone third as many devices (i.e. physical pages) as wereoriginally required with only 10% performance overhead.The key to this approach’s efficiency was to amortize thecost of reconfiguration by having each page process a siz-able stream of data (buffered through memory) before re-configuring. SCORE aims to automate the partitioning andefficient dynamic reconfiguration performed manually byVillasenor.

    The ease and success of such automation depends onappropriate models for program description and dynamicreconfiguration. In this regard, SCORE builds on prior artdeveloping ISA, data flow, distributed, and streaming com-putation models. In the remainder of this section, we dis-cuss the relation of SCORE to these prior models.

    ISA ModelsThe first attempts to define a “compute model” for recon-figurable computing devices were focussed on augmentinga traditional processor ISA withreconfigurable instruc-tions. PRISC [27] (and later Chimæra [15]) allowed thedefinition of single-cycle, Programmable Function Unit(PFU) operations using a TLB-like management and re-placement scheme tovirtualize the space of PFU instruc-tions, exploiting local, dynamic reuse of PFU instructions.The size of the PFUOP, itself, however, was fixed by thearchitecture and PFUOPs are constrained by the sequentialISA to execute sequentially. Hence, the model does not di-rectly, allow the architecture to scale and exploit additionalparallel hardware.

    DISC [36] and GARP [16] expand the PRISC modelto allow variable-size and multiple-cycle array configu-rations. These architectures can pack multiple configu-rations (instructions) into the available array and, in thecase of GARP, support an implementation dependent num-ber of cached array configurations. However, like PRISC,each array configuration must be smaller than the available,physical logic, and reconfigurable instructions can only becomposed sequentially in the ISA. Consequently, these ar-chitectures also prevent one from scaling array size andautomatically exploiting the additional parallel hardware.

    OneChip [19] expands the ISA extension model fur-ther by allowing scoreboarded operations from memoryto memory in the ISA. While still based on a sequentialISA computing model, this potentially facilitates the use ofmultiple, parallel RFUs. As long as each RFUOP operates

    2

  • on independent memory banks, the RFU operations willnot interlock and may proceed in parallel. This technique,however, exposes the memory buffers between pipelinedand chainable operations. It forces the user or compiler topick blocking factors and to schedule blocked operationsin parallel in the processor’s instruction dispatch window.In fact, this approach prevents direct pipeline assembly ofchained operations. Furthermore, the ISA forces the com-piler to schedule the invocation of RFU operations, limit-ing the opportunity to schedule components of the recon-figurable computation in a data-driven manner.

    Dynamic ReconfigurationLing and Amano [22] describe the Multi-Processor WAS-MII, a scalable FPGA-based architecture which partitionsand time-multiplexes large applications as FPGA-sizedpages, much like SCORE. The primary limitation to WAS-MII’s performance is that page communication is bufferedthrough a small,fixed set of device registers (the “tokenrouter”). With such a small communication buffer, a pagecan operate for only a short time before depleting availableinputs or output space and, if the page is time-multiplexed,triggering reconfiguration. Hence when running a designwhich is larger than available hardware, execution timemay be dominated by reconfiguration time. Brebner [5][6] proposes a similar demand-paged, reconfigurable sys-tem based on arbitrary-sizedswappable logic units(SLUs)which communicate through periphery registers2 and aresubject to the same inefficiency as WASMII when time-multiplexed. SCORE avoids this inefficiency by allowinglarge (unbounded) communication buffers, enabling longerpage execution between reconfigurations.

    CMU’s PipeRench [14] defines a reconfigurable fab-ric paged into horizontalstripeswhich communicate ver-tically as a pipeline. The execution model fully virtual-izes stripes and enables hardware scaling to any numberof physical stripes. Although stripes communicate throughinput-output registers as in WASMII, PipeRench’s stripe-sequential, pipelined reconfiguration scheme hides the ex-cessive reconfiguration overhead seen in WASMII. Thissequential reconfiguration scheme is well suited to sim-ple, feed-forward pipelines. However, this scheme doesnot support computation graphs with feedback loops, andit may waste available parallelism when squeezing widegraphs into a linear sequence of stripes. In particular, whenvirtualizing a computation with more parallelism than isavailable in a single architected stripe, non-communicatingstripes which simultaneously fit into hardware must still

    2In Brebner’s “parallel harness” model, SLUs are arranged in a meshand communicate with nearest neighbors via periphery registers. In thedata-parallel “sea of accelerators” model, SLUs do not communicate witheach other and so would not incur the same virtualization overhead dis-cussed above.

    load in sequence, incurring added latency and an area costfor buffering stripe I/O. SCORE makes no such restric-tions on execution order, allowing parallel reconfigurationof physical pages.

    Data FlowThe original Dennis formulation of data flow [11] [10]described a processor ISA which represented data flowgraphs directly, each instruction being an operator. Theexecution model included only a single result register perinstruction, allowing an instruction to execute only once ata time before its successors must execute. While this re-striction on instruction ordering is reasonable for a micro-processor where large instruction store and fast instructionissue are available, it is not reasonable for a reconfigurabledevice where reconfiguring on each instruction issue is toocostly. Iannucci’shybrid data flow[18] and Berkeley’sTAM [9] define operators by straight-line blocks of instruc-tions, relaxing the frequency of inter-instruction synchro-nization to only the entry and exit points of blocks. Nev-ertheless, these models inherit the same problem of fixedcommunication buffers as Dennis data flow and thus facethe same inefficiency as WASMII in a time-multiplexed re-configurable implementation.

    Streaming formulations of data flow remove the lim-itation of fixed input-output buffers, allowing arbitrarilymany tokens to queue up along an arc of a data flow graph.This generalization allows a time-multiplexed implemen-tation to fire an operator many times in succession beforereconfiguring, amortizing the cost of reconfiguration over alarge data set. Lee’s synchronous data flow (SDF) [21] [3]incorporates streaming for the restricted case of static flowrates. Although this model of computation is not Turing-complete (it lacks data-dependent control flow), it guaran-tees that conforming graphs can be statically scheduled torun with bounded stream buffers.

    Buck’s integer-controlled data flow (IDF) [7] incorpo-rates data-dependent control flow by adding to SDF aset of canonical dynamic-rate operators (e.g. switch, se-lect). SCORE permits a dynamic-rate model, allowingdata-dependent control flow inside any operator. As such,SCORE programs are essentially equivalent to IDF in ex-pressiveness, since a SCORE operator is equivalent to anIDF graph containing dynamic operators.

    SCORE shares a gross similarity to heterogeneous sys-tems which use streaming data flow to tie together ar-bitrary processors (conventional, special-purpose, and/orreconfigurable) including MIT’s Cheops [4], MagicEight[35] [34], and Berkeley’s Pleiades [26] [1]. The pro-gramming model of these systems is more restricted thanSCORE, typically based on a pre-defined set of streamingoperations. Furthermore, SCORE provides a stronger ab-stract model allowing pages (processors) to be swapped as

    3

  • needed and hiding implementation limitations like buffersizes.

    Streaming APIsVirginia Tech [20] defines a streaming API for processor-controlled networks of reconfigurable devices (e.g.SLAAC [8]). While standardizing the form in which ap-plications may be written, this API does not, in and of it-self, virtualize the size or fabric of compute resources andhence does not allow the definition of portable and scalabledesigns. Rather, it serves only as a hardware interface layerto manually reconfigure devices on the network.

    Maya Gokhale [13] defines a C-based streaming pro-gramming model. Like SCORE, Streams-C exploits thatfact that reconfigurable hardware is efficiently organizedas a collection of spatial pipelines and that streams pro-vide a natural abstraction for the hardware linkage betweentwo separate design components. Nevertheless, Streams-C only serves as a convenient way to compactly describespatial designs. No virtualization is performed, and theburden of handling placement and fixed buffer size restric-tions is placed entirely on the programmer. In these re-gards, SCORE attempts to provide a much higher levelprogramming model, providing semantics which are de-coupled from hardware artifacts, like buffer sizes and phys-ical hardware size, and automatically filling in these lowerlevel details at compile and run time.

    CSPIn many ways the SCORE computational model is simi-lar to Hoare’s Communicating Sequential Processes (CSP)[17]. Each SCORE operator can be viewed as a sin-gle process. These operators communicate with eachother via designated stream connections somewhat likeCSP’s named ports. Unlike CSP ports, SCORE streamsare buffered and offer an unbounded stream abstraction.Significantly, SCORE operators, unlike CSP processes,are not allowed to be non-deterministic. Composition ofSCORE operators always yields deterministic, observableresults. In fairness, most of CSP’s non-determinism is tofacilitate modeling of unpredictable, dynamic effects inreal systems, and most of SCORE could be modeled ontop of CSP. SCORE also allows dynamic construction ofcomputational graphs, which was not in the original CSPformulation, but could of course be added.

    3 SCORE Computational Models

    A compute model defines the computational seman-tics that a developer expects the physical machine to pro-vide. The compute model itself is abstract but captures the

    essence of how computation proceeds, defining the mean-ing of any computation. The compute model is given amore concrete embodiment in one or moreprogrammingmodels. The programming model provides a high-levelview of application composition and execution, adding anumber of practical conveniences for the programmer. Ul-timately, both models are grounded in anexecution modelwhich defines the way the computation is actually de-scribed to the physical hardware and the meaning associ-ated with any such description.

    The execution model, programming model, and abstractcomputational model are all consistent views of computa-tion. What differs among them is the level of detail whichthey expose or abstract (See Figure 1 and 2). The exe-cution model abstracts the number of key resources (e.g.ALUs, pages) to allow scaling across different hardwareplatforms. The programming model abstracts architecturalcharacteristics found in the execution model (e.g. ISA de-tails, limited resource sizes exposed at architectural level).The compute model abstracts away the concrete syntax andprimitives provided by a particular programming languageor system.

    3.1 Compute Model

    A SCORE computation is a graph of computation nodes(operators) and memory blocks linked together by streams.Streams provide node-to-node communication and aresimply single-source, single-sink FIFO queues with un-bounded length. Graph nodes (operators) are of two forms:(1) Finite-State Machine (FSM) nodes which interact withthe rest of the graphonly through their stream links; and(2) Turing complete (TM) nodes which support resourceallocation in addition to stream operations.

    SCORE FSMs have the property that the present stateidentifies a set of inputs to be read from the input streams.Once a full set of inputs is present, the FSM consumes theinputs from the appropriate set of input FIFOs and mayconditionally emit outputs or close input or output streams.As with any standard FSM, SCORE FSMs transition to anew state based on their inputs and present state. EachSCORE FSM has a distinguisheddonestate into which itmay enter to signal its completion and to remove itself fromthe running computation.

    A SCORE TM node is similar to a SCORE FSM nodebut adds the ability to allocate memory and to create newgraph nodes (FSM or TM operators) and edges (streams)in the SCORE compute graph.

    Memory is allocated in finite-sized blocks calledseg-ments. Each segment may be owned by a single operatorat a time. A SCORE TM may allocate new segments andpass them on to an FSM or TM node that it creates. Upontermination, when a TM or FSM node enters thedonestate,

    4

  • Compute Model Programming Model Execution Model

    Abstract model capturingessential semantics ofcomputation

    Particular set of programming con-structs providing a convenient wayto express computations in the com-pute model

    Low-level (executable) descriptionof the computation and the seman-tics which the hardware is expectedto provide when interpreting this de-scription

    The programming model is ab-stracted from certain details thatarise in the execution model, like ar-chitectural page size or number ofregisters.

    The execution model is abstractedfrom certain hardware size detailslike number of resources.

    sequential execution C+Unix MIPS-ISA+ single global memory + Unix-ABI

    C+WinAPI x86-ISA+ WinABI

    SCORE C++ + TDF MIPS-ISA+ ScoreRT + linux + linux-ABI

    + SCORE 256 4-LUT CPs+ SCORE

  • it returns ownership of any received segments back to theoperator that created it. If an operator attempts to accessa memory segment that it does not presently own, that ac-cess is blocked (i.e. the operator stalls) until the operatorregains ownership of the memory segment.

    The operational semantics of the SCORE computemodel are fully deterministic. This follows from the de-terminism of individual operators, the timing indepen-dent communication discipline, and the fact that opera-tors cannot side-effect each other’s state. In particular,(1) operators communicate with each other only throughstreams, whose token flow semantics guarantee a timing-independent order of execution; (2) memory segmentshave a single, unique owner at any time and thus do notsuffer from multiple-accessor, read/write-ordering hazards.Thus, the observable results of a SCORE computation arecompletely independent of the timing of any operator orthe delay along any stream.

    Appendix B defines the compute model more precisely.

    3.2 Programming Model

    A programming model gives the programmer a frame-work for describing a computation in a manner indepen-dent of device limits, along with guidelines for efficientexecution on any hardware implementation. It can be moreabstract than the execution model because the compilerwill take care of translating the higher level descriptionprovided by the programmer into the details needed for ex-ecution. The key abstractions of the SCORE programmingmodel areoperators, streams, andmemory segments.

    3.2.1 Basic ComponentsOperatorsAn operator represents a particular algorithmic transfor-mation of input data to produce output data. Operators arethe computational building blocks for a computation (e.g.multiplier, FIR filter, FFT). Operators may be behavioralprimitives or hierarchical graph compositions of other op-erators. Figure 3 shows an example video processing op-erator composed as a pipeline of transformations, includ-ing amotion estimationoperator, an imagetransformationoperator, a dataquantizationoperator, and acodingopera-tor. The size of an operator in hardware is implementationdependent and is in no way limited in the programmingmodel. Operators may need to be partitioned to fit onto anarchitectural compute page. Partitioning is an integral partof the automated in the compilation process.

    StreamsInter-operator communication uses a streaming data flowdiscipline. When the programmer needs to connect oper-

    ators together, he links the producer to the consumer op-erator using astreamlink. The link both serves to definewhere data is logically routed and acts as an unbounded-length queue for data tokens. Operators signal both whenthey are producing data and when they need to consumedata. This signalling translates into data presence signalson the stream links which synchronize all communicationbetween operators.

    Memory SegmentsA memory segment is a contiguous block of memory andserves as the basic unit for memory management. Memorysegments may be any size, up to an architecturally definedmaximum. A memory segment may be used in a SCOREcomputation by giving it a specific operating mode (e.g.sequential read, random-access read-write, FIFO) with ap-propriate stream interface, then linking it into a data flowgraph like any other operator (see Figure 5).

    3.2.2 Dynamic Features

    On top of these basic components, SCORE supports anumber of important dynamic features.• Dynamic rate operators• Dynamic graph composition and instantiation• Dynamic handling of uncommon events

    Dynamic rate operatorsAn operator may consume and produce tokens at data-dependent rates. This expressive power allows SCOREto describe efficient operators for tasks such as data com-pression, decompression, and searching or filtering. Sec-tion 5.2 shows a possible set of linguistic constructs forsupporting dynamic rate consumption and production. Toexploit dynamic rates, scheduling decisions should bemade at run time, when the dynamic rates and actual dataavailability are known.

    Dynamic composition and instantiationSCORE allows run-time instantiation of operators and dataflow graphs. That is, the computational graph may becreated, extended, or modified during execution. Extend-ing the graph means creating new graph nodes and edgeswhich may be defined in a data-dependent manner. An op-erating node may terminate during execution, and existingstream links may be shut down by their attached operators.

    This mechanism has several benefits over describing acomputation strictly by a static graph at compile-time. Itgives the programmer an opportunity to postpone or avoidallocating resources for parts of the computation whichare not used immediately or whose resource requirementscannot be bound until run time. It also enables the cre-ation of data-dependent computational structures, for in-

    6

  • Mot

    ion

    Est

    imat

    ion

    Tra

    nsfo

    rmat

    ion

    Qua

    ntiz

    atio

    n

    Cod

    ing

    Figure 3: Video Processing Operator

    stance, to exploit dynamically-unrolled parallelism. Fi-nally, it creates a framework in which aggressive imple-mentations may dynamically specialize operators aroundinstantiation parameters. That is, an operator may havepa-rametersbound atinstantiation time—i.e. when the opera-tor is composed into a data flow graph. This mechanism al-lows operators to be initialized with unchanging or slowlychanging scalar data or to be specialized around parametervalues. Examples in Section 5.2 show one set of linguisticconstructs to support composition and instantiation.

    Exception handling

    Exception handling falls naturally out of the data flow dis-cipline of SCORE. When an unusual condition occurs, theoperator may raise an exception. At this point, the operatorstops rather than producing output data. Dependent, down-stream operators may have to stall waiting for this operatorto resume and produce an output, but the data flow disci-plines guarantees that they wait properly for the operatorto handle the exception and produce a result. When the ex-ception is handled, the raising operator resumes operation,producing data, and allowing the downstream operators toresume in turn.

    3.3 Execution Model

    The key idea of a computer architecture is that it definesthe computational description that a machine will run andthe semantics for running it (e.g. the x86 ISA is a populararchitectural definition for processors). Someone buildinga conforming device is then free to implement any detailedcomputer organization that reads and executes this compu-tational description (e.g. i80286, i80386, i80486, Pentium,and K6 are all different implementations that run the samex86 computational description). Following this technique,the execution model for SCORE defines the run-time com-putational description for an architecture family and the se-mantics for executing this description.

    The SCORE execution model defines all computation interms of three key components:• A compute page(CP) is a fixed-size block of reconfig-

    urable logic which is the basic unit of virtualization andscheduling.

    • A memory segmentis a contiguous block of memorywhich is the basic unit for data page management.

    • A Stream linkis a logical connection between the out-put of one page (CP, segment, processor, or I/O) and theinput of another page. Stream implementations will bephysically bounded, but the execution model providesa logically unbounded stream abstraction.

    A computational description in this execution model is in-dependent of the size of the reconfigurable array, admittingarchitectural implementations with anywhere from one to alarge number of compute pages and memories. The modelprovides the semantics of an unlimited number of indepen-dently operating physical compute pages and memory seg-ments. Compute pages and segments operate on streamdata tagged with input presence and produce output datato streams in a similar manner. The use of data presencetags provides an operational semantics that is independentof the timing of any particular SCORE-compatible com-puting platform.

    Fixed Compute-Page SizesCompute pages are the basic unit of virtualization, schedul-ing, reconfiguration, and relocation. In analogy with a vir-tual memory page, a compute page is the minimum unit ofhardware which is mapped onto physical hardware and ismanaged as an atomic entity. Each compute page repre-sents a fixed-size piece of reconfigurable hardware (e.g.644-LUTs). Compute pages differ from the operators of thecompute model in that pages have architecturally imposedresource limitations such as size and maximum number ofstreams.

    The decomposition of a computation into computepages takes the stand that it is not feasible nor desirable tomanage every primitive computational building block (e.g.

    7

  • C2

    C1

    16b32b

    (a)

    Mpy-by-const108 LEs each

    32b add 32 LEs

    (b) (d)

    1

    12

    2

    (c)

    Figure 4: Example of Page Decomposition: (a) original operator as seen in programming model, (b) mapped to logicelements (LEs) in target architecture, (c) partitioning into 64-LE pages to match execution model page size, (d) final graphused by execution model

    4-LUT) as an independent entity—just as it is generally notdesirable to manage every bit of memory as an independentblock. Rather, by grouping together a larger block of re-sources, management and overhead can be amortized overthe larger number of computational blocks. This group-ing also allows hard problems, like placement and rout-ing, to be performed offline within each page. Note thatit is necessary that the page size be fixed across an ar-chitecture family so that all family member can run fromthe same run-time (binary) description. Otherwise, page(re-)packing, placement, and routing would need to be per-formed online. The fixed page discipline requires that com-pilers partition (or pack) more abstract computational op-erators into these fixed size pages. Figure 4 shows an ex-ample decomposition of an operator graph into pages.

    Compute pages may contain internal state which mustbe saved and restored when the page is swapped onto oroff of a physical compute page. Swapping may be nec-essary in a time-multiplexed implementation and is key tosupporting the semantics of an unbounded number of com-pute pages.

    Memory Segments and Configurable Memory BlocksA memory segment is a contiguous block of memorywhich is managed as a single, atomic memory block forthe purposes of swapping and relocation. A memory seg-ment may be used in one of several modes (e.g.sequentialread, random-access read-write, FIFO). When configured

    VCP0

    Segment0

    Segment1

    VCP1

    VCP2

    Figure 5: Data Flow Computation Graph with both Com-pute Pages and Segments

    8

  • into a particular mode, a segment has logical stream portsto connect it to the graph of pages (e.g.streams for randomaccess: address input, data input, data output, control in-put). Figure 5 shows an example graph connecting pagesand segments.

    To use a memory segment, the run-time system will mapit into a configurable memory block(CMB). The CMB isa physical memory block inside the reconfigurable array(See, for example, Figure 11) with active stream links andinterconnect to connect the memory segment into the ac-tive computation. In addition to holding user-specified seg-ments, CMBs are also used to hold segments containingCP configurations, segments containing CP state, and seg-ments associated with stream buffers. A single CMB mayhold any number of each of these types of segments as longas their aggregate memory requirement does not exceed theCMB’s capacity (see Figure 6 for a sample memory lay-out in a CMB). In our current vision, only a single suchsegment may actually be active in each CMB at any pointin time, but there is nothing in the SCORE definition thatprevents an implementation from being designed to handlemultiple, active segments in the same CMB.

    Physically Finite, Logically Unbounded StreamsStreams form the data flow links between pages. A page(CP or segment) indicates when it is producing a valid dataoutput with an out-of-banddata presentbit. The valid datavalue with its associated presence bit is termed atoken.The token is transported to the destination input of the con-suming operator. The stream delivers all data items gen-erated by the producer, in order, to the consumer, storingeach until the consumer indicates it has consumed it fromthe head of its input queue (See Figure 7). The data pres-ence tag in a token serves a similar role to astall signal ina conventional virtual memory or cache architecture; thatis, it lets the processing unit know if data is available and itcan continue processing or if the processing unit must waitfor data to arrive.

    When a stream is empty, the downstream operator willstall waiting for more input data. This discipline hidesthe detailed timing of operations from the programmingmodel, guaranteeing correct behavior while allowing vari-ations between implementations of the computing archi-tecture.

    Even at the run-time level, streams provide the abstrac-tion of unbounded capacity links between producers andconsumers.3 In practice, however, the streams are finite,with an implementation-dependent buffer capacity. To im-plement the semantics of unbounded, FIFO stream links,an implementation will usebackpressure(See Figure 7)

    3See Appendix A for a discussion of why unbounded buffers are nec-essary.

    to stall production of data items, and the run-time systemwill allocate additional buffer space in FIFO segments asneeded (See Figure 8 for an example of stream buffer ex-pansion).

    Physically, a virtual stream may be realized in one oftwo ways:• When both the producer and the consumer of a vir-

    tual stream are loaded on the physical hardware, thestream link can be implemented as a spatial connec-tion through the inter-page routing network betweenthe two pages.4 (See Figure 9.)

    • When one of the ends of the stream is not resident, thestream data can be sinked (or sourced) from a streambuffer segment active in some CMB on the component.(See Figure 10.)

    This allows efficient, pipelined chaining of connected op-erators when space permits, as well as deep, intermediatedata buffering when a computation must be sequentialized.

    Hardware VirtualizationCompute pages, segments, and streams are the fundamen-tal units for allocation, virtualization, and management ofthe hardware resources. At run time, an operating sys-tem manager schedules virtual pages and streams onto theavailable physical resources, including page assignmentand migration and inter-page routing.

    If there are enough physical resources, every page of acomputation graph may be simultaneously loaded on thereconfigurable hardware, enabling maximum-speed,fully-spatialcomputation. Figure 9 shows this case for the videoprocessing operator of Figure 3. If hardware resourcesare limited, a computation graph will be time-multiplexedonto the hardware. Streams between virtual pages that arenot simultaneously loaded will be transparently bufferedthrough on-chip memory. Figure 10 shows this case forthe video processing operator. Each component operatoris loaded into hardware in sequence, taking its input fromone memory buffer and producing its output to another.

    3.4 Model Implications

    3.4.1 Advice for Programmers

    One goal of the compute model is, at a high-level, to focusthe developer on the style of computation which is effi-cient for the hardware and execution model. To better uti-lize scalable reconfigurable hardware, SCORE developersshould:

    • Describe computations as spatial pipelines with mul-tiple, independent computational paths.A hardware

    4An implementation could choose to implement this link as a staticallyconfigured path as in FPGAs, a time-switched path, or even a dynamicallyrouted path.

    9

  • Base: 0x01000Bound: 0x02000Mode: Read-only

    Memory

    Data out

    Data inAddrCtrl

    CMB

    0x00000:config. segment

    0x01000: user segment 1

    0x02000:user segment 2

    0x00F00: state

    0x03800: FIFO

    0x0400: unused

    Active segment

    Figure 6: Segments and Other Data mapped onto a CMB

    0

    1

    0

    0

    10

    1

    Sour

    ce P

    age

    Sink

    Pag

    e

    0x00

    10

    0xA

    BC

    D

    0xFF

    FF

    No data being transmitted

    Full, asserting backpressure

    backpressure

    data present

    data

    w

    Network Stages

    Inpu

    t Que

    ue

    Figure 7: Stream Signals

    10

  • fullbackpressure

    CP A�

    CP B�

    Data

    1. Op A sends data to Op B

    CP A�

    CP B�

    Data

    2. Op B becomes full�

    CP A�

    CP B�

    3. Backpressure from B�

    stops Op A sending data

    CP A�

    CP B�

    4. No Data being sent�

    CP A�

    CP B�

    BufferSegment

    5. Runtime allocates new�

    segment to buffer stream

    CP A�

    CP B�

    6. Runtime dissassembles�

    direct connection A−>B

    CP A�

    CP B�

    7. Runtime creates new links�

    A−>segment; segment−>B

    CP A�

    CP B�

    8. Op A sends data to �

    stream buffer segment

    CP A�

    CP B�

    9. When Op B can take data,�

    it takes it from segment

    CP A�

    CP B�

    10. Segment serves as large buffer between A and B

    CP B�

    CP A�

    11. Runtime may decide to replace Op A

    CP B�

    12. Op B may continue taking data from stream buffer

    Figure 8: Expansion of Finite Stream Buffer to provide Unbounded Stream Buffer Semantics

    11

  • Quantize Code

    MotionEstimation

    Transform

    CMB

    CP

    buffer

    Figure 9: Fully Spatial Implementation of Video Processing Operator on Abstract SCORE Hardware

    Transform

    buffer

    buffer

    Code

    buffer

    buffer

    Quantize

    bufferbuffer

    buffer

    MotionEstimation swap

    swap swap

    Figure 10: Capacity-Limited, Temporal Implementation of Video Processing Operator

    implementation will attempt to concurrently executeas many of the specified, parallel paths as possible.

    • Avoid or minimize feedback cycles.Cyclic dependen-cies introduce delays which cannot be pipelined awayand hence increase the total run time or lead to page-thrashing in small hardware implementations.

    • Expose large data streams to SCORE operators.Large data sets help amortize the overhead of load-ing computation into reconfigurable hardware, espe-cially into small, time-multiplexed hardware imple-mentations.

    3.4.2 Generality

    While we have described the SCORE hardware model herein terms of a single processor and homogeneous computa-tional pages and memories, the model itself admits a num-ber of extensions. SCORE can accomodate heterogeneousand specialized computational pages, as seen in Pleiades[26] and Cheops [4]. Using specialized pages most ef-ficiently makes the scheduling problem more interesting,since some operators may run on multiple kinds of special-ized pages. Also, there is nothing which prohibits SCOREfrom using multiple conventional processors for executingsequential operators and/or the run-time scheduler. Con-ventional techniques for multiprocessing and distributedscheduling would be relevant in this case.

    12

  • 4 Hardware Requirements

    SCORE assumes a combination of a sequential proces-sor and a reconfigurable device. The reconfigurable arraymust be divided into a number of equivalent and indepen-dent compute pages.5 Multiple, distributed memory blocksare required to store intermediate data, page state, and pageconfigurations.

    The interconnect among pages is critical to achievinghigh performance and supporting run-time page place-ment. It should support high bandwidth, low latency com-munication among compute pages and memory, allow-ing memory pages to be used concurrently. The inter-connect must buffer and pipeline data as well as provideback-pressure signals to stall upstream computation whennetwork buffer capacity is exceeded. Routing resourcesshould be sufficiently rich to facilitate rapid, online rout-ing.

    The compute pages themselves may use any reconfig-urable fabric that supports rapid reconfiguration, with pro-vision to save and restore array state quickly. The BRASSHSRA subarray design [30] is a feasible, concrete imple-mentation for a compute page. It provides microsecondreconfiguration and high-speed, pipelined computation.

    Each configurable memory block(CMB) is a self-contained unit with its own stream-based memory port andan address generator (see Figure 6). CMBs may be ac-cessed independently and concurrently in a scalable sys-tem. The memory fabric may use external RAM or on-chip memory banks (e.g.BRASS Embedded DRAM [25]),with additional logic to tie into the data flow synchroniza-tion used by the interconnect network. The memory con-trollers need to support a simple, paged segment modelincluding address relocation within a memory block andsegment bounds. Streaming data support obviates the needfor external addressing during reconfiguration and streambuffering.

    The sequential processor plays an important part in theSCORE system. It runs the page scheduler needed to vir-tualize computation on the array, and it executes SCOREoperators that would not run efficiently in reconfigurableimplementation. Consequently, the processor must be ableto control and communicate with the array efficiently. Asingle-chip SCORE system (e.g. see Figure 11) integrat-ing a processor, reconfigurable fabric, and memory blockscould provide tight, efficient coupling of components.

    Although a single-chip SCORE implementation offersbenefits for performance and design efficiency, the SCOREmodel permits a wide range of implementations includingone using conventional, commercial components.

    5In a degenerate case, there can be only one page, but this sacrificesmany of the strengths of the SCORE model.

    uP

    L1 i$

    L1 d$

    L2 $

    CP

    CMB

    CP

    CMB

    Netw

    orkCMB

    CP

    CP

    CMB

    Figure 11: Hypothetical, single-chip SCORE system

    5 Language Instantiations

    As a computational model, any number of languageswhich obey the SCORE semantics could be defined to de-scribe SCORE computations. One could define subsetsof conventional HDLs (e.g. Verilog, VHDL) with styl-ized Input/Output primitives to describe SCORE opera-tors and operator composition. Similarly, one could de-fine subsets of conventional programming languages (e.g.C++, Java) to perform these tasks. To focus on the neces-sary semantics, we have defined an intermediate register-transfer level language (RTL) to describe SCORE oper-ators and their composition for our initial developmentwork. We view our intermediate language, TDF, as adevice-independent, assembly language target on the wayto architecture-specific executable operators.

    5.1 SCORE Language Requirements

    As indicated by the semantics of the SCORE computemodel, SCORE operators are synchronous, single clockentities, with their own state. Operators communicateonlythrough designated I/O streams. Operation is gated by datapresence on the I/O streams. As such, each operator can beviewed as a finite-state machine with associated data path(i.e. FSMD [12]). In a multithreaded language, such asJava or C++ with an appropriate thread package, a SCOREoperator would be an independent thread which commu-nicates with the rest of the program only through single-reader, single-writer I/O streams. Specifically, SCOREdoes not have a global, shared-memory abstraction amongoperators. An operator mayown a chunk of the addressspace (a memory segment) during operation and return itafter it has completed, but no two operators may simulta-

    13

  • fir4(param signed[8] w0, param signed[8] w1param signed[8] w2, param signed[8] w3,// param’s bound at instantiation timeinput unsigned[8] x,output unsigned[20] y)

    {state only(x): // ‘‘fire’’ when x present{

    y=w0*x+w1*x@1+w2*x@2+w3*x@3;// x@n notation picks out// nth previous value for// x on input stream.// (this notation is// patterned after Silage)goto only; // loop in this state

    }}

    Figure 12: TDF Specification of 4-TAP FIR (a static rateoperator)

    neously own a piece of memory.

    5.2 TDF

    TDF is basically an RTL description with special syn-tax for handling input and output data streams from theoperator. Common data path operators can be describedusing a C-like syntax. For example, Figure 12 shows howan FIR computation might be implemented in TDF. Op-erators may have parameters whose values are bound atoperator instantiation time; parameters are identified withthe keywordparam . In the FIR example, the coefficientweights are parameters; these are specified when the oper-ator is created and the values persist as long as the operatoris used. The FIR defines a single input stream (x ) and pro-duces a single output stream (y ). The behavior of the stateis gated on the arrival of the nextx input value, producinga newy output for each such input.

    To allow dynamic rate operators, the basic form of abehavioral TDF operator is that of a finite-state machine.Each state specifies the inputs which must be present be-fore it can fire. Once the inputs arrive, the operator con-sumes the inputs and the FSM may choose to change statesbased on the data consumed from the inputs. A simplemerge operator is shown in Figure 13, demonstrating howthe state machine can also be used to allow data dependentconsumption of input values. Output value production canbe conditioned as shown in Figure 14. Together, these al-low the user to specify arbitrary, deterministic, dynamic-rate operators.

    Of course, the FSM gives the user the semantic power todescribe heavily sequential and complex, control-oriented

    N.B. This version has been simplified for il-lustration; It does not properly handle theend-of-stream condition.

    signed[w] merge(param unsigned[6] w,// can use parameters to define// data width

    input signed[w] a,input signed[w] b)

    {signed[w] tmpA;signed[w] tmpB;// states used here to show dynamic// data consumptionstate start(a,b):

    {tmpA=a; tmpB=b;if (tmpA

  • // uniq behaves like the unix command// of the same name; it filters an// input stream, removing any adjacent,// duplicate entries before passing them// on to the output stream.signed[w] uniq(param unsigned[6] w,

    input signed[w] x){

    signed[w] lastx;state start(x):

    { lastx=x; uniq=x; goto loop;}state loop(x):

    {if (x=!lastx)

    { lastx=x; uniq=x; }goto loop;

    }}

    Figure 14: TDF Specification ofuniq Operator (a dy-namic output rate operator)

    merge3uniq(param unsigned[6] n,input signed[n] a,input signed[n] b,input signed[n] c,output signed[n] o)

    {signed [n] t;t=merge(n,merge(n,a,b),c);o=uniq(n,t);

    }

    Figure 15: TDF Compositional Operator

    operators. Nonetheless, the programmer should avoid se-quentialization and complex control when possible, as op-erator with many states are less likely to use spatial com-puting resources efficiently.

    Larger operators can be composed from smaller opera-tors in a straight-forward manner as shown in Figure 15.

    5.3 C++ Integration and Composition

    With a suitable stream implementation and interfacecode, SCORE operators can be instantiated by and usedwith a conventional, multithreaded programming lan-guage. Figure 16 shows an example C++ program whichuses themerge anduniq operators defined here. Notethat SCORE operator instantiation and composition can beperformed from the C++ code. Once created, the SCOREoperators behave as independently running threads, oper-ating in parallel with the main C++ execution thread. Ingeneral, a SCORE operator will run until its input streamsare closed or its output streams are freed.

    Once primitive behavioral (or leaf) operators are defined(e.g. in TDF or some other suitable form) and compiledinto their page-level implementation, large programs canbe composed entirely in a programming language as shownhere. If one thinks of TDF as a portable assembly lan-guage for critical computational building blocks, then thislanguage binding allows a high-level language to composethese building blocks in much the same way that assem-bly language kernels have been composed using high-levellanguages in order to efficiently program early DSPs andsupercomputers. The instantiation parameters for TDF op-erators allow the definition of generic operators which canbe highly customized to the needs of the application.

    6 Execution Example

    The following example demonstrates execution of thedesign in Figure 16. It shows array compute page reconfig-uration, execution of scheduled behavioral code, and somefundamental control signals.

    To ground this explanation to a particular hardware con-figuration and its constraints, we make the following as-sumptions about the reconfigurable array parameters andthe TDF design in the user application:

    • The design consists of three behavioral operators.Full implementation of each operator requires onlyone compute page.

    • The reconfigurable array contains one compute page(CP) and three configurable memory blocks (CMBs).

    15

  • #include "Score.h"#include "merge.h"#include "uniq.h"int main(){

    char data0[] = { 3, 5, 7, 7, 9 };char data1[] = { 2, 2, 6, 8, 10 };char data2[] = { 4, 7, 7, 10, 11 };// declare streamsSIGNED_SCORE_STREAM i0,i1,i2,t1,t2,o;// create 8-bit wide input streamsi0=NEW_SIGNED_SCORE_STREAM(8);i1=NEW_SIGNED_SCORE_STREAM(8);i2=NEW_SIGNED_SCORE_STREAM(8);// instantiate operators// note: instantiation passes parameters// and streams to the SCORE operatorst1=merge(8,i0,i1);t2=merge(8,t1,i2);o=uniq(8,t2);// alternately, we could use:// new merge3uniq (8,i0,i1,i2,o);// write data into streams// (for demonstration purposes;// real streams would be much longer// and probably not come from main)for (int i = 0; i < 5; i++) {

    STREAM_WRITE(i0, data0[i]);STREAM_WRITE(i1, data1[i]);STREAM_WRITE(i2, data2[i]);

    }

    STREAM_CLOSE(i0); // close inputSTREAM_CLOSE(i1); // streamsSTREAM_CLOSE(i2);// output results// (for demonstration purposes only)for (int cnt=0; !STREAM_EOS(o); cnt++) {

    cout

  • Table 1: Step-by-Step Execution ExampleTime Physical Array View Description

    I

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    ?Reconfig

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i3: 4, 7, 7, ...

    i0: 3, 5, 7, ...

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%

    CMB1

    i1: 2, 2, 6, ...

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%

    CMB2

    C conf/st

    Initially assume that the contents of streamsi0 ,i1 , andi2 have been loaded by the main proces-sor into segments CMB0 S0, CMB0 S1, CMB1S0. In addition, the configuration and initial stateof pages (operators) A, B, and C has been loadedinto segments CMB0 S2, CMB1 S2, and CMB2S2 respectively.Reconfiguration. Page A (merge ) is scheduledto run for the first timeslice. First, CP0 is con-figured with the contents of CMB0 S2. Then, thestreams are setup between CMBs and the CP0 asshown on the next diagram.

    II

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    ARun

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i3: 4, 7, 7, ...

    i0: 7, 9

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%

    CMB1

    i1: 8, 10

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSink

    0% 100%

    0% 100%

    CMB2

    t1: 2, 2, 3, ...C conf/st

    Array Status. CP0 is running behavioral code ofoperator A (merge ). CMB controller has set upappropriate active segment and operation mode foreach CMB. On this diagram, CMB0 and CMB1act as SeqSrc (sequential source) and CMB2 —SeqSink (sequential sink) relative to the connectedstreams. At this time, approximately half of tokenshave been consumed from both sources CMB0 S0and CMB1 S0 and sunk into CMB2 S1.

    III

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    AStall

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i3: 4, 7, 7, ...

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSrc

    0% 100%

    0% 100%

    CMB1

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSink

    0% 100%

    0% 100%

    CMB2

    t1: 2, 2, 3, ...C conf/st

    EMPTYEMPTY

    End of the first timeslice.Array Status. All tokens from both sourcesCMB0 S0 and CMB1 S1 have been consumed byCP0 and sunk into CMB2 S1.If a source node of a stream is not producingany tokens (e.g. empty segment CMB0 S0), asink node could stall due to unavailability of in-put tokens (e.g. CP0 is stalled, since operatorA requires tokens on at least one input to fire).On the diagram such streams are identified withEMPTY. Scheduler uses the information aboutEMPTYstreams to optimize schedule for the nexttimeslice.

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSink

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    AReconfig

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i3: 4, 7, 7, ...

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSrc

    0% 100%

    0% 100%

    CMB1

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSink

    0% 100%

    0% 100%

    CMB2

    t1: 2, 2, 3, ...C conf/st

    1

    2

    Reconfiguration. CP0 reconfiguration consists oftwo logically sequential steps, that could be paral-lelized if the array implementation permits.

    1. Save current configuration and state of CP0in CMB0 S2, which is allocated for A.

    2. Load the configuration and state for page Binto CP0 from CMB1 S2.

    After CP0 has been configured, the streams arecreated between compute nodes as shown on thenext diagram. CP0 is ready to run.

    continued on next page

    17

  • continued from previous pageTime Physical Array View Command Description

    IV

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    BRun

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2: 7, 7, 10, ..

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSink

    0% 100%

    0% 100%

    CMB1

    t2: 2, 2, 3, ...

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%

    CMB2

    t1: 7, 7, 10, ..C conf/st

    Array Status. CP0 is running behavioral codeof operator B (merge ). Approximately half oftokens in CMB0 S1 and CMB2 S1 have been con-sumed by CP0 and sunk into CMB1 S1.

    V

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    BStall

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2:

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSink

    0% 100%

    0% 100%

    CMB1

    t2: 2, 2, 3, ...

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%

    CMB2

    t1:C conf/st

    EMPTYEMPTY

    FULLEnd of the second timeslice.Array Status. All tokens from both sourcesCMB0 S1 and CMB2 S1 have been consumed byCP0 and sunk into CMB1 S1.If a sink node of a stream is not consuming tokens(e.g. 100% full CMB1 S1), a source node couldstall on a stream write. On the diagram suchstreams are identified withFULL. Scheduler usesthe information aboutFULL streams to optimizeschedule for the next timeslice.

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    BReconfig

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2:

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSink

    0% 100%

    0% 100%

    CMB1

    t2: 2, 2, 3, ...

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSrc

    0% 100%

    0% 100%

    CMB2

    t1:C conf/st

    1

    2

    Reconfiguration. Two main steps of reconfigura-tion are similar to those at time III. CP0 is loadedwith configuration and state of page C (uniq ).

    VI

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    CRun

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2:

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%

    CMB1

    t2: 7, 8, 9, ...

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSink

    0% 100%

    0% 100%

    CMB2

    t1:

    O: 2, 3, 4, ...

    C conf/st

    Array Status. CP0 is running behavioral code ofoperator C (uniq ). Approximately half of tokensin CMB1 S1 have been consumed by CP0 andsunk into CMB2 S0.

    continued on next page

    18

  • Timeslice

    Run A Run B Run CI II III IV V VI VII

    time

    Reconfiguration Page Execution

    Figure 17: Timeline for Execution Example

    continued from previous pageTime Physical Array View Command Description

    VII

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    CStall

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2:

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%

    CMB1

    t2:

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S0SeqSink

    0% 100%

    0% 100%

    CMB2

    t1:

    O: 2, 3, 4, ...

    C conf/st

    EMPTY

    End of the third timeslice.Array Status. All tokens from CMB1 S1 havebeen consumed by CP0 and sunk into CMB2 S0.

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%Logic/FSMs

    Configuration:Mode:

    CReconfig

    In 0 FIFO

    In 1 FIFO

    Out 0 FIFO

    Out 1 FIFO

    CP0 CMB0

    i2:

    i0:

    A conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S1SeqSrc

    0% 100%

    0% 100%

    CMB1

    t2:

    i1:

    B conf/st

    Active Seg:Mode:S0

    S1

    S2S3

    S2SeqSink

    0% 100%

    0% 100%

    CMB2

    t1:

    O: 2, 3, 4, ...

    C conf/st

    Reconfiguration. The current configuration andstate of CP0 are saved in CMB2 S2.Note: For the application that was demonstratedhere, saving configuration and state of CP0 wasnot necessary. A, B, and C were only scheduledonce, and therefore after each one runs on the CP0for a timeslice, its state is no longer needed. Sav-ing of configuration and state was shown for com-pleteness only. Should any of the pages be sched-uled to run in several non-consecutive timeslices,their state must be saved every time they are pre-empted and restored when scheduled. This is anal-ogous to context switching in traditional operatingsystems.

    19

  • to produce master files (merge.cc and uniq.cc) and alsoparameterized instance files (merge8.cc and uniq8.cc).The next step is to compile all C++ sources including thedriver code in main.C and thetdfc -generated sources.The build process terminates when all driver code is linkedwith the master files to produce user application executable(a.out ), and instance objects are linked with the run-timesystem libraries to produce dynamically linked shared ob-ject libraries (merge8.so and uniq8.so) containing the in-stance code. The purpose of this process should becomeclear in the next section as we describe the SCORE run-time environment in more detail.

    7.2 Run-time Environment

    The run-time system consists of the scheduler and thesimulator processes that execute under Linux as shown inFigure 19. In a real system, the OS kernel will containthe scheduler, and a reconfigurable hardware array will re-place the simulator. These components are connected bya pair of streams that permit bidirectional communicationand transmit scheduler commands and resource state to andfrom the array. The scheduler consists of instantiation andscheduling engines.

    Instantiation engine. Being an independent process, thescheduler has no knowledge of user applications’ computegraphs. The run-time system together with shared objectfiles built with a user application provide a way to commu-nicate the structure of compute graphs from a user applica-tion to the scheduler:

    1. Upon invocation, a user application places a seriesof requests to the scheduler to instantiate its computegraph nodes. This is accomplished by the code in themaster files, produced bytdfc and linked with theuser executable. The code contains a sequence of op-erations to connect to the scheduler through an IPCchannel and request to instantiate an operator. For ex-ample, in Figure 19 the code in the invokedmerge()routine requests instantiation of themerge operatorwith inputst1 andi2 .

    2. With the request, the scheduler receives a pointer tothe shared object file which contains the behavioralcode and the attributes of a parameterized instance ofan operator. The run-time system dynamically linkswith that shared object file (here, merge8.so), and thescheduler instantiates an operator and places it on awaiting list to be scheduled. Note that the shared ob-ject is necessary here in order to get the user’s applica-tion code loaded into the address space of the sched-uler which, of course, was built without any knowl-edge of the user code which it might be asked to run.

    The array simulator executes the behavioral code foreach resident compute node.

    Scheduling engine. The scheduling engine is invokedevery timeslice and is responsible for resource allocationand utilization, placement, and routing on the array. Itacts as a resource manager capable of enforcing a varietyof policies from fair sharing of the compute resources be-tween multiple user applications to favoring a particularapplication to meet its real-time constraints.

    Array simulator. The simulator provides a cycle ac-curate simulation by executing compute node behavioralcode, found in corresponding dynamically linked sharedobject files (e.g. merge8.so). As noted earlier, it com-municates with the scheduler through a pair of streams.Implemented using shared memory, streams also providedirect communication between a user application and thearray simulator. In the example in Figure 16 these streamsarei0 , i1 , i2 , ando.

    8 Example: JPEG

    As described in the previous section, we have imple-mented a complete SCORE run-time system and simula-tor on top of Linux and are beginning to develop severalapplications to guide our further understanding of criticaldesign issues for these systems. As an early exercise anddemonstration vehicle, we have implemented a completeJPEG (Joint Photographic Experts Group) image compres-sion algorithm [33] in TDF and C++ and performed basicscaling experiments where we vary the number of compu-tational pages in the system.

    8.1 Application

    The JPEG compressor mathematically decomposes theinput data into high and low frequency components. Theimage is first segmented into 8×8 pixel blocks, and thenthe decomposition is performed on every individual blockvia the DCT (Discrete Cosine Transform), a unitary trans-form that takes the pixel block as an input and returns an-other 8×8 block of coefficients, most of which are closeto zero. The coefficients are then scalar quantized andscanned into a one-dimensional stream via azigzagscan.Quantized coefficients are subsequently compacted withzero-length encoding, after which runs and lengths areHuffman encoded. (See Figure 20.)

    Our TDF implementation uses 13 512-LUT pages inorder to realize a fully spatial JPEG compressor whichis capable of processing one image sample per cycle.

    20

  • SCORE runtimeLinux User Processes

    User Application Scheduler Simulator

    instantiate

    schedule

    CP0

    CMB0

    CMB1

    CMB2

    ...merge()merge()uniq()...

    merge_8.souniq_8.so

    start/stop CPstart/stop CMBtransferCMB2CP...

    array_status

    waitList residentList

    merge

    PC---->"instantiate merge"

    link at

    runtim

    e

    shared mem

    streams

    IPC (Array Hardware)a.out

    Figure 19: SCORE Run-Time Structure and Interaction to User Application

    2D-DCT4

    writeseg

    2

    muxer1

    R/WSEGMENT

    R/WSEGMENT

    QuantZLE3

    MixToHuff

    1

    READSGMNT

    READSGMNT

    READSGMNT

    READSGMNT

    AddressData In

    R/W

    AddressData In

    R/W

    BitstreamAssembly

    2

    88

    (N.B.Large numbers on computational blocks indicate number of 512-LUT pages required.)

    Figure 20: JPEG Data Flow including Page and Segment Decomposition

    21

  • merge.ouniq.o

    a.out

    merge.tdfuniq.tdf

    tdfc

    merge.ccuniq.cc

    merge_8.ccuniq_8.cc

    merge.fuseruniq.fusermain.C

    c++

    merge_8.ouniq_8.o

    main.o

    ld ld -shared

    merge_8.souniq_8.so

    master instance

    Note: Each operator is described in a separate TDFsource file. Tools used aretdfc (TDF compiler),c++ (standard c++ compiler), andld (standard linkerwith capability of building stand-alone executablesand shared dynamically linked libraries).

    Figure 18: Build Process for User Application fromFig. 16.

    Simulator Parameters Value AssumedReconfiguration Time 5,000 cyclesSchedule Time Slice 250,000 cyclesCompute Page (CP) size 512 LUTsConfigurable Memory Block CMB size 2MbitsExternal Memory Bandwidth 2GB/s

    Table 2: System Parameters for Experiment.

    For smaller hardware, the SCORE scheduler automati-cally manages, at run time, the reconfiguration necessaryto share the physical CPs among the 13 virtual CPs.

    8.2 System Assumptions

    For these experiments, we assume a single-chip systemas described in Section 4, with external memory as neededfor the application. Table 2 summarizes the parameterswe assume for the system, based on our experience withthe HSRA [30] and embedded DRAM memory [25]. Forthese experiments page decomposition is performed man-ually. The scheduler is list based and operates in a time-sliced fashion like a conventional operating-system sched-uler; the scheduler takes care of all decisions on whereto place CPs and CMBs and manages all reconfigurationand data transfer, including the data movement on and offthe component as necessitated by the finite, on-chip mem-ory capacity. We assume scheduling time is overlappedwith computation and takes 50,000 cycles. We do not, cur-rently, model any limitations on routability among pages.The simulator accounts for all time required to reconfigurepages, store state, and transfer data between memories inthe chip.

    8.3 Results

    To study the scalability, performance, and efficiency ofSCORE, we ran our JPEG implementation on a series ofsimulated, architecture-compatible SCORE systems withvarying numbers of physical compute pages. Figure 21plots the total run time (makespan) of each system versusthe number of physical pages in that system. In this par-ticular experiment, we do not scale memory, so the resultsshown reflect (1) a fixed memory of 26 CMBs, and (2) un-limited memory. For comparison, we show a native x86-MMX implementation using Intel’s referenceijpeg library.

    The curves demonstrate that SCORE can automaticallyrun the JPEG application on less hardware with grace-ful performance degradation. Thus SCORE can automat-ically realize an area-time performance tradeoff. Further,the curves show that this application can be automati-

    22

  • 0

    5

    10

    15

    20

    25

    0 2 4 6 8 10 12 14

    Tot

    al T

    ime

    (m

    akes

    pan

    in m

    illio

    ns o

    f cyc

    les)

    Physical Compute Pages

    JPEG Encode Performance

    Pentium II MMXSCORE (26 CMBs)

    SCORE (unlimited CMBs)

    Figure 21: JPEG CP vs. Makespan

    cally virtualized onto half the compute pages of the fully-spatial implementation without incurring a substantial per-formance penalty. This kind of result is common when theload on compute operators vary widely; the lightly-loadedoperators can time-share a compute page without increas-ing overall runtime.

    The experiment exhibits some anomalies of our presentscheduler. The CP-makespan curves are not strictly mono-tonic due to heuristics in the list-based page selection ap-proach. Also, the scheduler is not optimized to minimizememory usage while buffering streams. In fact, it is notpossible to scale down the number of CMBs together withphysical CPs in very small hardware because there wouldnot be enough CMBs to virtualize the streams of presently-loaded pages. Hence, this experiment assumes a fixedmemory availability of 26 CMBs (twice the number of CPsin the application). To factor out the effect of unoptimizedstream buffering, we also performed the experiment withunlimited memory. The results exhibit a speedup of upto twofold over the limited memory case, suggesting thatthere is room for improvement in scheduling and memorymanagement.

    This experiment represents a single set of SCORE sys-tem parameters. As ongoing work, we are exploring manysystem parameters to gain insight into the regions of oper-ation where SCORE scheduling is most robust and to de-termine the parameters that provide the most efficient and

    balanced system design. Such parameters include computepage size, page I/O bandwidth, memory block size, andreconfiguration times.

    9 Summary

    Reconfigurable computation, defined simply as compu-tation performed on a collection of FPGA or FPGA-likehardware, has shown remarkable promise on point appli-cations, but has not achieved wide-spread acceptance andusage. One must make a large commitment to a particularFPGA-based system to develop an application. However,as we can now readily predict, the industry produces newer,larger, and faster hardware at a steady pace. Unfortunately,without a unifying computational model which transcendsthe particular FPGA implementation on which the applica-tion is first developed, one is stuck redoing significant workto port the application to newer hardware. This is particu-larly onerous when the established, alternative technology,the microprocessor, offers users steady performance im-provements with little or no time investment to adapt tonew hardware.

    Overcoming this liability requires a computationalmodel which abstracts computational resources, allowingapplication performance to scale automatically, adaptingto new hardware as it becomes available. The computa-

    23

  • tional model must expose and exploit the strengths of re-configurable hardware and help users understand how tooptimize applications for reconfigurable execution. Fur-ther, the computational model must allow problems to dealefficiently with dynamic and unbounded resource require-ments and dynamic program characteristics. Finally, themodel must support the efficient composition of solutionsfrom abstract building blocks.

    In this paper, we have introduced a particular com-putational model which attempts to address these needs.SCORE uses a paging model to virtualize all hardware re-sources including computation, storage, and communica-tion. It allows dynamic instantiation of dynamically sizedcomputational operators and supports dynamic rate appli-cations. A page partitioner and compiler along with arun-time scheduler takes care of automatically mappingthe unbounded and dynamically unfolding computationalgraph onto the fixed resources of a particular hardwareplatform. We have outlined the hardware requirementsfor such a model as well as the kind of programming lan-guages needed to describe and integrate SCORE computa-tions. We have implemented a complete SCORE run-timesystem and simulator. Initial experiments suggest that wecan achieve the desired scalability on sample applications.With this initial success, we are now attempting to broadenthe range of applications, automate more of the SCOREtool flow, and systematically explore the design space forSCORE compatible architectures.

    Acknowledgements

    This research is part of the Berkeley ReconfigurableArchitectures, Software, and Systems (BRASS) effortsupported by the Defence Advanced Research ProjectsAgency under contract number DABT63-C-0048 and bythe California MICRO Program.

    References

    [1] Arthur Abnous and Jan Rabaey. Ultra-Low-PowerDomain-Specific Multimedia Processors. InPro-ceedings of the IEEE VLSI Signal Processing Work-shop (VSP’96), October 1996.

    [2] Altera Corporation, 2610 Orchard Parkway, SanJose, CA 95134-2020. APEX Device Family,March 1999. .

    [3] Shuvra S. Bhattacharyya, Praveen K. Murthy, andEdward A. Lee. Software Synthesis from Dataflow

    Graphs, chapter Synchronous Dataflow. Kluwer Aca-demic Publishers, 1996.

    [4] Vincent Michael Bove, Jr. and John A. Watlington.Cheops: A Reconfigurable Data-Flow System forVideo Processing. IEEE Transactions on Circuitsand Systems for Video Technology, 5(2):140–149,April 1995. .

    [5] Gordon Brebner. A Virtual Hardware Operating Sys-tem for the Xilinx XC6200. InProceedings of the6th International Workshop on Field-ProgrammableLogic and Applications (FPL’96), pages 327–336,1996.

    [6] Gordon Brebner. The Swapable Logic Unit:aParadigm for Virtual Hardware. InProceedings of the5th IEEE Symposium on FPGAs for Custom Comput-ing Machines (FCCM’97), pages 77–86, April 1997.

    [7] Joseph T. Buck. Scheduling Dynamic DataflowGraphs with Bounded Memory using the Token FlowModel. PhD thesis, University of California, Berke-ley, 1993. ERL Technical Report 93/69.

    [8] Stephen P. Crago, Brian Schott, and Robert Parker.SLAAC: a Distributed Architecture for AdaptiveComputing. InProceedings of the 1998 IEEE Sym-posium on Field-Programmable Custom ComputingMachines (FCCM’98), pages 286–287, April 1998.

    [9] David E. Culler, Seth C. Goldstein, Klaus E.Schauser, and Thorsten von Eicken. TAM — A Com-piler Controlled Threaded Abstract Machine.Journalof Parallel and Distributed Computing, June 1993.

    [10] Jack B. Dennis. Data Flow Supercomputers.Com-puter, 13:48–56, November 1980.

    [11] Jack B. Dennis and David P. Misunas. A PreliminaryArchitecture for a basic data-flow processor. InPro-ceedings of the 2nd Annual Symposium on ComputerArchitecture, January 1975.

    [12] Daniel Gajski and Loganath Ramachandran. Intro-duction to High-Level Synthesis.IEEE Design andTest of Computers, 11(4):44–54, 1994.

    [13] Maya Gokhale, Janice Stone, Jeff Arnold, andMirek Kalinoskwi. Stream-Oriented FPGA Com-puting in the Streams-C High Level Lanugage.In Proceedings of the 2000 IEEE Symposium onField-Programmable Custom Computing Machines(FCCM’00). IEEE, April 2000.

    24

  • [14] Seth C. Goldstein, Herman Schmit, Matthew Moe,Mihai Budiu, Srihari Cadambi, R. Reed Taylor, andRonald Laufer. PipeRench: a Coprocessor forStreaming Multimedia Acceleration. InProceedingsof the 26th International Symposium on ComputerArchitecture (ISCA’99), pages 28–39, May 1999.

    [15] Scott Hauck, Thomas Fry, Matthew Hosler, and Jef-fery Kao. The Chimaera Reconfigurable FunctionalUnit. In Proceedings of the IEEE Symposium onFPGAs for Custom Computing Machines, pages 87–96, April 1997.

    [16] John R. Hauser and John Wawrzynek. Garp: AMIPS Processor with a Reconfigurable Coprocessor.In Proceedings of the IEEE Symposium on Field-Programmable Gate Arrays for Custom ComputingMachines, pages 12–21. IEEE, April 1997.

    [17] C. A. R. Hoare. Communicating Sequential Pro-cesses. International Series in Computer Science.Prentice-Hall, 1985.

    [18] R. A. Iannucci. Toward a Dataflow/Von NeumannHybrid Architecture. InProceedings of the 15thInternational Symposium on Computer Architecture,pages 131–40, May 1988.

    [19] Jeffery A. Jacob and Paul Chow. Memory Interfac-ing and Instruction Specification for ReconfigurableProcessors. InProceedings of the 1999 Interna-tional Symposium on Field Programmable Gate Ar-rays (FPGA’99), pages 145–154, February 1999.

    [20] Mark Jones, Luke Scharf, Jonathan Scott, ChrisTwaddle, Matthew Yaconis, Kuan Yao, and PeterAthanas. Implementing an API for Distributed Adap-tive Computing Systems. InProceedings of the IEEESymposium on Field-Programmable Custom Com-puting Machines (FCCM’99), pages 222–230, April1999.

    [21] Edward A. Lee.Advanced Topics in Dataflow Com-puting, chapter Static Scheduling of Data-Flow Pro-grams for DSP. Prentice Hall, 1991.

    [22] X. P. Ling and H. Amano. WASMII: a Data DrivenComputer on a Virtual Hardware. InProceedings ofthe IEEE Workshop on FPGAs for Custom Comput-ing Machines (FCCM’93), pages 33–42, April 1993.

    [23] Bruce Newgard. Signal Processing with Xil-inx FPGAs. , June 1996.

    [24] Mark Oskin, Frederic T. Chong, and Timothy Sher-wood. Active Pages: a Model of Computation forIntelligent Memory. In Proceedings of the 25thInternational Symposium on Computer Architecture(ISCA’98), June 1998.

    [25] Stylianos Perissakis, Yangsung Joo, Jinhong Ahn,André DeHon, and John Wawrzynek. EmbeddedDRAM for a Reconfigurable Array. InProceedingsof the 1999 Symposium on VLSI Circuits, June 1999.

    [26] Jan Rabaey. Reconfigurable Computing: The So-lution to Low Power Programmable DSP. InPro-ceedings of the 1997 IEEE International Confer-ence on Acoustics, Speech, and Signal Processing(ICASSP’97), April 1997.

    [27] Rahul Razdan and Michael D. Smith. A High-Performance Microarchitecture with Hardware-Programmable Functional Units. InProceedingsof the 27th Annual International Symposium onMicroarchitecture, pages 172–180. IEEE ComputerSociety, November 1994.

    [28] Paul T. Sasaki. A Fast FPGA (FFPGA) Using ActiveInterconnect. InProceedings of the 1998 Interna-tional Symposium on Field-Programmable Gate Ar-rays (FPGA’98), page 255, February 1998.

    [29] Edward Tau, Ian Eslick, Derrick Chen, JeremyBrown, and Andŕe DeHon. A First GenerationDPGA Implementation. InProceedings of the ThirdCanadian Workshop on Field-Programmable De-vices, pages 138–143, May 1995.

    [30] William Tsu, Kip Macy, Atul Joshi, Randy Huang,Norman Walker, Tony Tung, Omid Rowhani, Vargh-ese George, John Wawrzynek, and André DeHon.HSRA: High-Speed, Hierarchical Synchronous Re-configurable Array. InProceedings of the Interna-tional Symposium on Field Programmable Gate Ar-rays, pages 125–134, February 1999.

    [31] John Villasenor, Chris Jones, and Brian Schoner.Video Communications using Rapidly Reconfig-urable Hardware.IEEE Transactions on Circuits andSystems for Video Technology, 5:565–567, December1995.

    [32] John Villasenor, Brian Schoner, Kang-Ngee Chia,and Charles Zapata. Configurable Computer Solu-tions for Automatic Target Recognition. InPro-ceedings of the IEEE Workshop on FPGAs for Cus-tom Computing Machines, pages 70–79. IEEE, April1996.

    25

  • [33] Gregory K. Wallace. The JPEG Still Picture Com-pression Standard.Communications of the ACM,34(4):30–44, April 1991.

    [34] John A. Watlington. MagicEight: An Archi-tecture for Media Processing and an Implementa-tion. Thesis proposal, MIT Media Laboratory, Jan-uary 1999. .

    [35] John A. Watlington and V. Michael Bove, Jr. A Sys-tem for Parallel Media Processing. InProceedings ofthe Workshop on Parallel Processing in Multimedia,April 1997.

    [36] Michael J. Wirthlin and Brad L. Hutchings. DISC:the Dynamic Instruction Set Computer. InProceed-ings of the SPIE Reconfigurable Computing Confer-ence: Field Programmable Gate Arrays (FPGAs) forFast Board Development and Reconfigurable Com-puting, pages 92–103, October 1995.

    [37] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124.Virtex Series FPGAs, 1999..

    [38] Peixin Zhong, Margaret Martonosi, Pranav Ashar,and Sharad Malik. Accelerating Boolean Satisfiabil-ity with Configurable Hardware. InProceedings ofthe 1998 IEEE Symposium on Field-ProgrammableCustom Computing Machines (FCCM’98), pages186–195, April 1998.

    26

  • Data-In

    SwitchControl

    SelectControl

    Data-Out

    True-Side

    False-Side

    Swit

    ch

    Sele

    ct

    Figure 22: switch -select Example Motivating Un-bounded Stream Buffers

    A Unbounded Stream Buffers

    All levels of the SCORE computational model providethe abstraction of unbounded stream buffers. Picking anyfinite buffer size for streams would introduce an artifactinto the model which is very difficult to reason about.In particular, programs would be prone to deadlock anytime the number of tokens which the application needed toqueue up on a stream between a pair of operators was datadependent.

    The canonical instance of this deadlock hazard is exem-plified with a pair of nodes,switch and select.switchtakes in two inputs, a boolean control stream, and a datastream. It sends its output along one of two output streamsaccording to the value of the control input.select takesin three inputs, a control stream and two input streams.However, it does not read all three tokens on each cycle.Rather, it first reads the control token. Based on the valueof the control token, it then reads from one of the two inputstreams and passes that along to its single output stream.

    These two nodes can now be hooked up directly toeach other with the two outputs of theswitch node con-nected to the two inputs of theselect node as shownin Figure 22. We provide separate control streams forthe switch and select nodes. Now, if there is evera stream prefix of theswitch -control stream which con-tainsn more TRUEs thanFALSEs (or vice versa) than theselect -control stream, the stream between theTRUE sideof the switch and select nodes will have to holdntokens. If the streams were limited to some fixed-sizebufferm andm < n, then this subgraph would deadlock.Without loss of generality, consider the case in which theswitch node receivedn TRUEs followed by oneFALSE,while theselect node initially receives oneFALSE con-trol signal followed byn TRUEs. TheTRUE-side streamwould fill up with m tokens. Theswitch node would

    not be able to perform any more operations because it can-not write data onto theTRUE-side stream. Theselectnode, however, must process a token from theFALSE sidein order to continue, but there are no tokens on theFALSEside to consume. Theselect node cannot make forwardprocess until it is given a token on theFALSE side. Theswitch node cannot make any progress until the down-stream operator (theswitch ) consumes a token on theTRUE side. These two operators are now deadlocked on acyclic dependence. Note that ifm > n (orm unbounded),this deadlock would not occur and processing would beable to proceed.

    Since, in general, these control streams can be com-pletely independent, it is not possible to say that they willhave any particular property between them. If these controlstreams were coming from outside of the system, we wouldcertainly not have any control or knowledge of their rela-tionship. Even if they were generated inside the system,the general question of whether or not a given computationproduces a particular token value after a finite number ofoperations is equivalent to the halting problem.

    Therefore, in order to provide reasonable semantics tothe programmer, we accept the unbounded buffer size ab-straction and include support in the execution model to ex-pand finite buffers as necessary to meet this abstraction (upto the limit of the amount of memory we have available inthe system).

    B SCORE Compute Model

    Section 3.1 described the compute model informally.This section defines it more precisely.

    B.1 SCORE

    SCORE computation is a graph,G:

    G = {V,E}E = {e1, e2, ...}ei is a SFIFO

    V = Vf ∪ Vt ∪ Vi ∪ Vovi ∈ Vf is aSFSMvi ∈ Vt is aSTMvi ∈ Vi is aSINvi ∈ Vo is aSOUT

    Notes:

    • G will typically be initialized with at least one nodevs ∈ Vt to start the computation.

    27

  • • G may start with many nodes inV

    B.2 SFIFO

    SFIFO is a two-ended, unbounded FIFO which may becreated, closed, and freed.

    e = {psrc, psink, Q,EOS, FREE}EOS = Boolean

    FREE = Booleanpsrc is a PORT

    psink is a PORT

    Q is a QUEUE

    Operations:v ∈ V ≡ vertex from which the operation is invoked.

    Op Requirement Actionwrite(e,t) e.psrc ∈ outs(v)

    t ∈ TdataQ→ eos() Q→add(t)

    close(e) e.psrc ∈ outs(v)Q→ eos() Q→add(TEOS)

    t=present(e) e.psink ∈ ins(v)FREE=false t=Q→ empty()

    t=read(e) e.psink ∈ ins(v) t=Q→rm();FREE=false if (t ≡ TEOS)EOS=false EOS=true

    t=eos(e) e.psink ∈ ins(v)FREE=false t=EOS

    free(e) e.psink ∈ ins(v)FREE=false FREE=true

    Notes:

    • when anSFIFO is both freed and closed, it is removedfrom the SCORE graph. (e.Q → eos() ≡ e.FREE≡true⇒ E = E − {e})

    B.3 QUEUE

    QUEUE is a an unbounded queue.

    Q.data = ordered list ofT = {q0, q1, q2...qn−1}= {} when first created

    T = Tdata ∪ {TEOS}Tdata = finite alphabet

    Q→ empty() ≡ value= (| Q.data |≡ 0)Q→ add(t) ≡ Q.data = {q0, q1, q2...qn−1, qn = t}

    Q→ rm() ≡

    Q→ empty() value= q0;

    Q.data = {q1, q2,...qn−1}

    Q→ empty() ERROR

    Q→ eos() ≡ value= (qn−1 ≡ TEOS)

    B.4 SFSM

    SFSM is an FSM with stream (SFIFO) I/O operations.

    vf = {sc, sd, d, Sc, Sd, Sres, Pin, Pout, B}Sc = {s1, s2, ...sn}Sd = also finiteSd = Sres × Slocalsc ∈ Scsd ∈ Sdd ∈ ScB = {b1, b2, ...bn}bi = {Ii, Ai, Fci, Fdi}Ii ⊂ PinAi = {ai,0, ai,1, ...ai,mi}ai,j ∈ {fi,j , wi,j , ci,j}csd = current data state∈ Sdwi,j = if (gi,j(csd)) write(pout ∈ Pout, vi,j(csd))ci,j = if (gi,j(csd)) close(pout ∈ Pout)fi,j = if (gi,j(csd)) free(p ∈ Pin)

    vi,j(csd) = F : Sd → Tdatagi,j(csd) = F : Sd → Boolean

    Fci = F : Sd → ScFdi = F : Sd → SdId = {}Ad = {}Fcd = F : Sd → {d}Fdd = csd (Identity)d = {Id, Ad, Fcd, Fdd}

    Operation:1. Read: in statesi, if all inputs in Ii are present, read

    inputs into present state;7 otherwise, do nothing (stayin state, perform no actions or transitions).

    2. Action: perform all guarded writes, closes, and freesin Ai whose guards are true.

    3. Transition: update state according toFci andFdi.Notes:• SFIFOoperations present and read are only available to

    the read mechanism; they are not available for arbitraryuse within theSFSM.• d is the done state; anSFSM is done when it entersd.

    7N.B.values of present state,csd, actually change to reflect input val-ues.

    28

  • • A well formed SFSM will close all output streams andfree all input streams before making final transition tod.

    • A properly specifiedSFSM will specify both the EOFand data transitions; on EOF of an input, it should tran-sition only to a state that does not read that input andwhich cannot reach any state which can read said input.

    • In a properly formedSFSMafter performing a close ac-tion on an output, the machine will transition to a sta