streamit cookbook

33
StreamIt Cookbook [email protected] November 2, 2004 1 StreamIt Overview Most data-flow or signal-processing algorithms can be broken down into a number of simple blocks with connections between them. In StreamIt parlance, the smallest block is a filter; it has a single input and a single output, and its body consists of Java-like code. Filters are then connected by placing them into one of three composite blocks: pipelines, split-joins, and feedback loops. Each of these structures also has a single input and a single output, so these blocks can be recursively composed. A typical streaming application might be a software FM radio, as shown in Figure 1. The program receives its input from an antenna, and its output is connected to a speaker. The main program is a pipeline with a band-pass filter for the desired frequency, a demodulator, and an equalizer; the equal- izer in turn is made up of a split-join, where each child adjusts the gain over a particular frequency range, followed by a filter that adds together the outputs of each of the bands. Our goal with choosing these constructs was to create a language with most of the expressiveness of a general data-flow graph structure, but to keep the block-level abstraction that modern programming languages of- fer. Allowing arbitrary graphs makes scheduling and partitioning difficult for the compiler. The hierarchical graph structure allows the implementa- tion of blocks to be “hidden” from users of the block; for example, an FFT could be implemented as a single filter or as multiple filters, but so long as there is a stream structure named “FFT” somewhere in the program the actual implementation is irrelevant to other modules that use it. Since most graphs can be readily transformed into StreamIt structures, StreamIt is suit- able for working on a wide range of signal-processing applications. 1

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: StreamIt Cookbook

StreamIt Cookbook

[email protected]

November 2, 2004

1 StreamIt Overview

Most data-flow or signal-processing algorithms can be broken down intoa number of simple blocks with connections between them. In StreamItparlance, the smallest block is a filter; it has a single input and a singleoutput, and its body consists of Java-like code. Filters are then connectedby placing them into one of three composite blocks: pipelines, split-joins,and feedback loops. Each of these structures also has a single input and asingle output, so these blocks can be recursively composed.

A typical streaming application might be a software FM radio, as shownin Figure 1. The program receives its input from an antenna, and its outputis connected to a speaker. The main program is a pipeline with a band-passfilter for the desired frequency, a demodulator, and an equalizer; the equal-izer in turn is made up of a split-join, where each child adjusts the gainover a particular frequency range, followed by a filter that adds togetherthe outputs of each of the bands.

Our goal with choosing these constructs was to create a language withmost of the expressiveness of a general data-flow graph structure, but tokeep the block-level abstraction that modern programming languages of-fer. Allowing arbitrary graphs makes scheduling and partitioning difficultfor the compiler. The hierarchical graph structure allows the implementa-tion of blocks to be “hidden” from users of the block; for example, an FFTcould be implemented as a single filter or as multiple filters, but so longas there is a stream structure named “FFT” somewhere in the program theactual implementation is irrelevant to other modules that use it. Since mostgraphs can be readily transformed into StreamIt structures, StreamIt is suit-able for working on a wide range of signal-processing applications.

1

Page 2: StreamIt Cookbook

LowPass

Demod

BandPass ... BandPass

DUP

1 1RR

Equalizer

Adder

FMRadio

Figure 1: Stream graph for a software FM radio

2 Programming in StreamIt

2.1 A Minimal Program

void−>void pipeline Minimal {add IntSource;add IntPrinter ;

}void−>int filter IntSource {

int x;init { x = 0; }work push 1 { push(x++); }

}int−>void filter IntPrinter {

work pop 1 { print(pop()); }}

IntSource

IntPrinter

Minimal

This is the minimal interesting StreamIt program. Minimal is a StreamItpipeline: the output of its first child is connected to the input of its secondchild, and so on. It has two children, a source and a sink. Each of these are

2

Page 3: StreamIt Cookbook

implemented as StreamIt filter objects.A filter has two special functions, an init function and a work function.

Both of these are present in IntSource. The init function runs once at thestart of the program; the work function runs repeatedly forever. If the initfunction is omitted, as it is in IntPrinter, it is assumed to be empty. Workfunctions have static data rates. The source here declares that each iterationof the work function pushes a single item on to its output; the sink declaresthat it pops a single item from its input.

Every StreamIt structure has a single input and a single output. Thefilter and pipeline declarations here show the types of these inputs andoutputs. C-like int and float types are available, along with bit for one-bitdata and complex for complex floating-point data. void is used as a specialtype to indicate the boundary of the program: “the program” in StreamItis defined as a stream structure with both void input and output types. Afilter that takes no input at all should also be declared to take void as itsinput type, and similarly a void output can be used if a filter produces nooutput.

How to Compile and Run

The StreamIt compiler script strc can be used to compile and execute StreamItprograms. If you are using the StreamIt release, you can find all of the cook-book examples in the following directory:

cd $STREAMIT_HOME/apps/examples/cookbook

The minimal example is stored in Minimal.str, and the following com-mand will compile it for the uniprocessor backend:

strc Minimal.str -o minimal

The resulting binary is stored in minimal, and it can be executed for 5iterations as follows:

minimal -i 5

Doing so will print the integers from 0 to 4, in increasing order.During the course of compilation, a number of stream graphs are ouput

to dot files in the current directory. The dot format can be displayed andconverted to other formats using the Graphviz software, which is availableonline1. Running the following command will draw the stream graph forthe program, as pictured to the right of the source code above:

1http://www.research.att.com/sw/tools/graphviz/

3

Page 4: StreamIt Cookbook

dotty first-sir-tree.dot

There are many otherdotfiles that are output by the compiler; see Section 3of this document for more details.

The Java Library. In addition to using the StreamIt compiler, it is pos-sible to convert StreamIt programs into equivalent Java programs that canbe executed using any Java VM. This is particularly convenient for testingand debugging, as well as for cases when the compiler might encounter abug.

To run the Minimal program for 5 iterations in the Java library, do asfollows:

strc --library Minimal.str -i 5

This command will output a Minimal.java file, compile it with a Javacompiler, and run it using java. The output should always be identicalto that obtained using the compiler. In addition, the library will output aMinimal.dot file that can be visualized using Graphviz.

For more details on the StreamIt compiler and execution environment,including instructions for compiling to the Raw architecture, please con-sult Section 3.

2.2 A Moving Average Filter

void−>void pipeline MovingAverage {add IntSource();add Averager(10);add IntPrinter ();

}int−>int filter Averager(int n) {

work pop 1 push 1 peek n {int sum = 0;for ( int i = 0; i < n; i++)

sum += peek(i);push(sum/n);pop();

}}

IntSource

Averager

IntPrinter

MovingAverage

Most of a typical StreamIt program consists of filters that produce someoutput from their input. The Averager filter shown here is such a filter.

4

Page 5: StreamIt Cookbook

Like the filters shown before, Averager has a work function with staticallydeclared input and output rates.

In addition to peeking and popping, Averager peeks at its input stream.The peek() operator returns a particular item off of the input stream, withpeek(0) returning the next item that would normally be popped. The workfunction must declare a peek rate if it peeks at all, but this peek rate is amaximum, rather than an exact, rate; it would be valid for the Averagerfilter to peek(n−2) and never peek(n−1), but peek(n) is illegal. Note thatmixing peeking and popping is valid, but that popping an item shifts theindex of future peeks.

Averager also has a stream parameter. The number n is the number ofitems to average. This is passed like a normal function parameter fromthe add statement that creates the filter. Within the filter, the parameteris a constant: it is illegal for code to modify the parameter. This allowsparameter values to be used in expressions for e.g. I/O rates, as in the peekrate here.

This program also provides a basic demonstration of StreamIt’s filterscheduler. There is a guarantee that the Averager filter is not run until itsinput rates can be met, and in particular, that there are 10 inputs avail-able so peeking can happen. For this to happen, the source needs to runnine additional times at the start of the program; there can then be steady-state exections of source, averager, printer. The StreamIt compiler handlesthis automatically. While all of the examples so far have had filters withmatched I/O rates, the compiler also automatically schedules the execu-tion of adjacent filters whose push and pop rates are different.

2.3 A Low-Pass Filter

float−>float filter LowPassFilter(float rate , float cutoff ,int taps , int decimation) {

float [ taps ] coeff ;init {

int i ;float m = taps − 1;float w = 2 ∗ pi ∗ cutoff / rate ;for ( i = 0; i < taps ; i++) {

if ( i − m/2 == 0)coeff [ i ] = w/pi ;

else

5

Page 6: StreamIt Cookbook

coeff [ i ] = sin(w∗(i−m/2)) / pi / ( i−m/2) ∗(0.54 − 0.46 ∗ cos(2∗pi∗i /m));

}}work pop 1+decimation push 1 peek taps {

float sum = 0;for ( int i = 0; i < taps ; i++)

sum += peek(i) ∗ coeff[ i ];push(sum);for ( int i =0; i<decimation; i++)

pop();pop();

}}

The work function for a low-pass filter looks much like the work func-tion of the moving-average filter; however, it has extensive initializationcode. From the sampling rate, cutoff frequency, and number of taps, coef-ficients for an FIR filter can be statically calculated. This is done once, inthe init function, and saved in the coeff array; the work function then ef-fectively does a convolution. StreamIt provides a number of built-in math-ematical functions, such as the call to sin () here, along with the constantpi.

StreamIt’s array syntax is more C-like than Java-like. Every array has afixed length; this length can be a numeric constant or stream parameter, orother value that can be statically evaluated. In the declaration syntax, thelength of the array comes between the base type and the variable name.

The coefficient array here is defined as a field in the filter. If the namecoeff were used as a local variable in the init or work function, it wouldshadow the field, as in other languages. Otherwise, uses in both the initand work functions reference the field. If multiple low-pass filters existed,each would have its own coefficient array.

6

Page 7: StreamIt Cookbook

2.4 A Band-Pass Filter

float−>float pipeline BandPassFilter(float rate , float low, float high , int taps) {

add BPFCore(rate, low, high, taps);add Subtracter();

}float−>float splitjoin BPFCore

(float rate , float low,float high , int taps) {

split duplicate;add LowPass(rate, low, taps, 0);add LowPass(rate, high, taps, 0);join roundrobin;

}float−>float filter Subtracter {

work pop 2 push 1 {push(peek(1) − peek(0));pop(); pop();

}}

LowPass LowPass

DUP

1 1RR

BPFCore

Subtracter

BandPassFilter

We implement a band-pass filter using two low-pass filters in a StreamItstructure called a split-join. This structure contains a splitter, some numberof children that run in parallel, and a joiner. It overall has a single inputand a single output, and its children each have a single input and a singleoutput.

This split-join has a duplicating splitter; thus, each incoming item issent to both of the children. The joiner is a round-robin joiner, such thatoutputs are taken from the first child, then the second, in alternating order.There may be any number of children, in which case a round-robin joinertakes inputs from each of them in series. The order of the children is theorder in which they are added.

roundrobin can be used as a splitter, as well as a joiner; the meaning issymmetric. Other syntaxes are valid: roundrobin(2) reads two inputs fromeach child in turn, and roundrobin(1,2,1) requires exactly three childrenand reads one input from the first, two from the second, and one from thethird.

A typical use of a split-join is to duplicate the input, perform some com-putation, and then combine the results. In this case, the desired output isthe difference between the two filters; the Subtracter filter is placed in a

7

Page 8: StreamIt Cookbook

pipeline after the split-join, and finds the desired difference. In general, achild can be any StreamIt construct, not just a filter.

The implementation of pop() in the compiler and runtime system doesnot allow multiple pops to occur in the same statement. This is reflected inthe implementation of Subtracter here.

2.5 An Equalizer

float−>float pipeline Equalizer(float rate , int bands, float[bands] cutoffs ,float [bands] gains, int taps) {

add EqSplit(rate , bands, cutoffs , gains, taps);add float−>float filter {

work pop bands−1 push 1 {float sum = 0;for ( int i = 0; i < bands−1; i++)

sum += pop();push(sum);

}};

}float−>float splitjoin EqSplit(float rate , int bands, float[bands] cutoffs ,

float [bands] gains, int taps) {split duplicate;for ( int i = 1; i < bands; i++)

add pipeline {add BandPassFilter(rate, cutoffs [ i−1], cutoffs [ i ], taps);add float−>float filter {

work pop 1 push 1 { push(pop() ∗ gains[i]); }};

};join roundrobin;

}

This equalizer works by having a series of band-pass filters running inparallel, with their outputs added together. The caller provides arrays ofcutoff frequency and respective gains.

In the implmentation here, the output of EqSplit is a series of bands−1outputs from the respective low-pass filters. An inline filter is used to sumthe results together. This is akin to an anonymous class in Java; the filterdeclaration does not have an explicit name, but otherwise has syntax al-

8

Page 9: StreamIt Cookbook

BandPass

Scale

...BandPass

Scale

DUP

1 1RR

EqSplit

+

Equalizer

Figure 2: Stream graph for an equalizer

most identical to a top-level filter. In general, inline filters should only beused for very simple filters, such as this or the inlined amplifier in EqSplit.

EqSplit is a normal split-join, as shown previously. Its body consistsof a set of near-identical inlined pipelines; for pipelines and split-joins, theinput and output type declarations may be omitted on anonymous streams.Since the children are so similar, they are added within a normal for loop.The compiler is able to examine the loop provided that the loop bounds areexpressions of constants and stream parameters.

9

Page 10: StreamIt Cookbook

2.6 An Echo

float−>float feedbackloop Echo( int n, float f ) {

join roundrobin(1,1);body FloatAdderBypass();loop float−>float filter {

work pop 1 push 1 {push(pop() ∗ f);

}};split roundrobin;for ( int i = 0; i < n; i++)

enqueue(0);}float−>float filter FloatAdderBypass {

work pop 2 push 2 {push(peek(0) + peek(1));push(peek(0));pop();pop();

}}

AddBypass Scale

RR11

RR

Echo

This example uses a StreamIt feedback loop to implement an echo effect.In a sense, a feedback loop is like an inverted split-join: it has a joiner atthe top and a splitter at the bottom. A feedback loop has exactly two chil-dren, which are added using the body and loop statements. Thus, thisimplementation takes an input from the loop input and an input from thefeedback path, adds them, and outputs the result. The result is also scaledby the value f and sent back to the top of the loop.

Feedback loops have a specialized push-like statement, enqueue. Eachenqueue statement pushes a single value on to the input to the joiner fromthe feedback path. There must be enough values enqueued to preventdeadlock of the loop components; values enqueued delay data from thefeedback path.

10

Page 11: StreamIt Cookbook

2.7 Fibonacci

void−>int feedbackloop Fib {join roundrobin(0,1);body int−>int filter {

work pop 1 push 1 peek 2 { push(peek(0) + peek(1)); pop(); }};loop Identity<int>;split duplicate;enqueue(0);enqueue(1);

}

Using a feedback loop for a Fibonacci number generator is slightly un-usual but possible. The joiner reads no items from the stream input (alsodeclared of type void), but reads items continuously from the feedbackpath. Within a feedback loop, round-robin splitters and joiners addressthe external path first and the feedback path second. This loop also usesthe special Identity filter on the loop path; this is equivalent to an emptyfilter that copies its input to its output, but occurs frequently enough that ashorthand is useful to both the programmer and the compiler.

3 Using the StreamIt Compiler

This section walks through a sample session with the compiler and runtimesystem. We will use the FMRadio example from the StreamIt release as arunning example. To get started, change to the following directory:

% cd $STREAMIT_HOME/apps/examples/cookbook

The example is in FMRadio.str. The following sections describe the com-pilation of FMRadio using the uniprocessor backend, the Java library, andthe Raw backend. A summary of the compiler’s command-line options canbe found in Appendix B, or by typing strc -help at the command line.

3.1 Compiling for a Uniprocessor

There are two ways to compile a StreamIt program for execution on a general-purpose processor. Both methods compile StreamIt to a C program thatcan be further compiled with a C compiler. The first method (the default)preserves the hierarchical structure of the original program and relies on

11

Page 12: StreamIt Cookbook

a C runtime library to do buffer management. The second method (“stan-dalone”) produces a self-contained file where the entire stream graph hasbeen collapsed into a single function, with buffer management embeddedinto the code. The default output is more readable and provides some flex-ibility (by exposing the runtime library interface). The standalone outputmight be useful for groups interested in compiling C programs to new ar-chitectures; however, the size of the main work function might grow verylarge. We recommend using the default backend with the C runtime library.

Compiling for C library. To compile FMRadio using the uniprocessorbackend and C runtime library, issue the following command (the compileroutput is shown):

% strc FMRadio.str -o fmRunning Constant Prop and Unroll... done.Raising variable declarations... done.Propagating constant fields... done.Flattening blocks... done.Raising variables... done.Raising variable declarations... done.Moving initial assignments... done.Structuring... done.Scheduling... got schedule, interpreting... done.Annotating IR for uniprocessor... done.Generating code...

This will create a C file named FMRadio.c and a binary named fm. Thebinary can be executed for 5 steady-state iterations as follows:

% ./fm -i 5278074.000000278074.750000278075.437500278075.968750278076.437500

During the compilation process, several dot graphs are generated. Thedot format can be displayed and converted to other formats using theGraphviz software, which is available online2. For example, we can ex-amine a stream graph for the FM application as follows:

% dotty first-sir-tree.dot

The result appears in Figure 3. A complete list of the dot graphs thatare produced on the normal uniprocessor path are shown in Figure 4.

2http://www.research.att.com/sw/tools/graphviz/

12

Page 13: StreamIt Cookbook

FMRadio

FMRadioCore

Equalizer

EqSplit

Pipeline

BandPassFilter

BPFCore

Pipeline

BandPassFilter

BPFCore

Pipeline

BandPassFilter

BPFCore

Pipeline

BandPassFilter

BPFCore

FloatOneSource__3

LowPassFilter__24

FMDemodulator__38

duplicate

duplicate duplicate duplicate duplicate

roundrobin(1,1,1,1)

Filter__97

LowPassFilter__59 LowPassFilter__80

roundrobin(1,1)

Subtracter__87

Filter__91

LowPassFilter__59 LowPassFilter__80

roundrobin(1,1)

Subtracter__87

Filter__91

LowPassFilter__59 LowPassFilter__80

roundrobin(1,1)

Subtracter__87

Filter__91

LowPassFilter__59 LowPassFilter__80

roundrobin(1,1)

Subtracter__87

Filter__91

FloatPrinter__101

Figure 3: first-sir-tree.dot for the FMRadio example.

Domain-specific optimizations. It turns out that our version of the FM-Radio has a lot of redundant computation the way in which it is written.For example, each BandPassFilter could be implemented as a singleFIR filter rather than a composition of LowPassFilter’s; in fact, the en-tire equalizer could be collapsed to a single FIR filter. Further, some of theseoperations are more efficient if executed in the frequency domain, with anFFT/IFFT being used to translate to and from the time domain.

The StreamIt compiler includes a set of domain-specific optimizationsthat will automatically perform the transformations described above. Theanalysis considers all filters that are “linear”—that is, each of their outputsis an affine combination of their inputs. The compiler automatically de-tects linear filters by analyzing the code in their work functions. Then, itperforms algebraic simplification of adjacent linear filters, as well as auto-matic translation to the frequency domain. Since these transformations can

13

Page 14: StreamIt Cookbook

Filename Description

first-sir-tree.dotOriginal stream graph, as written by pro-grammer.

before-partition.dotCanonical version of stream graph, be-fore any stream transformations are applied.Nodes are annotated with their I/O rates.

after-partition.dotCanonical version of stream graph, after anystream transformations are applied. Nodesare annotated with their I/O rates.

schedule.dotFinal stream graph, annotated with I/O ratesand the number of times each node executesin the initial and steady-state schedule.

Figure 4: dot graphs produced on the uniprocessor path.

sometimes hamper performance, the compiler also does a global cost/ben-efit analysis to determine the best set of transformations for a given streamgraph.

The linearpartition option to strc will enable linear analysis andoptimizations3:

% strc -linearpartition FMRadio.str -o fmRunning Constant Prop and Unroll... done.Raising variable declarations... done.Propagating constant fields... done.Flattening blocks... done.Raising variables... done.Raising variable declarations... done.Moving initial assignments... done.Running linear analysis...WARNING: Assuming method call expression non linear(atan). Also

removing all field mappings.WARNING: Insufficient pushes detected in filterdone with linear analysis.Linear partitioner took 1 secs to calculate partitions.Structuring... done.Scheduling... got schedule, interpreting... done.

3In contrast, the linearreplacement and frequencyreplacement options willperform maximal algebraic simplification and frequency translation, respectively, even incases where it is not beneficial.

14

Page 15: StreamIt Cookbook

FMRadio

EqSplit

Pipeline

BPFCore

Pipeline

BPFCore

Pipeline

BPFCore

Pipeline

BPFCore

FloatOneSource

LowPassFilter

FMDemodulator

DUPLICATE(1,1,1,1)

DUPLICATE(1,1) DUPLICATE(1,1) DUPLICATE(1,1) DUPLICATE(1,1)

WEIGHTED_ROUND_ROBIN(1,1,1,1)

Filter

LowPassFilter LowPassFilter

WEIGHTED_ROUND_ROBIN(1,1)

Subtracter

Filter

LowPassFilter LowPassFilter

WEIGHTED_ROUND_ROBIN(1,1)

Subtracter

Filter

LowPassFilter LowPassFilter

WEIGHTED_ROUND_ROBIN(1,1)

Subtracter

Filter

LowPassFilter LowPassFilter

WEIGHTED_ROUND_ROBIN(1,1)

Subtracter

Filter

FloatPrinter

Figure 5: linear-simple.dot, which illustrates the linear sections ofFMRadio. Linear filters are shaded blue, while linear containers are shadedpink.

Annotating IR for uniprocessor... done.Generating code...

The linear analysis produces its own set of dot files that we can use toinspect the results of the optimizations. For example, the following com-mand will display the stream graph with the linear sections highlighted:

% dotty linear-simple.dot

As shown in Figure 5, FMRadio contains many linear components, in-cluding the first LowPassFilter and the equalizer. To see the stream graphafter linear optimizations have been applied, we can issue the followingcommand:

% dotty after-linear.dot

As illustrated in Figure 6, this stream graph shows that the equalizerwas collapsed into a single filter and then was translated to the frequencydomain (by virtue of the “Freq” prefix in the filter’s name.) However, the

15

Page 16: StreamIt Cookbook

FMRadio_Hier_211

FloatOneSource_96push0=1pop0=0

peek0 =0

LowPassFilter_97push0=1pop0=5

peek0 =64

FMDemodulator_98push0=1pop0=1

peek0 =2

TwoStageFreqFMRadio_child1_child1_child1_child0_219push0=193pop0=193

peek0 =193initPush0=130initPop0=193

initPeek0 =193

FloatPrinter_116push0=0pop0=1

peek0 =1

Figure 6: Final stream graph (after-linear.dot) for the FMRadio, com-piling with the -linearpartition option.

LowPassFilter at the top was left unmodified; this is because it has a largepop rate that degrades the performance of the frequency transformation. Inthis case, the linear optimizations lead to a 6.5X improvement in through-put.

The linear optimizations produce additional dot graphs; see Figure 7for details. For more information on the linear analysis and optimization,please refer to http://cag.lcs.mit.edu/linear.

Compiling as standalone. To compile FMRadio to a standalone file thatcan execute without the C runtime library, use the -standalone option:

% strc -standalone FMRadio.str -o fm/*Out of Kopi2SIR.Out of semantic checker.

16

Page 17: StreamIt Cookbook

Filename Description

ldp-partition-input.dotThe stream graph as it was input to the linearpartitioning algorithm.

linear.dotThe stream graph with linear filters high-lighted and each node annotated with its I/Orates.

linear-simple.dot Same as linear.dot but without the I/Orates.

after-linear.dotThe stream graph after linear transformationsare complete.

Figure 7: dot graphs produced by linear optimizations.

*/Entry to RAW BackendMoving initializers into init functions... done.Running Constant Prop and Unroll...Done Constant Prop and Unroll...Running Constant Field Propagation...Done Constant Field Propagation...Running Partitioning...

Found 26 tiles.Building stream config...Calculating partition info...Tracing back...Work Estimates:

Fused_Flo_Low_FMD_Low_Pre_E... 10240 (100%)Done Partitioning...Flattener Begin...Filters in Graph: 1Flattener End.Simulated Annealing AssignmentTiles layout.assigned: 1Initial Cost: 0.0Assign End.Switch Code Begin...FineGrainSimulator Running...End of init simulationEnd of steady-state simulation

17

Page 18: StreamIt Cookbook

sw0.s writtenSwitch Code End.Generating Raw Code:

Fused_Flo_Low_FMD_Low_Pre_EqS_Pos_Fil_Flo_172 (no buffer)Tile Code begin...Optimizing Fused_Flo_Low_FMD_Low_Pre_EqS_Pos_Fil_Flo_172...Code for Fused_Flo_Low_FMD_Low_Pre_EqS_Pos_Fil_Flo_172

written to tile0.cTile Code End.Creating Makefile.Exiting

As is evident in the compiler output above, the standalone option uti-lizes the Raw backend and stores the self-contained output file in tile0.c(as well as the binary fm). Also, due to an implementation detail, it is notpossible to control the number of runtime iterations for standalone pro-grams (they will run forever.)

3.2 Using the Java Library

A convenient aspect of the StreamIt compilation toolchain is that all StreamItprograms are first translated to Java files that can be executed against a Javaruntime library using a normal Java Virtual Machine. This is especially use-ful for testing and debugging applications, as well as validating the outputof the compiler.

The library can be invoked with the -library flag. Since strc willboth compile and execute the file in the library, you can specify the numberof iterations to execute with the -i flag. For example, to compile FMRadioand run for 5 iterations in the library, do as follows4:

% strc -library -i 5 FMRadio.str278073.94278074.75278075.38278075.94278076.4

You can also inspect the FMRadio.java file, which was generated forexecution in the library. It can be compiled and run with a standard Javacompiler and JVM. The library also produces a dot graph of the program;

4In this case, the library’s output is marginally different from the compiler’s due to nu-merical precision issues.

18

Page 19: StreamIt Cookbook

it is given the same name as the StreamIt file, but with a dot extension (i.e.,it is FMRadio.dot in this case.)

There are a few additional options available in the library. For instance,you can direct the library not to execute the program, but to instead justprint the schedule of filter firings:

% strc -library -norun -printsched FMRadio.strinit = [$0 = [email protected]$1 = [email protected]$2 = [email protected]$3 = [email protected]$4 = [email protected]$5 = [email protected]$6 = [email protected]$7 = [email protected]$8 = { {379 $0} {64 $1} {63 $2} {63 $3}

{63 $4} {63 $5} {63 $6} {63 $7} }]steady = [$9 = [email protected]$10 = [email protected]$11 = [email protected]$12 = [email protected]$13 = [email protected]$14 = [email protected]$15 = [email protected]$16 = [email protected]$17 = [email protected]$18 = [email protected]$19 = [email protected]$20 = [email protected]$21 = [email protected]$22 = [email protected]$23 = [email protected]$24 = [email protected]$25 = [email protected]$26 = [email protected]$27 = [email protected]$28 = [email protected]$29 = [email protected]$30 = [email protected]$31 = [email protected]$32 = { {5 $0} $1 $2 $3 $4 $9 $10 $11 $12 $13 $5 $14 $15 $16 $17 $18

$6 $19 $20 $21 $22 $23 $7 $24 $25 $26 $27 $28 $29 $30 $31 }

19

Page 20: StreamIt Cookbook

]!ml sched size = 39!ml buff size = 1299

Currently, the default scheduler is a minimal latency scheduler that usesphases to compress the code size. The schedule listed above has two com-ponents: an initialization schedule (to initialize buffers for filters that peek)and a steady-state schedule (that can loop infinitely). Each filter and split-ter in the graph is given a number for easy reference, and then the scheduleis printed at the bottom. A loop nest in the schedule is denoted by (N F),where the filter F executes N times. The schedule size and buffer size re-quired are printed at the end of the listing.

Additional options for the library can be found in Appendix B.

3.3 Compiling for Raw

To compile for an NxN Raw machine, use the -raw N option5. In mostcases, you will want to use -raw 4, since the actual chip has 4 tiles on eachside. When compiling to Raw, we also recommend using the -O1 flag toenable common optimizations; the default is -O0 (no optimizations), andthe most aggressive is -O2 (which includes slow and potentially unstabletransformations). More information about specific optimizations can befound in Appendix B.

Compiling FMRadio looks like this:

% strc -raw 4 -O1 FMRadio.str/*Out of Kopi2SIR.Out of semantic checker.*/Entry to RAW BackendMoving initializers into init functions... done.Running Constant Prop and Unroll...Done Constant Prop and Unroll...Running Constant Field Propagation...Done Constant Field Propagation...Running Partitioning...

Found 26 tiles.Building stream config...Calculating partition info...Tracing back...

5You can also compile for an NxM machine by using -raw N -rawcol M.

20

Page 21: StreamIt Cookbook

Work Estimates:LowPassFilter__24 1444 (10%)LowPassFilter__59 1420 (10%)LowPassFilter__80 1420 (10%)LowPassFilter__59 1420 (10%)LowPassFilter__80 1420 (10%)LowPassFilter__59 1420 (10%)LowPassFilter__80 1420 (10%)LowPassFilter__59 1420 (10%)LowPassFilter__80 1420 (10%)Fused_Pre_EqS_Pos_Fil_Flo 386 (2%)FMDemodulator__38 224 (1%)FMDemodulator__38 224 (1%)FloatOneSource__3 110 (0%)

Done Partitioning...Flattener Begin...Filters in Graph: 13Flattener End.Simulated Annealing AssignmentTiles layout.assigned: 15Initial Cost: 60060.0Final Cost: 29055.0 Min Cost : 21119.0 in 451 iterations.Assign End.Cannot perform Rate Matching.Switch Code Begin...WorkBasedSimulator Running...End of init simulationEnd of steady-state simulationsw0.s written

. . .

sw14.s writtenSwitch Code End.Generating Raw Code: FloatOneSource__3_96 (Direct Communication)

. . .

Generating Raw Code: FMDemodulator__38_178 (simple)Tile Code begin...Optimizing FloatOneSource__3_96...Code for FloatOneSource__3_96 written to tile0.c

. . .

Optimizing FMDemodulator__38_178...Code for FMDemodulator__38_178 written to tile2.c

21

Page 22: StreamIt Cookbook

FMRadio_Hier_175

FMDemodulator__38_Fiss_182

EqSplit_0_Hier_183

FloatOneSource__3_96push0=1pop0=0

peek0 =0

LowPassFilter__24_97push0=1pop0=5

peek0 =64

DUPLICATE(1,1)

FMDemodulator__38_177push0=1pop0=2

peek0 =2initPush0=0initPop0=0

initPeek0 =0

FMDemodulator__38_178push0=1pop0=2

peek0 =2initPush0=0initPop0=1

initPeek0 =1

WEIGHTED_ROUND_ROBIN(1,1)

DUPLICATE(1,1,1,1,1,1,1,1)

LowPassFilter__59_99push0=1pop0=1

peek0 =64

LowPassFilter__80_100push0=1pop0=1

peek0 =64

LowPassFilter__59_103push0=1pop0=1

peek0 =64

LowPassFilter__80_104push0=1pop0=1

peek0 =64

LowPassFilter__59_107push0=1pop0=1

peek0 =64

LowPassFilter__80_108push0=1pop0=1

peek0 =64

LowPassFilter__59_111push0=1pop0=1

peek0 =64

LowPassFilter__80_112push0=1pop0=1

peek0 =64

WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1)

Fused_Pre_EqS_Pos_Fil_Flo_173push0=0pop0=8

peek0 =8

Figure 8: Load-balanced stream graph (after-partition.dot) that is outputby partitioner for executing FMRadio on a 4x4 Raw machine.

Code written to tile10.cTile Code End.Creating Makefile.Exiting

A number of dot files are produced during the compilation to Raw;they are listed in Figure 9. For example, since there are more filters thanRaw tiles, the compiler has to partition the graph into a set of load-balancedexecution units. The output of the partitioning stage can be viewed as fol-lows (see Figure 8):

% dotty after-partition.dot

In this case, the partitioner detected that the LowPassFilter’s in theequalizer were on the critical path, so it allocated a tile for each one. Thebottom part of the equalizer was fused with the end of the pipeline. Also,the FMDemodulator was fissed into two data-parallel sections. The steady-state work estimates for before and after partitioning can be found in work-before-partition.dotand work-after-partition.dot. There arealso corresponding text files.

After partitioning, the compiler maps the partitioned stream graph tothe Raw tiles. The layout can be viewed in layout.dot:

22

Page 23: StreamIt Cookbook

Filename Description

dp-partition-input.dotThe stream graph in the format used by thepartitioning algorithm.

work-before-partition.dot

The stream graph before partitioning, anno-tated with estimates of the steady-state workwithin each node. Nodes with the sameamount of work are given the same color (al-though the colors themselves are meaning-less.)

work-before-partition.txtText listing of the work estimates for filters inthe graph, before load balancing.

work-after-partition.dot The stream graph after partitioning, anno-tated with work estimates as above.

work-after-partition.txtText listing of the work estimates for filters inthe graph, after load balancing.

flatgraph.dotThe stream graph with structure eliminated,before it is mapped onto Raw.

layout.dot

The final layout of filters onto Raw tiles, witharrows between communicating filters. Notethat the arrows do NOT indicate the routesthat items take.

initial-layout.dotThe initial layout, before the simulated an-nealing algorithm.

Figure 9: Files produced by the Raw backend, above and beyond thoseproduced by the uniprocessor backend.

23

Page 24: StreamIt Cookbook

��� � � � ��� �� � � � ��� � � � �����LowPassFilter__24_97_32

LowPassFilter__80_108_44

FMDemodulator__38_178_36 FMDemodulator__38_177_35

LowPassFilter__59_111_45 WEIGHTED_ROUND_ROBIN_Joiner_179_34LowPassFilter__80_100_40

LowPassFilter__59_103_41 WEIGHTED_ROUND_ROBIN_Joiner_181_38 tile10 LowPassFilter__80_112_46

LowPassFilter__59_107_43 LowPassFilter__80_104_42 LowPassFilter__59_99_39Fused_Pre_EqS_Pos_Fil_Flo_173_47

Figure 10: Final layout (layout.dot) for the FMRadio on a 4x4 Raw machine.

% dotty layout.dot

This layout is shown in Figure 10. Note that the arrows represent com-munication channels between filters, but are unrelated to the routes as-signed for these channels.

The final outputs of the Raw backend are: 1) the tile code, which runson the compute processors (tile*.c), 2) the switch code, which routesitems between processors (sw*.s), and 3) Makefile.streamit, whichinterfaces with the Raw simulation infrastructure and orchestrates the exe-cution.

Using the Raw simulator. To execute the output of the Raw backend,you will need the Raw simulator and “starsearch” infrastructure; for moreinformation, see the Raw website at http://cag.lcs.mit.edu/raw.The StreamIt compiler makes use of the cycle-accurate BTL (“Beetle”) sim-ulator for Raw.

To run the simulator, do as follows:

make -f Makefile.streamit run

24

Page 25: StreamIt Cookbook

This command will cause the simulator to boot up, execute the startupcode and then run for a bunch of cycles. If you want to quit before thesimulation is done, use CTL-C.

Once the simulator is executing the program, the output should shouldlook as follows:

[31: 00001384b]: 278073.937500[31: 000013d53]: 278074.750000[31: 000013fc4]: 278075.375000[31: 000014278]: 278075.937500[31: 0000144e4]: 278076.406250[31: 000014798]: 278076.812500[31: 000014a04]: 278077.156250[31: 000014cb8]: 278077.468750[31: 000014f24]: 278077.750000[31: 0000151d8]: 278078.000000...

The first number (31) represents the fact that the tile at row 3, column 1issued the print command. The next number (e.g., 00001384b) is the cyclecount (in hexadecimal) at which the third number was produced. The thirdnumber is the output value that is produced from FMRadio.

Running in debug mode. To run the simulator in interactive debug mode,you should run:

make -f Makefile.streamit debug

This will change the terminal window to a text layout of the Raw tilesbeing executed, showing the active instruction stream of each. An extra“shunt” window will appear to control the actions of the debugger; it willlook something like this:

// welcome to beetle// try:

help(); help(‘‘help’’);

Configuration:

[THREAD_C THREAD_E DEVICE NAME RESET ROUTINE CALC ROUTINEPARAM ]

[081c0f10 00000000 Serial Rom dev_serial_rom_reset dev_serial_rom_calc0844c9d0][0815fcf0 00000000 Print Service dev_print_service_rese dev_print_service_calc0844c410]

. . .

25

Page 26: StreamIt Cookbook

[08205508 00000000 streaming_dram dev_streaming_dram_res dev_streaming_dram_cal08444220]

// Use VMShowPositionOfThread(THREAD #); to inspect execution of device

/*[0] 0*/

To start the simulator, type go();. This will boot up the Raw proces-sor, bring the program into memory, and perform some other initializationtasks. After initialization, there will be a message and a prompt such as:

running...

. . .

### PASSED: -0000000001 ffffffff nan [x,y] = [1, 3]// [ serial rom : finished

tile.B79414E0.0004-0004.rbf-15 --> static port 15 ]// *** interrupted [135033/inf]stopped./*[34576982] 2*/

You are now ready to start doing useful things with the debugger. Thefollowing are the most useful commands:

• go(); starts the simulator executing, without any limit as to how manycycles it will run.

If you want to stop the simulator, you can hit CTL-C in the mainwindow (not the shunt window), which will give you the commandprompt back in the shunt window.

• step(N); steps the simulator forward N cycles. At the end of N cy-cles, the simulator will stop executing and return you to a commandprompt.

By stepping the simulator N cycles, each processor (of the 16) is steppedforward N cycles.

• sv(N); produces a graphical execution trace over N cycles. This traceshows the activity of each processor over time; we call it a “blood-graph” because red indicates that a processor is blocked and perform-ing no useful work. For FMRadio, the result of running sv(5000);appears in Figure 11.

26

Page 27: StreamIt Cookbook

Figure 11: Execution trace for FMRadio on a 4x4 Raw machine. The verticalaxis represents processors, while the horizontal axis represents time. Note that theentirely white processor is not being used by the configuration (you can verify thisby inspecting layout.dot, which shows the tile to be empty).

The complete guide to the colors in the bloodgraph is as follows:

Color Meaningwhite useful workpurple floating point operationgreen pipeline stall, e.g., for a hazardred blocked while receiving an item from the switchblue blocked while sending an item to the switch

• quit(); will exit the interactive debugger. Incidentally, typing CTL-Dwill close the shunt window.

Gathering numbers. The compiler provides automatic support for gath-ering throughput, MFLOPS, and utilization numbers, as well as automatedgeneration of bloodgraphs. To collect these performance statistics, use the-numbersoption, which takes an argument indicating the number of steady-state cycles to execute:

% strc -raw 4 -O1 -numbers 15 FMRadio.str

27

Page 28: StreamIt Cookbook

Compilation will proceed as before, except that the code will be instru-mented to gather numbers using hooks in the Raw simulator. The programshould then be executed in batch mode, and number gathering messages(one per steady-state) will replace the standard output:

% make -f Makefile.streamit run

. . .

running...

### PASSED: 1073741901 4000004d 2.00001836 [x,y] = [0, 0]

### PASSED: -0000000001 ffffffff nan [x,y] = [0, 0]Interrupted tile 0, pc = 0x3b8stopped.running...Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 464Cycles: 1312, MFLOPS: 460Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 464Cycles: 1312, MFLOPS: 461Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Cycles: 1312, MFLOPS: 463Generating results.outecho ‘date‘ cagfarm-42 ’end running BTL’ ‘pwd‘ >> /u/thies/research/streams/streams/misc/raw/stardata/logs/cagfarm-42.log; echo ‘date‘cagfarm-42 ’end running BTL’ ‘pwd‘ >> /u/thies/research/streams/streams/misc/raw/stardata/logs/all.log; rm /u/thies/research/streams/streams/misc/raw/stardata/logs/cagfarm-42:BTL::home:bits6:thies:streams:streams:apps:examples:cookbook;rm tile1.s tile8.s tile3.s tile11.s tile5.s tile13.s tile0.s tile7.s tile15.s tile2.s tile9.s tile10.stile4.s tile12.s tile6.s tile14.s 70.540u 18.180s 1:30.55 97.9% 0+0k 0+0io 58180pf+0w

The simulator will terminate naturally, leaving a detailed accounts sum-mary in the file results.out. Here is an excerpt from the results file:

28

Page 29: StreamIt Cookbook

Summmary:Steady State Executions: 15Total Cycles: 19680Total Steady State Outputs: 30Avg Cycles per Steady-State: 1312Thruput per 10ˆ5: 152Total Non-Blocked Cycles: 195438

Instruction Mix:FPU: 36484MEM: 37071BRANCH: 21584ADMIN: 0ALU: 100299

workCount* = 195438 / 314880Avg MFLOPS: 463

Also, a blood graph will be saved in bloodgraph.ppm.More options for the Raw backend are described in Appendix B.

29

Page 30: StreamIt Cookbook

A Keyword Review

Stream object types:

filter Declares a filter with a work function

pipeline Declares a series of stream objects, with the output of the firstconnected to the input of the second, etc.

splitjoin Declares a parallel set of stream objects, with a splitter and ajoiner distributing and collecting data

feedbackloop Declares a feedback loop with two children, with a joinercombining input data and the output of the loop and a splitter dis-tributing the output of the body to the output and the input of theloop

Filter work functions:

push Pushes an item on to the output of the filter. Must be called the exactnumber of times as in the rate declaration.

pop Retrieves and removes the first item from the input of the filter. Mustbe called the exact number of times as in the rate declaration.

peek(k) Retrieves the k + 1-th item from the input of the filter, withoutremoving it. If n items have been popped, k +n must be less than thedeclared peek rate.

Composite stream declarations:

add Adds a child after the existing children. (pipeline, splitjoin)

body Adds a child as the body part of a feedback loop.

loop Adds a child as the loop part of a feedback loop.

enqueue Pushes an item on to the input of the joiner coming from the looppart of a feedback loop.

split Declares the type and weights of the splitter. (splitjoin, feedbackloop)

join Declares the type and weights of the joiner. (splitjoin, feedbackloop)

30

Page 31: StreamIt Cookbook

duplicate Splitter type that takes each input item and copies it to the inputof each child.

roundrobin Splitter or joiner type that takes a specified number of itemsfrom the input (or output) and copies it to the input (or output) ofeach child.

B Options

--help Displays a summary of common options.

--more-help Displays a summary of advanced options (which are not de-scribed below).

--output 〈filename〉, -o 〈filename〉 Places the resulting binary in 〈filename〉.

--raw 〈n〉, -r 〈n〉 Compile for an 〈n〉-by-〈n〉 Raw processor.

--standalone, -S Generate C file without library dependencies.

--library Produce a Java file compatible with the StreamIt Java library, andcompile and run it.

--memory 〈size〉, -Mmx 〈size〉 Set maximum Java runtime heap size. Thedefault heap size is 1700Mb.

--verbose Show intermediate commands as they are executed.

Options available for all backends

-O0 Do not optimize (default).

-O1 Perform basic optimizations that should improve performance in mostcases. Adds--unroll 16 --destroyfieldarray --partition--ratematch --wbs.

-O2 Perform extended optimizations that should improve performance inmost cases, but may also cause the compiler to become unstable. Adds--unroll 256 --destroyfieldarray --partition --ratematch--removeglobals --wbs.

31

Page 32: StreamIt Cookbook

--linearreplacement, -l Domain-specific optimization: combine adjacent “lin-ear” filters in the program into a single matrix multiplication opera-tion wherever possible. Corresponds to the “linear” option in thePLDI’03 paper.

--unroll 〈n〉, -u〈n〉 Specify loop unrolling limit. The default value is 0.

Options specific to Raw backend

--partition, -p Partition the stream graph, using a dynamic programmingalgorithm to divide the graph into load balanced units for executionon separate tiles. This is turned on by default if the number of filtersin the expanded stream graph is greater than the number of tiles.

--magic net, -M Compile to a “magic” network in which the buffer sizesbetween Raw tiles are conceptually unbounded (the sender neverblocks), and the communication time has a constant overhead andper-hop latency. These parameters can be varied by editing the auto-generated Makefile.streamit file, which contains a string -magic crossbarC1H1. The number after the “C” refers to the constant overhead; afterthe “H” refers to the per-hop latency.

--numbers 〈n〉, -N〈n〉 Instrument code to gather performance statistics onsimulated code over 〈n〉 steady-state cycles. The results are placed inresults.out in the current directory.

--rawcol 〈m〉, -c〈m〉 Specify number of columns in Raw processor; –rawspecifies number of rows.

--wbs When laying out communication instructions, use the work-basedsimulator to estimate exactly when items will be produced and con-sumed. This improves the scheduling of routing instructions.

Options specific to uniprocessor backend

--frequencyreplacement, -F Domain-specific optimization: combine adja-cent “linear” filters in the program and convert them to the frequencydomain wherever possible. Corresponds to the “freq” option in thePLDI’03 paper.

--linearpartition, -L Domain-specific optimization: perform linear replace-ment and frequency replacement selectively, based on an estimate of

32

Page 33: StreamIt Cookbook

where it is most beneficial. Corresponds to the “autosel” option inthe PLDI’03 paper.

--profile Compile C code with profiling information. This requires thatyour system have profiling versions of the standard C and math li-braries (libc p.a and libm p.a), along with a profiling version of theStreamIt library; invoke make profile in $STREAMIT HOME/library/c.Profiling results can be analyzed using gprof.

Options specific to the Java library

--iterations 〈n〉, -i〈n〉 Run the program for 〈n〉 steady-state iterations. De-faults to infinity.

--marksteady Print “*” to standard output after each steady-state execu-tion.

--norun Perform the library setup and schedule the stream graph, but don’tactually run the program.

--nosched Don’t run the scheduler; instead, run in a pull mode, where thebuffer lengths are examined at run time and a preceding filter is runif needed to provide enough data to run the current filter.

--printsched Print out the minimal-latency schedule for the program.

33