hardware-software codesign lab report

Tallinna TehinkalikoolArvutitehnika instituut

Lab 1,2,3,4IAY0070 Riist- ja tarkvara koosdisain

Valentin Tihhomirov 971081

Tallinn 2008

Table of ContentsLab1: VGA generator...........................................................................................................3 Lab2: The Timer..................................................................................................................4 Screen Buffer development.................................................................................................5 Lab3: The Game (Xonix).....................................................................................................8 Lab4: Doing more in PicoBlaze.........................................................................................10 1. PS/2............................................................................................................................11 2. Video address computation........................................................................................11 3. Delay..........................................................................................................................12 The acquired experience....................................................................................................12

1

The goal of this laboratory work was to co-design and co-implement a video game on an FPGA dev. board. This document describes all the four parts of the assignment fulfilled: 1. VGA generator synthesis 2. Integration with PicoBlaze for timer application 3. The video game 4. Minimizing the FPGA area by moving some functions from configurable logic to PicoBlaze. The binary streams and sources are available at http://www.fileden.com/files/2006/10/20/303866/codesign/labs.rar along with the game trailer http://www.fileden.com/files/2006/10/20/303866/codesign/XSA3S1000_video_game.AVI The FPGA technology provides massive number of fine-grin gates. These elements are configured and combined into specific functions. Every gate computes and outputs a result at every clock cycle resulting in high computing power. However, not all functions require such critical response time and determinism.1 Sometimes, it makes sense to execute the slow processes in a time-multiplexing manner of a universal processor. The tiny RISC processors save the expensive FPGA resources, energy and, occasionally, the development time. This report describes how we succeeded in doing this.

Lab1: VGA generatorI started by optimizing the VGA Generator for the XSA Boards demo VHDL found at the board supplier official http://old.xess.com/appnotes/an-101204-vgagen.pdf web site. Everything beyond the 640x480 (800x600 including the invisible pixels) mode on 100 MHz Spartan3 board was pruned away in order to use the product throughout the rest of the work. The SDRAM video buffer was also removed because I had no plan to use it. The resulting VGA generator was tested by (pixel_cnt,line_cnt) => pixel color mapping:

1 Response time of a processor running many tasks is less deterministic than that of a dedicated FU.

2

Figure 1: VGA generator The Color map that converts the location of a current pixel under scan into its color can be substituted, which gives user the freedom to implement many different functions. In this particular case, it simply colors the different parts of the screen into definite colors in order to test the operation.

Lab2: The TimerThe second assignment was to implement a timer using a PicoBlaze (further pB for short). At first, the time-delay counter and decoder for seven-segment indicator were implemented purely in HW. Two counters and a decoder do that:

Figure 2: Timer The counters are ubiquitous in HW design. The frequency divider here is a pretty large counter and its implementation was experimented in several alternative ways: 1) VHDL limited range integer type, 2) unsigned type, 3) CoreGenerated binary counter, and, finally, 4) using DCM pre-scaling:

3

Figure 3: The timer with DCM prescaling The DCM divides the clock frequency by 32 thus reducing the counter size. Herein, we estimate different implementation alternatives: whether the function will be implemented in general-purpose configurable logic blocks or be fitted into ad-hoc dedicated configurable devices. The timer synthesis results are the following: Counter design Integer Unsigned CoreGen DCM+Unsigned FF 33 33 31 36 Slices 38 38 36 41 LUT = logic + routing 71(45+26) 71(45+26) 69(44+25) 82(51+31)

Table 1: The alternatives to implement the frequency division counter This shows that the CoreGen provides the best binary counter while DCM predownscaling results in even more logic than the full-range counter. This is a bit frustrating.i Finally, the pB was incorporated into the design replacing the Hex Digit counter. This is complete overkill but it was done to formally fulfill the assignment and acquire the pB interrupt processing skill.

Screen Buffer developmentSince one-digit indicator is a bit limiting to show the timer value, I moved on to displaying an array of characters on VGA monitor. This task matches well with the video buffer development that is useful for our goal, the videogame development. To show the hexadecimal digits on the screen, their glyphs are described by 8x8 black & white bitmaps in VHDL 16x8-element array of 8-bit vectors, a vector per glyph line. This constant array is depicted as FONT in the diagram below.

Figure 5: Displaying the text This video buffer display is an implementation of the color map block on the 4

Figure 1.1. It maps its current scan pixel location within 640x480 VGA screen, the (pixel_cnt, line_cnt) pair, onto 80x60 video buffer location2, which contains a digit pointing to a glyph located in the FONT ROM within which the current pixel determines the output color. This cascade of mappings results in a very long combinatorial path. The resulting max frequency of 100.04 MHz satisfies the timing constraint pretty tightly. Additional feeling that the introduction of PicoBlaze will inflict more delay motivated to pipeline this design. The content of video buffer, the text to display, is submitted by the application. To provide this flexibility (I know, we do embedded design that flexibility is opposed to but it gives the experimenting freedom here), the video buffer content was decoupled from the design only its interface (VHDL entity) was specified. The interface is very simple: an application provides a parallel array, a bus of digits, up to 4800 characters wide. The details in diagram on Fig. 6 complete the design description. The figures show how the application submits the screen buffer and that the buffer size might provide less than 4800 characters to display.

(a) Application provides the content. Figure 6: The screen buffer details.

(b) Blanking the screen beyond screen buffer size

Two applications were used during the screen buffer development: 1) a Fixed Buffer submitting the constant characters; and 2) a 4-digit Timer. The Timer application was used as a benchmark to monitor the improvements from the pipelining. In pipelining, the row base, that is 80xROW signal, was registered first. This gave a speedup from 100.04 to 120.4 MHz. The number of registers has expectedly increased while logic amount almost didnt. In a second pipeline experiment, the screen buffer address was registered in addition to the row base. This has increased performance to 127.3 MHz while the surprising fact is the reduced amount of logic, which is accorded by another surprising fact: the increased number of slices. Finally, the digit outgoing the FONT has been registered peaking in 147 MHz. It seems to be the maximum of this block, since the critical path has moved now to pixel_cnt computation in h_sync block. Below is the summary table (the following pipelines the previous). Both XST and P&R timings were included to show that the XST estimations are pretty conservative, yet, relatively good if we consider that actual freq produced by P&R always lies between2 that is 8x8 times smaller than the VGA screen

5

the two adjacent pipeline estimations made by XST. The quality of estimation is important because it is preferred method in co-design space exploration. Pipeline None (multiplier) None (shifting) Row_Addr (row*80 -> r) Addr (80 row+col ->r) Digit (FONT output -> r)(superpipline)

FFs 95 95 104 123 127

Slices 136 141 141 148 148 161 176

LUTs 257 266 267 262 262 301 316

XST freq 74 [MHz] 87,4 115 125 143 106 122

P&R freq 94 [MHz] 100,04 120 127 147 105 120

Alt. design (buf_off, row, col ->r) 124 Alt. design (digit -> r) 148

Another curious fact discovered about Xilinx tool set is that using dedicated multiplier for x80 results in smaller yet slower circuit than multiplying by shift-and-add: x*80 = x*5*16 = x*(1+4)*16. Finished this design, I have realized that the expensive multiplication is used for deriving the text video buffer location from the graphic screen pixel location, and can be completely eliminated if we duplicate the 2D pixel scan by a similar video buffer scan in parallel instead of deriving the second one from the first. The buffer column is incremented every 8 pixels and row is incremented every 8 scan lines. However, the high expectations did not justify by practice. As shown in the last two results of the previous table, the design size is much larger without any performance gain. Obviously, the 4800-character multiplexer selecting the digit from the application is not efficient video buffer implementation. And, nobody is going to access all the characters of the buffer simultaneously. Usually, characters are stored in video memory and are accessed serially by both VGA generator and the video memory rendering application. This demands a two-port RAM. In FGPA designs, BRAMs are used for this purpose. These are low-level Xilinx library components that have different sizes/width parameters in different FPGAs. The curious fact to note was that the multiplexer is inferred as a ROM in case of the fixed buffer application by smart XST synthesizer! Furthermore, it is automatically translated into BRAM by P&R! Therefore, there is no need to deal with BRAM components explicitly, which 1) saves us from going into unnecessary details and 2) fixes to a specific FPGA chip. This allows us to use high-level VHDL array constructs instead of the assembler style Xilinx BRAM instructions. The XST user guide confirms this intuitive guess; namely, to infer a memory block no more than two simultaneous read/write accesses is allowed on the array. This discovery has helped to complete the video buffer design. The game is based on the superpipeline (see the table) color map that uses a video buffer similar to the one depicted in Fig. 7 but a little more advanced the game needs to read the video content occasionally.

6

Figure 7: A memory-based video buffer.

Lab3: The Game (Xonix)This text mode game was my first contact with the computer that Ive got back in 1986 or so when it just appeared. The computer abilities looked like magic and I determinedly committed to become a programmer. The rules are simple: the bugs move either on white border ground or fly in internal black air space. The player must grab as much air as possible avoiding the bugs. He is also killed if a fly touches his trace. The processor moves the bugs along wall-reflecting diagonals. The player controls his moves by a PS/2 keyboard. The 10 least digits of the previously developed 16 hexadigit font are used for displaying the score, the amount of captured air. The rest, six most significant glyphs, have been modified to depict the background and moving objects. This coding results in 100% efficient video memory use (but leaves no room for displaying any ASCII text). The rules are pretty straightforward to implement. An instance of pB carries the most of the game complexity. It draws the moving objects, namely the player and bugs, and idly waits for 100 ms time event. The event starts a new game round where pB recomputes new locations for the objects. It then cleans the objects by rendering the background and restarts the main loop by drawing the objects in their new positions and waits for the next 100 ms time flag. The only hard thing is the air capture algorithm when player reaches the ground after tracing in the air. I had no better idea than the following: 1. fill the fly-containing air with gas cells. a. Replace the flies by gas cell. b. Iterate the screen replacing the air contacting the gas with gas until gas amount stops increasing. 2. replace the rest of the air with the ground. 3. restore the air from the gas. No double buffering is used, but, because processor is very fast, cleaning the object in their old places and redrawing them all in new places takes about 400 instructions. 7

This is 200 VGA pixels, or of a scan line. Therefore, no flickering is visible. Even the air capture that is computation-intensive and takes place between objects are cleaned and redrawn is unnoticeable. In my childhood experience, the capture froze the game for many seconds. Now I know what caused the computer to think so long. The PicoBlaze running at 100 MHz accomplishes this instantly. The VGA block displaying the video memory was described earlier in this paper. The superpipelined version of it was used in this game. Lets diagram the rest of the hardware that aids the Picoblaze to receive 100 ms timer events, read keyboard and access video memory:

The video buffer is accessed by three ports: two writable COL and ROW ports that select the digit address in the video buffer and the data port to write and read the digit at the address selected. The timer raises the Event flag every 100 ms, which is reset when pB reads it. The PS/2 block triggers on falling edge of 8-sample filtered ps2_clk. The trigger shifts in one bit of ps2_data. As soon as 9 bits received, the 8 data bits (parity is ignored) is shifted into the key buffer register. The stop bit interrupts the pB and resets the bit counter. The pB reads the received key code in the interrupt handler routine. The main loop synchronizes with the real time once per game round. Since the game round is supposed to be faster than 100 ms events (it is not true when area is captured), the dedicated timer provides precisely 100 ms rounds. I avoided using an interrupt for time events because 1. finished the round, the processor has nothing to do other than polling the timer 8

anyway. 2. If computation in some round takes longer than 100 ms, it still has to finish and the interrupt will not improve this situation anyway. Likewise with the timer, we read the keyboard key pressed only once per round. However the PS/2 keyboard events may come at any moment. It is easer, therefore, to miss them. One might think that the difference is insignificant because only last key-press matters and those that arrive during lengthy computation may be calmly discarded. However, the keyboard events are not that simple. We must ignore the key-up events and receive the extended codes. Therefore, I decided to interrupt the processor when one arrives to avoid missing the keyboard events. Furthermore, I had a plan to pull more PS/2 functionality, namely the data shifting in and bits counting, from the HW to the processor. Here, timing becomes even more critical because of negative consciences of missing a bit and order of magnitude higher frequency of the events. These considerations determined what should be read by polling and what by interrupt (the scheduling decisions). The game is controlled by extended keys: UP, LEFT, RIGHT, DOWN. The fact that their extended codes do not conflict with the basic codes allowed simplifying the keyboard processing by ignoring the extended code attribute. Additionally, the program filters out any PS/2 key-up events, which requires participation of both ISR and the read routine in the main loop. By the way, the ISR and main loop may be considered as two separate control threads. The first produces key presses by setting INT_CHAR register while the other copies the value and resets the register to 00 (invalid key). This producerconsumer ITC pattern must be guarded by some synchronization section. Since only ISR may interrupt the main loop, disabling the interrupt during the consumption suffices the synchronization3.

Lab4: Doing more by SWThe fact that pB performs the game very quickly while using only 1/3 of code along with the lab requirements motivated to take some functionality up to the controller. Below is the summary table and description. Design Original 1.PS/2 2.Video Addr 3.Delay Program size FFs Slices LUTs(functional, [instructions] route-through) 381 393 435 444 250 202 199 174 314 251 240 212 542(390, 84) 440(319,83) 426(305,53) 368(268,32) XST speed [MHz] 83 83 83 88 P&R timing [MHz] 101,7 91 97,3 103

3 The syncronization avoids missing key data when consumer writes the null while producer writes a newly received scan code simultaneously. May be this is overkill paranoid to have this level of gracefulness in the game but it trains doing things right.

9

1. PS/2The pB running at 100 MHz executes 50 mln instr/sec. Therefore, a small interrupt routine should be quite able to handle the PS/2 bytes occasionally arriving in about 11-bit bursts at 20 kHz. In the HW, we leave only ps2_clk filter and falling edge detector, which now raises the Ready flag directly interrupting the processor. The bit shifting and counting functions are moved to pB. Only the ISR required some minimal code increase in the program. But the negligible code increase consumed 3 additional pB registers. The resulting HW is a way more compact. It is slower, however. This is surprising result because the eliminated HW should remove the key reading logic from the quite critical pB read path and let more room for the placement. I believe it is because of HW optimization took another route and, in case of higher P&R effort, the result should be better. Despite the frequency failure below 100 MHz, the overclocked game runs properly. In addition to FPGA savings, the SW implementation of PS/2 protocol theoretically allows for much more flexibility in processing parity and communication break conditions.

2. Video address computationThe success of PS/2 migration instigated the further relocations. Since the pB is a 8bit controller, it is convenient to operate values in range 0..255. Rendering the picture, pB constantly moves the screen buffer cursor to read/write a digit at its position. It is convenient, therefore, to control the cursor location with a row/col pair of registers for 80x60 text screen. Originally, dedicated logic aided the pB to compute 13-bit video buffer address: ADDRESS = 80 x ROW + COL. In this design, I decided to bind the computation into the pB. This time, the program had to undergo the substantial rewriting. All outputs to row/col had to be replaced by address recomputation call. Every time a row or column changes, the low and high address parts are computed and output to external registers. The row change is expensive despite the multiplication is implemented by shifting. It is memoized by the pB address routine therefore. In addition, the program uses more registers. The available registers have exhausted and the scratch-pad memory is used to cache the current row base address. In the computation of next object positions, the scarcity of registers had to be compensated by more code.4 Finally, the program runtime were adversely affected the air-capture lag becomes visible. Some aesthetes may not agree but I state that this flickering does not degrade the gameplay. It does not appear in other respects. 4 It could be reasonable to evict some PS/2 interrupt regsters into scratch-memory instead. The deterministic slow down of rarely used PS/2 would not harm the keyboard reception while boost the intensely exploited video address computation. 10

Whereas in PS/2 migration brought the fair HW benefit at relatively minimal program cost, the video address migration has the opposite qualities: the high programming load brings very small HW savings.

3. DelayFinally, I have realized that there is no need for precise round-time period. In other words, the real-time timer is not necessary to implement the delay. The fact that one round iteration takes almost the same time allows to implement a constant delay between rounds in SW. Considerable amount of logic was saved and timing improved by removing the timer while program complexity grown minimally, no extra registers used nor runtime delays incurred. The frequency restores above 100 MHz in this version. Hardly, this optimization would be feasible from the system-level description because of the looseness of wait-cycle constraint.

The acquired experienceThis work does not give the important skill of system developing from the system level, e.g. SystemC language, and corresponding automated design space exploration. Nevertheless, relocating the processes from dedicated logic to universal processor by hand and greatly increasing the power of a tiny uP by extending it with little special HW, e.g. the 13-bit video buffer address computer evaluated in Lab4.2, give rather good feeling of the codesign subject at low level. This laboratory work experiment shows that a whole range of functions can be jammed into a tiny processor. All registers, most of scratchpad memory, half of instruction set5 and less than half of code ROM still were unused. I, however, feel that these are not the all optimizations possible. The fixed processor architecture suggests that the PicoBlaze is a hard core. Some of the unconnected blocks might indeed be abridged during synthesis, thus, meeting the design requirements in some extent. However, the synthesizer does not look into the program to figure out which registers, scratch-pad memory, instruction set and size are unused and, therefore, they all persist. This might be justified for ASIC RISCs produced in millions but looks controversial from FPGA and codesign point of view that teaches us and enables the configurable processors. Pruning unnecessary parts is a minimal thing of a soft core, let the more advanced ASIP facilities alone. Some configuration options are available for MicroBlaze. Probably, it is too much for such a simple thing as the PicoBlaze. As our example shows, the fair portion of the pB resources is utilized. There would be not much benefit removing the rest. What is really missing while I do any PicoBlaze design is the register combination with the I/O5 Notably, the negation was implemented using knowledge from the course of computational algorithms: -x = not(x) + 1 = (XOR x, FF) + 1. The arithmetic shifts and ADDCY were extremely useful in the last version of the game for the 13-bit video buffer address computation. Not-so-RISC conditional calls were sadly overlooked and not exploited.

11

ports. To output a value into an external register, it must first be loaded into a pB register. This duplication wastes the instruction set, code ROM, time, energy and LUTs. Furthermore, this requires to keep a copy of the value inside the pB in case we want to read it later6. A similar hassle repeats with input ports: it must first be loaded into internal register before it can be used by the pB. To avoid this duplication, it seems very desirable that the register file is available from both pB and external devices. The sharing may be achieved by merely letting the user to instantiate the registers. This also lets user to control their number. The registers can be virtual likewise I/O ports, they may go to HW combinatorial path. The idea is depicted in Figure 8. Figure 8: Register duplication elimination with a flexible register file.

(a) The typical duplication of registers

(b) User-instantiated registers would open access to the uP registers for external HW.

My above critics on pB optimization and register file was disgraced by my own experience of building a minimal instruction set (LOAD, ADD, AND, XOR, JZ and JUMP) RISC processor on the presented idea of combined I/O-regs in FPGA. Eight-bit data and 1024x18-bit BROM instructions made it an analog of pB. Supplemented with a couple of I/O registers, this custom processor was several times simpler than the pB at logic level. Nevertheless, the implementation consumed more Spartan3 logic. Now, developed the screen buffer from the initial multiplexer design, I may guess that sharing the registers between uP and external logic may result in inefficient multiplexing whereas pB reg file is implemented as a compact distributed RAM. Whatever it was, the general impression is that the pB is a core that was carefully manually tuned by competent Xilinx personnel for their FPGAs. This means that the user must configure uP for their application with great care, which is additional argument against doing this. This work was implemented using Xilinx WebPack 10 using the default options. Perhaps, it is this low effort that causes the speed perversion in the previous chapter summary table. But this option was used for easier reproduction of the results and faster synthesis favoring agiler development iterations. Since the timing is close and sometimes below the 100 MHz, I occasionally developed an overly long pipeline around the pB. This resulted in a demand for a delay slot: without it, some screen buffer I/O failed immediately after address writing. Later, this was abandoned. Yet, this is a part of6 or add the HW to read the external reg

12

co-design process and there remains plenty of runtime and code room for the delay slots if necessary. Simulation is considered a way to estimate the performance in co-design methodologies. In this work, it was used just to tune VGA generator and debug PicoBlaze I/O timing. The rest of the project, as soon as pB was incorporated, was implemented right in the development board and estimations were made in vivo for the reason the dev. board was available while simulation is unrealistic because the simulator is unbearably slow when pB is involved and PS/2 test bench would be unreasonably complex to develop. The facility to reprogram the BRAM in bit stream file avoiding the re-synthesis was utterly useful.

13

i Later, I have discovered that different styles of VHDL result in different synthesis. The advise to split FSMs into combinatorial and registration processes, yet draws the concept to VHDL students, seems not that good for implementing the real counters. The Xilinx tools demand the single-process template to recognize a dedicated binary counter, which is more compact and faster than using the general logic. The fading in Fig. 4(b) shows the part took up by the dedicated counter. Figure 4: Different VHDL templates result in different counter implementation.

(a) Two-process template infers the counter made of general-purpose logic.

(b) Single process infers a dedicated counter. Only reset condition remains in the general logic.

hardware-software codesign lab report

Documents