reconfigurable system on fpga

8/12/2019 Reconfigurable System on FPGA

http://slidepdf.com/reader/full/reconfigurable-system-on-fpga 1/96

Institut für Technische Informatik und Kommunikationsnetze Computer Engineering and Networks Laboratory

Departement Elektrotechnik Professur für Technische Informatik Professor Dr. Lothar Thiele

Matthias DyerMarco Wirz

Reconfigurable System on FPGA

Diploma Thesis DA-2002.14

Winter Term 2001/2002

Tutors:

Ch.Plessl, H.Walder

Supervisor:

Prof. Dr. Lothar Thiele



Abstract

FPGAs keep getting larger and faster. They have reached a level where a whole32 bit CPU fits into a single FPGA and doesn’t even fill it. So FPGAs can house

quite large logic circuits. Another development branch leads to dynamically reconfigurable FPGAs. That meansthat certain areas within the FPGA can be reconfigured while the rest continues torun unaffected.

The next step is to combine these two abilities. In this thesis we show how we im-plemented a CPU on an FPGA and combined it with additional cores which canbe dynamically exchanged while the CPU continues to run unaffected. Thereby wewant to use a flow which allows us to implement the entire design with mainstreamsynthesis tools.

We explain the steps it took to build the whole CPU on the FPGA, to add a network

card to transport data to the FPGA, and to get two different audio codecs to work.These audio codecs are the dynamic units on the FPGA, they will be replaced ondemand with dynamic reconfiguration. We also describe the installation of the op-erating system used to run on the CPU including the development of the necessarynetwork driver and the application program.

We will then present the techniques used to create bitstreams for partial recon-figuration using the JBits SDK and the difficulties that arise because of the non-existence of routing constraints in FPGA implementation tools.

iii



Preamble

In summer 2001 when we decided to do this project as our diploma thesis, both of us could hardly program anything in VHDL. We had done an introduction course in

VLSI, but that was about all of it. Neither did we have much knowledge about the

internals of an FPGA. Of course we had heard about this topic in different lectureshere at the Swiss Federal Institute of Technology, and we had occasionally playedwith them, but then only with graphical design tools. In short, we didn’t really knowwhat was expecting us then.

So in late october when we finally started we first had to read a great deal aboutall these tools and stuff, but actually, we got LEON to run within two weeks afterstart. Although this was mainly because LEON had already been configured for thedevelopment board we were using and we were by far not the first ones who try this,this was was good for our motivation to go on to the harder parts.

To write a PCM codec was one of the simpler tasks, and we collected our first experi-

ences with VHDL, and soon we got into it. For the network card, we found a projectfrom the University of Queensland, Australia which implemented a complete IPstack in VHDL.

But the network interface in there seemed to be quite complicated, so we tried todo it in a simpler way. And we can say that we at least partially succeeded. Oursolution is surely not as simple as it can get and not yet finished at all, but it is anintuitive design and the implemented part works.

It took us quite a while to implement the whole card, and it was the day beforechristmas break when it finally performed as it should have. So after the break wecould start with writing the software driver.

This was another field where we had little to no experience. So this too took quite awhile to reach its final form.

In the mean time the experiments with configuration on the FPGA had reached aform where we could start using JBits to create a partial bitstream for dynamicreconfiguration. Once more, this was a field we were absolute newbies, so after alot of trial and error we managed to get the reconfiguration working, but the newlyconfigured audiocodec would only produce a loud whistling noise.

The big break through only came one day before the final presentation. On the JBitsmailing list, someone remarked that whith him, some feature worked with JBitsversion 2.7 but not with 2.8. This gave us the idea to try the same with the older

v



Preamble

version 2.7. And what a big surprise, suddenly we could reconfigure successfully. Soat the presentation we could at least say "it works!"

During the whole theses, we both learnt a lot. We got insight into as different fieldsas VHDL programming, FPGA configuration, CPU design, operating system archi-tecture, the Ethernet protocol and audio codecs.

It was a very interesting time. If we had to choose both of us would do the sameproject again.

Finally, we would like to thank a few people who helped us in one or the other way:

Our tutor Christian Plessl He gave us great support with ideas. We could alwaysbug him with our questions which he answered helpfully. He was also right athand for questions concerning presentation and documentation.

Our co-tutor Herbert Walder Mainly his experience with JBits and dynamic re-configuration were very fruitful in our work.

Prof. Lothar Thiele We thank him that we could do this thesis in his researchgroup. The provided infrastructure was an important key point of our success.

Michael Lerjen Another student doing his diploma thesis, we could always askhim when we had a problem with VHDL again. Without him, our networkcard would not have improved so much. We’d especially like to thank him forthe improvement in the CRC checker code from the University of Queensland,which he adapted for his own project in such a way that it was also useable forour project.

Other students in our lab Finally, we don’t want to forget all the other studentsin our lab doing their own thesis. We had a lot of interesting discussions espe-cially during lunch hours, not only about our projects, but about nearly every-thing, but mostly related some way or the other with computers.

Zürich, March 15, 2002

Marco Wirz Matthias Dyer

vi



Contents

Abstract iii

Preamble v

Contents vii

Figures xi

Tables xiii

1: Preface 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Task (in german) . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2: Dynamic Reconfiguration 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Virtex FPGA Architecture Overview . . . . . . . . . . . . . . . . . . . . 11

2.3 Dynamic Reconfiguration for the Virtex Series FPGA . . . . . . . . . . 12

2.4 Design Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Flow 1: Without JBits . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.2 Flow 2: JBits Only . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.3 Flow 3: Combined Flow . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.4 Flow 4: Use JBits to merge Cores . . . . . . . . . . . . . . . . . . 15

3: Development Platform 17

3.1 Xilinx Virtex XCV800 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 XSV800 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 The LEON Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 VHDL Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Booting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.3 Top Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.4 UARTs 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.5 Synthesis with Synopsys . . . . . . . . . . . . . . . . . . . . . . . 21

vii



Contents

3.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 The Operating System RTEMS . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4: Network 27

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Network Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.2 Address Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4.3 FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.4 CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.5 Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.6 Sender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.7 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5.1 Streaming Data to LEON . . . . . . . . . . . . . . . . . . . . . . 37

4.5.2 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5.3 Application on LEON . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5.4 Application on PC . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5: Implementation of a dynamic reconfigurable System 43

5.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Virtual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3.1 VC Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.2 PCM Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.3 ADPCM Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Constraining the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4.1 Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4.2 Guided Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4.3 CLB Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Bitstream Manipulation with JBits . . . . . . . . . . . . . . . . . . . . . 60

5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5.2 Function Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.6 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.7 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.7.1 Dynamic Routing Flow . . . . . . . . . . . . . . . . . . . . . . . . 63

viii



Contents

5.7.2 Direct Copy Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.7.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.7.4 Design Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Conclusions 67

Future Work 69

A: LEON VHDL files 71

B: UCF Constraint File 73

C: Installing and Compiling RTEMS 77

D: Miscellaneous 79

Bibliography 81

ix



Contents

x



Figures

2-1 Example of Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . 10

2-2 Basic architecture of a Virtex FPGA . . . . . . . . . . . . . . . . . . . . 11

2-3 Virtex 2-Slice CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2-4 Dynamic Reconfiguration (Flow 1) . . . . . . . . . . . . . . . . . . . . . 142-5 Flow4: Use of JBits to directly copy a module . . . . . . . . . . . . . . . 15

2-6 Flow 4: Use of JBits and dynamic routing. . . . . . . . . . . . . . . . . . 16

3-1 Block Diagram of XSV800-Board . . . . . . . . . . . . . . . . . . . . . . 18

3-2 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4-1 Architecture of the network card . . . . . . . . . . . . . . . . . . . . . . 30

4-2 State machine of the receiver . . . . . . . . . . . . . . . . . . . . . . . . 33

4-3 State machine of the sender . . . . . . . . . . . . . . . . . . . . . . . . . 34

5-1 Application of a dynamic reconfigurable System . . . . . . . . . . . . . 43

5-2 Partitioning of our reconfigurable system . . . . . . . . . . . . . . . . . 44

5-3 Detailed view of the interface . . . . . . . . . . . . . . . . . . . . . . . . 45

5-4 Virtual Component Entity Schematic Symbol . . . . . . . . . . . . . . . 46

5-5 Timing for AK4520A Stereo Codec . . . . . . . . . . . . . . . . . . . . . 47

5-6 ADPCM Player Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5-7 ADPCM Splitter FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5-8 Control Path State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 52

5-9 ADPCM Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . 53

5-10 Flooplanning, Guided Routing and CLB Macros . . . . . . . . . . . . . 555-11 Guided Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5-12 Internal view of a pass–through CLB . . . . . . . . . . . . . . . . . . . . 59

5-13 Double Stage CLB Macro . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5-14 JBits Design Flow for partial Reconfiguration . . . . . . . . . . . . . . . 61

xi



Figures

xii



Tables

3-1 Main configuration of LEON . . . . . . . . . . . . . . . . . . . . . . . . . 19

4-1 Ethernet frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4-2 Memory locations of the network card . . . . . . . . . . . . . . . . . . . 31

4-3 Possible hardware improvements . . . . . . . . . . . . . . . . . . . . . . 37

5-1 VC Input and Output Signals . . . . . . . . . . . . . . . . . . . . . . . . 46

5-2 Sequence to produce serial audio data . . . . . . . . . . . . . . . . . . . 48

5-3 ADPCM word format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5-4 ADPCM Control Sequence (in pseudo VHDL) . . . . . . . . . . . . . . . 51

5-5 Step Shift Register States . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xiii



Tables

xiv



1 Preface

1.1 Motivation

For many years much of the research activity in computer architecture was focussedon designing fast general purpose CPUs. Driven by new applications particularlyfrom the multimedia domain, general purpose CPUs were enhanced by adding ad-ditional functional units for better support of the peculiarities of these applications.It turned out, that in spite of all advances in computer architecture, the computing power of general purpose CPUs is not sufficient for certain applications, e.g. real-

time video compression. Usually these kinds of applications are enabled by using dedicated hardware based on application specific integrated circuits (ASICs). Whilebeing an appropriate solution for a fixed application, an ASIC based solution hasthe inherent disadvantage that the functionality of the ASIC is not changeable andtherefore the ASIC cannot be used for a different purpose.

Reconfigurable Computing bases also on the idea to accelerate the computing in-tensive parts of algorithms using application specific circuits. But in contrast to the

ASIC approach, the circuits are implemented in reconfigurable logic. Usually, a re-configurable computing system consists of a general purpose CPU, which is coupledto a reconfigurable device, for instance a field programmable gate array (FPGA).While reconfigurable unit takes care of the computing intensive kernels of the ap-

plications, the CPU is used for the rest of the computations. The ability of changing the functionality of the circuit simply by reprogramming the reconfigurable unitadds greatly to the flexibility of such a system.

In the last couple of years a new generation of high-density and high-speed FP-GAs emerged, for instance the Xilinx Virtex series. The capacity of these devicesis sufficient for implementing a whole reconfigurable computing system consisting of a 32bit CPU core and a reconfigurable unit. An implementation of the completesystem in an FPGA enables an arbitrary coupling of CPU core and reconfigurableunits.

Xilinx Virtex devices provide an advanced reconfiguration feature called ’partial re-

1



Chapter 1: Preface

configuration’ . Partial reconfiguration allows to reconfigure parts of the FPGA only,while the the parts the the circuits that are not subject to this reconfiguration keepon working without interruption.

The availability of free CPU IP cores and high-density partial reconfigurable FPGAsare the fundamentals for this work. The idea of this thesis is to investigate how areconfigurable computing system based on a 32bit CPU core can be implementedon a Xilinx Virtex FPGAs. To support the replacement of reconfigurable units atruntime a generic interacts for components needs to be defined. To make the sys-tem really usable a method for generating FPGA configurations for the system andthe reconfigurable units is needed. As none of the existent mainstream FPGA cir-cuit synthesis tool supports partial reconfiguration a framework which provides thisfunctionality is required. Favorably this tool builds on the ordinary well-tried designtools and adds the support for partial reconfiguration on top of it.

1.2 Problem Task (in german)

2



1.2. Problem Task (in german)

DA-2002.14: Reconfigurable System on

FPGA

Wintersemester 2001/02

Christian Plessl

Institut für Technische Informatik und Kommunikationsnetze

ETH Zürich

23. Oktober 2001

Betreuer: Christian Plessl <[email protected]>

Studenten: Mattias Dyer <[email protected]>

Marco Wirz <[email protected]>

Dauer: 23. Oktober 2001 - 1. März 2002

1 Thematischer Hintergrund

Während vieler Jahre konzentrierte sich die Forschungsaktivität im Be-

reich CPU Architektur vor allem im Entwurf immer schnellerer Ger-

neral Purpose CPUs. Dabei wurden grosse Fortschritte erzielt und dieRechenleistung von Hochleistungs CPUs steigt seit Jahren durch weite-

re Fortschritte in der Architektur. Dies wird zum Beispiel das Zufügen

von spezialisierten Ausführungseinheiten erreicht, z.B. Datenverarbei-

tungeinheiten für Multimedia Anwendungen.

Es hat sich gezeigt, dass trotz allem Fortschritt in der Computer-

architektur einige Applikationen nach noch mehr Rechnerleistung ver-

langen. Dazu gehören besonders Anwendungen aus dem Bereich Kryp-

tographie und Multimedia, wie Audio-, Video- und Bildverarbeitung.

Um die für diese Algorithmen benötigte Rechenleistung bereitzustellen,

greift man meist auf Applikationssspezifische Integrierte Schaltungen

(ASICs) zurück, die ins Computersystem integriert werden um eine be-stimmte Aufgabe zu beschleunigen z.B. Einsteckkarten für MPEG Vi-

deo Komprimierung.

Reconfigurable Computing führt die Idee zeitkritische Teile von Al-

gorithmen mittels problemangepasster Hardware zu beschleunigen noch

weiter. Die Grundidee dabei ist es, die problemangepasste Hardware

mit programmierbaren Hardwarebausteinen zu realisieren. Die Abkehr

1

3



Chapter 1: Preface

von statischer Hardware bietet die Möglichkeit, die Hardware im Be-

trieb umzuprogrammieren, sog. Rekonfiguration, und erlaubt es die ver-

fügbaren Hardwareressourcen dynamisch für verschiedene Anwendun-gen zu nutzen. Es existieren inzwischen eine Vielzahl von Reconfigu-

rable Computing Forschungsprojekten, die diese Idee für verschiedene

Anwendungsszenarien genauer untersuchen. Meist liegt einem solchen

System eine gewöhnliche CPU zugrunde, die mit konfigurierbarer Logik

gekoppelt ist, welche lediglich für die zeitintensiven Kernel der Appli-

kationen verwendet wird. An unserem Institut läuft mit Zippy ebenfalls

ein Forschungsprojekt auf diesem Gebiet.

Eine Möglichkeit für die Realisierung von anwendungsspezifischer

Hardware in Reconfigurable Computing Systemen sind Field-Program-

mable Gate Arrays (FPGAs). Diese stellen die heute am weitesten ent-

wickelten, programmierbaren Logikbausteine dar und werden in einer Vielzahl von Anwendungen eingesetzt. Auf heutigen high–density FPGAs

lassen sich Schaltungen mit einer Komplexität bis zu mehreren Millio-

nen Gattern realisieren.

Durch die grosse Kapazität von FPGAs ist durchaus auch die Inte-

gration einer kompletten CPU auf einem FPGA möglich. Somit ist es

machbar ein komplettes Reconfigurable Computing System auf einem

einzigen Baustein zu implementieren, statt eine dedizierte CPU zu be-

nutzen. Der Vorteil eines solchen Systems liegt in der sehr nahen Kopp-

lung von CPU und der benutzer-konfigurierbaren Logik. Dies erlaubt,

verschiedene Arten der Koppelung zwischen CPU und der rekonfigu-

rierbaren Logik zu untersuchen oder die CPU selbst um rekonfigurier-

bare Recheneinheiten zu erweitern. Die Realisierung der CPU selbstauf dem FPGA hat zur Folge, dass die CPU dadurch weniger schnell ist

gegenüber einer dedizierten ASIC Realisierung. Eine offene Frage hier-

bei ist, ob die dazugewonnene Flexibilität bezüglich der Integration von

CPU und Logik die Einbusse an Geschwindigkeit der CPU wettmachen

kann.

2 Problemstellung

In den letzten Jahren entstanden einige Projekte, welche eine komplet-

te CPU inklusive Peripherie in einem FPGA implementieren. Waren die

ersten dieser Projekte eher einfache 8- oder 16bit CPUs [4] erlaubt diegesteigerte Kapazität und Geschwindigkeit aktueller FPGAs auch die

Implementation von leistungsfähigeren 32bit CPUs, z.B. die 32bit CPU

LEON [3], welche den SPARC V8 Instruktionssatz implementiert.

Heute existieren verschiedene CPU Komponenten für FPGAs, so-

wohl als freie Designs als auch als kommerzielle Produkte. Diese Kom-

ponenten werden als IP–Cores bezeichnet. Man unterscheidet zwei Ar-

2

4



Chapter 1: Preface

Testschaltungen implementieren.

2. Grundlagen LEON / SPARC Verschaffen Sie sich einen über-blick über LEON [3] und die SPARC V8 Architektur[7].

3. Implementierung CPU Core Erstellen Sie ein Konzept, wie die

CPU auf dem Board implementiert werden soll. Implementieren

und testen Sie ihre CPU.

4. Entwicklungsumgebung für CPU Core Um eine komfortable

Softwareentwicklung für das Board zu ermöglichen, benötigen Sie

Bootloader, Assembler, Linker und Linker. Verwenden Sie dazu

den GNU Crosscompiler und die GNU Binutils für SPARC. Es gibt

dazu ein speziell angepasstes LEON Crosscompiler Kit (LECCS)

[2]. Untersuchen Sie ob ein einfaches Embedded Betriebssystem,

wie z.B. eCOS [6] oder RTEMS [1] für Ihre Zwecke nützlich ist

und ob eine Anpassung auf diese Plattform möglich ist. Auf dem

XESS Board existieren eine Vielzahl von Peripherieschnittstellen,

implementieren die für Debugging und Test wichtigen Interfaces

[5].

5. Integration von Benutzerkonfigurierbarer Logik in den CPU

Core Untersuchen Sie verschiedene Möglichkeiten, wie benutzer-

konfigurierbare Logik von der CPU aus angesprochen werden kann

(Memory mapped, AMBA Bus, Reconfigurable functional units)

und implementieren Sie eine geeignete Variante.

6. Konzept für dynamische Rekonfiguration der Benutzerkon-figurierbaren Logik Machen Sie sich mit den Mechanismen der

partiellen Rekonfiguration des Xilinx Virtex vertraut [10]. Unter-

suchen Sie, wie in das CPU Core Design eine generische Kompo-

nente integriert werden kann, die zur Laufzeit gegen eine andere

ausgetauscht werden kann.

7. Designflow für Reconfigurable Computing System Untersu-

chen Sie, wie sich Ihre Konzepte des dynamischen Rekonfigurie-

rens der benutzerspezifischen Einheiten umsetzen lassen. Unter-

suchen Sie dabei die Möglichkeiten des JBits Tools von Xilinx. De-

finieren und Implementieren Sie einen Demonstrator der die Prin-

zipien der dynamischen Rekonfiguration aufzeigt.

4 Organisatorisches

• Zeitplan Erstellen Sie am Anfang Ihrer Arbeit zusammen mit

dem Betreuer einen realistischen Zeitplan. Halten Sie Ihren Ar-

beitsfortschritt laufend fest.

4

6



1.2. Problem Task (in german)

• Dokumentation Dokumentieren Sie Ihre Arbeit sorgfältig. Le-gen Sie dabei besonderen Wert auf die Beschreibung Ihrer Über-

legungen und Designentscheide.

Literatur

[1] OAR Corp. RTEMS homepage. http://www.rtems.com.

[2] Jiri Gaisler. LECCS LEON/ERC32 cross compilation system.WWW. http://www.gaisler.com/leccs.html.

[3] Jiri Gaisler. The LEON processor user’s manual. Gaisler Research,version 2.3.7 edition, August 2001.

[4] Jan Gray. Building a RISC system in an FPGA. Circuit cellar,116:26–33, March 2000.

[5] James O. Hamblen and Michael D. Furman. Rapid Prototyping of

Digital Systems. Kluwer Academic Publishers, 2000.

[6] RedHat. eCos homepage. http://sources.redhat.com/ecos.

[7] SPARC International Inc., 535 Middlefield Road, Suite 210, MenloPar, CA 94025, USA. The SPARC Architecture Manual, Version 8,sav080si9308 edition, 1992.

[8] XESS corporation, 2608 Sweetgum Drive, Apex NC 27502, USA. XSV Board V1.1 Manual, version 1.1 edition, September 2001.

[9] Xilinx. Xilinx Virtex 2.5V Field Programmable Gate Arrays, v2.5edition, April 2001.

[10] Xilinx Inc. Xilinx Application Note XAPP138: Virtex FPGA

Series Configuration and Readback, v2.4 edition, 7 2001.http://www.xilinx.com/xapp/xapp138.pdf.

5

7



Chapter 1: Preface

8



2 Dynamic Reconfiguration

2.1 IntroductionDynamic Reconfiguration is the ability to update only a portion of the configurationmemory in an FPGA with a new configuration without stopping the functionalityof the unchanged section of the FPGA [13]. Dynamic Reconfiguration enlarges thedesign space for developers. Different logic functions can be stored in memory untilthe need arises for them to be configured into the FPGA.

Recent advances in the manufacturing process promise 50 million gates of recon-

figurable logic by 2005 at substantially lower costs. The increased gate count along with richer embedded feature sets have greatly improved the economics for using Reconfigurable Technology. One single FPGA can simultaneously carry various com-plex cores like processors, decoders and filters just to name a few of them. DynamicReconfiguration allows to replace a specific core if a new function is required. Thissituation is similar in the manner with computers with large hard drives storing applications for days before they are loaded into memory.

Imagine a system which uses five different cores over the time, but not more thanthree simultaneously. Without dynamic reconfiguration you would either need ahuge FPGA which can carry all cores at once, or you would need three individualFPGAs which will be fully reconfigured. The first case is a waste of FPGA area. The

second case implies increased hardware costs and power consumption.

With dynamic reconfiguration an FPGA which has the size to carry three cores (forall occurring combinations) will suffice. Figure 2-1 points out this advantage. Allcores are stored in memory. On demand, an unused core can be replaced with a newcore by a partial bitstream. The difference to a full reconfiguration is that the othercores aren’t affected by the reconfiguration and keep their state.

With following list, we point out some advantages of dynamic reconfigurable sys-tems:

Rapid Prototyping: With dynamic reconfiguration a modular design can be im-plemented. A team of engineers can independently work on different pieces

9



Chapter 2: Dynamic Reconfiguration

Memory

1 3 4 52

1 1 1

2 2

3

4 4

5

Dynamic Reconfiguration

Figure 2-1: Example of Dynamic Reconfiguration

10



2.2. Virtex FPGA Architecture Overview

of a design and later merge these modules into one FPGA design. This paral-lel development saves time and allows for independent timing closure on eachmodule. Dynamic reconfiguration also allows you to modify a module whileleaving other, more stable modules intact.

Reconfiguration Speed: The time needed for a dynamic reconfiguration is pro-portional to the size of the configuration bitstream, which depends on the areato change. If only 20% of an FPGA is reconfigured the programming time willalso be 20% of a full reconfiguration.

FPGA Size: If the design uses some of the modules only temporarly, the FPGA areacan be shared with dynamic reconfiguration. You can therefore use a smallerFPGA.

Glue Logic: Having the modules together on a single FPGA instead of separatecomponents allows flexible complex and high–speed connections between thecores.

2.2 Virtex FPGA Architecture Overview

BLOCK−

RAM

IOB

CLB

Figure 2- 2 Basic architecture of a Virtex

FPGA

Figure 2-2 shows an architecture overview of a Virtex FPGA. Virtex FPGAs [1] arecomposed of an array of Configurable Logic Blocks (CLBs) surrounded by a ring of Input/Outputs Blocks (IOBs). On the east and west edges are Block RAMs (BRAMs).

The CLBs are the primary building blocks that contain elements for implementing customizable gates, flip flops, and wiring for connectivity. The IOBs provide circuitryfor communicating signals with external devices. The BRAMs allow for synchronousor asynchronous storage of kilobits of data, though each CLB can also implementsynchronous/asynchronous 32-bit RAMs.

Each CLB contains two slices (see figure 2-3). Each slice implements two 4-inputLook–Up–Tables (LUTs), two D–Type flip–flops, and some carry logic. The generalrouting allows data to be passed to or received from other CLBs.

11




F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

Y

YBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

slice_b.eps

Slice 1 Slice 0

XB

X

LUTLUT

LUT D

EC

Q

RC

SP

D

EC

Q

RC

SP

D

EC

Q

RC

SP

D

EC

Q

RC

SP

Figure 2- 3: Virtex 2-Slice CLB

2.3 Dynamic Reconfiguration for the Virtex Series

FPGA

The Virtex series FPGA supports dynamic reconfiguration. The configuration logicis separated from the user logic and does not require the use of normal resourcesallowing for continued operation of sections that do not change. The configurationwrite sequence is a glitch–less operation, so that only the memory bits that weremodified are toggled. The one exception to this is the Blockram. The configurationlogic requires the use of the read/write ports of the Block SelectRAMs when thememory contents must be read or written [13].

The smallest amount of configuration memory that can be written to or read fromis a frame. A frame spans from the bottom of the device to the top of the device,including the IOBs and CLBs, and contains a section of the data needed for eachrow. While an entire frame must be written into the device, only the bits that have

changed will be toggled. This can allow a single bit to be changed without affecting the rest of the device operation.

The routing configuration memory is modified in the same manner as the logic con-figuration. Modification of routing connectivity may cause contention. This will notdamage the device as long as it is short (<30ms). A signal which passes through asection of change will continue to pass the data during the reconfiguration, provid-ing that the reconfiguration does not intentionally change connections to the signalwire.

Although some FPGAs like the Virtex series support dynamic reconfiguration, theredoesn’t exist a design environment to develop complex dynamic reconfigurable sys-

12



2.4. Design Flows

tems in an integrated flow so far. The manufacturers are aware of this deficiency andare working on enhancements to be included in future versions of the implementa-tion software. Until then, developers have to climb down to a low level of bitstreammanipulation.

With Xilinx’s JBits SDK [14], a Java program to dynamically create or manipulate Virtex bitstreams, a first powerful tool is available. But it is still far from the comfortof high–level language hardware compilers.

2.4 Design Flows

We can distinguish four different Design Flows for dynamic reconfiguration withtoday’s tools:

Flow 1: Design all modules with common implementation tools and extract directlyfrom the bitstreams the partial reconfiguration.

Flow 2: Design all modules with JBits with the restriction to work on low abstrac-tion level.

Flow 3: Design an initial version with common implementation tools and applydynamic changes with JBits.

Flow 4: Design all modules with common implementation tools and merge themwith JBits.

Which of the flows is to be preferred depends on the targeted application. None of them seems optimal. Due to the lack of an elaborated design environment, develop-

ers have to accept restrictions.

2.4.1 Flow 1: Without JBits

Without JBits partial reconfiguration usually consists of a manual (or script based)manipulation of bitstreams. All non–partial bitstreams are generated with com-mon synthesis and implementation tools. From these bitstreams the modules areextracted by copying only the frames containing the modules into a new partialbitstream. The corresponding frames on the FPGA device are then replaced withthe ones from the partial bitstream by dynamic reconfiguration (see figure 2-4). A complete study of this flow can be found in [4].

Restrictions

A restriction that comes along with this flow, is that the granularity of changes isone frame. This implies a vertical segmentation of the design, which is not alwaysapplicable.

Because no dynamic routing is made, the connection to other components have tobe exactly the same for every module. This requires to design a hard constrainedinterface to ensure that every module uses the same routing resources for the con-nections.

13




partial bitstreambitstream 1

frames to extractdynamic reconfiguration

bitstream 2 (on FPGA)

Figure 2- 4 Flow 1: Dynamic Reconfiguration

without JBits. Extract frames frombitstream 1 into a partial bitstream

and paste it over an existing config-

uration (bitstream 2)

2.4.2 Flow 2: JBits Only

If you design all modules in JBits, you can easily use the dynamic reconfigurationability of this program. At any time the program can write changes in the design asa partial bitstream to the device. The JBits SDK also includes an automatic router,which can dynamically route and unroute connections.

Restrictions

This flow seems only applicable for small or data flow oriented applications. Thereare no tools in JBits to implement state machines or other helpful modeling con-structs.

JBits is still in development and incomplete. For example the abilities of the au-torouter are still limited. Not all resources can be used (e.g. long lines).

2.4.3 Flow 3: Combined Flow An initial design is created with common synthesis and implementation tools andconfigured to the FPGA device. Then JBits applies all further changes dynamically.The changes are not taken from another bitstream but created by the JBits programitself. This flow is suitable for complex application, which need only minor dynamicchanges like changing parameters of an algorithm or changing connectivity.

Restrictions

Now the restrictions of Flow 2 only apply to the dynamic changes and not to thewhole design anymore. The difficulty with a combined flow is to design the initial

14



2.4. Design Flows

circuit in a way that the JBits program can find specific resources again. Thereforethe initial design needs exact floorplanning.

Another restriction concerns signal timing. The implementation tool usually uses adelay based routing algorithm to create low skew circuits. If these connections areremoved and rebuild with JBits the timing is not guaranteed anymore, unless theprogram verifies explicit the delays of the new routes.

2.4.4 Flow 4: Use JBits to merge Cores

This flow is comparable with Flow 1. The difference is that we use JBits to extractthe modules out of the bitstreams and to merge them together with dynamic recon-figuration. There are two main advantages over Flow 1: To extract a module, JBitsis not limited to the granularity of frames. An individual CLB can be read with alllinked resources and be written as a partial bitstream to a device. Secondly a modulecan be connected to the interface with JBits’ autorouter.

We can realize dynamic reconfiguration in two ways within this flow. Figure 2-5shows a flow which directly copies the module from one bitstream to another. A flowwhich uses the routing function of JBits is illustrated in Figure 2-6.

bitstream 1

bitstream 2 (on FPGA)

copy

dynamic reconfiguration

Figure 2-5 Flow 4: Use of JBits to direct copy

a module: A first bitstream (bit-

stream 1) is loaded into memory.

It contains a module, which is con-

nected via a hard route to the in-

terface. The reconfigurable area is

then copied and pasted over the

module of the actual configuration

(bitstream 2). Since all modules

use the same (hard) connections,

the correct connection of the new

core is guaranteed.

Restrictions

The restriction on signal timing (cf. Flow 3) also concerns the flow which uses JBits’routing function. This is not important for the flow which directly copies the module,since it does not use the autorouter. But this flow uses a hard interface like Flow 1.Hard interfaces often need manual editing.

15




(a) Load a first bitstreaminto memory containing the module to configure.The module is connectedto a fix interface.

(b) Unroute the connec-tions to the interface andremember the source andsink locations.

(c) Load a second bit-stream into memory withthe actual configuration of the FPGA

(d) Unroute the connec-tions to the interface inthe second bitstream. re-configurable area

(e) Copy the module fromthe first bitstream fromstep (b) and insert it inthe second bitstream from

step (d).

(f) Reconnect the newmodule with the informa-tion from step (b)

Figure 2-6 Flow 4: Use of JBits and dynamic routing. The result of this flow is a partial bitstream

containing all changes from step (d) to step (f), which will be written to the device.

16



3 Development Platform

In this chapter we describe the development platform which we set up and used forour thesis. We will first describe the components of the development board we usedand then go on to the platform itself, the used hardware and finally the software.

3.1 Xilinx Virtex XCV800 FPGA

The Xilinx Virtex XCV800 FPGA is a high-density FPGA with 800 KGates equiva-lents. The version used on the board is in a 240-pin HQFP package.

The XCV800 FPGA contains 84 columns and 56 rows of CLBs. Additionally theFPGA contains a total of 28 block RAMs with a capacity of 4096 bits each. Theseblock RAMs are fully synchronous dual ported RAMs with independent control sig-nals for each port.

An overview of the architecture of the FPGA is given in section 2.2.

3.2 XSV800

The XSV800 prototyping board from XESS is a versatile platform for developing FPGA-Circuits. Its Xilinx Virtex XCV800 FPGA is connected to different interfacesfor communication with the outside world. There are two serial ports, one paral-

lel port, an Xchecker interface, a USB port, PS/2 mouse and keyboard ports and a10/100 MBit Ethernet physical layer interface (Ethernet PHY).

Further, there is a 110 MHz RAMDAC for video signal generation and an audiodriver which can process audio signals with a resolution of up to 20 bits and a band-width of 50 kHz.

For local data storage the board provides two RAM banks with 512k × 16 bits capac-ity. A separate 16 Mbit Flash EEPROM can be used either to save the configurationof the FPGA or to store data for use by the FPGA after configuration is complete.

Finally, there are some local controls on the board like 4 push buttons, a row of 8 dipswitches, two 7 segment LEDs and 10 universal LEDs.

17



Chapter 3: Development Platform

A block diagram of the XSV Board is shown in figure 3-1. The dotted elements werenot used for our project.

512Kx8SRAM

512Kx8SRAM512Kx8SRAM

512Kx8SRAM

PDIUSB

P11A

Decoder

Video

Virtex FPGA

USB VGAStereo Stereooutin

PS/2 RJ45

− eoara e

Ether

PHY DAC

RAM

CPLD

XC95108 16MBit

Flash

XCV800HQ240

LEDsDIPs

20 bit Stereo

Codec

Pushbuttons

MAX232A

Figure 3-1 Block Diagram of XSV800-

Board

3.3 The LEON Processor

The LEON is a freely available VHDL model of a 32 bit processor conforming to theSPARC V8 architecture. Originally developed by the European Space Agency, it isnow available under the Lesser GNU Public License (LGPL)1. It is being maintainedand further developed by Gaisler Research[3] in Göteborg, Sweden. A simple blockdiagram of the LEON processor is shown in figure 3-2.

3.3.1 VHDL Configuration

The VHDL model of LEON is fully configurable to permit synthesis for differentcache sizes, multiplier units and target architectures. The main configuration isdone in the file target.vhd. The basic configuration record consists of entries asshown in table 3-1.

The descriptions cover some of the most interesting or important issues of eachconfiguration option. A complete description of the options can be found either inthe VHDL source or in The LEON processor User’s Manual[3].

1http://www.gnu.org/licenses/lgpl.html

18



3.3. The LEON Processor

I−Cache D−Cache

Timers IrqCtrl

UARTS I/O port

PROM SRAM I/O

Controller

Memory AHB/APB

Bridge

AHB

Controller

User I/O

PCI

Not used in our project

LEON processor

LEON SPARC V8

Integer Unit

FPU

Co−proc

AMBA AHB

AMBA APB

8/16/32−bits memory bus

Figure 3- 2: Block diagram of the LEON processor

Option Configuration description

synthesis Target technology is Xilinx Virtex, block proms will be used.

iu Multiplier is optimized for use on an FPGA, no MAC, no FPU, no co-proc

fpu Type of FPU (none)

cp Type of co-processor (none)

cache 2k instr. cache, 2k data cache.

ahb One master on AHB bus.

apb 1 interrupt contr., no PCI.mctrl 32 bit memory.

boot Boot from prom, clock 20 MHz, baudrate 38400

debug Enable disassembling, but no other debugging

pci No PCI interface

peri Enable configuration register, no watchdog timer, no second interruptcontroller

Table 3-1: Main configuration of LEON

19




The processor can actually be run with 25 MHz on the XSV board, but then placing & routing takes quite long. And since for our project 20 MHz is more than suffi-ciently fast, we reduced the frequency which resulted in a siginifcant speedup of theimplementation process.

3.3.2 Booting

When LEON starts, it boots from an internal boot ROM according to our configura-tion.

For the boot prom it is possible to either use block RAMs or distributed logic. Whenusing block RAMs the boot prom is built with the Logic Core Generator from a tem-plate (virtex_prom256.xco). The contents of the ROM is taken from the file vir-

tex_prom256.mif, which contains just the bare bit code. When using standardlogic cells the boot program can be coded in a VHDL file as a large memory lookuptable. This is done in the file bprom.vhd.

We first defined two block RAMs with a total of 256 × 32-bit as a block prom for theboot process.

But later it was better to use standard logic cells so that the number of used blockRAMs could be reduced to 14 and thus LEON only uses block RAMs on one side of the FPGA. Together with the area constraints this led to less disturbing lines (cf.section 5.4.2) through the reconfigurable area.

Pmon

As a simple boot program we used the pmon monitor that comes with the LEON

distribution.

Pmon is a small boot program which, after initializing the processor, performs somememory checks, activates the serial port, says hello to the world (LEON-1: 1*2048K

32-bit memory, rmw) and then waits for a program to be downloaded on the serialinterface. This program is in the S-Record format which can be generated from anexecutable with the GNU objcopy program.

We always downloaded programs with 38400bps, although it would actually bepreferable to increase this baud rate since a compiled RTEMS program takes morethan 3 minutes to load over the serial interface. But we haven’t actually tested if itworks with higher rates.

Rdbmon

Another boot program is called rdbmon. It allows to plug into the processor with thedebugger gdb. Rdbmon provides support for setting breakpoints, single stepping and all the typical debugging tasks like reading memory addresses, disassembling code etc.

The plug-in runs over the second serial interface, while stdin and stdout of LEONremain on the first serial interface.

So with the appropriate cable (cf. section 3.3.4) we started minicom, a simple ter-minal program, on the first serial interface to capture all the output of LEON. After

20



3.3. The LEON Processor

downloading the rdbmon gdb is started with the according executable. The connec-tion to the board is established with

(gdb) set remote-baud 38400(gdb) target extended-remote /dev/ttyS1

and the program downloaded and started:

(gdb) load

(gdb) run

One of the advantages of rdbmon over the simpler pmon is that when plugging inwith the debugger the boot program can be recycled. So when the program thatwas run on LEON has terminated without error, the loader can be resumed with asimple

(gdb) jump *0x401f0000

in gdb, whereas with pmon the monitor has to be downloaded again.

After restarting, a new program can be loaded with

(gdb) file newprog.exe

The board has to be contacted again (target ...) and the new program can bedownloaded.

3.3.3 Top Level Design

For our level top design we took the file xsv800.vhd which was posted to the LEONmailing list by Stephan Schirrmann[15].

The main task for the top level design file is to connect LEON correctly to the twoRAM banks on the XSV board. We later expanded it to also connect LEON to ourextensions like audio codec and network interface.

3.3.4 UARTs 1 and 2

To be able to use both of the serial ports of LEON we reprogrammed the CPLD onthe XSV board. The CPLD is connected to 4 signals (RxD, TxD, RTS, CTS) of theDB9 connector on the board. But since we don’t use flow control we could route thesignals of the second serial port using the RTS/CTS lines of this connector. We hadto build a special Y-cable which splits these signal to two connectors on the otherend.

3.3.5 Synthesis with Synopsys

To synthesize LEON we used the fc2_shell of Synopsys FPGA Compiler 2. It isscriptable, but the commands can also be entered manually.

The steps to create a chip from the source VHDL files are:

21




(1) create_project leon

(2) add_file -library WORK -format VHDL path/to/file.vhd

(3) analyze_file

(4) create_chip -name leon -target VIRTEX -device V800HQ240

-speed -4 -frequency 20 -preserve xsv800

(5) current_chip leon

(6) optimize_chip -name leon_ x

(7) export_chip -root leon_ x

With (1), the project is built. Then with (2) the source files are added. This commandhas to be executed once for each source file. A correct order to add the files can beseen in appendix A .

The command (3) analyzes all the added VHDL files and checks for syntax errors.

The next step (4) is to create a chip and specify the hardware parameters.

Now synopsys does not automatically switch to the new chip as current chip, so thishas to be done manually (5).

Optimizing the chip (6) is the step which takes most of the time. When this is fin-ished, the chip can be written to a file with (7) for further processing.

So much for the first creation of the chip. Now when the VHDL source has beenupdated (i.e. any number of files have been changed) it is not necessary to performall these steps again.

(1) current_chip leon(2) analyze_file

(3) update_chip

(4) current_chip leon

(5) optimize_chip -name leon_ x+1

(6) export_chip -root leon_(x+1)

After switching the current chip back to the unoptimized version (1) all the modifiedfiles are re-analyzed (2). Now the chip must be updated (3), optimized as a newversion (5) (again the updated chip is not automatically set as current: (4)) andexported again (6).

Old designs can be deleted with

delete_chip name

where name is not the current chip.

3.3.6 Implementation

The implementation was done with Xilinx Foundation, first with version 3.3, lateralso with version 4i.

It is important to specify the correct constraints file so that the pins of the FPGA are assigned correctly. One pitfall here is that Foundation complains if there are

22



3.4. The Operating System RTEMS

too many pins constrained in the file, but if there are some pins unassigned, it justplaces them somewhere.

This way, we once nearly ruined our board when reprogramming the CPLD.The CPLD is connected so that some freely useable pins are connected with theprogramming pins. So when one of theses pins is tied to a constant logical level it isnot possible anymore to reprogram the CPLD.

We were somewhat lucky because between the pin which was set to a constant logiclevel and the programming pin there was a jumper which could be removed to re-porgram the CPLD again.

3.4 The Operating System RTEMS

When we were looking for an operating system for the LEON processor, we foundthat RTEMS from the OAR Corporation[8] was already ported to this architecture.RTEMS is an Open Source Real Time Operating System, and since it has a smallkernel it was exactly what we were looking for.

So after only a short time our first "Hello World" program was successfully tested onthe XSV board.

Programs are compiled with the LECCS tool set, which is actually the port of theGNU GCC compiler for the LEON architecture.

To load the programs on to the LEON, the executable has to be converted to anS-Record file first. This is done with

sparc-rtems-objcopy -O srec program.exe program.srec

We denoted all LEON executables with the extension exe so that they wouldn’t beconfused with normal Linux-Elf executables.

After powerup, LEON executes the pmon application which waits for the downloadof an S-Record file on the serial interface. So this file is sent via the serial interfaceto the LEON:

cat program.srec > /dev/ttyS x

The serial interface has to be configured to the correct baudrate and transmission

parameters (8n1). This is done by starting minicom and configuring the interface.minicom can then be left running, so it will show the standard output of the pro-gram running on LEON. The program doesn’t have to be started, this is alreadydone by the S-Record loader.

3.4.1 Program Structure

RTEMS is not an operating system like for example Linux which runs on the proces-sor, and then new programs can be loaded and executed. Instead, the user programis written, and then the operating system is linked as a library to the program, sothat one large program results which includes both the OS and the user program.

23




In an RTEMS program some basic constants have to be defined for certain featuresto be enabled during compilation. In our case this was

/* we need a console for communication, mostly debugging */

#define CONFIGURE_APPLICATION_NEEDS_CONSOLE_DRIVER

/* we use the clock as a ’still alive’ indicator */

#define CONFIGURE_APPLICATION_NEEDS_CLOCK_DRIVER

#define CONFIGURE_TICKS_PER_TIMESLICE 50

/* we have a timer for network timeouts */

#define CONFIGURE_APPLICATION_NEEDS_TIMER_DRIVER

#define CONFIGURE_MAXIMUM_TIMERS 5

/* to start several separate tasks */

#define CONFIGURE_EXTRA_TASK_STACKS (4 * RTEMS_MINIMUM_STACK_SIZE)#define CONFIGURE_MAXIMUM_TASKS 8

#define CONFIGURE_RTEMS_INIT_TASKS_TABLE

/* this is needed for the TCP/IP stack */

#define CONFIGURE_USE_MINIIMFS_AS_BASE_FILESYSTEM

#define CONFIGURE_LIBIO_MAXIMUM_FILE_DESCRIPTORS 10

An RTEMS program has no main() function like a normal C program. Instead,after starting the operating system sets up the whole environment and then startsthe function Init(rtems_task_argument) as a new thread. This is where theuser can start setting up the environment for his own program.

Usually, some new threads are started here, and probably some interrupt vectorsand timers get installed. We also set up the network stack here. This means wehave to give it an Ethernet address, an IP address with netmask, standard gatewayetc.

After everything has been initialized, the init task is terminated. The system runson with the different started tasks.

Tasks

When a new task is started, it has to tell the operating system that it may be pre-empted. That means it is possible to create preemptible as well as non-preemtibletasks. This is done with the rtems_task_mode function:

rtems_mode old_mode;

rtems_task_mode(RTEMS_PREEMPT | RTEMS_TIMESLICE,

RTEMS_PREEMPT_MASK | RTEMS_TIMESLICE_MASK,

&old_mode);

A task in RTEMS is basically a function which is called when the task is invoked. Itthen either runs until the function returns or the task terminates itself by call-ing rtems_task_delete(RTEMS_SELF). It can also actively call the scheduler

24



3.4. The Operating System RTEMS

with the fuction rtems_task_wake_after(timeout). If the argument timeoutis non-zero, the task is sent sleeping for the specified time. Otherwise, just a reschedul-ing is done and the task is set in the ready queue again immediately.

Interrupts and Timers

Interrupt and timer functions are implemented the same way as tasks: they arefunctions which are called when the interrupt is triggered or the timer expires.

The definition of a timer fuction is as follows:

rtems_timer_service_routine timer_routine(rtems_id timer_id,

void *user_data)

A timer is created with

rtems_timer_create(t_name, &timer_id)

and started with

rtems_timer_fire_after(timer_id, TICKS_PER_SECOND / 25,

timer_routine, &timer_data);

The pointer *user_data which is an argument in the timer routine can be usedto pass some data to the timer function. The last argument timer_data from thecalling function rtems_timer_fire_after() is passed for this purpose.

The definition of an interrupt handler is this:

static rtems_isr eth_interrupt_handler(rtems_vector_number v)

Interrupt handlers should only perform a minimal set of actions so that they returnsoon. They also must not call blocking functions.

One function which is allowed to be called is rtems_event_send. So an interrupthandler can check the cause for its invocation and then send a message to a specific

task to handle the cause.In the case of the network card the interrupt handler could check whether there isa new frame in the input buffer or the frame in the outgoing buffer has been suc-cessfully sent. Depending on the outcome of this check it could then call either thesending or the receiving task. This feature has not yet been implemented. Instead,the interrupt handler always calls the receiving task.

25



4 Network

4.1 Overview

As one of the main goals of our thesis was to show an application of a reconfigurablesystem on an FPGA, we decided to implement two different audio codecs which couldbe exchanged at runtime while the rest of the system on the FPGA continued to rununinterrupted.

To be able to practically demonstrate this scheme, we had to play sound on eachaudio codec. But to play sound, a large quantity of data is necessary. So we had to

somehow transport this data to the LEON. One possibility would be to use the serialinterface which is the standard input and standard output of the LEON processor.The data rate on the serial interface is 38400 bps, which is about 3800 bytes/s.

But sound at an acceptable quality has a sampling rate of at least 11.025 kHz witha sample size of 8 bits mono. This results in

11025Hz ∗ 8bits = 88200bps

So we decided to implement a network interface for LEON. The processor itself onlyoccupies about half of the FPGA, so there is still enough room for a small network

interface.

On a 10 Mbps Ethernet interface, the useable data rate is about

10Mbitss

1520 bytesframe

≈ 820frames

s

Maximal payload of an Ethernet frame is 1500 bytes. But with overhead from IP of 20 bytes and UDP of 8 bytes, the useable data size is 1472 bytes

frame.

So we can transmit data with no more than

27



Chapter 4: Network

1472 bytes

frame

∗ 820frames

s

≈ 1.15Mbytes

s

Audio data with a samplingrate of 44.1 kHz, 16 bit stereo has a data rate of

44100Hz ∗ 16bit ∗ 2channels ≈ 172kbytes

s

So the data rate on the network interface should be more than enough to streammusic with CD quality and even implement a simple flow control.

4.2 Ethernet

The structure of an Ethernet frame is shown in table 4-1. Each frame starts witha preamble of 7 bytes with the bit pattern 10101010. This is for the receiver tosynchronize his clock to the sender clock. The next byte is the start of frame delimiter

(SFD). It signifies the end of the preamble and hence the start of the actual frame.It’s value is 10101011.

The header of the frame starts with two addresses. Ethernet addresses are 48 bitslong. The first one is the destination address, followed by the source address.

The type field denotes what the frame contains. Alternatively, it can be informationabout the length of the field. If the value is less than 1500, it denotes the length of

the following data. Otherwise, it indicates what the data contains. A value of 0x0800denotes an IP packet, whereas 0x0806 stands for an ARP packet.

In case the payload of the Ethernet frame is an IP or ARP packet, no length infor-mation is needed since both the IP and the ARP header contain the information of the lenght of the packet.

A more detailed description of the Ethernet protocol can be found in [12] (chapter4.3.1, IEEE Standard 802.3 and Ethernet).

It is actually possible to transport more data in an Ethernet frame than the size of an IP packet. The additional bytes are then just ignored at the receiver side. Ourimplementation actually uses this feature, because with the current implementation

every packet that is passed to the network card is automatically stuffed to a lengthwhich is a multiple of 4 bytes.

As said above, an Ethernet frame can contain up to 1500 bytes of data. But thewhole frame (without preamble) must have a length of at least 64 bytes. This meansthat the data portion must be between 46 and 1500 bytes. If the packet in the frameis less than the required size, the frame can be filled with the optional pad field.

Finally, the CRC is calculated over the whole frame from the destination address tothe padding.

28



4.3. Prerequisites

Preamble SFD Dst Addr Src Addr Type Data Pad CRC

7 1 6 6 2 0–1500 0–46 4

Header: 14 Body: 46–1500

Table 4-1: Ethernet frame

4.3 Prerequisites

4.3.1 Hardware

The XSV-Board already contains an Ethernet PHY[10]. The LXT970A chip on theboard can drive both a 10 or 100 Mbps Ethernet line. But since the master clock

on the board is only 20 MHz we could not implement a 100 MBps network card(100 MBps = 10 ns/bit but the clock of 20 MHz = 50 ns).

The chip is a line driver and delivers the data in nibbles1. For this purpose, it gen-erates its own clock of approx. 2.5 MHz.

The processing of these nibbles to reassemble the Ethernet frames is usually doneby the network card. Thus we wrote the network card for the LEON ourselves in

VHDL.

The main state machines of the network card, the sender and the receiver process,both run with this slow network clock from the PHY. Therefore, there have to becertain synchronization mechanisms between the PHY and the processor core, suchas processor signals that are too short to be noticed by the slow running processes,or — in the other direction — signals that would be way too long if sent directly tothe processor.

4.3.2 Software

On the side of the operating system, there already exists a complete implementationof a BSD network stack. This stack contains all the ususal protocols such as IP, ARPand ICMP on the network layer and below, or TCP, UDP on the transport layer. Butwe still had to write the network card driver which actually communicates with thehardware of the network card.

4.4 Network Hardware

4.4.1 Architecture

The whole network card consists of basically three parts. These are the addressdecoder which communicates with the CPU, and the sender and receiver processeswhich communicate with the line driver chip.

Then there are some additional components such as the CRC checker and calculator(it is the same for both the sender and receiver, but instantiated twice, once for each).

1one nibble is half a byte (4 bits)

29



Chapter 4: Network

And there needs to be some buffer space to save the ethernet frames which arecurrently being processed. This space is implemented in the form of two FIFOs, onein each direction. They are both large enough to hold one Ethernet frame.

These FIFOs are implemented using block rams on the FPGA. They are not putinto the RAM because in this case it would have been necessary to write a memorycontroller which acts as an arbiter between the network card and the CPU whenaccessing memory. And the operating system would have to hand over some of thememory to the network card which could not be used otherwise.

The three parts of the network card each run with a different clock. The addressdecoder uses the CPU clock, whereas the clocks for the sender and the receiver areprovided by the line driver. That’s why there are some synchronization circuits inthe design.

The overall architecture of the network card is depicted in figure 4-1.

32

D a t a

32

D a t a

Leon

Address Decoder

i o s n

w r e n

o e n 32

FIFO

A

FIFO

B

D a t a

28

A d d r

SyncSync

Sender Receiver

3232

r d e n

w r e n

i n t r

s e n d

8 8

32 32

cpu−clock

tx−

clock

rx−

clock

i n t e r r u p t

11

n u m B y t e s

CRC CRC

PHY

t x_ e n

4 D a t a

r x_

d v 4

D a t a

Figure 4-1: Architecture of the network card

30



4.4. Network Hardware

4.4.2 Address Decoder

The address decoder attaches to the I/O port of LEON’s the memory bus. Depending

on the address the processor reads or writes, the address decoder performs differentoperations. A list of all possible actions is shown in table 4-2. The addresses men-tioned here address 32 bit words. So for the actual address, the number given herehas to be multiplied by 4. For addresses 3, 4 and 5 it doesn’t matter what value iswritten, the actual value is not even used in the address decoder. It is just the factthat they are written that is important.

All addresses are relative to the base address of the network card which we set tobe the first address of the I/O memory, 0x200000000.

Addr Dir Function Description

0 000 w write Data written to this address is saved in the sender

FIFO. The network card doesn’t read this data untilit gets the signal to start sending. But then the wholeFIFO is read and sent. So no more than one full packetcan be written to the FIFO before invoking the sender.

1 001 r read When a new packet has arrived and successfully beenplaced in the FIFO, it can be read from this address.Data contains the full Ethernet header starting withthe Destination Address. At the end after the actualdata there is some stuffing to the next 4 byte boundaryfollowed by the CRC of the Ethernet frame and theresult from the CRC checker.

2 010 r status

The status of the two FIFOs iscoded as follows:Bit Function

0 FIFO A empty1 FIFO A full2 FIFO B empty3 FIFO B full

3 011 w signal When this address is written the network card startssending the data in the sender FIFO.

4 100 w reset A Flush the FIFO A.

5 101 w reset B Flush the FIFO B.

6 110 r #Bytes After receiving a packet from the network card, read-ing this address results in the number of bytes thatlast packet consists of. This number includes the CRCand the calculated CRC value at the end of the packet.

7 111 not used

Table 4- 2: Detailed description of the memory locations of the network card

31



Chapter 4: Network

4.4.3 FIFOs

The two FIFOs are dual ported FIFOs[6] with a capacity of 511 × 32 bit each. Thus

they both can hold one full Ethernet frame. This buffer capacity is actually notenough to provide reliable network service in all cases. Because when frames arrivevery close after each other the processor is not fast enough to read the first frameout of the FIFO until the next one gets written into it and the FIFO reaches itscapacity quite fast. And since our implementation does not check whether there isfree space in the FIFO, this results in loss of data.

There is still room for improvement there. Either the FIFO has to be enlarged, sothat it can hold more than two full ethernet frames, or there has to be more thanone buffer to hold the incoming frames, where they can be stored in a round robinfashion.

The version with the larger FIFO should work, as the network card counts the num-ber of bytes of each frame, and the software can ask the card about this number. Soit is actually possible to have more than one frame in the FIFO but the softwareonly reads exactly one frame.

One problem with this solution arises in the case of frames with a wrong CRC. Atthe moment when there is never more than one frame in the FIFO at a time thereceiver process can just flush the FIFO if the incoming CRC is not correct. Butwhen there is the possibility that previous packets are still in the FIFO it is notpossible anymore to just flush it to get rid of the last frame. So with this solutionthe CRC check always has to be done in software.

We did not use larger FIFOs because there were not enough block RAMs availableon the FPGA to resize the FIFOs.

So the correct implementation would be to use several different buffers for the in-coming frames. So each new incoming frame gets written to a different buffer. Whenthe CRC check shows that the current frame is corrupted the according buffer can

just be flushed to get rid of the frame.

The two ports of the FIFOs have different clocks. This poses no difficulty, as theasynchronous FIFOs are read save, and then they are only written from one side(and actually only read from the other). So no extra synchronization circuit is nec-essary.

4.4.4 CRCThe design of the CRC generator was taken from the VHDL XSV Board Interface

Projects of the University of Queensland, Australia[16].

It was then improved by Michael Lerjen at the Computer Engineering and Networks

Laboratory, ETH Zurich to process one byte each clock cycle. This is necessary sincewe are running the CRC generator with the slow network clock compared to theQueensland project where it runs on the fast processor clock.

Whenever the signal CRCNewByte is asserted on a rising clock edge, a new byte isprocessed. In most cases though, this signal is de–asserted between two consecutivebytes, as they are not read as fast from the FIFOs. But in the sender, when the last

32




word is being transmitted, the speed has to be increased, and one byte is fed to thegenerator each cycle (cf. section 4.4.6).

4.4.5 Receiver

The basic ideas for the receiver (as well as for the sender) process have also beentaken from the project at University of Queensland.

The receiver waits until the PHY signals that new data has arrived. It then readsthe data into the FIFO. At the current state of development, data is written to theFIFO in any case, i.e. there is no control whether the destination address is the sameas the MAC address of the interface. In short, the interface is in promiscuous mode.

The state machine of the receiver is shown in figure 4-2.

Idle ReceiveSrcMacReceiveDestMac

Recevie

TypeField

Receive SignalData SendCRC

Data

2 12 12

4

rx_dv=0

/resetCnt=1

intr=1

CRCNewFrame=1

/resetCnt=1

rx_dv = 1

ReceiveSFD

Figure 4- 2: State machine of the receiver

Most of the states are more or less self-explanatory.

The signal rx_dv (receiver data valid) is from the network PHY and is asserted aslong as valid data is being sent to the card.

The numbers in the loops of the different states depict the number of clock cycles

the state machine remains in these states before proceeding. There is an internalcounter to implement this. Since the state machine receives one nibble of data fromthe PHY in each clock cycle, these numbers are twice the length of the according field of the Ethernet frame.

The state SignalData is used to synchronize data at the end of the frame. In theFIFO, only 32 bit words are saved. But it is not specified that the length of anEthernet frame has to be a multiple of 4 bytes. So when the signal rx_dv is de-asserted by the PHY, the remaining bytes are written to the FIFO and the rest isfilled with null bytes.

While receiving, data is also sent byte-wise to the CRC generator. After all data hasbeen received, the CRC could be checked and in case it is wrong, the frame could be

33



Chapter 4: Network

thrown away. This is not yet implemented. But the calculated CRC is also writtento the FIFO at the end (after data has been stuffed with null bytes), so it is alsopossible to do the CRC check in software.

As soon as all data is received, the receiver generates an interrupt to the CPU. Thisinterrupt is being synchronized so it can be adjusted to the CPU’s needs.

Since all data is written to a FIFO, the CPU, while reading data from the FIFO,cannot tell when the end of a frame is reached. For this reason the receiver countsthe number of bytes it sends to the FIFO. So when the receiver task is triggered bythe interrupt handler, it can ask the receiver about the length of the frame so it canperform the correct number of reads on the FIFO to remove just the whole Ethernetframe.

As stated in the section about the FIFOs (cf. section 4.4.3) this design leads to prob-lems as soon as there is more than one frame in the FIFO at a certain time.

While the case where frames arrive too fast can be dealt with in software it getsharder in case of a collision.

When the collision occurs and the jamming sequence is sent, only parts of the framehave been transmitted. So the frame is actually invalid, but parts of it have alreadybeen sent to the FIFO. Now it is up to the software to recognize the corrupted frameand discard it.

4.4.6 Sender

The sender has to take data from the FIFO and send it to the PHY. It gets a sig-

nal from the CPU when there is data to be sent. The sender then reads this data,calculates the CRC and sends the data and CRC to the PHY.

The sender is implemented as a state machine as shown in figure 4-3.

SendPreambleIdle SendData ShiftNibbles

CalcCRC SendCRC Wait Interrupt

9 8 15

16

=1doSendFrame

fifo_empty=1

intr=1

Figure 4- 3: State machine of the sender

34




Again, most states are self-explanatory. In the state SendData the sender reads a 32bit word from the FIFO. The first nibble is immediately sent and the state machineproceeds to the state ShiftNibbles. There, the second nibble of the current byte issent.

While sending, every byte is sent to the CRC generator. This is done in the stateSendData. But at the end, before getting the final CRC, we have to send 4 nullbytes to the CRC generator. So after the last 32 bit word has been read from theFIFO (the signal fifo_empty goes to 1), the state is changed to CalcCRC. Therethe speed at which bytes are sent to the CRC generator is doubled, so it is possibleto send 4 additional null bytes while still sending the last 4 data bytes. Thus, whenthe last nibble is sent, the CRC is calculated and can be transmitted right away.This is then done in the state SendCRC.

The Ethernet specification says that after a successful send, the line has to be quiet

for at least 12 byte times (24 cycles). To ensure that the state Wait was introduced.

At the end, it is possible to generate an interrupt to tell the CPU that we havefinished sending the current packet. This interrupt is generated, but it is not being forwarded to the CPU.

This interrupt can be the same as the receive interrupt, but then it must be possiblefor the CPU to somehow distinguish these two events. One idea is to map the stillfree memory address for this purpose. So when the CPU receives this interrupt, thismemory location states which of the two really happened.

4.4.7 Possible Improvements

The most important possible improvements, as stated in the previous sections, aresummarized in table 4-3.

Description Problem Possible Solution

CRC The receiver calculates theCRC of the incoming frame,but it delivers this frame re-gardless of the correctness of the CRC.

The hardware checks for a cor-rect CRC and delivers the frameonly if the CRC matches. Op-tionally, this could be configure-able. It was quite helpful for de-bugging purposes to be able to

catch all frames, even those witha wrong CRC.

Continued on next page. . .

35



Chapter 4: Network

. . . continued from previous page


Destination Address

The receiver does not per-form address checking on in-coming frames.

Usually, network cards only ac-cept frames which have the cor-rect destination address (the ad-dress of the card, a multicast orthe broadcast address). To de-liver all frames, it can be set inpromiscuous mode.

Collisions Neither the sender nor thereceiver care about colli-sions on the Ethernet. Thesender just sends its frame

and signals success. Andthe receiver receives dataas long as the PHY signalsvalid data.

Collision detection must be im-plemented in the sender and re-ceiver. They are alerted by thePHY when a collision occurs.

The sender should then wait andretransmit the frame, the re-ceiver should delete the alreadyreceived data.

Memory The network card has notenough memory to storemore than one Ethernetframe. With fast transmis-sions, this is not enough.

The internal buffer space of the network card should be im-proved. There should be severalbuffers which are used in a cir-cular fashion. Now the CPU hastime to process one frame un-til this particular buffer space is

needed again.FIFO The received frames are

stored in a FIFO. To be readby the software, the driverhas to read the same mem-ory address again and again.But it could be implementedmore efficiently if the bufferwas addressable in a linearway.

Instead of taking FIFOs asbuffer space between theaddress decoder and thesender/receiver linear address-able memory blocks should beused. Best would be to reservea certain area in the RAM of the processor. This way the cardcan write data directly, and thedriver doesn’t have to copy each

frame again.Continued on next page. . .

36



4.5. Software

. . . continued from previous page


Interrupt The sender does not knowwhen the network card hassent the frame because it

just doesn’t get informedabout that.

The sender should generate aninterrupt when a frame has beensuccessfully sent. This interrupttells the driver that there is asend buffer free for new data.(This also works with severalsend buffers. Then the driversknows about these buffers andfills them all. Then he waits foran interrupt until sending thenext frame).

Table 4- 3: Possible hardware improvements

4.5 Software

The software, that had to be written, consists of a network driver for the networkcard described in section 4.4 and an application program that runs on the LEONprocessor. This program reads data from the network and sends it to the audio codec.That’s about all there is to do on the LEON processor. But to show that LEON is stillalive, we implemented an additional simple task which prints a running clock on theconsole.

And finally we wrote a program on a Linux box which reads audio data from a fileand sends it on to the network in the format the audio codec on th FPGA needs.

4.5.1 Streaming Data to LEON

To send data over the ethernet network to LEON, a streaming protocol is needed.For this task, we used the UDP protocol to transport data on to LEON. It is notnecessary to take TCP with its handshake and retransmission features, becauseaudio data must be ready at a certain point in time, so if a packet is lost it makeslittle sense to retransmit it, as it will arrive too late, and there will be an audiblebreak in the music.

We actually tried to implement the stream using TCP. But the stack is not con-trollable by the software, meaning it sends packets on his own which cannot beprevented by any means except changing the protocol stack. After the handshakehas taken place, the stack tries to figure out the maximal windows size as fast aspossible. When the first packet with data has been delivered it immediately sendsanother empty packet with just the ACK bit set. In the current configuration thenetwork driver is thrown off track by this second packet (cf. section 4.4.3).

On the other hand, if a packet is lost, it is just omitted when sending to the audiocodec. In this case, the music just coughs once and then continues with the next datapacket.

37



Chapter 4: Network

But even though, we implemented some flow control mechanism. The PC sends apacket with data to LEON, which writes it to the audio FIFO. Then the audio codecplays this data. As soon as LEON has written all data to the FIFO, it sends a packetback to the PC confirming the data just written. Then the PC sends the next junk.With this protocol, it is simple to prevent LEON from being flooded with packets,and a possible buffer underrun is also efficiently handled.

Another possibility would be to make the PC send data at exactly specified times.But then it would be necessary to calculate this intervals and stick very tightlyto them. Because if the interval is just very little too short, LEON will finally beforced to drop data to prevent being overrun. With too long an interval LEON willeventually run out of data leading to nasty breaks in the music.

4.5.2 Driver

The driver basically consists of the functions interrupt handler, sender daemon, re-

ceiver daemon and initialization.

The driver was written according to the RTEMS network manual[9]. Thus it resem-bles in large parts the example driver ’Generic 68360’.

The sender and the receiver are both RTEMS tasks. The sender gets a signal fromthe stack if it has to send data, the receiver gets a signal from the interrupt handlerwhen new data has arrived.

During the whole processing in the network stack, the frame is saved in a specialdata structure called mbuf. The structure is quite complicated and contains differentunions and other structs, but the important fields are a buffer to save the wholeEthernet frame, a pointer to the start of the data (which doesn’t have to be the startof the buffer), and a pointer to the start of the IP packet.

Receiver

The receiver first allocates a small pool of mbufs. Each mbuf gets external storagefor exactly one Ethernet frame. It then waits for the interrupt. Once this has arrived,it first reads from the network card the number of 32 bit words the packet consistsof. Then the actual data is read.

The data is written to the next available mbuf from the chain. There is a smallpitfall here, since the Ethernet header has a size of 14 bytes. When the whole frame

is just copied into the buffer, then the IP packet starts on a non-aligned memoryaddress which causes LEON to trap. That’s why the header and the body are copiedseparately into the mbuf with a space of two bytes inbetween.

The network card delivers the Ethernet frame with the CRC field and the resultof the CRC check included. So it is actually possible to program a packet snifferwhich captures all packets, including such with a wrong CRC if the hardware doesn’tdo CRC checking. This feature was particularly useful during development of thedriver, because on the PC, the network card rejects packets with wrong CRC field,so the software never even gets these packets, and thus they are not visible. The onlyindication for their existence was the blinking LED on the hub in the test network.

38



4.5. Software

After having copied the packet and marked the header and the body, the mbuf ispassed on to the next higher layer of the stack.

Finally the used mbuf must be replaced and the counter shifted on to the next mbufin line to receive a frame.

Sender

The sender waits for the stack to pass mbufs containing frames to send. Once thesignal from the next layer has come, it first clears the sender FIFO.

The first mbuf from the packet is dequeued and interpreted. If it is the first mbuf

(the flag M_PKTHDR is set) the length of the whole frame is saved for later use. Thedata is then copied into a sender buffer after it has been checked that there won’tbe a buffer overflow. Finally, the mbuf is freed.

If all mbufs have been read, the length of the received frame is checked against theadvertised length from the first mbuf.

There is again a minor pitfall here. It is possible that the higher network layersproduce mbufs with no data and the length field set to zero. But with the implemen-tation here this should not pose a problem.

Now it is time to adjust the lenght. First we have to ensure that the minimal lenghtrequirement of an ethernet frame is fulfilled. And then we are sending words of 32bits, but the length information is in bytes.

As soon as all the data has been written to the FIFO the network card is signaled tobegin sending.

The sender task has to adjust statistics and reset certain variables, before it canbegin waiting for the next frame to send.

Actually, at that time it should wait for the interrupt the card generates when theframe is really sent.

4.5.3 Application on LEON

The application program on LEON is quite simple. It just reads data from the Eth-ernet and then writes the sound samples to the FIFO.

The sound transport protocol runs over UDP. In the packet, there is just raw data,

as produced by sox (and probably the ADPCM coder). LEON takes these samplesand sends them unaltered to the FIFO of the audio codec.

The only difficulty here is to use the correct byte order. The host computer is a PCrunning Linux and therefore byte order is little-endian. LEON is a SPARC imple-mentation which is big-endian.

In spite of our flowcontrol mechanism, there are still some packet losses. To improvethe program and recover after a packet loss, the program starts a timer just beforecalling the blocking receiver function recvfrom(). So when no packet arrives ina certain time period, the timer function flushes the network FIFOs, resends the

39



Chapter 4: Network

confirmation packet and resets the timer to a longer period (the interval is actu-ally doubled). With this modification it is usually possible to simply recover from apacket loss.

Of course the data in the lost packet is also lost, but with audio streaming this is notreally a problem since in this case the played music just coughs once before going on.

Even when sending ADPCM data a packet loss doesn’t pose a problem. After such aloss the signal is off the correct values by a constant. This should make no differencein the played sound as long as the signal does not saturate.

When it finally saturates, it should produce some strange noise but at the same timethe offset by which it is off the correct signal diminishes. So when the signal leavessaturation again it is closer to the original than before. That means it is actuallyself-adjusting after a packet loss, with only minor implications on the sound quality.

In practice, playing ADPCM produced some strange results. After a packet loss, thesound sometimes grew louder, sometimes quieter, sometimes only one channel washeard afterwards.

The problem here is that the network card doest’t work correctly on a packet loss.

To ensure that LEON is still alive, we wrote a task showing a running clock onthe standard output. In between it checks standard input for user commands. Thatway it is possible to implement arbitrary user controlled functionality into the pro-gram easily. We did some functions for debugging (such as ’show FIFO status’, ’flushFIFOs’, ’show network card statistics’).

Also a volume control was implemented. When pressing ’+’, the volume is increased,when pressing ’-’ it is decreased.

The volume control is implemented by bitwise shifting the samples to the left (vol-ume is positive) or to the right (volume is negative). So the amplitude of the samplesis multiplied/divided by 2 with each level of volume.

This volume control only works for PCM data. When sending ADPCM the volumehas to be set to 0 as it doesn’t change the samples in this case. If the volume is not0 when sending ADPCM data the result is only loud noise.

4.5.4 Application on PC

The program on the PC — it is called sound — has to stream the data on to LEON. Itopens up two UDP connections, one to send data, the other to receive confirmationsfrom LEON.

It then reads a junk of data, sends this to LEON and waits for the confirmation.

Now for the two audio codecs we have to create two different formats of audio data,PCM and ADPCM data.

The sound-Application doesn’t distinguish between the two formats, so the differ-ence has to be made before.

PCM data is just plain audiosamples, 16 bit signed, stereo, with a sampling rateof 39063 Hz (this sampling rate is a fraction of the processor clock of 20 MHz :

40



4.5. Software

39.0625kHz ∗ 29 = 20MHz). But to create ADPCM data we need raw data which canthen be converted using the ADPCM library[11]. Thus the first task is to create rawdata.

This is something the sound program sox[17] can do. It reads wav files, convertsthem to raw data while at the same time adjusting the sample rate, including theapplication of the necessary lowpass filter. Since the function lowp of sox is only alowpass of first order, the edge frequency is chosen at 85% of the new bandwidth.

sox -t wav infile.wav

-r 39063 -s -w -t raw outfile.raw lowp 16600

Now we are able to play wav files on LEON. On the internet there are some radiostations that stream mp3 music2. To be able to play these streams we first had to be

able to send mp3 files to LEON. There are basically two possible solutions to this:to send mp3 data to LEON and decode it in hardware or to decode it in software onthe host and send raw data.

Playing raw data was already implemented so we looked for a program to convertmp3 files to raw data.

The program mpg123 can either write wav files:

mpg123 -w outfile.wav infile.mp3

or, to be able to directly stream data through, it can also write raw data:

mpg123 -s infile.mp3 |

sox -r 44100 -s -c 2 -w -t raw -

-r 39063 -s -c 1 -w -t raw - lowp 16602 |

sound -

The final step is to load the stream from the Internet and pipe into mpg123. For thiswe ’misused’ the text based webbrowser lynx.

lynx -useragent=’xmms/1.2.4’ -source http://IP:PORT |

mpg123 -y -s - |

sox -r 44100 -s -c 2 -w -t raw -

-r 39063 -s -c 1 -w -t raw - lowp 16602 |

sound - 1460

2http://www.shoutcast.com

41



Chapter 4: Network

42



5 Implementation of a dynamic

reconfigurable System

In this chapter, we describe the implementation of the reconfigurable system wehave developed in our thesis. Thereby we have targeted a specific application, whichshould demonstrate a partial reconfigurable system and the problems associatedwith it. Figure 5-1 illustrates this application.

DACsound

CPUformat 1

audio

core 1

DACsound

CPUaudio

format 2 core 2

partial dynamic reconfiguration

Figure 5-1

Application of a dynamicreconfigurable System

We have a CPU core and a sound core on an FPGA. The sound core is connected tothe CPU and receives audio data in a specific format. It then decodes the audio dataand sends it to the DAC on the development board. The sound core can only decode

43



Chapter 5: Implementation of a dynamic reconfigurable System

and play one specific audio format. If we need to play another format, we replacethe sound core with partial dynamic reconfiguration. One key point is, that the CPUisn’t affected by the partial reconfiguration and continues to run.

To realize dynamic reconfiguration, we have used flow 4 described in section 2.4.4.We have implemented both versions of flow 4. To distinguish the two, we namedthe flow from figure 2-6 the Dynamic Routing Flow and the flow from figure 2-5 the

Direct Copy Flow.

5.1 Partitioning

Our reconfigurable System consist of a static and a dynamic part (see Figure 5-2).

LEON

I/O−Pads

I n t e r f a c e Virtual

Component

Data Bus

Adress Bus

VC Signals

static dynamic

Figure 5- 2

Partitioning of our reconfig-urable system: A static part

with LEON and the Inter-

face and a dynamic part

with the Virtual Component

LEON: The LEON SPARC Processor is the central component. Refer to chapters 3and 4.

Interface: The interface is the key part for a successful reconfiguration. It has tobe designed and physically constrained well, in order to allow the replacementof the Virtual Components.

Virtual Component: The Virtual Component is the only reconfigurable unit. Ittemporarily provides a certain function to the system like playing a specificaudio format. It can be replaced on demand with another Virtual Component.

5.2 Interface

We had three choices to attach an extra component to the LEON processor (cf. fig-ure 3-2):

• Connect to AMBA AHB bus:

The LEON processor uses the ABMA AHB bus for high-speed data transfers.There are currently two slaves attached to the AHB bus: the memory controllerand the APB bridge.

• Connect to AMBA APB bus:

The AMBA APB bus is connected via the AHB/APB bridge to the AHB bus.Data transfers are slower than with the AHB bus but less complex.

44



5.3. Virtual Components

• Connect to memory bus:

The LEON processor supports a special address space for memory mapped I/O.Memory mapped I/O devices can be attached to the address and data bus andbe accessed the same way as memory.

Writing AHB and APB peripherals is quite complex, so we decided to use the sim-plest one which is memory mapped I/O.

To avoid contention, we had to insert tristate buffers to and from the databus. Anaddress controller sets the control signals, depending on the specified address. Toloosen the dependency of the Virtual Component on the processor, we inserted twoFIFO queues (see Figure 5-3).

CLB

Macros

32

32

Audio Signals

Control

Status

VC Data Out

VC Data In

Handshake

8

8

4

4

Databus

Virtual Component

I/O−Pads

FIFO Handshake

FIFO b

FIFO a Figure 5- 3

Detailed view of the inter-

face between the processor,

external IOBs and the Vir-

tual Component

We wanted to design the interface to the Virtual Component as flexible as possible.This means we should also allow applications other than just playing audio. Youcould think of a general computation unit which receives data from the processor,manipulates them and sends them back. The VC Data In and VC Data Out buses areintended for this purpose and lead through the FIFOs. Another pair of buses provideinstant access to the Virtual Component (Control) and the possibility to read backa status (Status) without passing the FIFOs. For a more detailed explanation of thesignals to the Virtual Component, refer to section 5.3.1.

As explained in section 2.4, the points where the Virtual Component is connectedto the interface must be fix and defined, so that the JBits manipulation program

can reconnect the replaced Virtual Component. For this reason, all signals to andfrom the Virtual Component lead through the CLB Macros. A CLB Macro does notadd any logic to the system, it only let the signal pass through it, but it can bephysically placed to a certain location on the FPGA. In section 5.4 we describe howwe constrained the CLB Macros.

5.3 Virtual Components

In our application, a Virtual Component has to decode and play audio in a specificformat. We have designed two of them, one playing PCM data and another one play-ing ADPCM. With this two replaceable cores, we show the mechanism of partial

45




dynamic reconfiguration. There are only two components by now, but once you canreplace the first one with the second, you can also replace it with a third or a fourthcore. As long as all cores uses the same interface and behave similarly concerning signal timing, the Virtual Components could implement any desired function.

5.3.1 VC Interface

The Virtual Components build a new level in the VHDL hierarchy. Figure 5-4 showsthe schematic symbol of the entity and table 5-1 describes the input and outputsignals.

32

clk

32 32

88

datain dataout

regout

rd_en

wr_en

mclk

lrck

sclk

sdin

full

empty

regin

reset

Figure 5- 4Virtual Component Entity Schema-

tic Symbol

Signal Type Width Description

datain input 32 Data input from FIFO a

dataout output 32 Data output to FIFO b

regin input 8 Control input (direct)regout output 8 Status output (direct)

empty input 1 input FIFO empty flag

full input 1 output FIFO full flag

rd_en output 1 input FIFO read enable

wr_en output 1 output FIFO write enabe

mclk output 1 master clock (stereo codec)

lrck output 1 left/right channel (stereo codec)

sclk output 1 audio serial data clock (stereo codec)

sdin output 1 audio serial data (stereo codec)

reset input 1 reset signal

clk input 1 clock signal

Table 5-1: Input and Output Signals of the Virtual Component Entity

Our two Virtual Components are read–only cores. They don’t need to return anydata. Therefore the dataout, regout, full and wr_en signals are not connected insidethe component.

5.3.2 PCM Player

The PCM Player is a simple yet well sounding audio component. PCM means thatthe incoming data consists of uncompressed raw audio samples. So all that has to

46




be done is to deliver this data in the right form to the stereo–codec.

On our XSV800 board there is an AK4520A 20 bit Stereo ADC & DAC1. Only four

signals are needed in our case:MCLK: Master Clock InputLRCK: Left/Right Channel InputSCLK: Audio Serial Data ClockSDTI: Audio Serial Data Input

Figure 5-5: Timing for AK4520A Stereo Codec

Figure 5-5 shows the signal timing. LRCK and SCLK are divisors of MCLK . Therelation is:

f MCLK = 256 · f S

f LRCK = f S

f SCLK = 64 · f S

where f S is the sampling frequency.

To generate these signals, we let a counter run with the system clock ( 20 MHz). Wethen assign:

mclk <= count(0);

sclk <= count(2);

lrck <= count(8);

This results in a sampling frequency of approximately 39.063 kHz. We could haveachieved a more standard sampling rate (e.g. 44.1 kHz) by applying an externaloscillator. But since we can pre-convert any audio streams to this rate, there is noneed for it by now.

To produce the serial audio output, we use the same counter to encode fife states,which are described in table 5-2. Whenever a certain couter value is reached, thecorresponding action is triggered.

1 Analog Digital Converter, Digital Analog Converter.

47




Counter

(binary) Procedure

000000000

Latch right sample from FIFO into shift register: The16 bit shift register is loaded with the lower 16 bits of the32 bit input word.

x0xxxx001 Output Serial Audio Bit: For left and right channel 16

times, output the most significant bit of the shift register.

x0xxxx010 Shift left shift register: After the output, left shift regis-

ter to process the next bit.

100000000

Latch left sample from FIFO into shift register: The16 bit shift register is loaded with the higher 16 bits of the32 bit input word.

110000000

Generate FIFO read pulse: After left and right sample

are read from the input word, generate a read pulse to tellthe FIFO to output the next word.

Table 5- 2: Sequence to produce serial audio data

5.3.3 ADPCM Player

The ADPCM Player is more complex than the PCM Player. Adaptive DifferentialPulse Code Modulation (ADPCM) codecs are waveform codecs which instead of quan-tizing the speech signal directly, like PCM codecs, quantize the difference betweenthe speech signal and a prediction that has been made of the speech signal. If the

prediction is accurate then the difference between the real and predicted speechsamples will have a lower variance than the real speech samples, and will be ac-curately quantized with fewer bits than would be needed to quantize the originalspeech samples.

The ADPCM Player can be divided into three stages (see Figure 5-6):

decode

decode

4 16

32 1

split serialize

(PCM Player)

Figure 5-6: ADPCM Player Stages

1. Read a 32 bit word from the FIFO and split it up in 4 bit nibbles.

2. Decode ADPCM for left and right channel.

3. Serialize left and right PCM data to Stereo-Codec and synchronize left andright channel decoder.

48




Splitter

The 32 bit word from the FIFO has the format shown in Table 5-3. ’n1 (l)’ means

that the first nibble is for the left channel.

31 27 0

n1 (l) n2 (r) n3 (l) n4 (r) n5 (l) n6 (r) n7 (l) n8 (r)

Table 5- 3: ADPCM word format

When there are words in the FIFO (indicated with the inactive empty signal) theleft decoder is enabled for reading. On it’s request we load word from the FIFO intoa shift register which is directly connected to the left and the right decoder with it’sfour highest bits (27− 31). Now, the left decoder reads the first nibble and we enable

the right decoder. On the right decoders request, we shift the register left by fourbits that the second nibble can be read. This continues the same way until all of theeight nibbles are read (see State Diagram in Figure 5-7). Note that this does notsynchronize the decoders. We presume that the decoders send alternately their readrequests (see next section).

0 1 2 3 4

56789

r e a d r e q u e s

t

l ef t d e c o d er

read requestright decoder

shift register << 4

read requestleft decoder


shift register << 4


shift register << 4


shift register << 4

enableleft

decoder

enableright

decoder

enableright

decoder

enableleft

decoder

enableright

decoder

enableleft

decoder

enableright

decoder

r e a d r e q u e s t

r i gh t d e c o d er

s h i f t r e gi s t er <

< 4

enableleft

decoder

(8th nibble)

7th nibble 6th nibble 5th nibble 4th nibble 3rd nibble

2nd nibble1st nibble

not empty

read FIFO

s h i f t r e gi s t er <

< 4


shift register << 4load shift register

Figure 5-7: Finite State Machine to feed two ADPCM decoders with 4 bit nibbles

Serializer

The Serializer is actually more or less the same as the PCM Player (see Section 5.3.2).The difference is that in this case the left and right channel samples will not be read

from the FIFO but directly from the two decoders. Since the decoders have a FIFOinterface (see next section), the Serializer has to provide a full flag, to tell the de-coders, that they can output the decoded samples. This signal is used to synchronizethe two decoders. Both full signals (for the left and right decoder) are most of thetime high. The two decoders have decoded a sample (they are much faster) and arewaiting for the full signal to become low. When the Serializer wants to load its shiftregister with the left sample it shortly deactivates the full signal for this decoder.Half a period later the full signal for the right decoder will be deactivated resulting in an exact alternation of the two decoders.

49




ADPCM Decoder

In the beginning we intended to insert an IP-core ADPCM decoder. But the only free

available ones were either too simple (low quality) or too complex. So we designedone by ourselves.

Our ADPCM decoder is a hardware realization of the software decoder from Sticht-ing Mathematisch Centrum in Amsterdam [11] and is fully compatible with theirsoftware encoder. The implemented standard is called Intel/DVI ADPCM , which isa 16 bit PCM to 4 bit ADPCM coder and decoder.

For details on the used algorithm refer to the C source code. The code is freely avail-able (see appendix D for the license). The following five steps are needed to convert

ADPCM to 16 bit PCM.

1. Get new nibble (delta).

2. Update index:index = index + indexTable[delta]

3. Compute difference (vpdiff ):

vpdiff =

|delta| + 1

2

· step

4

4. and new predicted value ( pred):

pred = pred + sign (delta) · vpdiff

5. Update step value:step = stepTable[index]

Since |delta| has only three bits, the difference vpdiff (in step 3) is easily computedwith one assignment and three summations:

1. vpdiff = step 3

2. if (delta(0) = 1) vpdiff = vpdiff + (step 2)

3. if (delta(1) = 1) vpdiff = vpdiff + (step 1)

4. if (delta(2) = 1) vpdiff = vpdiff + step

With one Adder/Subtracter and five special purpose registers (see Figure 5-9), wecan process one sample in five cycles. A Finite State Machine sets all the control sig-nals needed in the data path. Figure 5-8 shows an abstraction of this state machine.In normal operation, it cycles through states one to five. Table 5-4 describes thesestates.

Our ADPCM decoder was designed to handle a FIFO interface, both for reading thedelta nibbles and to write out the PCM samples. There are four FIFO handshakesignals to control the interaction between the decoder and the input an the outputFIFO respectively. The empty and the full signal inform the decoder whether it can

50




Procedure Control Signals

1. delta <= fifo deltaload <= ’1’

pred <= pred +/- vpdiff predload <= ’1’

add/sub <= not sign

amux <= "01"

bmux <= ’1’

vmux <= ’1’

2. index += indexTable indload <= ’1’vpdiff <= step >> 3 vpload <= ’1’

output enable stepshift <= ’1’

amux <= "00"

bmux <= ’0’

vmux <= ’0’

output_en <= ’1’

3. vpdiff += step >> 2 vpload <= ’1’

stepshift <= ’1’

deltashift <= ’1’

amux <= "1x"

bmux <= ’1’vmux <= ’1’

4. vpdiff += step >> 1 vpload <= ’1’

stepshift <= ’1’

deltashift <= ’1’

amux <= "1x"

bmux <= ’1’

vmux <= ’1’

5. vpdiff += step vpload <= ’1’

step = stepTable stepload <= ’1’

read request amux <= "1x"

bmux <= ’1’vmux <= ’1’

input_en <= ’1’

Table 5- 4: ADPCM Control Sequence (in pseudo VHDL)

51




1

34

5

empty

full

5i

2i

1i

init

2

/empty

/empty

empty

/full

full

/full

/empty

Figure 5-8Control Path State Dia-

gram with Initial States(Init, 5i, 1i and 2i), Normal

Run States (1 to 5) and Wait

States (empty and full).

read from the input FIFO (if it’s not empty) or write to the output FIFO (if it’s notfull). With the input_en signal the decoder reads a new nibble from the FIFO andwith the output_en signal it writes the computed sample to the output FIFO. As canbe seen from Figure 5-8 the decoder checks the empty signal before entering statefive (where it sends the read request) and if the input FIFO is empty it enters the

empty wait state. Similarly, when the output FIFO is full the decoder enters the full

wait state.

Although the decoder can handle a FIFO interface it is directly connected to other

components without a FIFO in our design. This is possible if the handshake signalsare handled.

The states init, 5i, 1i and 2i are initial states. States 5i, 1i and 2i have the samefunctionality as states 5,1 and 2 except that they don’t process values from a previ-ous cycle.

Adder/Subtracter

The Adder/Subtracter in the decoder is an IP module from LogiCORE. It has two16 bit unsigned inputs and a 16 bit unsigned output. The add/sub signal controlswhether a summation or a subtraction is made. The ofl high signal indicates an over-flow, if the result exceeds the 16 bit bounds. In case of a subtraction ofl is normally

high and goes to low if and underflow occurs.

Delta Shift Register

This shift register is loaded with the 4 bit ADPCM nibble. The first bit, the sign bit,will not be shifted and remains in the register until the next loading. The other threebits, the magnitude, are available for the IndexTable until the signal deltashift goeshigh for the first time. Then the three magnitude bits will shift right. The delta0

signal is needed to control amux in the computation of vpdiff in states three to five.Before the first shift it has the value of delta(0). After the first shift it has the valueof delta(1) and at last the value of delta(2). If delta0 is ’0’ in these states, the leftmultiplexer (amux) has a null vector as output, making the summation ineffective.

52




0 110 01 00 11

vpdiff

0 1

step shift

index

Step Table

89 x 16

pred

vpload

add/sub

add/sub

Index Table

8 x 8 s h i f t

delta

8 16

16

16

16

877

4

16

1616

16

sat

0

stepload

stepshift

predload

amux bmux

vmux

sign

delta0

overflow

overflow

d e l t a l o a d

d e l t a s h i f t

i n d l o a d

+ / −

delta

sample

Figure 5-9: ADPCM Decoder Architecture (Data Path)

53




Step Shift Register

This register also has a special architecture for the computation of vpdiff in states

three to five. The register has a 16 bit input and output, but the internal width is 19bits. On stepload 16 bit output from the Step Table is loaded into the lower end of the shift register (see Table 5-5). The output consist of the 16 higher bits of the shiftregister. So the first value, which appears on the output after loading the register isstep 3. On each stepshift the register shifts left by one bit.

input output:19 15 ⇓ 0

stepload 0 0 0 step step 3stepshift 0 0 step 0 step 2

stepshift 0 step 0 0 step 1

stepshift step 0 0 0 step output

Table 5-5: Step Shift Register States

Index Table and Step Table

The Index Table and the Step Table are combinational ROM look-up tables. Theysimply output the defined value for an applied address. Step Table only hold 89 val-ues although its input width is seven bit. The other values are zero. The outputformat of Step Table is 16 bit unsigned and the one of Index Table is 8 bit 2’s com-plement (values from −1 to 8).

Saturation and Number Format

We have to pay attention to two special summations:

index = index + indexTable

pred = pred ± vpdiff

Let us consider the first summation. We have to restrict index to a range of 0 to 88.So we must saturate if the summation exceeds this bound. Another problem is that

Index Table can have a negative output (−1) which we want to add to the unsignedindex. The summation with −1 actually isn’t a problem if we only take 8 bits fromthe result. To inhibit going below zero and above 88 we compare the result with 88. If it’s greater than this value there was either an overflow or an underflow. The highestoverflow is 88+ 8 = 96. An underflow appears when 0 + (−1) = 255 (unsigned result).So we can inspect the eighth bit of the result. If it’s ’0’ then there was an overflowand we set index to 88, if it’s ’1’ we set index to zero.

The second summation, the computation of the output sample, also needs a specialsaturation arithmetic. The sample output is a 16 bit 2’s complement value with arange from −32768 to +32767. We can’t use this format for the summation becauseof the unsigned adder/subtracter. We shifted the whole scale up by 32768, mapping

54



5.4. Constraining the Design

the lowest value to zero and the highest to 65535. We can thus detect an overflowor an underflow by inspecting ofl and add/sub. If both are equal and high, therewas an overflow and we saturate on 65535, if they are low, we saturate on zero.The conversion back to the 2’s complement format is simply done by inverting thehighest bit.

5.4 Constraining the Design

Constraining the design is essential for partial reconfiguration. If we want to cut ablock from one design and paste it in another one, we have to make sure that nothing from the static part changes (see section 2.4.4). This implies that the static and thedynamic part have to be locally separated. The separation doesn’t only concern theplacement of logic blocks but also the routing. The routes of the static must not passthe dynamic area, otherwise they may be disconnected after the reconfiguration.

We applied three different steps to constrain our design: floorplanning, guided rout-ing and the insertion of CLB macros. The following sections describe these steps.Figure 5-10 shows a schematic of the result.

LEON

Component

Virtual

I n t e r −

f a c e

CLB Macros

reconfigurable area

routing blocks

Figure 5-10Constraining the Design:

Flooplanning, Guided Rou-

ting and CLB Macros (for

the Direct Copy Flow).

5.4.1 Floorplanning

With floorplanning we have defined a static and a reconfigurable area. We have used Xilinx’s Floorplanner and the UCF-Flow2. This allowed us to write all settings in an

UCF file. See appendix B for the UCF file of our design.

• Area constraints:

If the synthesis tool preserves the VHDL hierarchy, you can apply an area con-straint to any entity within the design. We did this for the CPU, the Interfaceand for the Virtual Component.

• Manual placement:

After verifying the effects of floorplanning, we saw that some componentsdidn’t obey the area constraints. These components were all tristate buffers(TBUFs) and it looked like they were placed arbitrarily over the FPGA and also

2UCF: User Constraints File.

55




in the reconfigurable area. The explanation of this behavior is that all TBUFsdriving the same net have to be horizontally aligned (on the same FPGA row).We utilize TBUFs to connect the external RAM and our FIFO interface to theLEONs databus. What we had to do then, was to find all TBUFs which areconnected to the databus, including the ones inside LEON of the memory con-troller. Then we had to manually place them to fulfill both the horizontal align-ment and our floorplanning ideas.

5.4.2 Guided Routing

We have successfully constrained the placement of logic blocks with floorplanning.Unfortunately there aren’t any similar methods to constrain the routing. Especiallythere are no constraints in current tools which inhibit a route from the static part topass our reconfigurable area. If you have a net connecting a component in the static

part on the left side with an IOB on the right side, the route may cause problems. Wecall the routes which could possibly pass the reconfigurable area Disturbing Lines.One method to avoid Disturbing Lines is to arrange the IOB location. If all pads of the static part and the static part itself are on the same half, no routes will passthrough the other half. This method isn’t applicable in our case. We can’t freelychoose IOB locations due to the fix wiring on the development board.

We have evaluated four potential solutions for the Disturbing Line problem:

• Dynamic routing:

With the dynamic reconfiguration tool JBits, we first detect the Disturbing Line and remember its start and end points. Before we replace the Virtual

Component, we unroute the Disturbing Line. After the replacement, we dy-namically reconnect the line.

But this method causes other problems. Since the Disturbing Line is also in thestatic part, the partial reconfiguration will also undesirably affect this part.

Another problem is that a disconnection of a static route conflicts with the ideaof dynamic reconfiguration. The static part should continue running during the reconfiguration. The third problem is that the routing algorithm of JBits ismuch simpler than the one of the conventional tools. For instance no long lines

are used. We have rejected this idea because of these problems.

• Manual routing:

Once we have implemented the design, we open it with the FPGA Editorand manually reroute all Disturbing Lines. We saw that sometimes it sufficesto reroute a net with the autorouter. With luck, the new route doesn’t passthrough the reconfigurable area.

Manual routing is extremely time-consuming. You have to assemble the routefrom short segments. If there is only one Disturbing Line this method wouldbe an option, but not for more. We desire a method which can be integrated inan automatic flow.

• Anti core:

An anti core acts as a placeholder for the later dynamically inserted VirtualComponent. One can instantiate the anti core in VHDL as a black box with the

56




same connections as the real core. The anti core occupies most of the resourceswithin a CLB resulting that no other logic elements can be placed there. Moreinformation on anti cores can be found in [14].

It is not clear how many routing resources an anti core reserves for itself andif it is enough that no other routes can pass this area. Creating an anti coreactually needs an existing JBits core. But our Virtual Components are notJBits cores. They are written in VHDL and synthesized and implemented withconventional tools. There may exist a workaround to produce an anti-core outof a netlist, but we didn’t follow this way.

• Guided routing:

The idea of guided routing is illustrated in Figure 5-11. We can not constraina route itself, but any components connected to that route. We thus insert a

pass–through CLB in each Disturbing Line and place them in an reasonable

area. A pass–through CLB is actually only a look–up–table with one input andwithout any logic function. It simply passes the signal through it. One canroute up to four lines through one CLB (two slices with each two look–up–tables). This CLB can be placed with the usual area constraints in the UCFfile. We have combined these CLBs to three routing blocks located around thereconfigurable area (see Figure 5-10). In section 5.4.3 we describe how to builda pass–through CLB macro.

To decide where to place the pass–through CLBs we still have to open the ini-tial design in the FPGA Editor and inspect the Disturbing Lines. But we haveto do this only once and not after every new implementation like in manualrouting. The CLBs will add a short delay to the route. This is negligible formost designs, but should be taken into account for high–speed designs.

LEONreconfigurable

area

Disturbing Line

(a) Initial Design with a Disturbing Line

LEONreconfigurable

area

x,y

(b) End Result after the Insertion of apass–trough CLB

Figure 5-11: Guided Routing

57




5.4.3 CLB Macros

To connect the Virtual Component with the static interface, we have also inserted

pass–through CLBs. This is important for two main reasons:1. With location constraints, we can place the pass–through CLBs to a fix loca-

tion. This allows the JBits manipulation program to find it and (re-)connectthe replaced Virtual Component.

2. With pass–through CLBs we can split up a connection to the Virtual Compo-nent in a static and a dynamic part. Only the dynamic part (from the pass–through CLB to the Virtual Component will be affected by the reconfiguration.The following situation explains the need for this. The Virtual Component hasconnections to IOBs on the left side. A direct connection would imply that thewhole net will be reconfigured. But the route crosses the static part, which

would also be affected by the reconfiguration. If a pass–through CLB is in-serted, it can be placed close the Virtual Component keeping the dynamic partof the net short.

We have implemented two different connection types with pass–through CLBs: A single stage CLB Macro and a double stage CLB Macro. We have inserted the sin-gle stage CLB Macros with the Dynamic Routing Flow and the double stage CLBMacros with the Direct Copy Flow.

Single Stage CLB Macro

A single stage CLB macro is actually the same as a feed–through macro. It consistsof only two slices and can feed four routes through it. We have used the FPGA

Editor to create a macro which we could instantiate in our VHDL architecture. Thefollowing steps describes the procedure:

1. Open the FPGA Editor and choose File->New. Select Macro and enter a filename (eg. nvc2.nmc). Select a part with the same architecture (size and pack-age is unimportant).

2. Zoom in the Array Window and select the left slice in an arbitrary CLB. ChooseEdit->Add to add a slice component. Do the same with the right slice.

3. Double–click the left slice to open the Block Window. Click the Begin Editing

icon and edit the slice by clicking on the desired resources. Look–up–table func-

tions can be added by clicking the F= icon. When finished, don’t forget to savethe changes. Figure 5-12 shows the final slice. Edit the right slice in the samemanner.

4. We now have to add macro pins to all the inputs and outputs. In the Array

Window click on a pin and choose Edit->Add Macro External Pin. Givethe external pin a meaningful name (without a preceding ’$’ as in the defaultname). We named the inputs and outputs from and to the interface iin<0>,

iin<1>, iout<0> and iout<1>. The pins to the Virtual Component bear thenames vin<0>, vin<1>, vout<0> and vout<1>.

5. Save the macro.

58




6. Instantiate the macro in the VHDL code with

component nvc2

port (vin : in std_logic_vector(1 downto 0);

vout : out std_logic_vector(1 downto 0);

iin : in std_logic_vector(1 downto 0);

iout : out std_logic_vector(1 downto 0));

end component;

7. For the implementation place the .nmc file in the same directory as the syn-thesis netlist of the design (.edf).

Figure 5-12

Block Window of the FPGA Editor with a pass–through

CLB slice.

The single stage CLB macros can be placed with normal location constraints in theUCF file.

Double Stage CLB Macros

The idea of the double stage CLB macros is to achieve a hard connection betweenthe Virtual Component and the interface (see figure 2-6). Hard means that the in-

serted macro has prerouted nets (hard routed macro) which remains unaltered inthe design. Normally a macro has only soft routes, which will be routed togetherwith the whole design. The advantage of using hard routed macros as connectionbetween the Virtual Component and the interface is that the JBits manipulationprogram doesn’t have to use the routing function. The connection is guaranteed be-cause all modules use the same macros and therefore the same routing resources. If an extracted module is pasted over an existing one, the new module uses the sameconnections as the old module.

A Double Stage Macro consists of four pass–through slices and four hard wiredroutes (see figure 5-13). A signal from the interface to the Virtual Component for

59




Slice 1 Slice 0

Slice 1 Slice 0

i i n < 0 >

i i n < 1 >

i o u t < 0 >

i o u t < 1 >

v o u t < 0 >

v o u t < 1 >

v i n < 1 >

v i n < 0 >

CLB

top

CLB

bottom

Figure 5-13 Double Stage CLB Macro

example will go from iin over the top CLB pass-through slice and the hard wiredroute to the bottom CLB pass–through slice and to vout.

The proceeding to create the Double Stage CLB Macro is very similar to the one forthe Single Stage CLB Macro. New is the creation of hard wired routes. In the FPGA Editor this is done with following steps:

1. Select the source pin of the route.

2. Press the Shift–Key and hold it while selecting the sink of the route.

3. Choose Edit->Add.

4. If Automatic Routing was enabled (in Main Properties) the net is already routed.If not, select the unrouted net and choose Tools->Route->Auto Route.

Because the routes are hard, the placement of the Double Stage CLB Macros mustbe done with caution. An error will occur, if two misplaced macros uses the samerouting resource. In our case, we saw that no more than six macros can be placedone upon the other. For the datain and dataout buses for example (see section 5.3.1)we need 16 macros. We have arranged them in a 4 × 4 array.

5.5 Bitstream Manipulation with JBits5.5.1 Introduction

JBits SDK is an Application Program Interface to the Xilinx configuration bit-stream. This API permits Java applications to dynamically modify Xilinx Virtexbitstreams.

Figure 5-14 shows the design flow for partial reconfiguration. The whole idea behindpartial reconfiguration is to only make the changes necessary to a device that willbring it into a desired configuration. The partial reconfiguration model performsthis function by determining changes made between the last configuration sent to

60



5.5. Bitstream Manipulation with JBits

the device and the present configuration in memory. Then it must create a sequenceof packets that will partially reconfigure the device. After all that, the model willmark the device and memory as synchronized and the process will start over again.

Virtex Bitstream

from Xilinx tools

partial Bitstream

Virtex Hardware

JBits SDKJava Application

Figure 5-14 JBits Design Flow for partial Re-

configuration

A brief introduction into JBits is given in the JBits Tutorial [14].

5.5.2 Function Blocks

In this section, we explain the functions we have used in our JBits program for the Dynamic Routing Flow and the Direct Copy Flow.

Reading and Writing Bitstreams

To read a bitstream you have to execute following commands:

jbits.read(<BitstreamFile>);

jbits.enableCrc(true);

jbits.clearDirtyFrames();

The last command tells JBits that the FPGA contains the same configuration and istherefore synchronized.

To write all the changes since the last clearDirtyFames() to a partial bitstream,you can use:jbits.writePartial(<partialFile>);

A full configuration can be written with:jbits.write(<BitstreamOutFile>);

Using JRoute

JRoute uses a database called the Resource Factory. There, the information is storedif a routing resource is used by a route, or if it is available for new routes. Thisdatabase has to be filled explicitly when you read in a new bitstream. This is donewith:

61




ResourceFactory rf = ResourceFactory.getResourceFactory(jbits);

rf.fillResourceFactory();

Only JRoute uses and updates this database. If a connection is made with the set

or the makeConnection command the routing resource has to be marked as usedin the database.

Routing Table

We have used the object NetPinsList for our routing table. As the name says, it isa list of NetPins. These two classes are both from the RTP package. Although wedon’t have RTP–cores, we uses these helpful constructs as follows:

NetPinsList rtable = new NetPinsList();

NetPins net = new NetPins(null);

To add a source to a net:net.addSource(<Pin>);

To add a sink to a net (the net can have multiple sinks):net.addSink(<Pin>);

To add the net to the routing table:rtable.add(net);

To add a complete net to the routing table, we use the trace function of JBits. If rtree is the traced Route Tree of a source, we can add the complete net with:

net = new NetPins(null);

net.addSource(rtree.getPin());

for(int k=0;k<rtree.getBottom().length;k++){net.addSink(rtree.getBottom()[k].getPin());

}

rtable.add(net);

Copy–Paste a Module

The extraction of a module from a bitstream is done with a special function, whichreads all available resources of a source CLB and writes this configuration to atarget CLB. The extraction–function was initially written by Phil James Roxby, oneof the inventors of JBits. Our co–tutor Herbert Walder added the set–function to this

class.The set–function does not reserve the routing resources in the Resource Factory.If JRoute is used after this function, the database has to be updated with thefillResourceFactory() method.

Clock Distribution

The Xilinx Tools activate only those connections of the clock–net to the CLBs whichare needed. The clock routing will not be copied with the above function, thereforewe have to do our own clock distribution.

In the com.xilinx.tools package, there are functions which connect all CLBswithin an area or the entire chip to a specific clock–net.

62



5.6. Partial Reconfiguration

Finding LUT inputs

We saw that the input pins for the Single Stage CLB Macro are not the same for

all macros (eg. F1 or G1). The reason for this is that the place&route function in the Xilinx Tools rearranges the LUT inputs if a better routing can be achieved. To findthe single input of a LUT in the Single Stage CLB Macro, we test the input muxesof the LUT. For example to test if the F1 input of slice 1 is used, we do:

int[] S1_F1 = jbits.get(<row>,<col>,S1F1.S1F1);

if (!Util.Compare(S1_F1,S1F1.OFF)){ //used=true }

Unused Connections

The unused connections are tied to ground. For the Dynamic Routing Flow we sawthat an additional slice with a zero look–up–table outputs a ground–net, which isconnected to the macro input. But this additional slice is not in the reconfigurablearea, but right besides the macro. These connections need a special treatment, be-cause they will not be copied by the copy–paste function.

We can detect such connections with the route tracer. If a route from a macro doesnot end in the reconfigurable area, it must be a ground–net. Instead of copying thisnet, we first remember which macro has this unused input. Then, we set directly thecorresponding look–up–table in the target bitstream to zero. The sequence to set theF LUT of slice 0 to zero is:

int[] nullLut = Util.IntToIntArray(0,16); // create 0 LUT output

nullLut = Util.InvertIntArray(nullLut); // has to be invertedjbits.set(<row>,<col>, LUT.SLICE0_F, nullLut);

5.6 Partial Reconfiguration

The partial reconfiguration is done with a special PCI–I/O–Card. We had to repro-gram the CPLD on the XSV800 to connect the parallel port to the SelectMap inter-face of the FPGA, since partial reconfiguration is not supported for the Slave Serial

Mode [13]. With a user program which runs on the configuration PC, we can down-load full and partial bitstreams to the device.

5.7 Implementation Results

5.7.1 Dynamic Routing Flow

For a first approach we only downloaded the full configuration bitstreams producedwith JBits. With this we wanted to show the functionality of our JBits program.

The result of the Dynamic Routing Flow was not overwhelming. We could downloadthe full bitstream with the ADPCM–Player replaced by the PCM–Player, but it didnot sound clean. Though the module played our PCM music, there was a loud noiseinterference. We supposed that the datain bus was only partially connected, so we

63




went back to the JBits program and did some signal tracing. But there, we could notfind any inconsistency.

The opposite direction (from the PCM–Player to the ADPCM–Player) was evenworse. We couldn’t even reconnect all the signals of the replaced module. Withoutcause, the routing function pretended that a certain input pin is already in use andthat it can therefore not route our net. But we did a previous reverseUnroute callfor this pin, and the tracing function also did not detect any routing resource for thispin either. For these reasons we preferred the Direct Copy Flow, which does not useJRoute at all.

5.7.2 Direct Copy Flow

For the Direct Copy Flow we had no problems building the bitstreams, since theJBits program for this flow is much easier. We also started with testing the fullbitstreams. The bitstream with the ADPCM–Player replaced by the PCM–Playerwas almost perfect. We could not hear any noise again. Sometimes there was still acracking on the loudspeakers.

The bitstream with the PCM–Player replaced by the ADPCM–Player was not work-ing correctly. When LEON sends the audio samples to the FIFO it seams that theyimmediately disappear. The FIFO never got filled, even if we didn’t tell the moduleto start playing. So far, we haven’t found the bug, thats why we concentrated on theother working bitstream.

To replace the ADPCM–Player with the PCM–Player we have a partial bitstreamwhich should apply only the changes needed, leaving the other part intact. We wrote

the full bitstream with the ADPCM–Player to the device, loaded the software andverified it by playing ADPCM music. Then we sent the partial bitstream to thedevice. We saw that the reconfiguration was much faster than the full configura-tion. The result was the following behavior: The application on LEON did not crash,which was good news. But if we then started playing something (PCM or ADPCM)we heard only some sort of a sawtooth sound. With this we showed that it did recon-figure something, but not the right way.

We couldn’t actually solve this puzzle. But from the JBits mailing list, we heardthat there are differences between version 2.8, which we used and the older version2.7 concerning the generation of partial bitstreams. We then ran our program withversion 2.7 and the big surprise was that it worked! The reconfiguration successfullyreplaced the ADPCM–Player with the PCM–Player. Although only in one direction,we had a proof of concept for our methods.

5.7.3 Network

The network connection for our system (cf. chapter 4) was developed simultaneouslyto the reconfigurable part described in this chapter. We targeted an end system,where we had both network connection and the reconfiguration ability. We did notmanage to integrate the network in the reconfigurable flow within the time limitof our thesis. We have two versions now: One version with network connection and

64



5.7. Implementation Results

streaming ability, but not reconfigurable, and one version without network connec-tion, but reconfigurable. For the reconfigurable version we had to pre–load the audiodata into the SRAM on the development board.

To integrate the network in the reconfigurable system, we will have to apply theconstraints of section 5.4. Since the network uses eight Blockrams, which will in-evitably be on the right side of the FPGA, floorplanning and guided routing will bequite demanding to avoid disturbing lines.

5.7.4 Design Facts

LEON:

• Size: ∼ 3865 slices (41.1 % CLB usage with Virtex XCV800)

• Blockrams: 14 (50 % BRAM usage)

Virtual Components:

• PCM–Player: Size: ∼ 35 slices (0.4 %)

• ADPCM–Player: Size: ∼ 430 slices (4.5 %)

Interface:

• Size: ∼ 105 slices (1.1 %)

• Blockrams: 4 (14.3 %)

Hardware/Software Versions used

• Xilinx Foundation Series 3.1i (3.3–sp7) / 4.1i

• FPGA Express Xilinx Edition 3.6.0

• LEON1–2.3.7

• RTEMS–4.5.0–jg–2

• JBits2.7 / JBits2.8

65




66



Conclusions

We have built a dynamic reconfigurable system on an FPGA. The system consists of a static part with the LEON 32bit CPU IP core and a dynamic part with a reconfig-urable unit.

To implement the LEON CPU on a Virtex FPGA on the XSV800 development board,we have successfully adapted the VHDL configuration to our target architecture.We have also enabled important features of the development platform and attachedthem to the CPU.

As an operation system, RTEMS proved to be a good choice. It allowed us to writeuser applications in a convenient manner. We have written a network device driver,which connects our Ethernet interface to the operation systems network stack. Thenetwork interface we have implemented on the FPGA gives the processor a fasterconnection to an attached computer. An application on LEON can successfully es-tablish an UDP connection and receive and send data over it. Although the network

card is not yet capable to handle every possible cause of error, in a simple environ-ment with no collisions it works flawlessly.

On top of this development environment, we have built a demonstrator of a dy-namic reconfigurable system. We have implemented two dynamic reconfigurableunits, which are both audio–codecs. The objective was to stream selectively PCMor ADPCM audio data from a PC over Ethernet to our FPGA, where depending onthe incoming data either the PCM or the ADPCM codec would play the datastream.For the ADPCM module we have developed our own ADPCM decoder unit.

We could successfully test the static system (without reconfiguration). Both modulesplayed the appropriate audio stream clearly. Also the network connection and theapplication on LEON worked as we expected.

The next step was to dynamic replace the modules on the FPGA. Therefore we haveworked out a flow which allows us to design the entire system with the mainstreamsynthesis tools and then to use the bitstream manipulation tool JBits for the dy-namic replacement. To enable this flow, we had to constrain the designs with floor-planning and the insertion of hard macros. We finally managed to dynamically re-place one audio–codec with the other and could herewith demonstrate the working of our concept.

Due to the time limit of our thesis, our system is not yet elaborated. For example wedid not manage to integrate the network interface in the dynamic flow. The dynamicreconfiguration itself could only be demonstrated in one direction (from ADPCM to

67



Conclusions

PCM). Thus, we can not consecutively replace the two modules as we targeted inour vision. Since our flow depends heavily on JBits, which is still in development,we attribute some of the problems to bugs in this software.

In spite of these problems, we believe to have attained our goals. On the one handwe have built a versatile development platform containing the LEON CPU with var-ious interfaces. On the other hand we could demonstrate a dynamic reconfigurablesystem.

68



Future Work

As mentioned before our system is not yet elaborated. In this section we give someideas to improve the current version.

LEON CPU • Speed: The current version now runs at 20 MHz. With dedicated floorplan-

ning and minor changes in the VHDL code, it should be possible to achieve anoticeable speed up for the CPU.

• New Versions: The LEON is continuously improved by Gaisler Research [3].We utilize LEON1-2.3.7. At the end of our thesis LEON2-1.0.2 was released.New versions could have valuable improvements for our system, e.g. a DMA controller.

• Boot–Loader: Our version boots now from a simple monitor program fromBlockrams or distributed memory on the FPGA. We then have to download the

user application over the slow serial interface. A new boot concept would bevery helpful. For example the usage of the FLASH-PROM on board could beevaluated. It is also possible to download user application over Ethernet intomemory.

Network Interface

• Collisions: The current version does not handle collisions. Collided frames arestill partially received. Therefore a collision detection should be implemented.

• Destination Address checking: Now the receiver does not check the desti-nation address and accepts every frame.

• Frame Buffer: Instead of Blockram FIFOs, a more suitable buffer for the in-coming and outgoing Ethernet frames should be evaluated. It would be helpfulif more than one frame at the time could be stored.

• Interface to the CPU: It could be advantageous to attach the network inter-face as an AMBA AHB device to the CPU, to allow high speed data transfers,also with a potential DMA controller.

Virtual Components

• Decoder State for ADPCM: If now a junk of the ADPCM stream gets lost,the decoder looses its state, resulting in strange effects like increasing or de-

69



Future Work

creasing the volume. If one enables to set the decoder state explicitly, this statecould be sent in the beginning of every Ethernet frame.

• PCM Stereo Player: The PCM–Player is still mono and could be enhanced tostereo.

• Other Virtual Components: Our interface allows other virtual componentsthan audio–codecs. For example different dynamic accelerators could be imple-mented.

Constraining

• Anti–Cores: To avoid disturbing lines, we have not tried to utilize anti–cores.This method should also be taken into account.

• Other Development Board: The main reason for disturbing lines is the de-

velopment board we use. If we could freely assign the IOBs of the FPGA, wecould prevent most of the disturbing lines.

Dynamic Reconfiguration

• Debugging: We still do not know, why the dynamic reconfiguration workswith one design and not with the other. There is a difference in bitstreamgeneration in JBits versions 2.7 and 2.8. The knowledge of this difference andof other bugs may be the missing key for a successful reconfiguration.

• Alternatives to JBits: It could be possible to replace the modules with othertools than JBits for the direct copy flow.

70



A LEON VHDL files

The order in which the files have to be added in synopsys is as follows:

add_file -library WORK -format VHDL ../leon/amba.vhd

add_file -library WORK -format VHDL ../leon/target.vhd

add_file -library WORK -format VHDL ../leon/device.vhd

add_file -library WORK -format VHDL ../leon/config.vhd

add_file -library WORK -format VHDL ../leon/sparcv8.vhd

add_file -library WORK -format VHDL ../leon/iface.vhd

add_file -library WORK -format VHDL ../leon/macro.vhd

add_file -library WORK -format VHDL ../leon/bprom.vhdadd_file -library WORK -format VHDL ../leon/multlib.vhd

add_file -library WORK -format VHDL ../leon/tech_generic.vhd

add_file -library WORK -format VHDL ../leon/tech_virtex.vhd

add_file -library WORK -format VHDL ../leon/tech_atc25.vhd

add_file -library WORK -format VHDL ../leon/tech_atc35.vhd

add_file -library WORK -format VHDL ../leon/tech_fs90.vhd

add_file -library WORK -format VHDL ../leon/tech_umc18.vhd

add_file -library WORK -format VHDL ../leon/tech_map.vhd

add_file -library WORK -format VHDL ../leon/cachemem.vhd

add_file -library WORK -format VHDL ../leon/icache.vhd

add_file -library WORK -format VHDL ../leon/dcache.vhd

add_file -library WORK -format VHDL ../leon/acache.vhdadd_file -library WORK -format VHDL ../leon/cache.vhd

add_file -library WORK -format VHDL ../leon/apbmst.vhd

add_file -library WORK -format VHDL ../leon/ahbstat.vhd

add_file -library WORK -format VHDL ../leon/ahbtest.vhd

add_file -library WORK -format VHDL ../leon/ambacomp.vhd

add_file -library WORK -format VHDL ../leon/ahbarb.vhd

add_file -library WORK -format VHDL ../leon/lconf.vhd

add_file -library WORK -format VHDL ../leon/fpulib.vhd

add_file -library WORK -format VHDL ../leon/fp1eu.vhd

add_file -library WORK -format VHDL ../leon/ioport.vhd

add_file -library WORK -format VHDL ../leon/irqctrl.vhd

add_file -library WORK -format VHDL ../leon/clkgen.vhd

71



Appendix A: LEON VHDL files

add_file -library WORK -format VHDL ../leon/mctrl.vhd

add_file -library WORK -format VHDL ../leon/rstgen.vhd

add_file -library WORK -format VHDL ../leon/timers.vhd

add_file -library WORK -format VHDL ../leon/uart.vhdadd_file -library WORK -format VHDL ../leon/div.vhd

add_file -library WORK -format VHDL ../leon/mul.vhd

add_file -library WORK -format VHDL ../leon/iu.vhd

add_file -library WORK -format VHDL ../leon/proc.vhd

add_file -library WORK -format VHDL ../leon/wprot.vhd

add_file -library WORK -format VHDL ../leon/mcore.vhd

add_file -library WORK -format VHDL ../leon/leon.vhd

Next are the files of the network interface:

add_file -library WORK -format VHDL ../leon/crcgenerator.vhdl

add_file -library WORK -format VHDL ../leon/ether_recv.vhdl

add_file -library WORK -format VHDL ../leon/ether_send.vhdl

add_file -library WORK -format VHDL ../leon/fifo.vhdl

For the audio codec, it is either a normal pcm codec:

add_file -library WORK -format VHDL ../leon/adpcm/vcaudio.vhd

or an adpcm codec:

add_file -library WORK -format VHDL ../leon/adpcm/amux.vhdadd_file -library WORK -format VHDL ../leon/adpcm/control.vhd

add_file -library WORK -format VHDL ../leon/adpcm/deltareg.vhd

add_file -library WORK -format VHDL ../leon/adpcm/indexsat.vhd

add_file -library WORK -format VHDL ../leon/adpcm/indextable.vhd

add_file -library WORK -format VHDL ../leon/adpcm/mux.vhd

add_file -library WORK -format VHDL ../leon/adpcm/reg16.vhd

add_file -library WORK -format VHDL ../leon/adpcm/reg7.vhd

add_file -library WORK -format VHDL ../leon/adpcm/stepreg.vhd

add_file -library WORK -format VHDL ../leon/adpcm/steptable.vhd

add_file -library WORK -format VHDL ../leon/adpcm/predreg.vhd

add_file -library WORK -format VHDL ../leon/adpcm/decoder.vhd

add_file -library WORK -format VHDL ../leon/adpcm/vcadpcm.vhd

And finally the top designs:

add_file -library WORK -format VHDL ../leon/top.vhdl

add_file -library WORK -format VHDL ../leon/xsv800.vhd

72



BUCF Constraint File

# Timing constraints: (clock to 20 or 25 MHz)

NET "clk" TNM_NET = "clk";

TIMESPEC "TS_clk" = PERIOD "clk" 25 MHz HIGH 50 %;

# IOB Locations

...

# Manual Placed TBUFs (ext RAM data -> databus)

INST "TRIBUF0_31" LOC = "TBUF_R16C40.1" ;






























73



Appendix B: UCF Constraint File



# Manual Placed TBUFs (VC Status / FIFO Handshake -> databus)

INST "bififo0/TRIBUF2_31" LOC = "TBUF_R16C45.1" ;














INST "bififo0/TRIBUF2_17" LOC = "TBUF_R9C45.1" ;INST "bififo0/TRIBUF2_16" LOC = "TBUF_R9C45.0" ;
















# Manual Placed TBUFs (FIFO b -> databus)



























74









# Manual Placed TBUFs (databus -> FIFO a)































# Manual Placed BlockRAMs (FIFO)

INST "bififo0/fifob/B7" LOC = "RAMB4_R2C1" ;

INST "bififo0/fifob/B11" LOC = "RAMB4_R3C1" ;

INST "bififo0/fifoa/B7" LOC = "RAMB4_R0C1" ;

INST "bififo0/fifoa/B11" LOC = "RAMB4_R1C1" ;

# FIFO Area Constraints

AREA_GROUP "AG_bififo0" RANGE = CLB_R1C43:CLB_R16C50 ;

AREA_GROUP "AG_bififo0" RANGE = TBUF_R1C43:TBUF_R16C50 ;

INST bififo0 AREA_GROUP = AG_bififo0 ;

AREA_GROUP "AG_bififo0/fifoa" RANGE = CLB_R1C45:CLB_R8C50 ;

INST bififo0/fifoa AREA_GROUP = AG_bififo0/fifoa ;

AREA_GROUP "AG_bififo0/fifob" RANGE = CLB_R9C45:CLB_R16C50 ;

INST bififo0/fifob AREA_GROUP = AG_bififo0/fifob ;

# LEON Area Constraints

AREA_GROUP "AG_leon0" RANGE = CLB_R1C1:CLB_R56C40 ;

AREA_GROUP "AG_leon0" RANGE = TBUF_R1C1:TBUF_R56C40 ;

75



Appendix B: UCF Constraint File

AREA_GROUP "AG_leon0" RANGE = RAMB4_R0C0:RAMB4_R13C0 ;

INST leon0 AREA_GROUP = AG_leon0 ;

# Virtual Component Area Constraints

AREA_GROUP "AG_vcaudio" RANGE = CLB_R24C55:CLB_R40C78 ;

INST "vctop0/vcaudio_1" AREA_GROUP = AG_vcaudio ;

# Guided Routing (get rid of disturbing lines)

INST "FTRAM2ADR_1" LOC = CLB_R1C41.*:CLB_R15C45.*;



INST "FTBAR_2" LOC = CLB_R1C41.*:CLB_R15C45.*;




INST "FTBAR_1" LOC = CLB_R51C41.*:CLB_R56C43.*;INST "FTSER" LOC = CLB_R51C41.*:CLB_R56C43.*;

INST "TBUFRXD_1" LOC = TBUF_R52C82:TBUF_R56C84;

#INST "TBUFRXD_2" LOC = TBUF_R52C82:TBUF_R56C84;

# Manual Placement of Double Stage CLB Macros

# VC Data In / VC Data Out

INST "vctop0/VCIDATA_0" LOC = CLB_R6C78.*:CLB_R24C78.*;





INST "vctop0/VCIDATA_5" LOC = CLB_R7C77.*:CLB_R25C77.*;INST "vctop0/VCIDATA_6" LOC = CLB_R7C76.*:CLB_R25C76.*;










# Control / Status / Handshake

INST "vctop0/VCIHS0" LOC = CLB_R7C56.*:CLB_R25C56.*;



INST "vctop0/VCIREG_0" LOC = CLB_R10C56.*:CLB_R28C56.*;




76



C Installing and Compiling RTEMS

This is a short introduction to compiling and installing the RTEMS operating system andadding the network driver code.

We first installed LECCS, the compiler suite, into /opt/rtems/. It is necessary to add thispath (/opt/rtems/bin) to the environment.

We then installed the RTEMS source into ∼/rtems-4.5.0-jg-2. This directory is called source directory in the following instructions.

For the installation, we built a special working directory, ∼/rtems_build/. It is importantthat these two directories are in the same subdirectory (in our case the home directory).

Otherwise the compilation will fail.

We then configured the source. This is done in the working directory.

../rtems-4.5.0-jg-2/configure \

--prefix=/opt/rtems \

--target=sparc-rtems \

--enable-gcc28 \

--enable-posix \

--enable-networking \

--enable-cxx \

--disable-multiprocessing \

--enable-rtemsbsp=leon1 \

--disable-tests

Then the whole system is built by invoking make.

After this step, we have a complete version of the whole RTEMS operating system, but stillwithout our network card driver.

We now patched the source directory with the patch patch_source and the working direc-

tory with the patch patch_work:

∼/rtems-4.5.0-jg-2/ # patch -p1 < patch_source

∼/rtems_build/ # patch -p1 < patch_work

77



Appendix C: Installing and Compiling RTEMS

Then the whole system is compiled again. This step is quite short as only the network carddriver gets compiled and included in the libraries.

Finally, the system needs to be installed with make install, which requires root privilegessince it installs into the directory /opt/rtems/.

Now when the network driver has been updated, the source must be compiled again withmake followed by make install as root.

78



D Miscellaneous

Intel/DVI ADPCM coder/decoder CopyrightCopyright 1992 by Stichting Mathematisch Centrum, Amsterdam, The

Netherlands.

All Rights Reserved

Permission to use, copy, modify, and distribute this software and its

documentation for any purpose and without fee is hereby granted,

provided that the above copyright notice appear in all copies and that

both that copyright notice and this permission notice appear in

supporting documentation, and that the names of Stichting Mathematisch

Centrum or CWI not be used in advertising or publicity pertaining to

distribution of the software without specific, written prior permission.

STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO

THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND

FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE

FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES

WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN

ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT

OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

79



Appendix D: Miscellaneous

80



Bibliography

[1] Xilinx Inc. Virtex 2.5V FPGAs. Data Sheets.http://www.xilinx.com/partinfo/ds003-1.pdf

[2] Xilinx Inc. Virtex Series Configuration Architecture User Guide. Application Note,

XAPP151 (v1.5) September 27, 2000.http://www.xilinx.com/xapp/xapp151.pdf

[3] Gaisler Research. The LEON Processor User’s Manual. Version 2.3.7, August 2001.http://www.gaisler.com

[4] Andreas Haase. Untersuchungen zur dynamischen Rekonfigurierbarkeit von FPGA.Diploma Thesis, September 2001. Technische Universität Chemnitz-Zwickau.

[5] Asahi Kasei Microsystems Co., Ltd. (AKM). AK4520A – 100dB 20-Bit Stereo CODEC.Data Sheet.http://www.asahi-kasei.co.jp/akm/usa/product/ak4520a/ek4520a.pdf

[6] Xilinx Inc., LogiCore. Asynchronous FIFO. IP for CORE–Generator.

http://www.xilinx.com/ipcenter/catalog/logicore/docs/async_fifo.pdf [7] Xilinx Inc., LogiCore. Adder/Subtractor. IP for CORE–Generator.

http://www.xilinx.com/ipcenter/catalog/logicore/docs/addsub.pdf

[8] Oar Online Applications Research Corporation.http://www.oarcorp.com

[9] Oar Online Applications Research Corporation, RTEMS Network Supplement

Edition 1, for RTEMS 4.5.0, 6.September 2000.http://www.oarcorp.com/rtems/releases/4.5.0/rtemsdoc-4.5.0/ share/rtemsdoc/pdf/networking.pdf

[10] Intel Corp., Dual-Speed Fast Ethernet Transceiver.http://courses.ece.uiuc.edu/ece311/docs/datasheets/ethernet.pdf

[11] Jack Jansen, Centre for Mathematics and Computer Science.Simple 16-bit ADPCM coder and decoder.ftp://ftp.cwi.nl/pub/audio/adpcm.shar

[12] Andrew S. TanenbaumComputer Networks, Third Edition.Prentice Hall.

[13] Xilinx Inc. Xilinx Online Partial Reconfiguration FAQ.http://www.xilinx.com/xilinxonline/partreconfaq.htm

[14] Xilinx Inc. JBits Tutorial. JBits version [email protected]

81

mailto://[email protected]

http://www.xilinx.com/xilinxonline/partreconfaq.htm

ftp://ftp.cwi.nl/pub/audio/adpcm.shar

http://courses.ece.uiuc.edu/ece311/docs/datasheets/ethernet.pdf

http://www.oarcorp.com/rtems/releases/4.5.0/rtemsdoc-4.5.0/share/rtemsdoc/pdf/networking.pdf

http://www.oarcorp.com/rtems/releases/4.5.0/rtemsdoc-4.5.0/share/rtemsdoc/pdf/networking.pdf

http://www.oarcorp.com/

http://www.xilinx.com/ipcenter/catalog/logicore/docs/addsub.pdf

http://www.xilinx.com/ipcenter/catalog/logicore/docs/async_fifo.pdf

http://www.asahi-kasei.co.jp/akm/usa/product/ak4520a/ek4520a.pdf

http://www.gaisler.com/

http://www.xilinx.com/xapp/xapp151.pdf

http://www.xilinx.com/partinfo/ds003-1.pdf



Bibliography

[15] Stephan Schirrmann, Re: [leon_sparc] Xess board implementation, LEON mailing list.http://groups.yahoo.com/group/leon_sparc/message/1259

[16] Peter Sutton, VHDL XSV Board Interface Project

, University of Queensland, Australia.http://www.itee.uq.edu.au/ peters/xsvboard/

[17] Chris Bagwell, Sound eXchanger Swiss Army Knife of Sound Processing Programs.http://sox.sourceforge.net

http://sox.sourceforge.net/

http://www.itee.uq.edu.au/~peters/xsvboard/

http://groups.yahoo.com/group/leonprotect%20T1extunderscore%20sparc/message/1259

reconfigurable system on fpga

Documents