an operating system for reconfigurable computing · ii an operating system for reconfigurable...
TRANSCRIPT
An Operating System
for Reconfigurable Computing
Research Thesis for the Degree of Doctor of Philosophy
By Grant Brian Wigley Bachelor of Engineering in Computer Systems Engineering (Hons), University of South Australia
Adelaide, South Australia
April 2005
Reconfigurable Computing Lab
School of Computer and Information Science
Division of Information Technology, Engineering and Environment
The Unversity of South Australia
i
Abstract
Field programmable gate arrays are a class of integrated circuit that enable logic functions and
interconnects to be programmed in almost real time. They can implement fine grained parallel
computing architectures and algorithms in hardware that were previously the domain of
custom VLSI. Field programmable gate arrays have shown themselves useful at exploiting
concurrency in a range of applications such as text searching, image processing and
encryption. When coupled with a microprocessor, which is more suited to computation
involving complex control flow and non time critical requirements, they form a potentially
versatile platform commonly known as a Reconfigurable Computer.
Reconfigurable computing applications have traditionally had the exclusive use of the field
programmable gate array, primarily because the logic densities of the available devices have
been relatively similar in size compared to the application. But with the modern FPGA
expanding beyond 10 million system gates, and through the use of dynamic reconfiguration, it
has become feasible for several applications to share a single high density device. However,
developing applications that share a device is difficult as the current design flow assumes the
exclusive use of the FPGA resources. As a consequence, the designer must ensure that
resources have been allocated for all possible combinations of loaded applications at design
time. If the sequence of application loading and unloading is not known in advance, all
resource allocation cannot be performed at design time because the availability of resources
changes dynamically.
The use of a runtime resource allocation environment modelled on a classical software
operating system would allow the full benefits of dynamic reconfiguration on high density
FPGAs to be realised. In addition to runtime resource allocation, other services provided by
an operating system such as abstraction of I/O and inter-application communication would
provide additional benefits to the users of a reconfigurable computer. This could possibly
reduce the difficulty of application development and deployment.
In this thesis, an operating system for reconfigurable computing that supports dynamically
arriving applications is presented. This is achieved by firstly developing the abstractions with
which designers implement their applications and a set of algorithm requirements that specify
the resource allocation and logic partitioning services. By combining these, an architecture of
ii
an operating system for reconfigurable computing can be specified. A prototype
implementation on one platform with multiple applications is then presented which enables an
exploration of how the resource allocation algorithms interact amongst themselves and with
typical applications.
Results obtained from the prototype include the measurement of the performance loss in
applications, and the time overheads introduced due to the use of the operating system.
Comparisons are made with programmable logic applications run with and without the
operating system. The results show that the overheads are reasonable given the current state of
the technology of FPGAs. Formulas for predicting the user response time and application
throughput based on the fragmentation of an FPGA are then derived. Weaknesses are
highlighted in the current design flows and the architecture of current FPGAs must be
rectified if an operating system is to become main-stream. For the tool flows this includes the
ability to pre-place and pre-route cores and perform high speed runtime routing. For the
FPGAs these include an optimised network, a memory management core, and a separate layer
to handle dynamic routing of the network.
iii
Contents
1 INTRODUCTION .................................................................................................1
2 RUNTIME SUPPORT FOR RECONFIGURABLE COMPUTING ........................6
2.1 Field programmable technology.............................................................................................................8
2.1.1 Introduction ..........................................................................................................................................8
2.1.2 Reconfigurable computing architectures ............................................................................................10
2.1.3 FPGA architectures.............................................................................................................................15
2.1.4 Conclusion ..........................................................................................................................................19
2.2 Abstractions, services and runtime systems ........................................................................................20
2.2.1 Services...............................................................................................................................................20
2.2.2 Prototypes ...........................................................................................................................................24
2.2.3 Evaluation...........................................................................................................................................26
2.3 Allocation and partitioning...................................................................................................................29
2.3.1 Allocation ...........................................................................................................................................29
2.3.2 Partitioning .........................................................................................................................................32
2.4 Reconfigurable computing design flow................................................................................................35
2.4.1 Traditional design flow.......................................................................................................................35
2.4.2 Runtime application design flow ........................................................................................................38
2.5 Applications and benchmarks for reconfigurable computers ...........................................................40 2.6 Conclusion..............................................................................................................................................42
3 METHODOLOGY ..............................................................................................43
3.1 Abstractions, architecture and design flow .........................................................................................47 3.2 Resource allocation and application partitioning ...............................................................................47 3.3 Operating system prototype and metrics.............................................................................................48 3.4 Performance evaluation ........................................................................................................................48 3.5 Conclusion..............................................................................................................................................49
4 ABSTRACTIONS, ARCHITECTURE AND DESIGN FLOW .............................50
4.1 Abstractions ...........................................................................................................................................52
4.1.1 Process abstraction..............................................................................................................................52
4.1.2 Address space .....................................................................................................................................59
iv
4.1.3 Inter-process communication..............................................................................................................61
4.1.4 Conclusion ..........................................................................................................................................67
4.2 Operating system architecture .............................................................................................................68
4.2.1 Previous reconfigurable computing runtime system architectures......................................................68
4.2.2 Proposed reconfigurable computing runtime system architecture ......................................................70
4.2.3 Sample application execution .............................................................................................................73
4.2.4 Conclusion ..........................................................................................................................................74
4.3 Algorithm specifications........................................................................................................................75
4.3.1 Runtime requirements for algorithms .................................................................................................75
4.3.2 Allocation ...........................................................................................................................................76
4.3.3 Partitioning .........................................................................................................................................76
4.3.4 Conclusion ..........................................................................................................................................77
4.4 New application design flow .................................................................................................................78 4.5 Conclusion..............................................................................................................................................80
5 RESOURCE ALLOCATION AND APPLICATION PARTITIONING..................81
5.1 Allocation ...............................................................................................................................................82
5.1.1 Survey of allocation literature.............................................................................................................82
5.1.2 Algorithm 1 – Greedy based...............................................................................................................84
5.1.3 Algorithm 2 – Bottom left ..................................................................................................................85
5.1.4 Algorithm 3 – Minkowski Sum ..........................................................................................................87
5.1.5 Algorithm performance.......................................................................................................................90
5.1.6 Algorithm selection ..........................................................................................................................100
5.2 Partitioning ..........................................................................................................................................101
5.2.1 Survey of partitioning literature........................................................................................................101
5.2.2 Algorithm 1 – Temporal partitioning................................................................................................103
5.2.3 Algorithm performance.....................................................................................................................104
5.3 Conclusion............................................................................................................................................107
6 OPERATING SYSTEM PROTOTYPE & METRICS ........................................108
6.1 Operating system prototype ...............................................................................................................110
6.1.1 Hardware platform............................................................................................................................111
v
6.1.2 Application architecture....................................................................................................................113
6.1.3 Primitive architecture........................................................................................................................114
6.1.4 ReConfigME implementation architecture .......................................................................................115
6.1.5 Sample application execution ...........................................................................................................121
6.1.6 Applications for ReConfigME..........................................................................................................124
6.1.7 Implementation issues ......................................................................................................................129
6.2 Metrics..................................................................................................................................................130
6.2.1 Response time...................................................................................................................................130
6.2.2 Throughput .......................................................................................................................................131
6.3 Conclusion............................................................................................................................................132
7 PERFORMANCE EVALUATION.....................................................................133
7.1 Experimental environment .................................................................................................................135
7.1.1 Benchmark application .....................................................................................................................135
7.1.2 Experiential configuration ................................................................................................................136
7.2 Performance results.............................................................................................................................144
7.2.1 User response time............................................................................................................................144
7.2.2 Application throughput .....................................................................................................................150
7.2.3 Conclusion ........................................................................................................................................156
7.3 Predictor metrics .................................................................................................................................157
7.3.1 Response time...................................................................................................................................157
7.3.2 Application throughput .....................................................................................................................160
7.3.3 Comparison of fragmentation measure.............................................................................................162
7.3.4 Chance of allocation .........................................................................................................................163
7.4 Conclusion............................................................................................................................................167
8 CONCLUSION AND FUTURE WORK ............................................................168
8.1 Research contributions........................................................................................................................169
8.1.1 Summary of major contributions ......................................................................................................172
8.2 Suggestions for future work................................................................................................................173
9 REFERENCES ................................................................................................174
vi
List of Figures
Figure 1: General FPGA Structure .............................................................................................9
Figure 2: Reconfigurable computer with a reconfigurable ALU..............................................11 U
Figure 3: Reconfigurable computer with a reconfigurable coprocessor...................................12
Figure 4: Loosely coupled reconfigurable computer ................................................................13
Figure 5: FPGA granularity examples ......................................................................................16
Figure 6: Granularity of an FPGA architecture ........................................................................17
Figure 7: Various FPGA logic allocation mechanisms ............................................................21
Figure 8: Two Dimensional Bin Packing .................................................................................30
Figure 9: Hardware Circuit Design Methodology ....................................................................36
Figure 10: A Summary of the methodology used in this thesis................................................46
Figure 11: The previous work, methodology and deliverables associated with this chapter ...50
Figure 12: Software Operating System Process .......................................................................53
Figure 13: Data flow graph.......................................................................................................55
Figure 14: Reconfigurable computing process abstraction.......................................................58
Figure 15: Classical operating system address space ...............................................................59
Figure 16: Reconfigurable computing address space abstraction.............................................61
Figure 17: Software inter-process communication abstraction ................................................62
Figure 18: Possible inter-process communication mechanisms ...............................................63
Figure 19: Processes of fixed size arranged in a fixed mesh topology orientated network......65
Figure 20: The on-chip network used in the reconfigurable computing inter-process
communication abstraction ...............................................................................................67
Figure 21: Client-Server model architecture ............................................................................69
Figure 22: The RAGE System Dataflow Architecture .............................................................69
Figure 23: Architecture of the operating system ......................................................................70
vii
Figure 24: Allocation service....................................................................................................76
Figure 25: Hardware partitioning .............................................................................................77
Figure 26: The previous work, methodology and deliverables associated with this chapter ...81
Figure 27: Greedy based allocation ..........................................................................................85
Figure 28: The bottom left allocation algorithm process..........................................................86
Figure 29: The heuristic used to calculate the remaining rectangles........................................87
Figure 30: Minkowski Sum example........................................................................................88
Figure 31: Bottom left heuristic used with the Minkowski Sum..............................................90
Figure 32: The execution runtime of the greedy, bottom left and Minkowski Sum allocation
algorithms .........................................................................................................................93
Figure 33: A Fragmented FPGA...............................................................................................96
Figure 34: Fragmentation recorded for the typical, large and small sized applications ...........98
Figure 35: Temporal partitioning proposed by Purna.............................................................103
Figure 36: The execution runtime obtained from the partitioning algorithm.........................105
Figure 37: Previous work, methodology and deliverables associated with this chapter ........108
Figure 38: ReConfigME implementation architecture ...........................................................110
Figure 39: RC1000pp Block Diagram....................................................................................112
Figure 40: The RC1000pp ......................................................................................................113
Figure 41: Application architecture for ReConfigME............................................................113
Figure 42: Operating system primitive architecture ...............................................................115
Figure 43: Platform tier architecture.......................................................................................117
Figure 44: Architecture of ReConfigME’s Colonel ...............................................................118
Figure 45: Operating system tier ............................................................................................120
Figure 46: User tier architecture .............................................................................................120
Figure 47: Complete sample application in data flow graph format ......................................121
Figure 48: Handel-C code listing for add one data graph flow node......................................121
viii
Figure 49: Java class file defining data flow graph structure .................................................122
Figure 50: ReConfigME data flow graph class structure .......................................................123
Figure 51: Status displayed before the allocation of the application......................................124
Figure 52: Status displayed after the allocation of the application.........................................124
Figure 53: Screen capture of the blob tracking application executing on ReConfigME........125
Figure 54: Allocation status of the FPGA when the blob tracking is loaded onto the FPGA by
ReConfigME...................................................................................................................126
Figure 55: Screen capture of the edge enhancement application executing on ReConfigME126
Figure 56: Allocation status of the FPGA when the edge enhancement is loaded onto the
FPGA by ReConfigME...................................................................................................127
Figure 57: Allocation status of the FPGA when the edge enhancement and the blob tracking
are loaded onto the FPGA by ReConfigME ...................................................................128
Figure 58: Screen capture from Xilinx Floorplanner verifying the locations of the applications
on the FPGA ...................................................................................................................128
Figure 59: Previous work, methodology, and deliverables associated with this chapter .......133
Figure 60: DES block architecture..........................................................................................136
Figure 61: Test case 1 floor-plan ............................................................................................139
Figure 62: Test case 2, set 1 (typical sized) floor-plans .........................................................139
Figure 63: Test case 2, set 2 (large sized) floor-plans ............................................................140
Figure 64: Test case 3 floor-plans...........................................................................................141
Figure 65: Test case 4, set 1 (typical sized) floor-plans .........................................................142
Figure 66: Test case 4, set 2 (large sized) floor-plans ............................................................143
Figure 67: The response time verses the number of partitions the application is divided into
for sets 1 (typical) and 2 (large) in test case 4 ................................................................149
Figure 68: The response time verses the number of applications already allocated onto the
FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4 ...........................................150
Figure 69: Possible worse case signal delay...........................................................................151
ix
Figure 70: The application throughput versus the number of partitions the application is
divided into for sets 1 (typical) and 2 (large) in test case 4 ............................................155
Figure 71 : The application throughput versus the number of applications already allocated
onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4 .............................155
Figure 72: The user response time versus the fragmentation percentage for both sets and all
test cases .........................................................................................................................158
Figure 73: User response time versus the adjusted fragmentation for both sets and all test
cases ................................................................................................................................159
Figure 74: A graph of application throughput versus fragmentation......................................160
Figure 75: User response time versus the adjusted fragmentation for both sets and all test
cases ................................................................................................................................161
Figure 76: Response time versus Walder fragmentation measure..........................................163
Figure 77: Signal delay versus Walder fragmentation measure .............................................163
Figure 78: A graph of the percentage success and failed allocation of applications in test case
2 and 3.............................................................................................................................166
x
List of Tables
Table 1: A Summary of Characteristics of Reconfigurable Computing Architectures ............14
Table 2: A summary of the characteristics relating to the use of an operating system for
reconfigurable computing .................................................................................................19
Table 3: Services Provided by Runtime System Prototypes.....................................................27
Table 4: Summary of common reconfigurable computing applications...................................40
Table 5: A research methodology suggested by Crnkovic .......................................................44
Table 6: Methodology paths used in this thesis ........................................................................45
Table 7: Evaluation of network topologies...............................................................................66
Table 8: A summary of the well-known allocation algorithms that appear in the research
literature ............................................................................................................................83
Table 9: Parameters of the applications used to measure the execution runtime of the
allocation and partitioning algorithms ..............................................................................92
Table 10: Number of applications allocated onto the FPGA....................................................94
Table 11: The average percentage increase in fragmentation for the algorithms compared to
each other..........................................................................................................................99
Table 12: Summary of partitioning algorithm runtime complexities .....................................102
Table 13: A summary of the metrics designed for reconfigurable computing operating systems
........................................................................................................................................130
Table 14: Parameters of the applications used in the response time and throughput
experiments.....................................................................................................................138
Table 15: User response time for test case 1 ..........................................................................146
Table 16: User response time for test case 2 ..........................................................................146
Table 17: User response time for test case 3 ..........................................................................147
Table 18: User response time for test case 4 ..........................................................................147
Table 19: Application throughput for the worse case and when the application was not under
the operating system control ...........................................................................................152
xi
Table 20: Application throughout for test case 1....................................................................152
Table 21: Application Throughput for test case 2 ..................................................................152
Table 22: Application throughput for test case 3....................................................................153
Table 23: Application throughput for test case 4....................................................................153
Table 24: Results from an experiment to verify the allocation successful formula ...............165
xii
List of Equations
Equation 1: Minkowski Sum ....................................................................................................88
Equation 2: Walder Fragmentation Grade ................................................................................96
Equation 3: Example of Fragmentation Grade .........................................................................97
Equation 4: Fragmentation percentage .....................................................................................97
Equation 5: Fragmentation percentage ...................................................................................157
Equation 6: Linear equations for the response time versus fragmentation percentage for both
sets ..................................................................................................................................158
Equation 7: Adjusted fragmentation percentage for predicting user response time ...............159
Equation 8: Linear equations for the signal delay versus fragmentation percentage for both
sets ..................................................................................................................................161
Equation 9: Adjusted fragmentation percentage for predicting signal delay..........................162
Equation 10: The percentage chance of allocating a process .................................................164
xiii
Acknowledgements
I’d firstly like to thank my supervisor Dr David Kearney. I met David about six years ago
when he offered to supervise me in an honours project. Towards the end of the year it was
very clear that the project could turn into a great PhD and suggested I should seriously
consider it. Over the next five years and after much blood, sweat and tears, a PhD dissertation
was written. Throughout this period David was always there to discuss research ideas, proof
read my papers, discuss the weekly football results on a Monday morning, and of course edit
the dissertation. I am very grateful for the time he spent with me as it was far beyond what
was expected. I would also like to thank him for his contribution to my travel expenses which
allowed me to attend several international conferences where I was able to meet fellow
researchers who I could discuss my ideas with. Neither of us will ever forget the 2 weeks we
spent in La Grande Motte and Montpellier at FPL 2003. I would also like to thank my
associate supervisor Dr Oliver Diessel for all his help and guidance in the first few years of
my PhD.
Over the past five years I have worked with a great group of people within the Reconfigurable
Computer Lab including Martyn George, John Hopf, Mark Jasiunas, Ross Smith, Matthew
Pink, and Maria Dahlquist. I would like to thank you all for the many research discussions we
have had and friendship you have shown me over the years. I would also like to thank the
financial support provided to both myself and the lab by the Sir Ross and Keith Smith Trust
fund.
I would also like to thank the School of Computer and Information Science for the financial
support they have provided to both me and the Reconfigurable Computing Lab. Since I began
my PhD, the school has had three Head of Schools; Andy Koronios, Brenton Dansie, and
David Kearney, who have always provided me with any equipment I needed to conduct my
research. Other academic staff members within the school I would also like to thank include
Sue Tyerman and Rebecca Witt for proof reading the thesis, and Jill Slay for giving me the
opportunity to teach parts of our offshore program in Hong Kong and Malaysia. Although this
did not directly contribute to my thesis, I gained valuable experience from it which I will take
into my professional academic career. I would also like to thank the general staff who manage
our office, without you guys the place would come to a stand still. A special mention goes to
xiv
Malcolm Bowes and Greg Warner for their brilliant system administration assistance and
friendship.
Completing a PhD can be very stressful at times, especially during the write up. But with the
help of great friends, I managed to find the necessary support when needed. I’d like to thank
Ali and Shania Darling, Stewart Itzstein, Hayley Reynolds, and Kate Tidswell who did just
that. Special thanks go to Wayne Piekarski and Hannah Slay. Wayne; thanks for your
friendship, advice, all those stories, the Hong Kong showroom trips and the thousands of
lunches we have had over the past five or so years. Completing a PhD at the same time as
you, made it that much more enjoyable. Hannah; what can I say? Work colleague, dive buddy,
boat captain, but most of all true friend. Thank you so much for your support, advice, and
friendship you have given me over the past 4 years. All those dive trips, the hundreds of
lunches, and all the gossip sessions we had really did give me a chance to forget about the
PhD just when I needed to.
Most importantly of all, I would like to thank my mum Glenda, dad Brian, my sister Kelly,
and my grandparents, for all their love, support, and financial assistance they have given me
over the past 27 years; without it I would not be writing this section in my PhD thesis. Mum,
thanks for all your help around home and all those late night dinners. Dad, thanks for getting
me to help you around the house, it made for a nice distraction when I needed it. Kelly, thanks
for telling me to keep at it all those times when I thought I’d had enough. Finally, my late
grandpa Jack Wigley always told me that one of the most important things in life is your
education. I never forgot that and it turned out that he was right; thanks Pa.
Grant Wigley
xv
Declaration
I declare that this thesis does not incorporate without acknowledgment any material
previously submitted for a degree or diploma in any university and that to the best of my
knowledge it does not contain any materials previously published or written by another
person except where due reference is made in the text or published in my paper list below.
Grant Wigley
Adelaide April 2005
xvi
Author publications
1. Piekarski, W, Smith, R, Wigley, G, Thiele, N, Thomas, B, and Kearney, D., “Mobile Hand Tracking
using FPGAs for Low Powered Augmented Reality.” In 8th International Symposium on Wearable
Computers, Arlington, VA, Nov 2004.
2. Smith, R, Piekarski, W, and Wigley, G., “Hand Tracking for Low Powered Mobile AR User
Interfaces.” In 6th Australasian User Interface Conference, Newcastle, Australia, 2005.
3. Jasiunas, M, Kearney, D, Hopf J, and Wigley, G., “Fusion for Uninhabited Airborne Vehicles.” In 2nd
Field Programmable Technology (FPT’02), Hong Kong, China, 2002, IEEE Computer Society.
4. Wigley, G, Kearney, D, and Warren, D., “Introducing ReConfigME: An Operating System for
Reconfigurable Computing.” In 12th International Conference on Field Programmable Logic and
Applications (FPL’02), Montpellier, France, 2002, Springer.
5. Warren, D, Wigley, G, and Kearney, D., “Hardware Implementation of Geometric Hashing.” In 2nd
Field Programmable Technology (FPT’02), Hong Kong, China, 2002, IEEE Computer Society.
6. Wigley, G, and Kearney, D. “The Management of Applications for Reconfigurable Computing using an
Operating System”. In 7th Asia-Pacific Computer Systems Architecture Conference, 2002, ACS Press.
7. Wigley, G, and Kearney, D., “Research Issues in Operating Systems for Reconfigurable Computing.” In
International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’02), 2002,
CSREA Press.
8. Wigley, G, Hopf, J, and Kearney, D., “Using Software Techniques when Developing Hardware
Applications”. In International Conference on Engineering of Reconfigurable Systems and Algorithms
(ERSA’02), 2002, CSREA Press.
9. Warren, D, Kearney, D, and Wigley, G., “Field Programmable Technology Implementation of Target
Recognition Using Geometric Hashing”. In International Conference on Engineering of Reconfigurable
Systems and Algorithms (ERSA’02), 2002, CSREA Press.
10. George, M, Pink, M, Kearney, D, and Wigley, G., “Efficient Allocation of FPGA Area to Multiple
Users in an Operating System for Reconfigurable Computing”. International Conference on
Engineering of Reconfigurable Systems and Algorithms (ERSA’02), June 2002, CSREA Press.
11. Wigley, G, and Kearney, D., “The Development of an Operating System for Reconfigurable
Computing”. In IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'01), Napa
Valley, CA, USA, April 2001, IEEE Press.
12. Wigley, G, and Kearney, D., “The First Real Operating System for Reconfigurable Computers”. In 6th
Australasian Computer Science Week (ACSAC’01), Gold Coast, Australia, January 2001. IEEE
Computer Society.
13. Diessel, O, and Wigley, G., “Opportunities for Operating Systems Research in Reconfigurable
Computing”. School Technical Report, March 2000
14. Diessel, O, Kearney, D, and Wigley, G. “A Web-Based Multi-user Operating System for
Reconfigurable Computing”. 6th Reconfigurable Architectures Workshop (RAW’99). Springer,
December 1999
Chapter 1 – Introduction
1
1 Introduction
In any electronic product development a choice must be made between using general-purpose
hardware such as a microprocessor, or special purpose hardware such as application specific
integrated circuits (ASIC). If a microprocessor is used as the target hardware, the application
will be written in software. For many applications such as word processing this is suitable.
However there are many applications that require complex computations to be performed in
real time and there are no microprocessors that can achieve this. These applications can
include automatic target recognition (ATR) [72], active networks [127] and image processing
[8]. Such applications are transferred into a custom hardware implementation as it is well-
accepted that this type of solution produces higher performance with lower unit cost of
production but at the high non-recurring engineering cost and often long system development
cycles. For many applications, producing an application specific mask is not economically
viable as only very few units may be required.
An alternative to a pure hardware implementation that avoids the disadvantages described
above is through the use of a field programmable gate array (FPGA). The key difference
between an FPGA and other chip technologies such as ASIC is that it can be configured by
the end user in the field. There is less risk involved when configuring applications in the field
because if a mistake is made, it is not necessary to wait weeks and spend a large sum of
money to fabricate a new device. The FPGA can simply be reconfigured with the corrected
application. Although FPGAs may not accelerate an algorithm as much as if it were
implemented in an ASIC, it has been shown that significant performance increases over
software can be achieved. The current popularity of these devices is reflected by the sales
figures of FPGA vendors. In 2003, Xilinx who only fabricates FPGAs was ranked third in
overall sales figures of semi conductor manufacturers [121].
1
Chapter 1 – Introduction
2
Although FPGAs have provided considerable speedup to numerous algorithms, they are not
ideally suited in all types of applications. Some algorithms such as floating point arithmetic
can not be efficiently implemented in an FPGA in the absence of “hardware” multipliers.
Other algorithms are best implemented with a combination of hardware and software to
optimise performance. To accommodate this type of application, FPGAs have been coupled
with microprocessors to form what is commonly known as a Reconfigurable Computer. This
provides the ability to exploit both the general purpose processor’s flexibility and the FPGA’s
capability to implement application specific computations.
The research domain of reconfigurable computing has become very popular over the past ten
years. At present there are at least five international conferences held each year which discuss
various topics of reconfigurable computing. The topic of operating systems for reconfigurable
computing has had increased interest over the past few years with several special sessions and
focus groups hosted at these conferences.
There have been many reconfigurable computing platforms proposed and built. Early attempts
include Garp [66], SPACE [63] and Pam [18]. These platforms primarily consisted of low
density FPGAs, small amounts of on-board memory, and a low bandwidth between the FPGA
and microprocessor. Due to limited resources, many of these types of reconfigurable
computers could only accommodate one application, resulting in a single task environment.
These applications were then able to be completely designed prior to loading onto the
reconfigurable computer as their execution order was known in advance.
With the development of reconfigurable computers containing FPGAs with in excess of 6
million system-gates, such as the RC2000 [32] and Bioler 3 [20], it is now feasible to consider
the possibility of sharing the FPGA between multiple concurrently executing applications.
This could potentially increase the resource usage of the expensive FPGA logic and decrease
response times so users will not have to wait for the FPGA to be completely available. The
multiple use of an FPGA depends on some form of runtime reconfiguration.
All SRAM-based FPGAs can be runtime reconfigured, that is the context of the device can be
completely changed whilst the controlling software application continues to run. Partial
reconfiguration is an optimisation of runtime reconfiguration where only a specific part of the
FPGA is changed. This minimises the total reconfiguration time but requires all of the
Chapter 1 – Introduction
3
applications to be stopped. Dynamic reconfiguration allows part of the FPGA to be stopped
and reconfigured whilst the remainder continues to operate unaffected. This allows
applications to be configured onto the FPGA without having to wait until the entire FPGA is
free or stop other executing applications. However none of these innovations provide
hardware resource allocation, a mechanism that is essential if an FPGA is to be shared
amongst applications. The use of dynamic runtime reconfiguration has gone part way to
providing this sharing support that would be needed in an operating system like environment.
Consider the following application. An unpiloted aerial vehicle (UAV) is an aircraft without
on-board pilots that is flown autonomously or via radio link. These aircraft are usually small
and have limited power and payload capabilities. During a flight, a UAV may require several
diverse algorithms including data encryption and target recognition. Traditionally these
algorithms would have all been configured onto the FPGA at one time even if they were not
used concurrently. This results in the use of an FPGA with a much higher logic density than
would be required if it could be shared amongst the multiple algorithms. Through the use of
an operating system, a much lower logic density FPGA can be used as the algorithms can be
configured onto the FPGA when needed, resulting in smaller and less demanding
reconfigurable computers.
Such a scenario raises several interesting research questions that have not been deeply
investigated. How is logic area to be allocated to FPGA applications at runtime? Are there
any suitable algorithms that already exist in other research domains that support this? Can an
application be divided into a number of parts to better fit the FPGA surface without
significant impact on application performance? If so, what is a suitable partitioning
algorithm? What abstractions are necessary for designers to write applications for such an
operating system? What would the application design flow be, and would the existing one be
modifiable to suit? Are popular FPGA architectures suitable for an operating system and if
not what is needed for them to support it? These are the major questions that will be addressed
in this thesis. The remainder of this thesis is structured as follows.
Chapter 2 organises the previous work in the field into several major themes. Initially, a
review of reconfigurable computing and FPGA architectures is given with the aim of
determining the most suitable FPGA and reconfigurable computer to be used with an
operating system. Previous attempts at sharing the FPGA amongst multiple applications are
Chapter 1 – Introduction
4
then outlined. The current allocation of FPGA resources, application partitioning techniques
and previous attempts at describing and building runtime system for reconfigurable computers
are detailed. The current design flow used in reconfigurable computing application
development is then reviewed in relation to its suitability to support dynamic resource
allocation. Finally, suitable applications and benchmarks that appear in the research literature
that can be used in conjunction with an operating system environment will be investigated.
In chapter 3, the methodology used in this thesis is presented. The methodology consists of
four phases. Firstly, an operating system for a reconfigurable computer is conceptualised. This
results in a set of abstractions and operating system architecture for a reconfigurable
computer. Secondly, a set of suitable algorithms that might be used to implement the
architecture is specified. The top performing algorithms according to selected criteria will
then be implemented. Thirdly, a prototype and set of metrics that can be used to measure its
performance are described. The prototype is described by implementing all of the previously
suggested abstractions, architecture and algorithm specification. The set of metrics are
selected by reviewing application literature to determine what hardware designers perceive
application performance to be. Finally, the performance of the prototype is evaluated and
characterised. This is achieved by selecting a benchmark application, building a test
environment with which to use the prototype, and performing a series of experiments.
Following this, a series of correlations are made between the results.
In chapter 4, the operating system for reconfigurable computing is conceptualised. The
abstractions, algorithm specifications, operating system architecture and application design
flow are presented. The abstractions that make the conceptual framework of the operating
system are developed by making comparisons to software operating systems. The algorithm
specifications that will ultimately implement these abstractions are presented. From this, an
architecture of an operating system for a reconfigurable computer is outlined. Finally, an
investigation is conducted into whether the current reconfigurable computing application
design flow can be used or modified to suit the proposed operating system architecture.
In chapter 5, suitable algorithms that perform FPGA area allocation and application
partitioning are presented. These algorithms are selected from a range of algorithms initially
evaluated for approximate complexity to rule out those that are unsuitable. Selected
algorithms are then implemented and their performance measured against particular micro-
Chapter 1 – Introduction
5
metrics. From these results, the most suitable allocation and partitioning algorithms can be
determined.
Chapter 6 presents a prototype operating system known as ReConfigME. This consists of the
previously described abstractions, algorithms, and architecture and is demonstrated running
three applications at runtime. A set of metrics that can be used to measure the performance of
the operating system and its applications will then be presented. This is achieved by
investigating what application designers perceive application performance to be and
determining if any software metrics can be transferred into the reconfigurable computing
domain.
Chapter 7 involves using the prototype to evaluate its performance. This is achieved by
selecting a suitable benchmark application, building a set of test cases for use in a series of
experiments, and carrying out the experiments to measure impact the operating system has on
the user response time and application performance. From the results obtained from the
experiments, correlations are made to determine if there are any connections between the
measured metrics. A series of formulas are then derived to predict the likely performance
before an application is loaded.
Chapter 8 concludes this thesis by summarising the key contributions made by the author and
presents possible future work.
Chapter 2 – Runtime support for reconfigurable computing
2 2 Runtime support for reconfigurable computing
From the early beginnings of development with programmable hardware, such as
programmable logic devices, the goal has always been to capture some of the flexibility in
conventional software based systems, while retaining the algorithmic speedup that hardware
provides [79]. A Field Programmable Gate Array (FPGA) [26] is one such programmable
logic device. An FPGA is a silicon chip device made of an array of configurable logic blocks.
Once programmed through the use of hardware programming, these configurable logic
blocks form hardware circuits. FPGAs have been coupled with general purpose processors
and memory to form a potentially versatile platform commonly known as a Reconfigurable
Computer [40].
A unique feature of some reconfigurable computers is dynamic reconfiguration [71]. This
allows a hardware application to be configured onto the FPGA without having to stop the
execution of other resident applications. As FPGA density increases beyond 10 million
configurable gates [108], and with the use of dynamic reconfiguration, it’s becoming more
feasible for several applications that once required one FPGA each, to share a single high
density device. However, the current design flow has almost no support for allocating FPGA
resources to dynamically arriving applications. As a consequence the designer must ensure
that resources have been allocated for all possible combinations of loaded applications at
design time. However if the types of applications to be loaded are not known at design time, it
is not feasible for all resource allocation to be performed as the availability of resources
changes dynamically over time.
The use of a runtime environment loosely modelled on the traditional software operating
system [126] may overcome this problem. As reconfigurable computing applications arrive 6
Chapter 2 – Runtime support for reconfigurable computing
7
(consisting of both software and hardware), the runtime system would need to provide
services such as FPGA, microprocessor and memory resource allocation, communication
between the software and the hardware parts of the application, and general housekeeping
duties. While dynamic reconfiguration is necessary for an operating system, it is not sufficient
for the support of sharing a reconfigurable computer.
This chapter reviews previous work associated with these issues and is structured into themes.
The first theme is configurable hardware. In this section an investigation into the numerous
types of reconfigurable computing platforms and FPGA architectures proposed in the
literature is presented. From all of these proposals, the most suitable platform and device
architecture for use in an operating system environment will be selected. The second theme is
prototype runtime systems for reconfigurable computing applications. In this section it will be
shown that although there are many runtime systems previously developed, they are usually
only responsible for trivial FPGA configuration and data transfer between the host and FPGA.
It is shown that the issues of resource allocation have not been deeply explored in published
prototypes. The third theme is algorithms for allocation and partitioning that might be used in
the proposed operating system. In this section the use of allocation and partitioning algorithms
for further increasing the usage of the FPGA are detailed. It is shown that there have been
very few attempts to apply resource allocation to the reconfigurable computing domain. The
fourth theme is design flow. In this section, it is shown that the current design flow used for
reconfigurable computing applications has limitations when applied to a dynamic runtime
reconfigurable environment. It will also be shown that of the few alternative design flows
proposed, none fully address all the issues of designing reconfigurable computing applications
for a shared environment. The final theme is on applications and benchmarks for
reconfigurable computers. In this section it will be shown that few benchmarks have been
developed for use with an operating system. It will also present the types of applications that
are best suited to an operating system environment.
Chapter 2 – Runtime support for reconfigurable computing
8
2.1 Field programmable technology
In this section both field programmable technologies (FPT) and reconfigurable computers
(RC) will be discussed with the aim of determining the most suitable platform for use with an
operating system. This will be achieved by firstly classifying reconfigurable computers
according to their degree of coupling between the FPT and the microprocessor, and secondly
classifying FPT according to their logic granularity. From this the most suitable category of
reconfigurable computer and FPT for use in an operating system is identified. It will be shown
that the most suitable platform consists of a large number of FPT logic cells arranged in
medium granularity on a commercial FPGA chip, coupled to a custom von Neumann
processor using a high speed general purpose bus and via a local bus to bulk commodity
RAM. Such a platform will be shown to have the capacity to run multiple applications of
general interest with acceptable performance and to have a justified need and capability for
runtime resource allocation.
2.1.1 Introduction
A FPT device [34] can be configured in the field by the end user [95] to create a digital logic
hardware solution. These devices have become very popular over the past decade as they do
not require the high engineering costs and long manufacturing lead times as application
specific integrated circuits (ASIC) [21] do. In this thesis we are interested in them because
they allow reuse and sharing of the device by different applications. Field programmable
technology devices primarily include programmable logic devices [115] and FPGAs [26].
Programmable logic devices (such as CPLDs and PALs) will not be considered in this thesis
as their density is relatively small compared with the modern FPGA and there is little
advantage in sharing a low density device amongst multiple applications.
An FPGA consists of an array of uncommitted circuit elements and interconnect resources
and is configured by the end user through a form of hardware programming. Figure 1 gives an
overview of the general structure of an FPGA. The logic cell on an FPGA is often referred to
as a configurable logic block (CLB) and performs the logic operations of the application,
usually implemented via a k-way lookup table and a flip flop for state storage. The routing
matrix connects the CLBs together using a specific structure that may consist of local and
chip length wires. The I/O cells, often referred to as I/O blocks, connect directly to the pins of
the device and are used to read and write signals from outside the chip.
Chapter 2 – Runtime support for reconfigurable computing
Figure 1: General FPGA Structure
FPGA manufacturers have developed a variety of hardware technologies for programming.
Some make chips with fuses that are programmed by passing a large current through them.
These types of FPGAs are called one-time programmable (OTP) because they cannot be
rewritten internally once the fuses are blown [36]. Other FPGAs make the connections using
pass transistors controlled by configuration memory. One type of FPGA resembles an
EPROM or EE-PROM: it is erasable and is then placed in a special programming socket to
reprogram it. Most manufacturers now use static RAM to control the pass transistors for each
interconnection, thus allowing the FPGA to be rapidly reconfigured any number of times.
With this rapid reconfigurability, some FPGAs have provided support partial and or runtime
reconfiguration [71]. This is the idea of changing the configuration of an FPGA whilst its
computation is still in progress. Some FPGAs even support dynamic runtime reconfiguration
where only a portion of the FPGA is reconfigured while the remaining part continues to
execute. In this thesis we will only consider SRAM reprogrammable FPGAs because if an
FPGA can not support dynamic runtime reconfiguration, such as the fused or EPROM type
FPGAs, there is little advantage in using an operating system to manage it as its resources can
only be allocated at compile time. This prevents applications from sharing the FPGA at
runtime.
9
Initially considered as a weakness due to the volatility of the programming data storage, in-
system reprogramming capabilities of SRAM-based FPGAs led to the new Reconfigurable
Computing paradigm [132]. It is generally agreed in the literature that a reconfigurable
computer (RC) is a computing machine that incorporates programmable logic devices to
Chapter 2 – Runtime support for reconfigurable computing
10
create a hardware architecture that may be modified at runtime [122] [95] [130]. In common
with the original conception [55] of a machine that can have a fixed architecture driven by
software and a variable architecture via programmable interconnect of user definable logic
cells, a reconfigurable computing machine should include a general purpose processor. This
provides the ability to exploit both the general purpose processor’s flexibility and the
reconfigurable processor fabrics capability to implement application specific computations.
2.1.2 Reconfigurable computing architectures
In this section there is a review of reconfigurable computing platform architectures with the
aim of selecting the most suitable for use in applied research for an operating system
environment. The platforms are divided into three categories according to the coupling
between the microprocessor and reconfigurable processing unit. The three categories are then
evaluated against a set of criteria to determine the most suitable one for use in an operating
system environment. These criteria include the
• Availability of parallelism
• The need for custom design tools not currently available
• The potential for runtime resource allocation
• The overall density and speed of the programmable logic, and
• The commercial availability of platforms.
There have been many attempts to categorise reconfigurable computing platforms. A
commonly accepted classification is either a tightly-coupled or a loosely-coupled platform
[49]. A tightly-coupled reconfigurable computer has the reconfigurable processing unit
integrated into the general purpose processor internal buses, whereas a loosely-coupled
reconfigurable computer is connected via a general purpose bus. Wolinski [135] extended this
to include new coupling architectures. This led to three categories: a reconfigurable ALU in
which arithmetic operations can be customised, a coprocessor where special instructions can
be diverted from the integer unit to the attached coprocessor, and a loosely-coupled
configuration where the processor transfers the data for the programmable logic via an I/O
bus such as PCI.
Chapter 2 – Runtime support for reconfigurable computing
Reconfigurable processing unit as a replacement ALU
Reconfigurable instruction set processors (see Figure 2) have been used in a variety of
applications, but now have renewed interested as they have been shown to increase the
performance of some multimedia applications [103]. In this type of reconfigurable computer
the arithmetic operations can be customised by reconfiguring the ALU. For example instead
of add it can be replaced with add modulo 3. Due to the customised instruction set, general
interest applications would need to be recast in terms of the specific instruction set. There is a
lack of commercially available platforms and design tools with which to build the
applications and prototype. There are only a few offerings including Elixent [53], and PACT
[15], with several more proposed in academia including GARP [66] and Kress [85].
The ability to perform runtime resource allocation on such architectures is very limited. The
instruction sets could be swapped at the same time as switching the application in a multi-
threaded way. However this could be incorporated into the context of a traditional operating
system by increasing its size to include the custom instruction set in use for each thread. This
would still result in sequential processing so the advantage of real hardware parallelism is still
lost.
Figure 2: Reconfigurable computer with a reconfigurable ALU
Reconfigurable processing unit as a coprocessor
It has been shown in the past that an increase in performance could be gained when a von
Neumann based architecture is coupled with a coprocessor, often a precision floating point
processor. Similarly, in reconfigurable computing a performance increase can be achieved
11
Chapter 2 – Runtime support for reconfigurable computing
when a microprocessor is coupled with a reconfigurable processing unit (see Figure 3). Unlike
a reconfigurable ALU, where the instruction set can be customised, in a coprocessor
architecture the special instructions are diverted from the integer unit to the attached
reconfigurable processing unit for execution in programmable logic.
The microprocessor either exists within the same fabric as the reconfigurable processing unit,
commonly known as a hard core microprocessor, or is configured onto the programmable
logic, known as a soft core. Hard core microprocessors have been demonstrated through
several commercially available platforms containing devices such as Xilinx Virtex II Pro
(PowerPC) [141] and Altera Excalibur (ARM) [3]. Soft core microprocessors include the
Xilinx Microblaze [137] and Altera Nios [5]. However as the fabric is shared between the
microprocessor and programmable logic, the processor clock speed is usually slower than a
modern von Neumann processor, and the amount of logic available to the user is less as some
must be consumed by the hard or soft core microprocessor.
Figure 3: Reconfigurable computer with a reconfigurable coprocessor
Reconfigurable processing unit coupled to the I/O system bus
A reconfigurable processing unit coupled to a general purpose processor via an I/O bus is
commonly known as a loosely coupled architecture (see Figure 4). This type of architecture is
very common and commercially available (for example the RC2000 [32], BenOne [105] and
Virtual Computer [29]). There are also research versions of these including PRISM-I [9],
SPACE 2 [63] and Pilchard [92]. The main advantage of this type of architecture is the ease
of constructing a system using recent technology high speed microprocessors and large gate
12
Chapter 2 – Runtime support for reconfigurable computing
count reconfigurable processing units [14]. Although it has been argued in the past that the
bandwidth between the attached processor and the reconfigurable processing unit is too low
[69], current improvements to the performance of standard buses (e.g. 64 bit, 133 MHz PCI-X
bus has increased bandwidth to in excess of 1000 MB/s) has increased the effective data
transfer rate. The loosely coupled architecture also has a high possibility of runtime resource
allocation due to the user of a high density FPGA.
Figure 4: Loosely coupled reconfigurable computer
Summary
To determine which of the three reconfigurable computing platform architectures should be
used in an operating system environment, a set of criteria was chosen to rank them according
to particular characteristics. These criteria are as follows.
1. The availability of the platform to support parallelism; without this, most of the
applications would need to be redesigned, resulting in a possible drop in performance.
2. The commercial availability; part of this thesis involves measuring the actual
performance of the operating system and the applications executing under it.
3. The need for developing custom design tools; there is a better chance of the operating
system being accepted within the programmable logic domain if the current design
flow and tool set can be used. This also alleviates the need for any tool set
development.
4. The overall density and speed of the programmable logic and microprocessor; the
density of the programmable logic must be able to have modern hardware applications
13
Chapter 2 – Runtime support for reconfigurable computing
14
configured onto it and the microprocessor must have the necessary clock speed to
execute the software part of the application.
5. The possibility that the architecture can support runtime resource allocation; as the
goal of the operating system is to support multiple runtime applications, the ability to
perform resource allocation is essential.
6. The suitability of the platform to support the general interest applications that will be
used under an operating system environment. The architecture must minimise the
performance loss on the applications introduced due to the introduction of the
operating system.
Table 1 shows a comparison of each of these three architectures, ranking them on how well
they perform in each of the criteria described above.
Reconfigurable
ALU
Reconfigurable
Computer with
Coprocessor
Loosely
Coupled
Availability for parallelism Low Medium High
Commercial availability
of platforms
Medium Medium High
Need for custom design tools
not currently available
Medium Medium Low
Overall density and speed of
programmable logic and
processor
Low Medium High
Possibilities for runtime
Resource allocation
Low Medium High
Supports general interest
reconfigurable computing
applications
Low Medium High
Table 1: A Summary of Characteristics of
Reconfigurable Computing Architectures
The architecture of the reconfigurable computer that appears most attractive for an operating
system is the one based on the loosely-coupled microprocessor and reconfigurable processing
Chapter 2 – Runtime support for reconfigurable computing
15
unit. A platform with such architecture is readily commercially available with a fast clock rate
microprocessor and high density FPGA, can have its applications designed using the standard
design flow and tool set, supports applications with parallelism, provides possibilities for
runtime resource allocation and supports general interest reconfigurable computing
applications. Neither the reconfigurable coprocessor or ALU are ideally suited as they do not
perform as well in one or more of the criteria.
2.1.3 FPGA architectures
In this section architectures for FPGAs are reviewed with the objective of evaluating their
suitability to work together with an operating system that allocates resources at runtime and a
loosely-coupled reconfigurable computing architecture. There are two major facets of the
FPGA architectures that are of interest. The granularity of the configurable logic elements,
and the availability of dynamic reconfiguration support either as a capability for dynamic
reconfiguration driven from the attached host or as built in hardware support to implement
context switching through autonomous self reconfiguration. It will be shown that a medium
grain FPGA which has dynamic reconfiguration but not necessarily self reconfiguration is the
closest match for the proposed operating system.
Granularity
FPGA devices are commonly categorized according to the granularity of the architecture. A
typical architecture may be considered as fine, medium or coarse grain. Fine grain
architecture (see Figure 5 (a)) usually contains a simple configurable logic block, often only
two-input gate logic. They are best suited for applications that require very fine grain bit
manipulation. Some of the existing well-known academic fine-grain reconfigurable devices
include Montage [65], Triptych [22], 3D-FPGA [37], GARP [66] and BRASS [28].
Commercial fine-grain architectures included Xilinx 6200 [142] and Altera Flex 10K [4]. Fine
grained architectures require more routing resources than coarser grained ones, and given the
cost of routing in terms of area and delay, the optimal architecture is not likely to be the finest
grain. Fine grain architectures are also probably now obsolete with very few commercially
available versions recently released.
Chapter 2 – Runtime support for reconfigurable computing
(a) Xilinx 6200 FPGA (Fine grain) (b) Xilinx Virtex (Medium grain)
(c) PACT (Coarse grain)
Figure 5: FPGA granularity examples
FPGAs with larger lookup tables and CLBs, more flip flops and CLB to CLB direct carry
chains are commonly known as medium grain architectures. These include LP_PGAII [60],
LEGO [38], Xilinx 4000 and Xilinx Virtex [139]. This type of architecture looks promising
for use in an operating system environment as it has a high enough density to support multiple
applications, is commercially available, and supports general interest applications. Figure 5
(b) is an example of a medium grained architecture functional unit and shows the increase in
complexity as compared with the fine grain architecture.
Coarse grained architectures for FPGAs have more complex logic and routing elements that
are domain specific. This implies that there are components that optimise a domain specific
function that could be implemented in other ways on a medium grained architecture. They
also have a routing structure that is suited to a particular application domain. Devices with
such an architecture include RaPiD [52] and Chameleon [33]. Domain specific architectures
often provide little performance increase to applications that can not be customised to suit the
architecture. Figure 5 (c) is an example of a coarse grained architecture functional unit and
16
Chapter 2 – Runtime support for reconfigurable computing
shows the increase in complexity when compared with the medium grain architecture. A
summary of what contributes a particular type of granularity is shown in Figure 6.
1. 2 input NAND gate CLBs and local routing
2. Reconfigurable logic blocks with lookup
tables, flip flops and a routing hierarchy
3. Reconfigurable logic blocks with fast carry
and other inter-block communication, and a
routing hierarchy
4. Reconfigurable logic block including a
multiplier and a routing hierarchy
5. Reconfigurable logic block including a state
machine and internal RAM, and a specialist
routing structure
Fine Grain
Medium Grain
Coarse Grain
Figure 6: Granularity of an FPGA architecture
Dynamic reconfigurability
Another facet of the architecture that is of interest is the ability of the FPGA to support
dynamic reconfiguration from either the attached host or via single cycle self reconfiguration.
This is necessary so applications can be configured onto the FPGA at runtime whilst not
affecting the already configured ones. An early attempt at dynamic reconfiguration driven
from the attached host was demonstrated on the Xilinx 6200 series FPGA. This fine grained
FPGA performed random access partial reconfiguration which allowed selected CLBs of the
FPGA to be reconfigured while the remainder of them continued to operate unaffected, thus
allowing applications to be configured at runtime. This type of dynamic reconfiguration was
inherited by the Xilinx Virtex, Virtex II and Virtex II pro series of FPGAs.
A single cycle self-reconfigurable FPGA is a device that can toggle a particular part of its
own configuration which results in part of or the entire FPGA to reconfigure itself within a
single clock cycle. Early research into this was conducted by DeHon [47] where he presented
the Dynamically Programmable Gate Array (DPGA) which was able to rapidly switch among
several pre-programmed configurations. This rapid reconfiguration allows DPGA array
17
Chapter 2 – Runtime support for reconfigurable computing
18
elements to be reused in time without significant overhead. Trimberger [128] presented a
time-multiplexed FPGA architecture based on an extension of the Xilinx 4000 product series.
It contained eight configurations of the FPGA that were stored in on-chip memory and could
reconfigure the FPGA in a single clock cycle of the memory. Scalera [114] presented the first
design and implementation of a context switching reconfigurable computer (CSRC). CSRC
was designed to be a 4 bit DSP dataflow engine that is simultaneously capable of efficiently
implementing glue logic. However Sidhu [119] stated that for efficient self reconfiguration,
the device should perform fast logic switching and fast random access of the configuration
memory. CSRC could switch contexts in a single clock cycle but provide only serial
configuration memory access. Therefore Sidhu proposed the Self-Reconfigurable Gate Array
architecture, which supports both single cycle context switching and single cycle random
memory access. More recently, two commercial reconfigurable devices with coarse grain
architectures that support self reconfiguration have appeared in the literature. These include
XPP [15] from PACT and ACM [110] from Quicksilver. These devices have specialist
architecture that can execute domain specific applications more efficiency.
Although an FPGA architecture that supports single cycle self reconfiguration would be of
interest for an operating system environment, the current devices that support it are either too
domain specific or are research based projects without commercially available products.
Conclusion
The FPGA architecture that is most suited to an operating system would be one that can
support self reconfiguration, is not too domain specific, and is commercially available.
However as shown in Table 2, none of the three FPGA architectures have all of these
characteristics. The architecture that looks the most promising for use in this operating system
is a medium grained FPGA. Although self reconfiguration is desirable, in commercially
available FPGAs it is only supported in coarse grained architectures. Such architectures are
too domain specific and it could be difficult to map general interest applications to it. Fine
grained architectures, although not domain specific and support runtime reconfiguration
(Xilinx 6200) are not suited either as they are no longer commercially available or therefore a
prototype can not be built.
Chapter 2 – Runtime support for reconfigurable computing
19
Fine Grain Medium Grain Coarse Grain
Commercial
Availability
Low High Medium
Support for
reconfiguration
Dynamic Dynamic Dynamic or Self
Domain specific No No Yes
Table 2: A summary of the characteristics relating to the use of an operating system
for reconfigurable computing
2.1.4 Conclusion
In this section it has been shown that medium grained FPGAs arranged as an I/O attachment
to a standard microprocessor present the most appropriate platform for an operating system
for reconfigurable computing. While devices that self reconfigure are emerging they are
excessively domain specific which limits their use for an operating system. Hence FPGAs
which are externally reconfigurable under the control of a host are a more appropriate
platform for general purpose reconfigurable computing even though their reconfiguration
times are longer and the flexibility of the geometry of reconfiguration modules is limited.
Chapter 2 – Runtime support for reconfigurable computing
20
2.2 Abstractions, services and runtime systems
In the previous section it was determined that the most suitable reconfigurable computing
platform for use with an operating system is a medium grained FPGA that supports runtime
reconfiguration, coupled to a microprocessor via a standard I/O interface. As a reconfigurable
computer has little support for automated resource allocation, the scheduling of incoming
applications, or the management of the transfer of I/O, the use of a runtime system may be
necessary.
In this section various services and runtime systems that have been proposed to support
reconfigurable computing applications are reviewed. What is implied by the services of
resource allocation, hardware partitioning, application scheduling and I/O in a reconfigurable
computing environment are initially examined. The prototypes that have been published are
then summarised and compared in relation to these services. An evaluation shows that none of
the prototypes have included all the services.
2.2.1 Services
The structure of an operating system for a traditional von Neumann based architecture has
been well defined [120] [126] [125] as comprising of services (e.g. application launching,
multi-tasking, and file and memory management) and abstractions (e.g. process and file). This
is not the case in reconfigurable computing. However there is a growing trend in the
reconfigurable computing research literature to investigate services for a runtime system. To
date these include resource allocation, hardware partitioning, application scheduling and I/O.
Resource allocation
If a reconfigurable computer has applications competing for hardware resources, mechanisms
and policies are required to allocate these resources in a way that will not interfere with other
executing applications. These resources include the FPGA logic area, routing matrix, I/O pins
and external memory. Most of the reconfigurable computing resource allocation literature has
avoided discussion on the allocation of the routing matrix, I/O pins and external memory and
has concentrated on the FPGA logic area [25].
The concept of allocating the FPGA logic area to reconfigurable computing applications
within an operating system like environment was first suggested by Brebner (Virtual
Hardware Operating System [24]). He suggested that the FPGA area be divided into square
Chapter 2 – Runtime support for reconfigurable computing
segments of equal size which the applications could be allocated to (Figure 7 (b)). This
increased the number of applications from one per FPGA to one per square segment. As the
FPGA was only divided into a small number of fixed segments, this kept the complexity of
the allocation algorithm to a minimum. However there needed to be modifications made to the
current design flow so applications could be expressed in these fixed sized segments [24] or
current applications be broken down at runtime. The process required to break a
reconfigurable computing application into segments, or swappable logic units (SLU) as they
are defined by Brebner, is too time consuming to be performed at runtime [23] in an operating
system. To avoid this overhead he then suggested that SLUs have various rectangular
dimensions (Figure 7 (c)). Although there was no deep investigation, a two-dimensional
recursive bisection technique was suggested as an algorithm to perform this allocation.
RACE [124] and the Dynamically Reconfigurable System [73] suggested that the FPGA logic
area was allocated by configuring one application per FPGA device (see Figure 7 (a)). This
allows the current design flow to be used in application development but reduces the
allocation complexity as only one application is configured per FPGA. The limitation with
this policy is the number of applications on the platform at one time is restricted to the
number of FPGA devices. Burns [27] extended this idea to allow applications to have various
geometric shapes (Figure 7 (d)). Although this gives the runtime system the most freedom in
the shape selection, allocation algorithms that could perform such a task with a minimal
overhead were not presented.
Figure 7: Various FPGA logic allocation mechanisms
If the resource allocation policy shown in Figure 7 (b)(c)(d) were to be used, there is a
possibility that routing resources beyond the logic area allocation may be needed to allow
inter application communication or external I/O. These routes will breach the logic area
allocation and could possibly require a global router to manage them [45]. A runtime routing
21
Chapter 2 – Runtime support for reconfigurable computing
22
package for the Xilinx Virtex FPGAs (JRoute) has been designed by Keller [82] and can be
configured as a global router. As applications are routed, JRoute is able to manage what
resources have been used and what resources are available. This gives JRoute the ability to
perform routing at runtime. However, the routing algorithm used in JRoute does not guarantee
a route can be found and thus limits its useability.
Although there is limited literature on I/O pin allocation, Babb [12] proposed a technique that
allows more than one application to use a single I/O pin. He suggested that the FPGA pins be
multiplexed amongst multiple logical wires and be clocked at the maximum frequency of the
FPGA. This would allow more than one application to use a selected pin. The decision of
whether to share an I/O pin was made at compile time through the use of a static scheduler. If
this mechanism was to be transferred to the operating system, a dynamic scheduler would
need to be used as applications would be arriving at runtime.
Application partitioning and scheduling
The concept of partitioning logic circuits is well published and is defined to be the process of
dividing an application into multiple parts because it can not fit onto the target device in its
entirety. Once partitioned, these parts can then either be configured onto a reconfigurable
computer in time [77, 78] or configured at the same time onto a single or multi-FPGA
reconfigurable computer. If partitioned in time, a scheduler can be used to determine which
parts should be loaded and removed at what times [117]. Scheduling applications for a
reconfigurable computer is not the same as with a von Neumann based architecture. There are
no obvious ways to pre-empt a reconfigurable application due to the absence of a well defined
instruction fetch, decode, and, execute cycle. If all the application parts are configured onto
the FPGA at one time and are not removed until the application has completed, pre-emption is
not necessary.
The idea of partitioning reconfigurable computing applications in time under runtime system
control was introduced in both the XOS [86] and ACE runtime [45] systems. In these papers
it was proposed that if a reconfigurable computing application was larger than the target
device, it should be partitioned into equal sizes matching the target device and then these parts
be swapped onto the FPGA through the use of a pre-emptive scheduler. Partitioning
reconfigurable computing applications across multiple FPGAs was demonstrated by the
Dynamically Reconfigurable System [73] and RACE [124]. In both these systems the
Chapter 2 – Runtime support for reconfigurable computing
23
reconfigurable computer had multiple FPGAs and a single application was partitioned and
configured at one time onto them. The process of partitioning the application across multiple
FPGAs was performed statically using well known partitioning algorithms. Performing the
partitioning statically does not introduce an unwanted time overhead but provides the ability
to execute applications larger than the target FPGA. The order in which to load the partitioned
application can then be determined through the use of a static scheduler [112] [104], which
have been proposed in the research literature.
Instead of partitioning an application across multiple FPGAs, Brebner [24] proposed the idea
of partitioning reconfigurable computing applications into partitions known as swappable
logic units (SLU). SLUs could be swapped onto an FPGA into a number of fixed sized
locations. The advantage of this is the ability for the FPGA to support multiple concurrent
applications. However the process of partitioning all applications into an SLU structure is
complex and time consuming and it was noted that it could not be performed at runtime
without a significant time overhead. Therefore all applications were statically partitioned into
the fixed sized SLU structure. It appears from the literature that very few researchers have
considered the idea of partitioning applications into variable sized parts at runtime whilst
under operating system control.
Input and output
In almost all of the runtime systems proposed and built, researchers have seen the need to
manage the transfer of I/O between the reconfigurable computing application and the
microprocessor. As most of the runtime systems only support one concurrent application, it
has been assumed that the I/O would simply be transferred via the FPGA pins. However if the
runtime system was to support more than one concurrent application, this may not be possible
as sharing I/O pins amongst applications on an FPGA can be difficult.
To avoid contention on the FPGA pins, Babb [12] suggested that multiple signals be
multiplexed onto one pin (see section 0). This would then allow more than one application to
use a single pin at one time. However the number of applications that can share the FPGA pin
is limited by the clock speed of the reconfigurable application and FPGA. An alternative to
multiplexing the pins is to use an on-chip network and arbitrator. This involves configuring a
network and arbitrator onto the FPGA before applications are. As applications are configured
onto the FPGA, they are connected to the on-chip network which is responsible for creating a
Chapter 2 – Runtime support for reconfigurable computing
24
route between the application and the network arbitrator. The arbitrator then negotiates with
all of the applications giving each one exclusive use of the network and I/O pins at a specific
time. To avoid contention on the FPGA pins only the network arbitrator is directly connected
to them and all of the applications transfer their I/O via the network. Although this has been
proposed [81], it has not been used in conjunction with any runtime system for reconfigurable
computing.
2.2.2 Prototypes
Operating systems traditionally provide run time support for applications. Surprisingly, in
view of the number of reconfigurable platforms and architectures proposed and built [18, 31,
61, 63], very few of these projects have included an investigation of run time support.
Everybody who ever built a platform has seen the need for a single user loader [63] [124],
often in the guise of interface software between the reconfigurable computing platform and
the host system. Some researchers have seen the need for a run time environment.
Smith developed a system he called RACE [124], reconfigurable and adaptive computing
environment. In RACE the allocation of FPGA logic area was on a one application FPGA
basis. This dramatically reduces the complexity of the allocation algorithm as it is much easier
to determine an available FPGA as compared with a section of area on one FPGA. It is worth
noting that although RACE can support as many applications as it has FPGAs, as FPGA
density increases, the need for multiple FPGA based platforms diminishes as numerous
applications can fit onto one very dense FPGA.
Similar to Smith, with respect to the allocation policy, Jean reported on a resource manager
for a dynamically reconfigurable system [73]. He defined a dynamically reconfigurable
system to be one that allows part of the hardware to be reconfigured whilst another part
continues to execute. If implemented on one FPGA, it would be similar to partial
reconfiguration, however in this paper it was implemented by the use of multiple FPGAs.
Therefore the resource manager allocated and de-allocated whole FPGAs when required. Jean
reported on the performance of the resource manager when used with several applications and
the impact of supporting concurrency and preloading in reducing application execution time.
Davis [45] proposed a method for developing reconfigurable hardware object class libraries
and a runtime environment to mange these. He had a conceptual development of layered
Chapter 2 – Runtime support for reconfigurable computing
25
abstractions. At the top is a hardware object scheduler which manages precompiled cores, in
the middle is a place and route layer and at the bottom is a virtualization of the FPGA to make
the system portable. However the authors missed the need for resource allocation. This
indicates they really didn’t have a vision for a full scale runtime system, as a loader does not
need the allocation step. There is also no evidence the authors attempted to implement the
layered architecture as the details for each level of abstraction is very brief.
Moseley [104] proposed a runtime management system called Reconnetics. The runtime
system provides an environment of high-level control over the logic gates which requires little
knowledge of the underlying hardware technology. The user supplies the circuits; these are
captured and placed in an archive for later use and an engine directed by a high-level user
program loads, places, and interacts with these processors. Although this provides the ability
to dynamically load applications and performance results were given showing that the runtime
system has been implemented, the author does not mention anything about area allocation, a
fundamental process that is required in an operating system.
Rakhmatov [112] proposed an adaptive multi-user online reconfigurable engine for the
architecture AMORE, which consists of FPGAs with attached multiple microprocessors. This
runtime system is of interest as they address the issues of FPGA logic area allocation and
application scheduling. It supports variable sized FPGA logic area allocation through the use
of a two dimensional bin packing algorithm. However the authors do not explain why the
particular bin packing algorithm was chosen and do not investigate other alternative allocation
algorithms. Application scheduling is performed by the use of a dynamic scheduler [111]. The
order in which the applications are loaded onto the FPGA is determined by the
communication constraints, making sure all communication applications are resident on the
FPGA at one time. Although named a dynamic scheduler, the scheduling decisions are all
made offline when a complete list of applications to be loaded onto the FPGA is known.
A commercial reconfigurable computing operating system is known as Field Upgradeable
Systems Environment (FUSE), developed by the Nallatech Company [106]. FUSE is an
interface between the platform operating system and the hardware circuit design language. It
allows data communications from software directly to the FPGA based applications running
on the hardware. Although named “Reconfigurable Computer Operating System”, it really
Chapter 2 – Runtime support for reconfigurable computing
26
only provides an API between the user and the hardware platform. It provides very few of the
services described in the section above that are necessary for it to be an operating system.
Although this thesis will concentrate on operating systems for reconfigurable computers with
an attached microprocessor, it is worth noting runtime system research undertaken on the
Xputer [85] and AMORE architectures. Kress outlined an operating system for custom
computing machines (XOS) based on the Xputer Paradigm in [86]. Multi-tasking was
performed on the Xputer through the use of a dynamic scheduler, instead of having
concurrently executing applications. This involved the dynamic scheduler swapping the
application on the reconfigurable logic unit after a set amount of time.
2.2.3 Evaluation
A summary of the services that each prototype incorporates is shown in Table 3. From this
table it can clearly be seen that no runtime system provides support for even the minimal
number of services that have already been proposed in the literature. In fact, of the eight
prototypes discussed, only five of them perform any type of hardware partitioning, only one
provides any method of data input and output, two implement a pre-emptive scheduler, one
implements a static scheduler and all eight implement some form of FPGA logic area
allocation. Of the prototypes that implement some form of hardware partitioning, four of them
partition across multiple FPGAs and one partitions into fixed sized SLUs. None of the
prototypes partition applications at runtime into variable sized partitions. Although all of the
prototypes perform logic area allocation, only two allocate applications of variable sizes while
the remaining six allocate on a per FPGA basis. The per FPGA basis allocation was primarily
used as the current design flow could be used to produce applications for such a system and
the allocation algorithm has a low computational complexity. Both of the prototypes that
performed variable sized allocation did not give any performance results of the algorithms and
did not compare other alternatives.
Chapter 2 – Runtime support for reconfigurable computing
27
Name of
Prototype
Allocation of
resources
Application
Partitioning
Scheduling of
applications
I/O
AMORE
[112]
Logic area
variable sized
rectangles
Not
mentioned
Static
scheduling
Not
mentioned
Dynamically
Reconfigurable
System
[73]
Logic area
per FPGA
Across
multiple
FPGAs
Not
mentioned
Not
mentioned
Virtual
Hardware
Operating System
[24]
Logic area
variable sized
rectangles
Static into
SLUs
Not
mentioned
Bus
Addressable
registers
RACE
[124]
Logic area
per FPGA
Across
multiple
FPGAs
None Not
mentioned
ACEruntime
[45]
Logic area
per FPGA
Across
multiple
FPGAs
Pre-emptive
scheduler
Not
mentioned
Reconnectics
[104]
Not
mentioned
Not
mentioned
Demand
and static
scheduling
Not
mentioned
XOS
[86]
Logic area
per FPGA
Across
multiple
FPGAs
Pre-emptive
scheduler
Not
mentioned
FUSE
(Nallatech)
Logic area
per FPGA
Not
mentioned
Not
mentioned
Not
mentioned
Table 3: Services Provided by Runtime System Prototypes
This section outlined what is implied by resource allocation, hardware partitioning,
application scheduling and I/O management as discussed in reconfigurable computing
runtime system literature. It was shown that there is a lack of a concise set of services that
must be provided by a runtime system for a reconfigurable computer. Of the prototypes
Chapter 2 – Runtime support for reconfigurable computing
28
proposed and built no one runtime system has demonstrated a complete set of these services.
In most of the published research, resource allocation and hardware partitioning appear to be
the defining services of runtime systems for reconfigurable computing. However there
appears to have been little investigation and performance evaluation into optimal allocation
and partitioning algorithms for variable sized applications.
Chapter 2 – Runtime support for reconfigurable computing
29
2.3 Allocation and partitioning
In the previous section, runtime systems for reconfigurable computers that appear in the
research literature were presented. It was shown that although the services of resource
allocation and hardware partitioning were suggested, none of the prototypes had included any
systematic evaluation of algorithms that can operate on variable sized dynamically arriving
applications. In this section, published resource allocation and hardware partitioning
algorithms that are suitable and appear in either the reconfigurable computing or other
research domains will be presented. The review will focus on algorithms which have a
potential to be transferred into the reconfigurable computing domain.
2.3.1 Allocation
In any system where there are applications competing for limited hardware resources, a
resource manager is required. Without one, applications can take resources that are already
occupied by other applications, ultimately causing them to have undefined behaviours. The
task of a resource manager is to monitor the availability of the hardware resources and
allocate them as requested to arriving applications. In a reconfigurable computer with
applications competing for the hardware, the major resource that needs management is the
FPGA logic area. Allocating reconfigurable computing applications onto the FPGA logic area
can be simply defined as translating an area (the application) onto another area (FPGA) so
that the translated area does not overlap any existing used area but fits within the boundary of
the total usable area.
The general resource allocation problem has been investigated in various research domains
including allocating data to computer file systems [96] and memory [98], allocating
processors to multi-processor computers [107], and block placement within mesh connected
parallel computers [144]. However, many of these previous attempts are unable to be adapted
for reconfigurable computing as there is a fundamental difference. In traditional von
Neumann computing, the data is moved to the processor. In multiprocessors for example,
programs are moved to the processors, such as the classical load balancing problem. Another
example is the well explored problem of allocating applications to a mesh of processors. This
is not really relevant to FPGAs either because the location of the hardware modules in an
FPGA is not necessarily predetermined by the location of existing hardware. In the case of
existing hardware on the FPGA (such as might be provided for memory access) the task is not
Chapter 2 – Runtime support for reconfigurable computing
to load a program onto this hardware but to locate the FPGA core in an appropriate position in
relation to the exiting hardware. In reconfigurable computers, the circuits (“processors”) are
moved to connect with memory under area constraints and therefore these general resource
allocation algorithms can not be adapted.
The problem of determining where to allocate incoming reconfigurable computing
applications onto an FPGA is very similar to that of calculating fabric cutting plans in the
manufacture of clothing. In clothing manufacturing, the goal is to minimise the amount of
fabric wasted when a particular pattern of clothing is cut out. Milenkovic [102] used a
commonly known mathematical formula known as the Minkowski Sum [58] to calculate all
the possible locations where the clothing pieces could be laid out. As the Minkowski Sum
only reports on the possible locations in which the pattern can be laid out, a greedy based
algorithm was used to select which of them resulted in the optimal placement. He reported
that current software packages performing this task are wasting approximately 20% of the
fabric and it was shown that through the use of the Minkowski Sum and the greedy algorithm
this dropped to around 9.5%.
The problem of packing reconfigurable computing applications onto an FPGA is also similar
to the well-studied problem of bin packing. The traditional bin-packing problem involves
packing a list L = {a1, a2, … an} of items into a set of bins BB1, B2, … subject to the constraints
that the set of items in any bin fits within that bin’s capacity. Most literature on bin-packing
concentrates on classical one dimensional bin-packing, such as First Fit and Best Fit. One
dimensional bin-packing is not suitable for area allocation on an FPGA as the surface of the
FPGA is two dimensional. However as shown in , two dimensional bin packing is. Figure 8
Figure 8: Two Dimensional Bin Packing 30
Chapter 2 – Runtime support for reconfigurable computing
31
The two-dimensional finite bin packing problem consists of determining the minimum
number of large identical rectangles bins that are required to allocate without overlapping a
given set of rectangular items. Two categories of two dimensional bin-packing algorithms
include offline bin packing, where it can use full knowledge of all items in packing L, and
online bin packing where packed items cannot be repacked at a later time and the full list of
items to be packed is not known at the start. Offline bin packing is not relevant in this thesis
as it can only be used when all the applications that are required to be packed onto the FPGA
surface are known in advance. As the application arrival rate into the runtime system
proposed in this thesis will be random, the process of bin packing can not be performed
offline. Online bin packing algorithms are able to accept applications arriving after the
packing process have begun and are thus more suitable to the proposed runtime system.
Baker, Coffman, and Rivest [13] introduced an implementation of two dimensional online bin
packing. In this implementation, the items are rectangles and the goal is to pack them into a
unit-width semi-infinite strip so as to minimize the total length of the strip spanned by the
packing. Packed rectangles can not overlap each other or the boundaries of the strip. Each
successive item is placed as near the bottom of the strip as possible and then as far left at that
height as possible. The problem with this particular implementation and most other online two
dimensional bin packing algorithms is the strip of area is assumed to be infinite, whereas an
FPGA area is finite and the runtime complexity is at best O (n3) [39] where n is the number of
rectangles being allocated.
Chazelle [35] implemented a modified two dimensional online bin packing algorithm in an
attempt to reduce the runtime complexity but retain the quality of allocation. It was found that
using poorly ordered lists can lead to arbitrarily bad packings and long runtimes [13].
However this could be avoided by simply ordering the lists in decreasing widths and
allocating all tasks in the bottom left corner (a heuristic commonly known as bottom-left).
Chazelle reported that the algorithm had only quadratic complexity in terms of the number of
rectangles. Although a reduction in runtime was achieved, the algorithm still assumed an
infinite height bin.
In an attempt to adapt a bin packing algorithm that can allocate FPGA logic area, Bazargan
[17] made modifications to the best fit and first fit algorithms. His algorithm involved
dividing the remaining free FPGA area into empty regions (sometimes referred to as holes)
Chapter 2 – Runtime support for reconfigurable computing
32
and if the incoming application is able to fit into that empty region, it was marked as a
candidate for allocation. Best fit, lowest bottom side or the bottom left heuristic is then used
to determine into which empty region the application is allocated. The advantage of this
implementation is by not considering every possible place in which an application can be
allocated to, the time complexity was reduced to O (log n) for each allocation. However this
resulted in an average of around 7% loss in allocation quality.
An attempt to develop a resource allocation management tool for reconfigurable computing
was made by Eatmon [51]. He introduced the Dynamic Resource Allocation and Management
framework (DREAM). DREAM is a tool that evaluates placement (defined as allocation in
this thesis to avoid confusion with traditional logic placement) algorithms for configurable
logic devices. Incorporated into DREAM are three placement algorithms, best-fit, first-fit and
random placement. The results gained by the author for these algorithms are unacceptable for
use in a runtime system though. For example, the total execution time for the First-fit
algorithm averaged around 11 seconds, for the Best-Fit algorithm around 6 seconds and for
the random placement around 4 seconds. These total execution times place too much of an
overhead onto any runtime system that would use such algorithms.
In this section it was shown that there are some promising algorithms for FPGA logic area
allocation in both the reconfigurable computing and other research domains. However high
complexity and the quality of the allocation in terms of wasted area on the FPGA is a problem
for some algorithms. There appears to be a trade-off between the quality of allocation and the
execution runtime and no-one has yet attempted this type of analysis. The allocation problem
has only been considered in isolation here and the ability to partition applications to overcome
blocking (where an application cannot be allocated unless spilt up) has not been considered.
Hardware partitioning for reconfigurable computing is reviewed in the next section.
2.3.2 Partitioning
Logic partitioning is traditionally used to split an application into equal sized parts when it
can not fit onto target devices. In a runtime environment, it is envisaged that logic partitioning
would be used to divide an application into numerous parts of different sizes. This is because
in a runtime system as applications are loaded and removed from the FPGA, the area becomes
fragmented (distributed non-contiguously). Instead of waiting for contiguously available area
to configure the application onto, it is partitioned into specified sizes that match what is
Chapter 2 – Runtime support for reconfigurable computing
33
currently available. The benefits of this is that it should overcome the possibility that large or
odd shaped applications may be starved of execution time because they do not fit in their
original form onto the allocated space. As logic partitioning has been an active area of
research for at least the last 25 years, there have been numerous proposed solutions.
According to Alpert [2], logic partitioning algorithms can be divided into four major
categories: move-based approaches, geometric representations, combinatorial formulations,
and clustering approaches. However move-based approaches dominate the research literature
primarily because the algorithms are very intuitive and simple to describe and implement.
Those that have the potential to be used in conjunction with a runtime system will be outlined
below.
Generally, an algorithm is move-based if it iteratively constructs a new solution based on the
previous history. In 1970, Kernighan and Lin (KL) [83] described an algorithm that involves
iteratively swapping pairs of neighbourhood modules with an objective function of
minimising the cut-size, that being the number of nets connected to nodes in both partitions.
A simple implementation of KL requires O (n3) time per pass. Fiduccia and Mattheyses (FM)
[57] modified the KL algorithm and reduced the time per pass to linear in the size of the
netlist. The key to the speed up was the bucket data structure used to find the best node to
move. Instead of using the greedy improvement approaches described above to minimise the
cut-size, Kirkpatrick et al. [84] introduced Simulated Annealing (SA). This involves picking a
random neighbour of the current solution and moving to that solution if the new one
represents an improvement. Through the use of the algorithm, it can be shown that over time
it will converge to a globally optimum solution given an infinite number of moves and a
temperature schedule that cools to zero slowly. The authors of [75] conclude that SA is a
competitive approach when compared to KL in specific circumstances; however multiple runs
of KL with random starting solutions may be preferable in others.
However FPGA partitioning poses different challenges than Min-Cut partitioning due to the
hard size and pin constraints in mapping onto these devices. FPGAs also have variable sized
partitions and as such partitioning with the objective of minimising the number of
communication wires may not be adequate enough as the partition may not fit into the desired
location. Woo and Kim [136] proposed a k-way extension to FM algorithm which has the
objective function of minimising the maximum number of I/O pins used in the device. This
algorithm is similar to FM in that modules are swapped until an objective function is reached,
Chapter 2 – Runtime support for reconfigurable computing
34
but many more modules may need to be examined before finding a feasible solution. Kuznar
et al [88] applied FM bi-partitioning to address the common multiple FPGA device
partitioning problem. In this algorithm, given a number of devices and modules in the circuit,
an integer linear program can be solved to find a set of devices that yields a lower bound on a
cost. It is possible that this solution can be mapped onto the devices while still satisfying the
I/O pin constraints. However logic partitioning algorithms with these objective functions are
not suited to a runtime environment because they partition into fixed sizes. In a runtime
environment the application may need to be divided in numerous different geometrical
dimensions.
A possible way to represent a circuit is to describe it according to a directed acyclic graph.
The nodes in the graph represent computation while the edges represent the communication
between the nodes. Purna [109] introduced the concept of temporal partitioning of directed
acyclic graphs. Temporal partitioning partitions and schedules a data flow graph into
temporally interconnected subtasks. Given the logic capacity of the configurable computing
unit, temporal partitioning will partition the circuit k-way such that each partitioned segment
will not exceed the capacity of the configurable unit. Scheduling then assigns an execution
order to the partitioned segments so as to ensure correct execution. This algorithm might be
suitable for a runtime environment if the temporal dimension was replaced by a geometric
one.
In this section it was shown that although there have been numerous allocation and logic
partitioning algorithms proposed and implemented, very few of them are suitable for a
runtime environment for a reconfigurable computer. As applications are dynamically arriving,
the two dimensional FPGA area must be allocated to incoming applications so as not affect
the other executing applications. When applications are loaded and removed, holes of various
shapes and sizes are created and therefore the partitioning algorithm must be able to divide an
application into the various sized parts.
Chapter 2 – Runtime support for reconfigurable computing
35
2.4 Reconfigurable computing design flow
This section reviews the current application design environment for reconfigurable computing
platforms that consist of a medium grained FPGA loosely coupled to a standard
microprocessor. It will be shown that at present this environment assumes that the designer
will carry out resource allocation on the FPGA at design time. Where dynamic
reconfiguration is supported, the designer must still do resource allocation because the
reconfiguration involves interchanging cores with identical resource requirements. Very little
support exists for the adaptive integration of software and reconfigurable hardware as would
be needed if runtime reconfiguration were to be widely used in reconfigurable applications.
2.4.1 Traditional design flow
An application that targets a reconfigurable computer with an I/O system bus style coupling
consists of two parts: a hardware circuit, or bitstream that configures the FPGA, and a
software host program that interacts with the platform. These two files combined form a
reconfigurable computing application. The hardware circuit design methodology for a
reconfigurable computing application has been adopted from the VLSI domain and is shown
in Figure 9. It consists of three major stages: design entry, technology mapping and place and
route.
Design entry primarily involves describing how the circuit will behave but as a consequence
also involves allocating the necessary FPGA I/O pins. There are two common ways to do this,
using schematic capture or using a hardware description language. Schematic capture uses a
computer aided drawing package to describe the circuit. It involves the selection of
components from a library, connection of component’s input and output wires, and naming
and commenting of the components. As shown in Figure 9, schematic capture results in an
implicit entry of a netlist and does not require synthesis. However, a disadvantage is it may
require an early selection of the target technology [101]. Schematic capture was popular a
decade ago when hardware circuits were several thousand gates large. As the number of gates
that make up circuits increased, it became very difficult to use as it was too time consuming to
layout large circuits at gate level.
Chapter 2 – Runtime support for reconfigurable computing
Logic Synthesis & Target Library Mapping
Generate gate-level descriptions
using target library cells
Gate Level Netlist
Placement and RoutingCreate circuit layout using an automatic placement
and routing tool
Create BitstreamThe file that will configure the FPGA
Hardware Description Language
Detailed code to describe gate-level or register transfer
level to describe logic functionality
Schematic CaptureCAD package to layout the
logic functionality
Technology MappingMapping the netlist to vendor specific architecture
Design Entry
Figure 9: Hardware Circuit Design Methodology
A hardware description language (HDL) can be a more abstract design entry method. It
includes many of the elements known from programming languages like data (addition,
subtraction, etc) and control operations (if, case, etc). Two traditional HDLs are Very High
Speed Integrated Circuits Hardware Description Language (VHDL) [7] and Verilog. These
languages allow circuits to be described at either the behavioural or structural level.
Behavioural descriptions involve abstract definitions of system functionality as register
transfer level (RTL) whereas structural descriptions involve gate level connections. The
description then undergoes the process of synthesis, which involves mapping the circuit to a
netlist.
A new set of HDLs to create hardware circuits has recently become popular. These HDLs
contain subsets of common software programming languages such as C [16]. They use similar
syntax and are extended to support hardware circuits. Examples of such languages include
Handel-C [30], System-C [19], and Hardware Join Java [68]. An advantage of these
languages as compared with the traditional HDLs is the ability to reduce the design time.
Traditional design involves prototyping the algorithm in software or behavioural VHDL and
36
Chapter 2 – Runtime support for reconfigurable computing
37
then translating it into register transfer level VHDL or Verilog; a process that can introduce
errors and requires extensive debugging. These new HDLs may avoid these problems as there
is no need to prototype in software or behavioural VHDL because the language is already
software-like. Another advantage of these new HDLs is students with limited or no VHDL
experiences are able to develop hardware circuits. It was shown in Loo [93], students with
limited VHDL experience were able to develop hardware applications such as a parallel filter
within weeks. However, a disadvantage of Handel-C for example, is it often requires more
area than what a VHDL implementation would. Loo [93] showed that a DES encryption in
Handel-C required five times more area than the corresponding VHDL implementation.
Once a detailed gate description of the circuit has been created, it must be translated to the
actual logic elements of the FPGA. This stage is known as technology mapping and is
dependent on the exact target architecture. With a lookup table (LUT) based architecture, the
circuit is divided into a number of small sub-functions, each of which can be mapped to a
single LUT. The resultant blocks are then allocated to a specific location within the hardware,
often close to communicating blocks to minimise routing delays, in a process known as
placement. The communicating blocks are allocated and wired together by configuring the
appropriate routing matrices in a process known as routing. These two processes are often
combined and are known as place and route as the placement of the circuit will directly affect
the quality of routes made. Floor-planning is often used as part of the current design flow as it
can reduce the time required to complete the place and route phase. However, floor-planning
assumes you know what resources will be available [54]. At this stage in the design
methodology, the timing and behaviour of the circuit can be analysed to verify that it meets
the minimal operating speed constraints. After successful timing verification, the hardware
design process is complete and a bit sequence (commonly known as a bitstream) is generated.
Once a bitstream has been produced, a software host program must be developed to load it
onto the surface of the FPGA. A typical flow of this host program would be to set the clock
rate, configure the FPGA and perform the necessary I/O between the host computer and
reconfigurable computing platform. It uses a combination of the platform device driver and
associated application programming interface to perform these required management tasks.
After both the hardware circuit and software host program have been developed, the software
host program is used to load the hardware circuit onto the FPGA, perform the desired input
Chapter 2 – Runtime support for reconfigurable computing
38
and output between the FPGA application and host computer, and finally close the
reconfigurable computer after use, leaving it in a known state. To guarantee the application
will not be interrupted by other applications modifying the hardware, the software host
program and platform device driver block subsequent attempts to reconfigure the platform.
However with dynamic runtime reconfiguration becoming a feature in most modern FPGAs,
additional bitstreams are needed to be configured onto the FPGA at the same time. This has
resulted in changes to the traditional design flow.
2.4.2 Runtime application design flow
There has been considerable research conducted into improvements in design methodologies
for reconfigurable computing applications so they can enable the use of runtime and dynamic
reconfiguration. Hadley and Hutching [64] described a methodology for implementing
runtime systems that partially reconfigure FPGA devices. It involved maximising the static
circuitry and minimising the dynamic changing circuits. Shirazi et. al. [116] described a
method that automates the identification and mapping of reconfigurable regions in runtime
reconfigurable designs. This involved identifying possible components for reconfiguration, a
sequence of conditions for activating an appropriate component, and optimising the
successive components based upon reconfiguration time, operation speed, and design size.
Vasilko et. al. [129] introduced a design methodology for partial runtime reconfiguration
composed of two phases, the design of the static portion using the traditional design flow and
the partitioning, scheduling and allocating of the runtime part of the application. He stated
that their methodology reduced the number of design flow iterations, resulting in shorter
design time and high-quality results.
An application note [138] from Xilinx (a major manufacturer of dynamic runtime
reconfigurable FPGAs) outlined how the traditional design flow had been modified to support
module based partial reconfiguration. The design methodology is as follows. The design is
firstly described with a traditional HDL in conformance to a set of partial reconfiguration
guidelines. It had been noted earlier in the thesis that for partial reconfiguration to be
successful on their architecture, numerous constraints had to be placed on the designer. The
major one being that partial reconfiguration is column based and the modules should extend
column wise and not row wise. Further constraints are also placed onto the HDL coding
including having a top level design that is limited to I/O, clock logic, and the instantiation for
the bus macro, defining each reconfigurable module as a self-contained block, only using bus
Chapter 2 – Runtime support for reconfigurable computing
39
macros for communication between modules, defining all clock resources to use dedicated
global routing and not allowing modules to directly share signals with other modules except
clocks. Following design entry a floor-plan is then constructed describing the position on the
FPGA where each of the modules in the application will be configured. The standard place
and route tools are then run on each of the modules as well as each configuration of a
particular module in the application. An initial bitstream for the full design is then created,
followed by individual bitstreams for each reconfigurable module. The bitstreams are then
configured onto the FPGA via a software host program via the SelectMap interface.
All of these reported methodologies are built around the traditional sequential design flow
described in section 2.4.1 and can only be used when all of the order of application executes is
known prior to runtime. Cores can not be pre-placed and pre-routed and then relocated at
runtime. However Dyer [50] proposed that through the use of direct bitstream manipulation
and the Xilinx JBits SDK [62] applications could be relocated and connections could be
rerouted online, opening up future dynamic applications. The current SDK of JBits does not
support combinatorial and sequential synthesis, timing-driven placement, or advanced
routing. JBits also in our experiments does not scale to circuits beyond a very small size.
In this section it was shown that the traditional design flow is not suitable for developing
applications that use dynamic resource allocation. However through various academic and
commercial design methodologies it is now possible to develop applications that use dynamic
reconfiguration if all of the applications are known prior to configuring the FPGA. The
literature lacks a design methodology that supports applications that arrive dynamically.
Chapter 2 – Runtime support for reconfigurable computing
40
2.5 Applications and benchmarks for reconfigurable computers
Many reconfigurable computing applications are composed of a combination of hardware and
software. This has the advantage of being able to exploit the algorithmic speedup that
hardware can provide as well as the flexibility software gives. Since the introduction of the
first commercially available FPGA by Xilinx in 1984, there have been numerous applications
proposed and built. A selection of the many published reconfigurable computing applications
that have been implemented on a medium grained FPGA loosely-coupled reconfigurable
computer are shown in Table 4.
Cryptography Signal/Image processing
Communications Other
DES
[80]
Cordic
[99]
IPsec
[43]
Searching and Sorting
AES
[97]
Automatic target recognition
[131]
Reconfigurable Router
[67]
Boolean Satisfiability (SAT)
[145]
IDEA
[91]
Edge detection
[10]
LZ Data compression
[70]
Convolution
[113]
Table 4: Summary of common reconfigurable computing applications
These selected applications can be broadly classified into the applications domains of
cryptography, demonstrated through various implementations of the commonly used single
and triple Data Encryption Standard (DES) [80], Rijndael or Advanced Encryption Standard
(AES) [97], the International Data Encryption (IDEA) [91] and LZ data compression [70];
communications, through the implementations of the IP security protocol [43] and
reconfigurable routers [67]; signal processing, with implementations of cordic [99], automatic
target recognition (ATR) [131], edge detection [10], convolution [113], and software radios
[48]; and searching, through the Boolean satisfiability (SAT) algorithm [145]. All of these
applications are well suited to reconfigurable computing as the algorithms can be heavily
parallelised resulting in considerable speedups in execution runtime. Most of them can also be
written with a pipeline like structure which allows data to be streamed from the input device,
usually the microprocessor, into the application and then streamed out of it. This type of I/O
Chapter 2 – Runtime support for reconfigurable computing
41
architecture minimises the amount of onboard or on-chip memory needed to store and hold
large amounts of input data. This is seen as an advantage as most reconfigurable computers
have limited onboard memory (often less than 128 MB) and on-chip memory (often less than
10MBits). It also suits the loosely coupled reconfigurable computing architecture selected for
use in an operating system for reconfigurable computer. This is because of the high bandwidth
coupling between the microprocessor and FPGA.
Unlike in the software community, the reconfigurable computing domain lacks benchmarks
that can be used to compare the performance of these applications. Although benchmarks for
general purpose computers have been deeply investigated, there still appears very few that are
specifically designed for the area of reconfigurable computing. Two examples of these
however are the Adaptive Computer System (ACS) benchmark suite [87] and the
Reconfigurable Architecture Workstation (RAW) [11] benchmark suite. The ACS benchmark
suite has been designed for evaluating a configurable computing system’s architecture and
tools. Instead of using functional benchmarks, ACS used stress-marks, or benchmarks that
focus on a specific characteristic of a reconfigure system such as versatility, capacity, timing
sensitivity, scalability and interfacing. The RAW benchmark suite consists of 12 programs
representing general purpose algorithms including binary heap, bubble sort, merge sort, DES,
Fast Fourier Transform (FFT), game of life, and matrix multiply. The size of each benchmark
program is adjusted depending upon the capacity of the target reconfigurable hardware.
Chapter 2 – Runtime support for reconfigurable computing
42
2.6 Conclusion
This chapter has presented a review of the major themes in the published literature on runtime
systems for reconfigurable computing. The important outcomes of the survey have been to
highlight the absence of an operating system like software for a reconfigurable computer, the
lack of resource allocation and partitioning algorithms that are suitable for use in an operating
system, the inability for the current design flow to produce applications to be used in
conjunction with such a system, and the need for metrics to measure the performance of the
applications and the associated system.
This raises several research questions which will be addressed in this thesis:
• Is it feasible to implement an operating system with low overheads that supports
dynamically arriving applications for a reconfigurable computer?
• What abstractions and services are needed to be provided by such an operating
system?
• What constraints will be placed onto applications if resource allocation and application
partitioning has to be completed at runtime?
• Are there suitable algorithms for these services and how will they effect the operating
system?
• What modifications will need to be made to the current design flow to adapt it so it
can produce runtime applications?
• What benchmark and metrics are necessary to measure the performance of the
applications under operating system and the operating system itself?
It is the purpose of the remainder of this work to provide theoretical and experimental
foundations to show that an operating system can be described and implemented for a
reconfigurable computer. Each of these questions will in turn be answered throughout this
thesis.
Chapter 3 – Methodology
3
3 Methodology
In the previous chapter it was shown that there are quite significant gaps in the literature
regarding the runtime management of reconfigurable computing applications. A summary of
these gaps are stated below and in this thesis research contributions will be made to address
them.
1. There is no agreed list of abstractions that should be used in an operating system for
reconfigurable computing (section 2.2.1).
2. Current design flows have little support for dynamic reconfiguration with resource
allocation (section 2.4).
3. Algorithms for runtime resource allocation and runtime application partitioning have
not been deeply investigated in the reconfigurable computing domain (section 2.3).
4. There is no prototype runtime system for reconfigurable computing that demonstrates
runtime resource allocation and partitioning (section 2.2.2).
5. There has been little discussion of metrics that might be used to evaluate the
performance of an operating system for reconfigurable computing (section 2.5).
6. There have been no evaluations into the affect an operating system environment will
have on reconfigurable computing application performance (section 2.5).
In this chapter the methodology that will be used to address these gaps will be outlined. It is
based on previous work of Crnkovic [42] in which he suggested that a methodology consist of
following a path of categorising the research question, selecting a strategy that will result in
the question being answered, and choosing a validation technique to verify the results
obtained. For each stage he suggested five different types of research questions and these are
summarised in Table 5. As an example, a type of question being proposed is “how to do X?”;
a strategy is then selected to provide an answer to the question; in this case it could be any of
the 5 shown in Table 5. In the final stage, a validation technique is selected to verify the
43
Chapter 3 – Methodology
44
results obtained from the strategy. The selected validation technique depends upon the
strategy used.
Question Strategy/Result Validation
Feasibility
Does X exist and what is it?
Is it possible to do X at all?
Qualitative model
Report interesting observations
Generalise from examples
Structure a problem area
Persuasion
I thought hard
about this, and
I believe . . .
Characterisation
What are the characteristics of X?
What exactly do we mean by X?
What are the varieties of X
and how are they related?
Technique
Invent new ways to do some
tasks, including implementation
techniques
Develop ways to select from
alternatives
Implementation
Here is a prototype
of a system that . .
Method/means
How can we do X?
What is a better way to do X?
How can we automate doing X?
System
Embody result in a system using
the system both for insight and as
a carrier of results
Evaluation
Given these criteria,
the object rates as . .
Generalisation
Is X always true of Y?
Given X, what will Y be?
Empirical model
Develop empirical predictive
models from observed data
Analysis
Given the facts, here
are the consequences
Selection
How do I decide whether X or Y?
Analytic model
Develop structural models
that permit formal analysis
Experience
Report on use in
practise
Table 5: A research methodology suggested by Crnkovic
The research undertaken in this thesis has been divided into four chapters; abstractions,
architecture and design flow; resource allocation and application partitioning; operating
system prototype and metrics; and performance evaluation. For each chapter, a methodology
that will address the associated research questions is constructed from Crnkovic’s [42] work
above and are shown in Table 6. Figure 10 summarises the four methodologies used in this
thesis. For each methodology, the previous work that is drawn upon and the expected
deliverables resulting from the research are shown. The remainder of this chapter is structured
into four sections, with each directly relating to a future chapter in this thesis. In each section,
Chapter 3 – Methodology
45
the research questions that will be addressed are stated, the methodology that will used to
derive answers for them will be presented and the resultant deliverables will be outlined.
Path
Methodology Question Strategy Validation
Chapter 4
Abstractions, architecture and design flow
Feasibility Qualitative model Persuasion and implementation
Chapter 5
Resource allocation and application partitioning
Method Technique Evaluation
Chapter 6
Operating system prototype and metrics
Method Technique Implementation
Chapter 7
Performance evaluation
Characterisation System Evaluation
Table 6: Methodology paths used in this thesis
Chapter 3 – Methodology
47
3.1 Abstractions, architecture and design flow
When designing abstractions, the architecture and a design flow for use in conjunction with an
operating system there are two questions that need to be addressed. These are as follows:
1. Is it feasible to define abstractions and an architecture to support runtime resource
allocation for reconfigurable computing?
2. Is it feasible to design applications for an operating system using the current tools and
design flow?
Both these questions are categorised as feasibility and the methodology chosen to address
them involves developing a qualitative model that will be validated through a combination of
persuasion and implementation. In the first question there will be two parts in developing this
qualitative model. A uniqueness and analogy investigation between reconfigurable computing
and the software based architecture will be conducted. This will result in a list of abstractions
for a reconfigurable computing operating system. Based on these abstractions the architecture
for an operating system will be derived. This architecture will then define the requirements of
the algorithms that provide these abstractions. The list of abstractions will be validated
through the use of persuasion and the architecture will be validated via an implementation.
The second question will involve investigating whether it is possible to modify the current
design flow and tools to develop applications for use in conjunction with an operating system
environment. This will be validated through a qualitative discussion.
3.2 Resource allocation and application partitioning
As a result of the research carried out in the previous section, a set of algorithm specifications
for both area allocation and hardware partitioning will have been derived. An area allocation
and partitioning algorithm that satisfy those specifications and meet selected criteria will then
be chosen for use in the operating system. This stimulates the following question:
1. How is area allocation and application partitioning performed in conjunction with an
operating system for reconfigurable computing?
This type of question is categorised as method; the strategy of technique was chosen to
provide a solution to the question and an evaluation will be used to validate the results. The
Chapter 3 – Methodology
48
strategy technique involves initially undertaking a survey of the research literature in other
domains to see if either area allocation or hardware partitioning algorithms that meet the
specifications has already been proposed. From this, both allocation and partitioning
algorithms that might be suitable will be selected. These algorithms will then be sorted based
on their complexity and runtime performance. The higher ranked algorithms will be then be
adapted to suit the operating system architecture proposed in the previous section. The best
performing allocation and partitioning algorithm will then be selected for use in the operating
system. This research will be validated through evaluation where the most suitable algorithms
will be evaluated against criteria to determine the allocation and partitioning algorithm that
performs the best.
3.3 Operating system prototype and metrics
Once the architecture and algorithms of the operating system have been determined, a set of
metrics are needed to measure the performance of the applications. This raises the following
question:
1. How can the performance of the applications under operating system control be
measured?
This question can be categorised as method; the strategy of technique will be used to address
it, and it will be validated through implementation. The strategy of technique will involve
selecting a set of metrics that will measure the impact of the introduction of an operating
system on the user and application performance. This will be achieved by reviewing the
current research literature to determine exactly what application designers perceive the
performance characteristics of their applications to be. The most important performance
characteristics will then become the metrics that measure the performance associated with the
operating system. The research results will be validated through an implementation prototype
of the operating system and by executing some popular applications on the operating system.
3.4 Performance evaluation
The result of the research conducted in the previous section will be a set of metrics that can be
used to characterise the performance of a prototype operating system. The research question
that needs to be addressed in this section is:
Chapter 3 – Methodology
49
1. What effect does this operating system have on application performance?
2. How quickly does the prototype respond to user interaction?
3. Are there any relationships between the results obtained from the experiments?
All of these questions are categorised as characterisation; the strategy of system will be used
to address them, and the results will be validated through an evaluation. To address these
questions via the strategy of system involves initially creating a test environment so the effect
the operating system has on application performance and user interaction can be measured.
This test environment will incorporate a selected benchmark application, a series of test cases,
and a prototype implementation as a test bed. To verify the result from the strategy of system,
an evaluation to determine if there are any relationships between any of the measured metrics
will be performed. If so, an attempt to derive formulas for predicting the correlation will be
undertaken.
3.5 Conclusion
In this chapter it was explained how the research conducted in this thesis will be divided into
four chapters. For each chapter, the research questions being proposed were presented,
methodologies to address these questions were put forward and the associated deliverables
were outlined. In the remainder of this thesis, these methodologies will be executed with the
aim of filling the research gaps that were exposed in chapter 2.
Chapter 4 – Abstractions, architecture and design flow
4 4 Abstractions, architecture and design flow
It was highlighted in the literature review that there appears to be no agreed set of
abstractions, architecture or design flow for a reconfigurable computing operating system.
Therefore a set of abstractions, architecture, algorithm specification and new design flow will
be presented in this chapter. A summary of the previous work, methodologies and
deliverables associated with this chapter are shown in Figure 11.
Software Operating Systems
FPGA Technology
Existing Algorithms
Operating System
Abstractions
Specifications of Algorithms
Abstractions, architecture and design flow (chapter 4)
Uniqueness analysis
DeliverablesMethodologyPrevious Work
Architecture design process
Analogous analysis
Architecture Design
Existing Design Flow
New Design Flow
Design flow analysis
Figure 11: The previous work, methodology and
deliverables associated with this chapter
50
The chapter is divided into four parts, with each section associated with a specific deliverable.
In the first section, a set of abstractions that suit a reconfigurable computing operating system
will be presented. This will be achieved through a survey of software operating systems and
reconfigurable computing technology, combined with a qualitative approach based on analogy
and uniqueness. The selected abstractions will define what the architecture of the operating
system must implement and is therefore presented in the second section. In the third section,
the specifications of the algorithms that implement the resource allocation and partitioning
Chapter 4 – Abstractions, architecture and design flow
51
components of the architecture will be defined. In the final section, the implications of the
new operating system architecture and its underlying abstractions for the design of
reconfigurable computing applications are investigated. This is combined with previously
published design flow research to result in a new design flow for application development
within an operating system environment.
Chapter 4 – Abstractions, architecture and design flow
52
4.1 Abstractions
Abstraction is a design technique that focuses on the essential aspects of an entity and ignores
or conceals less important ones [76]. It is an important tool for simplifying a complex
situation to a level where analysis, experimentation, or understanding can take place. It has
long been associated with classical software operating systems. A widely accepted set of
software operating system abstractions include the process, the address space, and the inter-
process communication [120]. These abstractions define the architecture and algorithm
specifications that will ultimately be implemented in the operating system. In an operating
system for reconfigurable computing, a generally agreed set of abstractions is yet to appear
(see 2.2.1). Therefore before any architecture or algorithm specification can take place, a set
of abstractions and resulting services needs to be selected.
In this section, a set of abstractions for a reconfigurable computing operating system will be
defined. This will be achieved by drawing an analogy from the software operating system
domain and examining unique features of a reconfigurable computer. For abstractions that
already exist in the software domain (process, address space and inter-process
communication) the investigation aims to find out if it can be transferred to the reconfigurable
computing domain. If there are unique features preventing direct transfer, this work attempts
to accommodate these unique features.
4.1.1 Process abstraction
Early software computer systems allowed only one program to be executed at a time. This
program had complete control of the system and had sole access to that subset of the system’s
resources which it is authorised to access. This resulted in the notion of a process shown in
Figure 12. A process is defined to be a sequential program in execution and is composed of
the object program (or code), the data on which the program will execute, any resources
required by the program and the status of the program’s execution.
Chapter 4 – Abstractions, architecture and design flow
Figure 12: Software Operating System Process
Analogy and uniqueness
As was the case in the microprocessor computer system, reconfigurable computing needs to
evolve into a multiple application environment to better utilise the hardware resources. If
multiple applications are loaded onto the reconfigurable computer, the hardware circuit, the
application data, and any resources the application is using all need to be associated with the
particular application. However the software process abstraction can not be transferred
directly to a reconfigurable computing operating system as there are three unique features of
reconfigurable computing applications preventing it.
Firstly, there can be no exact counterpart of context switch program code. In the software
process abstraction, the program code is a set of sequential instructions which can arbitrarily
be divided into equal sized parts. However, in a reconfigurable computer, the “program”
consists of a two-dimensional logic circuit that is commonly loaded in its entirety onto an
FPGA for execution. Partitioning hardware is far more computationally complex than
partitioning sequential software, unless the circuit has been arranged to facilitate this at design
time.
Secondly, maintaining the process state of a reconfigurable computing application is much
more complex than a software one. When a software process is swapped off a microprocessor,
the operating system performs a context switch to ensure the current state of the process is
maintained. This involves saving the values of a fixed number of registers, often including the
53
Chapter 4 – Abstractions, architecture and design flow
54
process number and a program counter. This procedure ensures the program can be loaded
with the same state at a later date. However if a circuit is swapped off an FPGA, there are not
a fixed set of registers that can be saved to ensure it can be reloaded with the same state at a
later date. This can only be achieved by saving all state holding elements of the circuit,
thereby resulting in a variable sized process state; much different to that of a software context.
The traditional software process abstraction is unable to hold a variable sized state and thus
can not be used for the reconfigurable computing process abstraction.
Thirdly, in a software process, the program data associated with the object code is always
contained in a separate part within that process. However, in a reconfigurable computing
application, instructions are circuits and the division between data and computational
elements is less clear. Three commonly used locations where data can be stored in a
reconfigurable computing application is one, in the lookup tables [41], two, in the block
memory [140] commonly distributed around the edge of the FPGA, and three, in external data
RAM[32]. Most modern reconfigurable computing platforms have memory attached directly
to the I/O pins of the FPGA and data can be streamed into the circuit via an on-chip memory
controller. The classical software process abstraction is unable to represent all these forms of
data storage.
Due to these three unique features of reconfigurable computing applications, the software
process abstraction can not be transferred to the reconfigurable computing domain without
modification. An investigation into what has previously been documented in the research
literature that potentially modifies the software based abstraction to overcome these unique
features will now be undertaken.
Survey of literature
Although dividing hardware circuits and swapping parts during execution is usually avoided
because of the difficulty of saving state and the loss in performance, logic partitioning can be
performed with minimal loss in performance if an application is designed to support it. There
have been several suggested ways to structure a circuit so logic partitioning can be performed
within the operating system with minimal loss in performance.
Firstly, the circuit could have a fixed structure composed of smaller equally sized circuits that
when arranged in a particular geometric alignment would be logically equivalent to one large
Chapter 4 – Abstractions, architecture and design flow
circuit. An example of where this type of structure is implemented is in the Virtual Hardware
Operating System [24]. Brebner introduced the idea of describing a circuit as a collection of
swappable logic units (SLUs), or “hardware pages”. This allows parts of the circuit or SLUs
to be more easily swapped in and out of the hardware on demand, similar to how a software
page is swapped in and out of memory. However, there are problems with an SLU like
structure. Partitioning a significantly large circuit into many small sized partitions can impact
of the performance. Partitioning it into few large SLUs can lead to internal fragmentation of
the area. It would seem more appropriate to partition the application into parts that are
variable in size that follow the natural structure of the application.
Secondly, the circuit could be structured according to a data flow graph (DFG) [44] as was
demonstrated in the adaptive multi-user online reconfigurable engine AMORE [112]. A data
flow graph can be viewed as an abstract circuit representation without clock signals or timing
information, where the nodes represent operations and the edges represent data paths. In
particular the nodes of the graph can be either simple operations such as adders, bit shifts, or
memory read and writes; or complex operations such as floating point division or multipliers.
An example is shown in Figure 13. The advantage of modelling a circuit according to a data
flow graph is the nodes on the graph do not have a fixed size area. A data flow graph based
circuit can be partitioned if there is insufficient area available on the FPGA for the entire
circuit to be loaded in one location.
1 32
X
+
Y = '(1+2) x 3'
Y
Figure 13: Data flow graph
55
Chapter 4 – Abstractions, architecture and design flow
56
However, algorithms that partition a DFG require some intelligence to maintain the circuit’s
performance. Most of these algorithms [2] attempt to minimise the number of communication
channels required between partitions as these can increase the circuit delay. In an operating
system, not only does the number of communication links need to be minimised but the
application has to be partitioned into specified sizes that match the available space on the
FPGA. This means that some nodes may need to aggregated together and others separated. If
the DFG contains a feedback loop that is partitioned, a state holding element needs to be
inserted. A more detailed evaluation of logic partitioning is held off until later in this thesis
(see section 5.2).
If any part of the reconfigurable computing application is to be swapped off the FPGA before
it has completed execution (pre-emption), a context switch, similar to that associated with a
software program has to be performed. The status of a hardware circuit (task switch) needs to
be preserved when it is removed from the FPGA surface part way through its execution.
Previous research in task switching for reconfigurable computing applications has been
conducted by Simmler [123]. He outlined that to successfully perform task switching, the
current state of all registers and internal memories must be able to be extracted from the
circuit, all registers and memory bits must be able to be preset or reset when a circuit is
restored, the position of all state holding elements must be known in advance to perform state
extraction, and the platform clock has to be able to be completely stopped. He also described
three limitations of FPGA designs if used in a task switching environment. Firstly, latches or
registers can be implemented by means of combinatorial logic then their storage can neither
be read nor initialised from most FPGAs. Secondly, the design must indicate when it is safe to
stop the clock and switch the task. For example allowing a task to switch at any time can lead
to a switch right after an addressing phase of the external memory. When the task is restored
it will read invalid data as it would have already been presented at the memory output. Due to
these design limitations, it is felt that pre-emption in the initial prototype would place too
many restrictions onto the designer and will not be considered.
Unlike a classical software operating system where if a process has an extremely long
execution time it typically blocks all other waiting processes, on a reconfigurable computer,
the process with the long execution time only blocks a portion of the FPGA area. Other
waiting applications can be loaded onto the other available area. Thus performing pre-emption
is not as critical for a reconfigurable computing multi-user operating system.
Chapter 4 – Abstractions, architecture and design flow
57
Reconfigurable computing process abstraction
The reconfigurable computing process abstraction that will be used in the operating system in
this thesis will consist of the hardware circuit being described as a data flow graph with data
source and sink nodes inserted for simplified I/O access. Hardware circuits can be designed
according to a DFG model [94] and it simplifies I/O access as a DFG is modelled as a flow of
data. A DFG also provides support to efficient partitioning.
There are several research tools described in the literature that are able to convert a typical
hardware circuit into a data flow graph if not initially designed with one [74] [134] [89].
These tools could be used in conjunction with the standard FPGA design flow to assist in
developing applications so they can fit a DFG structure. If however the circuit can not be
modelled as a data flow graph, it can still be used with the process abstraction, although the
operating system will not be able to partition it unless the application structure comes with an
custom partitioning algorithm.
The use of a DFG with inserted data source and sink nodes simplifies I/O because as a DFG is
modelled as a flow of data, input can be loaded at one end of the graph and the output can be
obtained from the other (see Figure 14 (c)). All data transferred between the process and
external resources is passed via the data source and sink nodes. These nodes are then
interfaced to a standard communication module that will be attached to every process that
requires I/O. This provides the basis for virtual I/O in the operating system. The associated
data will be streamed into the circuit from the external memory or other processes. This could
quite easily be extended to support streaming from both block RAM (BRAM) and lookup
table configured memory in future operating system prototypes. The reconfigurable process
abstraction in this thesis will not have the ability to store the process state as the operating
system will not preform pre-emption.
Chapter 4 – Abstractions, architecture and design flow
Figure 14: Reconfigurable computing process abstraction
Until now, the reconfigurable computing process abstraction will perform all I/O via a
specially constructed standard communication module attached to every process (Figure
14(a)). However, a special type of process abstraction that can directly access the FPGA pins
(Figure 14(b)) for application performance reasons is now introduced. This process
abstraction is very similar to the ones described above, although it can have an alternative
source of I/O instead of, or as well as the external memory. How the process connects to the
I/O pins will be left to the designer and abstracted away from the process abstraction. Such an
extension to the process abstraction is required for applications to avoid any loss in
performance due to the introduced memory latency. This may include applications that
require large amounts of streaming data such as real-time video images in an image fusion
application. The major reason why not all processes have direct pin access is because it places
a constraint on the allocation. Ideally for routing efficiency, processes that have direct I/O pin
access should be placed as close to the pins as possible. Placing this restriction on all
processes may increase the complexity of the associated allocation algorithms.
58
Chapter 4 – Abstractions, architecture and design flow
4.1.2 Address space
Most modern software operating systems support multiple pseudo concurrent processes. This
is partly achieved by processes sharing the main memory and being swapped back and forth
to the microprocessor for execution. For the main memory to be safely shared amongst
multiple processes, it must be allocated to them according to operating system polices and
then have mechanisms put in place to prevent illegal access from other processes. This led to
what is known as an address space abstraction and in a classical operating system, is a linear
set of locations used by a process to reference the primary memory locations, operating
system services, and resources (see Figure 15). The address space stores all the logical entities
used by a process and specifies an address by which they can be referenced without kernel
involvement. A process can only reference memory that has been mapped into its address
space.
Figure 15: Classical operating system address space
Analogy and uniqueness
In a reconfigurable computing environment, if there are multiple applications on the FPGA at
one time, an address space abstraction will be required to prevent hardware circuits from
accessing or modifying parts of the FPGA that may affect other executing circuits. If the
application data is stored separate from the circuit, in on-board memory for example, a
mechanism to address and protect it will also be required. These requirements are analogous
to what the software address space abstraction can provide. However, there are three unique
features of a reconfigurable computer that prevent the software address space abstraction from
being transferred without modification into the reconfigurable computing domain.
59
Chapter 4 – Abstractions, architecture and design flow
60
One, in a software operating system a process consists of sequential instructions and data that
are allocated into memory and accessed through a linear address space. In a reconfigurable
computer, a process will consists of a two-dimensional logic circuit that needs to be loaded
onto an FPGA and possibly data that needs to be stored into on-board memory. In the
software address space abstraction there is no concept of a two-dimensional hardware
resource.
Two, there are other resources apart from external RAM that need be allocated to processes
on a reconfigurable computer. In a software operating system, memory is a major resource
that requires allocation. In a reconfigurable computer, CLBs, routing wires, BRAM,
multipliers and I/O pins are just some of the resources that could require allocation to
processes. Address space allocation algorithms need to be modified to suit this complex
environment.
Thirdly, the ability to share a logic circuit on a reconfigurable computer is much more
difficult than sharing a software program stored in memory. The software address space
abstraction allows software located in memory to be shared between processes for read
access. This has been well demonstrated through examples such as shared libraries. However
in a reconfigurable computer, not all circuits can be shared. For example, if a circuit has the
associated data embedded into it, sharing the circuit is not possible. For a circuit to be shared
it must be able to have data streamed in externally and it must be able to be time multiplexed.
An example of a shared circuit is a memory controller that is responsible for reading and
writing a single bank of off-chip RAM for several processes. Only one process at a time can
read or write to the memory and once it is complete the next process can use the shared
circuit. These three unique features of a reconfigurable computer prevent the software based
address space abstraction from being transferred from the software operating system domain
without modification.
Reconfigurable computing address space abstraction
The reconfigurable computing address space abstraction that will be used in this operating
system will consist of a two-dimensional address space for the FPGA and a single dimension
address space for the on-board memory as shown in Figure 16. The FPGA address space will
be represented in two-dimensions with each cell corresponding to a configurable logic block
(CLB). Each cell in the address space will initially only hold a value to represent whether the
Chapter 4 – Abstractions, architecture and design flow
CLB is available for allocation. This abstraction, in combination with an allocation algorithm
will provide protection as it will prevent other circuits from being allocated to occupied
FPGA locations. The on-board memory will be represented by a linear address space in
conjunction with conventional memory allocation algorithms. If the data used is not
embedded into the logic it must come from an external source and hence the external data
needs to be included with the logic in a single process concept.
Figure 16: Reconfigurable computing address space abstraction
The local routing resources and I/O pins will not separately be represented in this address
space but will be considered to be included within the area used exclusively by the process, as
they will be assumed to be part of a primitive architecture (to be discussed in the inter-process
communication abstraction). The process can use any local routing resources within the
bounding box of area allocated to it. However, the operating system must be able to reserve
some routing resources for inter-process communication that do not constrain where the
process can be allocated.
4.1.3 Inter-process communication
61
In early operating systems, processes were able to communicate only through the use of
shared memory. In shared memory, the user implements code that accesses a special part of
the computer’s address space which more than one process can access. Data is placed in this
part of the address space by one process and other processes subsequently read and use it.
However, if the two processes did not share the same address space, the operating system
kernel must manage access. Inter-process communication (IPC) provides a mechanism to
Chapter 4 – Abstractions, architecture and design flow
allow processes to communicate and to synchronise their actions without necessarily sharing
some part of the address space. IPC abstractions enable a process to copy information from its
own address space, form it into a message and send the message to a receiving process which
will copy it into its own address space. This is shown in Figure 17.
Message
Data to be shared
Data to be shared
Address Space for Process X
Address Space for Process Y
Figure 17: Software inter-process communication abstraction
Analogy and uniqueness
Processes in an operating system for reconfigurable computing also need to be able to
communicate with other processes that do not share the same address space. When an
application is partitioned, each partition becomes a new process and as these processes do not
share the same address space, a communication like mechanism is needed. This concept is
analogous to the software inter-process communication abstraction in which messages are
formed in packet like capsules of data and passed between the communicating processes.
However in a reconfigurable computer, instead of passing the packets via special files known
as ports as is the case in software, the data can be transferred either via memory, non-shared
direct hardware channels, abutment, or an on chip network.
Performing inter-process communication via memory often involves connecting the two
communicating processes to the memory via an arbitrator and memory controller (see Figure
18). This reduces the need for the memory access circuitry to be configured into each process.
When communication between the two processes is required, one process would indicate to
the arbitrator it wishes to write data into memory. The arbitrator would grant it access,
allocate it a memory location and the data would then be loaded into memory at that location
via the memory controller. Once the write was completed, the arbitrator would indicate to the
communicating process that data is available and it would be passed onto it via the memory
controller.
62
Chapter 4 – Abstractions, architecture and design flow
Higher performance processes typically need to communicate directly to other processes
without the overhead introduced by the memory controller and arbitrator. This can be
achieved through non-shared hardware channels or via abutment (see Figure 18). Non-shared
hardware channels involve the use of a runtime router that can dynamically route channels
between the two communicating processes. Abutment involves collocating two processes
with a particular geometric alignment so that a standard interface creates a communication
channel just because they are placed next to each other. Once the channels have been routed,
the data can then be transferred between the two processes at a greater performance than if it
were transferred via memory.
Another mechanism that supports inter-process communication in reconfigurable computing
applications is the use of an on-chip network (see Figure 18). This involves configuring a
communication infrastructure onto the FPGA separate from all processes and then a process
can connect to the shared network. The advantage of using an on-chip network is that similar
performance may be achieved when compared to the non-shared hardware channels, but there
is more flexibility in where the processes can be allocated. There are a variety of network
topologies that can be used for the shared network, ranging from a bus network to a star
network, and these will be discussed in more detail in the next section.
Process 2Process 1
FPGA
I/O Pins
Onboard M
emory
Process 3
Mem
ory C
ontroller
(a) Process 1 and 2 via a hardware channel(b) Process 1 and 3 via abutment(c) Process 2 and 3 via onboard memory(d) Process 4 and 5 via shared on-chip network
Process 4
Process 5
Figure 18: Possible inter-process communication mechanisms
63
Chapter 4 – Abstractions, architecture and design flow
64
Although the software inter-process communication abstraction can be transferred to the
reconfigurable computing domain with little modification, an alternative to ports must be used
to transfer the data between communicating processes. A survey of previously published
communication interfaces will now be investigated.
Survey
An early attempt at inter-process communication with FPGAs via abutment was described by
Brebner [24]. He stated for inter-process communication to be possible each swappable logic
unit had to be fixed in size and have a communication interface built into each one. Once the
SLUs were placed onto the FPGA in pre-determined locations, they could communicate to the
SLU directly above, below, right and left of itself via the communication interface. If an SLU
wanted to communicate to another SLU that was not in one of these locations the
communication was either not possible, had to pass through other SLUs until it reached the
required one or wait until the desired location became available.
Mignolet [100] and Yi-Ran [143] avoided some of the problems that are faced by the
abutment style of communication. They proposed the use of a shared fixed uniform mesh
packet forwarding network as shown in Figure 19. This is the most widely suggested
architecture that is described in the literature for the sharing of a single programmable logic
chip among applications that are loaded at user demand. However, it does not avoid the
difficulty of fixed sized circuits. The problem with using equal sized processes in inter-
process communication is the restrictive nature of the size of the process that can be placed,
and the possible increase in internal area fragmentation when the processes and the fixed size
area segments are not exactly the same size. It can also result in a loss in application
performance because the application is automatically partitioned into multiple processes so it
can fit into the area requirements. However, this has the advantage that the network is a fixed
location on the FPGA and does not need to be altered at runtime.
Chapter 4 – Abstractions, architecture and design flow
Figure 19: Processes of fixed size arranged
in a fixed mesh topology orientated network
A consequence of the relaxation of the regular fixed size constraints is that a shared on-chip
network must be dynamically re-routable. Since routing at runtime involves online algorithms
that must have execution times that are not excessive in comparison with application
execution time, the complexity of the runtime routing must be restrained. In [81], Kearney
presented an evaluation of network topologies including bus, star, mesh, ring and tree that
might be suitable for such runtime re-routable shared networks. This is shown in Table 7. His
criterion was based on ease of implementation, wire routing cost (i.e. some topologies require
many wires to be run over large distances on the chip), concurrency or the ability to support
multiple memory banks, latency, and scalability or how does it perform with a substantial
amount of applications connected to it; all important criteria associated with a reconfigurable
computing operating system. He concluded that although the bus topology was clearly not the
best performer based on the criteria, the poor concurrency and latency of the topology could
be overcome through the use of multiple buses.
65
Chapter 4 – Abstractions, architecture and design flow
66
Ease of
implementation
Wire routing
cost
Concurrency Latency Scalability
Bus ++ - -- - --
Star ++ -- ++ ++ --
Mesh -- -- + + +
Ring ++ + - +/- +/-
Tree - - + + +
Table 7: Evaluation of network topologies
+ favourable; - unfavourable; +/- neutral
Reconfigurable computing inter-process communication abstraction
The software inter-process communication abstraction will be transferred to the
reconfigurable computing domain as there are no real unique features of FPGAs preventing it.
It will consist of the formation of messages and the passing of these between communicating
processes. However, instead of the use of ports, it will be supported by a pre-configured
primitive architecture. A primitive architecture is an FPGA logic design shared by several
applications and remains on the FPGA as applications are allocated and de-allocated. The
primitive architecture may be runtime reconfigured in minor ways as the needs of applications
change. The primitive architecture proposed here consists of a memory controller and shared
on-chip re-routable bus network (see Figure 20). A memory based inter-process
communication style was initially selected because it would be the easiest to implement for
the first operating system as most platforms have onboard memory. However, there appears
no real reason why the other more direct forms of inter-process communication can not be
implemented in future prototypes. Processes wishing to communicate data can be placed
anywhere on the FPGA and the network will re-routed to connect to the process.
Chapter 4 – Abstractions, architecture and design flow
Figure 20: The on-chip network used in the reconfigurable computing inter-process
communication abstraction
4.1.4 Conclusion
In this section, through an analogy and uniqueness survey, a set of abstractions consisting of
the process, address space and inter-process communication were selected for use with a
reconfigurable computer. It was outlined that the process abstraction will consist of the
hardware circuit being described as a data flow graph with inserted source and sink nodes.
This gives the operating system the ability to stream I/O data into the application as well as
support application partitioning. The address space abstraction will consist of a two-
dimensional address space to represent the FPGA and a traditional one-dimensional address
space to address the attached memory. The inter-process communication abstraction will
consist of processes forming messages and passing them to other processes via an on-chip
network and memory controller. This type of abstraction best supports dynamically arriving
variable sized processes.
67
Chapter 4 – Abstractions, architecture and design flow
68
4.2 Operating system architecture
In the previous section it was demonstrated that an operating system for reconfigurable
computing should consist of three abstractions: process, address space and inter-process
communication. Although all of these abstractions exist in the software operating system
domain, they had to be modified in order to suit a reconfigurable computer. These newly
defined abstractions influence the structure and components of the operating system
architecture. In this section, the architecture for the operating system proposed in this thesis
will be developed. This will be achieved by summarising any previous attempts at suggesting
reconfigurable computing operating system architectures and from this previous knowledge,
any ideas that fit the proposed abstractions will form the basis of the new architecture. The
new architecture will then be completed by describing all the relevant components and
interactions between them.
4.2.1 Previous reconfigurable computing runtime system architectures
In chapter 2, previous research on operating system like artefacts was reviewed. In this
section lower level implementation details on runtime systems are reviewed. As there have
been several runtime systems developed for reconfigurable hardware (see 2.2.2), these have
resulted in a few simple customised architectures. The most primitive of these is the
client/server like architecture described by Simmler et. al [123]. This architecture is composed
of only three sub-systems: a client application, a hardware management unit, and the
reconfigurable hardware itself, as shown in Figure 21. In this architecture the client
application communicates to the hardware management unit which in turns converts the
request into the specific platform API and then directly passes it to the hardware. This type of
architecture is very primitive and not really suited to the operating system proposed in this
thesis as it only performs simple tasks such as the configuration of the FPGA and
management of I/O. It does not provide support for dynamic resource allocation, application
partitioning or inter-process communication.
Chapter 4 – Abstractions, architecture and design flow
Client / Application
Client / Application
Client / Application
Server / Hardware Manager
FPGA Coprocessor Board
Figure 21: Client-Server model architecture
Burns [27] extended this architecture to include two new sub-systems: a transformation and
configuration manager, as presented in Figure 22. The transformation manager is responsible
for translating the circuits to improve area usage. For example, if the position that the circuit
was designed to be placed onto is occupied, the translation manager will rotate, mirror or
scale the circuit so it can be placed in a different location. However, very few details were
provided on how to actually perform the transformations and it is felt that it would be too
computationally expensive to do so at runtime. The configuration manager is simply a
hardware abstraction layer so the architecture can be ported to any type of reconfigurable
hardware. Although the architecture itself is not suitable for the operating system in this
thesis, as it also does not provide support for the proposed abstractions, the concept of a
hardware abstraction layer will be utilised.
Figure 22: The RAGE System Dataflow Architecture
69
Chapter 4 – Abstractions, architecture and design flow
4.2.2 Proposed reconfigurable computing runtime system architecture
As there is no architecture of an operating system that suits the selected process, address
space, and inter-process communication abstraction described in the literature, the
components and interactions between them that result in an architecture will be presented
here. The architecture shown in Figure 23 consists of seven components responsible for user
input, service providing, resource allocation, logic partitioning, bitstream compilation and on-
chip network configuration, hardware abstraction, and the on-chip network itself. Each of
these components will now be described in more detail, followed by a description of the path
a sample application would take to be executed under this architecture.
Figure 23: Architecture of the operating system
Shell
70
The shell in this operating system will be very similar to a traditional one. It will provide an
interface between the user, the operating system and the hardware. Users will input
commands and execute applications via the interface. The operating system service provider
is then responsible for converting them into appropriate calls to the operating system
application programming interface. Applications will be delivered to the Allocator which will
begin the process of preparing it for execution. The operating system service provider will
also report to back to the shell on the status of the requested operations. Possible commands
Chapter 4 – Abstractions, architecture and design flow
71
that the shell could provide would include loading an application, accessing on board
memory, and clock control.
Operating system service provider
The operating system service provider is responsible for interpreting the user commands from
the shell, converting them into the appropriate application programming interface, and then
passing them directly to the hardware abstraction layer. The services the component would
provide include on-board memory reading and writing, and platform specific configuration
information. The advantage of using such a component is if the hardware abstraction layer is
altered at any point in time, only the operating system service provider has to be modified. If
the architecture did not have such a component and the hardware abstraction layer was
changed, all of the user interfaces would need to be modified. As the operating system may
have several different shells in the future, this will reduce the amount of maintenance that
needs to be performed.
Allocator
The Allocator is responsible for finding a section of vacant area on the FPGA that is large
enough to accommodate the new application, and for allocating the necessary on-board
memory. This requires the Allocator to keep track of where the free FPGA area is and where
the previously allocated processes have been put. When an application is passed to the
Allocator from the operating system service provider, it will initially calculate an estimate of
the amount of free area that will be required to accommodate it. If the area is available, the
Allocator selects the best place to locate the application. If there isn’t enough area available
for the complete application, it will either partition or block it.
If there is enough area available but it’s not in one contiguous block, the application will need
to be partitioned. The Allocator will then determine the largest segment of free area available
and pass that and the application to the Partitioner. This process may be repeated numerous
times until the complete application has been successfully allocated or partitioning fails and
the application is blocked. All of the allocated and partitioned parts will then be passed to the
placer for the next stage of processing. The full specifications of the allocation algorithm will
be outlined in section 4.3.2.
Chapter 4 – Abstractions, architecture and design flow
72
Partitioner
The Partitioner is responsible for partitioning the application logic into multiple parts if it can
not fit onto the available area in its current geometric dimensions. If the Partitioner is called, it
would have already been determined by the Allocator that there is enough area available for
the application, but it is not in one contiguous block. The Allocator will inform the Partitioner
of the amount of area it has to partition the application into and the Partitioner will return
which aggregation of nodes the data flow graph have been allocated to that particular segment
of area. This process will be repeated until the entire application has been allocated and
partitioned. If the Partitioner is unable to divide the application into a small enough partition
to fit into the specified area, it will inform the Allocator of this. The Allocator will then decide
whether to search for a larger portion of area or block and place the application in the ready
queue. More details on the partitioning algorithm will be given in section 4.3.3.
Loader
The loader is primarily responsible for creating the bitstreams that will configure the FPGA,
and the configuration of the on-chip physical network. Once the application has been
partitioned into a process and allocated a place on the FPGA, the loader determines how the
communication network must be configured to incorporate the new process. It combines this
network configuration information with the process itself to produce the specific FPGA
bitstream. It then passes the bitstream and configuration information to the hardware
abstraction layer that will actually load it onto the target FPGA.
Hardware abstraction layer
It is commonly agreed that a hardware abstraction layer (HAL) is a layer of programming that
allows an operating system to interact with a hardware device at a more abstract level. Unlike
modern personal computer hardware such as the hard disk or memory, there is no standard
interface for an FPGA or reconfigurable computer. Every manufacturer has a different
application programming interface (API) and for the user to access it, they need to have a
good understanding of the platform. The hardware abstraction layer in this operating system
should provide a standard API that abstracts away the underlying reconfigurable computing
hardware. It should provide an API to connect to the platform, configure the FPGA, control
the clock rates and access any onboard memory. All of the hardware specific code can then be
hidden, not only from the user but from the operating system itself. The advantage of using
Chapter 4 – Abstractions, architecture and design flow
73
this type of HAL is the target platform can be changed without significant redevelopment of
the operating system except for the hardware abstraction layer itself.
Network and network manager
The network and network manager together form the primitive architecture which is primarily
responsible for supporting inter-process communication and I/O. It consists of the inter-
process communication network and associated hardware to support it. The primitive
architecture is configured onto the FPGA by the loader before any user processes are. Then as
processes are configured onto the FPGA, the network configurator informs the network
manager via the hardware abstraction layer what must be changed and configured to the
network to support the incoming process. Once the changes have been made, it is responsible
for arbitrating between the processes to decide who has access to the network to guarantee the
data being transferred is not damaged or corrupted.
4.2.3 Sample application execution
The execution of an application begins with the user informing the operating system via the
shell that there is an application waiting execution. The application described in a data flow
graph format is then passed via the operating system service provider to the Allocator. The
Allocator begins by calculating if there is enough vacant FPGA area for the entire application
to be configured onto the FPGA in its current geometric dimensions. If there is, the allocated
location and the application will be passed onto the placer. If not, the application will be
blocked, put into a ready queue until more area becomes available and the user informed via
the operating system service provider. If there is area available, but it’s not in one contiguous
block, the largest amount of free area and the application will be passed to the Partitioner. The
Partitioner will then attempt to partition the application into a process that can fit into the
allocated area. This process of partitioning and allocating is repeated until either the
application has been fully partitioned and allocated or all the area has been used. If all the area
is used and the application has not been fully allocated, the application will be blocked and
placed into the ready queue.
The completed application will be passed to the loader so an FPGA bitstream can be
produced. The loader will also configure the on-chip network for the processes that require
inter-process communication and off-chip I/O. Once the bitstream has been created, it will be
configured onto the FPGA via the hardware abstraction layer.
Chapter 4 – Abstractions, architecture and design flow
74
Once the application is in execution, the user can interact with any of the processes through
the operating system service provider. Via the shell, the user can request an I/O operation, an
alteration to the clock speed, or even the termination and removal of selected processes. The
user operations will be translated into the hardware abstraction API by the operating system
service provider which will in turn pass directly the commands onto it.
4.2.4 Conclusion
In this section, the architecture for a reconfigurable computing operating system that suits the
process, address space, and inter-process communication abstraction has been presented. This
was achieved by firstly surveying the previous literature to investigate whether one had
already been proposed that could be used. It was shown that little architecture for
reconfigurable computing runtime systems have been presented and as such, a new one that
suits the selected abstractions was developed. This architecture consists of seven components
with the Allocator supporting the address space abstraction, the Partitioner supporting the
process abstraction and the network and network manager supporting the inter-process
communication abstraction. The other four components include a shell as a user interface, an
operating system service provider to support hardware configuration, a loader to generate the
FPGA bitstreams, and a traditional hardware abstraction layer to abstract the low level
platform programming from the user and operating system.
Chapter 4 – Abstractions, architecture and design flow
75
4.3 Algorithm specifications
It was highlighted in the architecture described in the previous section that the defining
components of the operating system that will ultimately support two of the proposed
abstractions (process and address) is the allocation of the FPGA area (Allocator) and logic
partitioning of the applications (Partitioner). Although several algorithms have been proposed
in the previous literature to solve both (see section 2.3), very few of them are suitable for this
environment because of the need to allocate area to variable sized applications, partition
applications into predefined sizes, and to carry out both of these at runtime.
As these algorithms now need to be performed at runtime because the status of the FPGA can
not be predicted at design time, a trade-off between execution runtime and quality of
allocation and partitioning needs to made. In this section the requirements and algorithm
specifications of the allocation and partitioning components will be presented.
4.3.1 Runtime requirements for algorithms
The traditional design stages of partitioning, and placement and routing use stochastic
algorithms to produce high performance applications at the expense of the total execution run
time. This type of algorithm is ideal for the offline design flow because the resulting
performance of the application is far more important than the total execution runtime of the
algorithms. The designer is able and happy to wait for an extended period of time for the
stages to produce such a high performance result. However, in an operating system
environment this is neither necessary nor possible. A runtime complexity near linear and with
an actual execution runtime in the order of milliseconds are required.
If the execution runtime of either the allocation or partitioning algorithms is reduced, the
resulting performance of the application can suffer. In the case of allocation, the applications
may use more FPGA area than compared to if a stochastic based allocation algorithm was
used. However as FPGA logic density will reach 20 million system gates in the foreseeable
future, logic area is becoming much less of a restriction than it once was. As the applications
used in the operating system will be made up of larger granularity processes which have
already been pre-placed and pre-routed using the high performance commercial design tools,
the number of iterations a partitioning algorithm would need to make in order to perform a
successful partition would be reduced.
Chapter 4 – Abstractions, architecture and design flow
4.3.2 Allocation
The specifications of the FPGA allocation algorithm are as follows. As a reconfigurable
computing application either dynamically arrives into the operating system via the ready
queue or part of an application arrives via the Partitioner, given the application’s geometrical
dimensions, the Allocator must first determine if the total amount of area the application
requires is less than what is currently available on the FPGA. If not, the application will be
placed back into the ready queue. If there is enough area available the Allocator must return
the position and size of the area where the application can be put so it will not overlap or
interfere with any other already resident applications. If there may be several positions where
the application can be put, the one that is returned must be best in terms of some criteria. This
will be discussed in more detail in the next chapter.
If the total sum of the available area is greater than what is required by the incoming
application but there is not one contiguous block of area larger than what is needed by the
application, the position and size of the largest available area will be returned. This will then
be used by the Partitioner so it can attempt to assemble a subset of connected data flow graph
nodes that match the available area. This allocation process is shown in Figure 24.
Figure 24: Allocation service
4.3.3 Partitioning
To avoid possible lengthy delays in user response time, logic partitioning will be used to
divide applications into geometric dimensions that match what is available on the FPGA.
Unlike most traditional logic partitioning algorithms that perform bi-partitioning or min-cut
(see section 2.3.2), to maximise the area utilisation of the FPGA, an application should be
76
Chapter 4 – Abstractions, architecture and design flow
partitioned into a specified size. Therefore the partitioning algorithm must be able to accept a
particular size constraint from the Allocator and fill the area with as many data flow graph
nodes as possible without impacting on the application’s performance or integrity, shown in
Figure 25. It should also avoid partitioning feedback loops and minimise the amount of inter-
process communication where possible. Once the specified area is full, the Partitioner should
indicate to the Allocator it requires another block of area, and then continue its process of
partitioning. This process is repeated until the entire application has been partitioned.
Figure 25: Hardware partitioning
4.3.4 Conclusion
In this section the runtime and functional requirements of both the allocation and partitioning
algorithms were defined. It was specified that as the algorithms will be executing at runtime,
their runtime complexity must be approximately linear, and actually execute in milliseconds
on a typical software platform. It was outlined that the allocation algorithm must calculate a
location on the FPGA that can accommodate the size of an incoming application and the
partitioning algorithm must be able to divide an application structured as a data flow graph
into any number of various sized partitions. These runtime and functional specifications will
be used to guide the selected of the algorithms that will be incorporated into the operating
system prototype in chapter 6.
77
Chapter 4 – Abstractions, architecture and design flow
78
4.4 New application design flow
It was outlined in the literature review in section 2.4 that the traditional design flow does not
really support the development of partial bitstreams for use with dynamic runtime
reconfiguration. Recent attempts have been made to overcome this limitation through the use
of methodologies that support module based application design [138]. However, dynamic
allocation and partitioning will require further modifications to suit the suggested operating
system environment. In this section the limitations of the current design flow methodologies
that prevent it from being used in this environment will be outlined. This will be followed by
the new design flow methodology that will be used to develop applications for execution
under the proposed operating system architecture.
The current design flow as outlined in the literature review is used for describing hardware
circuits that are to be loaded onto a reconfigurable computer whose entire surface is
configured at once. However, with many FPGAs now supporting dynamic runtime
reconfiguration, modifications to the design flow have had to be made to support it. A recent
suggestion states that dynamic runtime reconfiguration would benefit if the applications were
designed into module like components. These modules are then swapped in and out on the
same location on the FPGA through the use of dynamic runtime reconfiguration. This is
achieved through the compilation of an initial bitstream and numerous partial bitstreams that
are ordered and then configured onto the FPGA over time. This can only be performed if all
of the application modules are known prior to the compilation of the bitstreams. In the
architecture proposed above however, applications will dynamically arrive into the system
and through the use of the suggested operating system can be arbitrarily placed anywhere on
the FPGA. Applications that are designed with the current design flow are unable to be used
in such a system.
Firstly, as FPGA area allocation is performed at runtime because the availability of hardware
resources can not be predicted at compile time, all of the application modules need to be
relocatable. However in the current design flow, these modules are pre-placed and pre-routed
and can only be relocated through the use of device specific APIs such as JBits.
Experimentation with JBits by the author however shows that it is unable to arbitrarily
relocate and reconnect pre-placed cores of a practical size. Secondly, as dynamic partitioning
will be used to divide an application into a more suitable geometric size, applications need to
Chapter 4 – Abstractions, architecture and design flow
79
be designed with a data flow graph structure. As logic partitioning is most commonly
performed manually by the application designer, the current design flow has no support for
designing applications with any such structure. Thirdly, a runtime router is required to finalise
any routes that are not local to any pre-routed core. In fact, pre-routed cores must be designed
to avoid some of the routing resources on the FPGA so these resources are available for
runtime global routing. This is because all current commercial FPGA architectures do not
have a separate global routing architecture. Once the position of the application has been
determined, the runtime router will then connect it to either another application or directly to
I/O pins. Fourthly, if the operating system is used with an architecture that does not support
dynamic reconfiguration, checkpoints are needed to be inserted into the design so the
application can be parsed whilst a reconfiguration of the whole FPGA is carried out. There are
no provisions in the current design flow for any type of checkpoint insertion.
All of the limitations described above have led to the development of a new design flow
methodology. It initially involves describing the application through the use of a traditional
hardware description language. In order to utilise the dynamic partitioning of the operating
system, it must be designed with a data flow graph structure. The nodes of the graph will be
the computational elements of the application such as the adders and multipliers. The arcs of
the graph describe which nodes need communications between each other. Following the
design entry, the application is then synthesized and technology mapped. Each of the data
flow graph nodes are then internally pre-placed and pre-routed through the use of the
traditional placement and routing algorithms. This reduces the amount of placement and
routing that needs to be performed at runtime. As there are no external routing connections
made between the nodes, they will be relocatable at runtime. Once all of the modules have
been completed, the application is then ready to be loaded into the operating system. After the
operating system has determined the location of all the data flow graph nodes, the runtime
router would be to connect all the communicating nodes together.
Chapter 4 – Abstractions, architecture and design flow
80
4.5 Conclusion
This chapter resulted in four major deliverables. First, through a qualitative based analogy and
uniqueness survey between software and reconfigurable computing operating systems, three
newly defined abstractions were described. These were:
1. The reconfigurable computing process abstraction consisting of the hardware circuit
described as a data flow graph model with data source and sink nodes inserted for
virtualised I/O.
2. The reconfigurable computing address space abstraction consisting of a two-
dimensional address space for the FPGA and a single dimensional address space for
the external memory.
3. The reconfigurable computing inter-process communication consisting of the
formation of messages and passing them between processes via a pre-configured
primitive architecture of a memory controller and on-chip re-routable bus network.
Secondly, these abstractions were combined together with particular features from previously
suggested reconfigurable computing runtime systems to result in the formation of the
reconfigurable computing operating system architecture. Its major components include a shell
for user interaction, Allocator for resource allocation, Partitioner for application partitioning,
a loader for configuring and managing the platform, and a primitive architecture that supports
inter-process communication.
Thirdly, the specifications of the algorithms that will implement the Allocator and Partitioner
components of the architecture were defined. It was determined that for the runtime
requirements of these algorithms, their quality of allocation or partitioning needed to be
traded for a reduced execution runtime. For the functional requirements, the Allocator must be
able to load an incoming application on a vacant part of the FPGA that will not interfere with
any other executing applications. The Partitioner must be able to divide a data flow graph
structured application into partitions of specified sizes.
Finally, a modified design flow for application development that suits the newly defined
abstractions and operating system architecture was proposed. This was achieved by
investigating the limitations of the current design flow and surveying the literature to
determine if any other application development environments could be used.
Chapter 5 – Resource allocation and application partitioning
5 5 Resource
allocation and application partitioning
In the previous chapter, a set of abstractions, the operating system architecture and resulting
algorithm specifications for the components of allocation and partitioning were presented. In
this chapter the most suitable algorithms that meet the specifications of the allocation and
partitioning components will be selected, implemented and their performance measured
against a set of selected metrics. This will then enable the most suitable allocation and
partitioning algorithm to be selected. A summary of the previous work, methodologies and
deliverables associated with this chapter are shown in Figure 26.
Figure 26: The previous work, methodology and
deliverables associated with this chapter
The chapter is divided into two sections: algorithms for resource allocation and algorithms for
application partitioning. In each section there are three tasks undertaken that result in the
selection of the most suitable algorithm. This initially involves surveying the research
literature with the aim of listing all the algorithms that suit the functional and runtime
specifications outlined in the previous chapter. These algorithms are then sorted based on
their complexity and runtime performance. The higher ranked algorithms will be adapted and
implemented to suit the operating system architecture. The performance of the implemented
algorithms will then be measured through the use of selected metrics. The algorithm that is
judged to perform the best will then be selected to be used in the associated component within
the operating system prototype.
81
Chapter 5 – Resource allocation and application partitioning
82
5.1 Allocation
When the FPGA surface is shared amongst multiple applications, an address space abstraction
will be used to define what resources each application have been allocated. It will also prevent
applications from corrupting each other by using already taken resources. In order to support
this address space abstraction, an Allocator which is responsible for allocating hardware
resources to incoming applications, has been defined. In section 4.3.2, the functional
specifications of the allocation algorithm within the Allocator component were presented.
These are summarised below.
1. To determine the size and position of vacant segments of area that an incoming
application can fit onto that does not interfere with already allocated applications.
2. If there is enough vacant area available on the FPGA for the incoming application but
not in one contiguous segment, the largest segment that is available should be
determined.
3. If there is not enough vacant area on the FPGA for the incoming application to be
allocated onto, the application should be added to a ready queue until more area
becomes available.
4. If there is more than one possible location to place the application, choose the location
that maximises the usage of the FPGA area among all present and future applications.
In this section the most suitable allocation algorithm for use in the operating system prototype
will be selected. This will be achieved through an initial survey of the previous allocation
literature that appears in either the reconfigurable computing or other research domains. These
algorithms will be ranked according to their runtime complexity and the lower complexity
ones that meet the absolute runtime limits will be adapted to suit this environment. The
performance of these adapted algorithms will then be measured using selected metrics with
the aim of determining the most suitable allocation algorithm for use in the operating system
prototype.
5.1.1 Survey of allocation literature
Shown in Table 8 is a summary of the allocation algorithms presented in this thesis that have
some potential for use in the proposed operating system. They are ranked in order of runtime
Chapter 5 – Resource allocation and application partitioning
83
complexity from least to most where n is the number of possible locations the applications
can be allocated onto and m is the size of the application being allocated.
Algorithm Runtime complexity Satisfies functional specifications
Bottom Left
[17]
O (log n) Yes
Minkowski Sum
[58]
O (n + m) Yes
One Dimensional Bin Packing
[90]
O (n log n) No
Two Dimensional Bin Packing
[13]
O (n3) Yes
DREAM
[51]
O (n2) Yes
Table 8: A summary of the well-known allocation
algorithms that appear in the research literature
The traditional bin-packing problem is similar to the allocation of FPGA area. Most bin-
packing algorithms concentrate on the classical one dimensional bin-packing problem but
these are not suitable for the allocation of FPGA area as the FPGA surface is two
dimensional. Two dimensional bin-packing algorithms can be adapted to suit the operating
system Allocator and a particular implementation of one that suits the functional
specifications has been described by Baker, Coffman and Rivest [13]. The problem associated
with this algorithm is the stated runtime complexity of O (n3) far exceeds the linear
requirements as defined in section 4.3.1. Bazargan [17] presented a modified two dimensional
bin-packing algorithm, by not considering every possible place in which the application can
be allocated to, met the runtime complexity as it was reduced to O (log n). Although the
quality of allocation will be affected, this is offset by the significant decrease in runtime
complexity. Eatmon [51] presented a modified algorithm based on the Bazargan proposal
which improved the quality of the overall allocation but at the expense of an increase in the
runtime complexity O (n2).
Chapter 5 – Resource allocation and application partitioning
84
The Minkowski Sum [58] has been shown to improve the utilisation of material when applied
to the problem of fabric cutting plans. This problem is very similar to the allocation of
applications onto an FPGA. However, the runtime complexity to determine the fabric cutting
plans (O (n5)) far exceeds the linear requirements as defined in section 4.3.1. This is because
non-convex polygons are allowed in the most general Minkowski Sum and a very slow
greedy algorithm is used to select the optimal location. The problem of allocating applications
onto an FPGA can be reduced by the use of convex polygons, as will be shown in section
5.1.4. This results in runtime complexity of the Minkowski Sum reducing to linear time.
Another linear time algorithm that selects between possible multiple locations where the
applications can be allocated is used instead of a greedy based algorithm.
Only the Bottom Left and Minkowski Sum algorithms meet both the functional and runtime
specifications required for the Allocator. These two algorithms will now be described in more
detail, outlining any modifications that were needed to suit the operating system architecture.
5.1.2 Algorithm 1 – Greedy based
The first algorithm described and implemented for allocating the FPGA area to incoming
applications is based on a traditional greedy style algorithm. Such an algorithm was
implemented before either of the two described above were because not only is its runtime
complexity linear (see below), it is also very easy to implement and a better understanding of
the problem could be achieved before more complex algorithms were implemented.
As the applications arrive into the operating system, they are queued in a standard first in first
out (FIFO) queue (shown in Figure 27 (a)). When there are one or more applications in the
queue, the algorithm will take the application at the front of the queue and begin the
allocation. This involves searching a list of areas that match the known minimum area
requirements of the application (shown in Figure 27 (b)). The minimum area requirements of
the application are pre-calculated at compile time and are known as a virtual rectangle. The
FPGA area in this algorithm is represented as a list, analogous to a list of disk blocks on a
disk drive. To calculate the location of the vacant area within which the application will be
allocated, the algorithm initially determines if there is enough vacant area on the FPGA for
the application to be allocated. If not, the application will be placed back into the queue until
more area becomes available.
Chapter 5 – Resource allocation and application partitioning
Ready Queue
Application Allocated
Incoming applications
Free Space Here
(a)
(c) (d)
(b)
Figure 27: Greedy based allocation
If there is enough vacant area, the bottom left corner of the virtual rectangle will be placed
over the first CLB in the list. If it overlaps with other previously allocated applications, the
virtual rectangle will progressively and deterministically be moved through the list until a
location can be found (shown in Figure 27 (c)) where the application can be allocated and not
interfere with other applications. Once the algorithm finds a successful location, it will stop
searching and mark the area as used (shown in Figure 27 (d)). The location details will then
be passed onto the next stage in the architecture for further processing. If the virtual rectangle
searches the list of CLBs without a successful allocation, it will return the location and size of
the segment of free area that was closest to the application area requirements. The details of
this segment are then passed onto the Partitioner for further processing.
5.1.3 Algorithm 2 – Bottom left
The second allocation algorithm chosen to be adapted and implemented for the Allocator is a
variation modelled on the bottom left algorithm proposed by Bazagan. It consists of two parts:
one, an empty space manager for insertion and deletion of applications, similar to the greedy
based algorithm, and two, a set of heuristics for dividing up the free space. In the empty space
manager, the free FPGA area is represented by rectangles, where each rectangle may have
multiple CLBs (Figure 28 (c)). When an application arrives in the queue (Figure 28 (a)), the 85
Chapter 5 – Resource allocation and application partitioning
algorithm searches the list of available rectangles, looking for a rectangle with dimensions
that are equal to or larger than the size of the application, remembering the size of the
application has been pre-calculated at compile time. When the first suitable rectangle is
located, the algorithm will allocate the application to the bottom left-hand corner of the
selected rectangle, assuming the application size is less than the rectangle it is being allocated
into (Figure 28 (d)). The location of where the application is to be placed is then passed on to
the placer for further processing. If there are no rectangles that can accommodate the
application, the application will be blocked and placed back into the ready queue.
Figure 28: The bottom left allocation algorithm process
In the second part of the bottom left allocation algorithm, the remaining free space is divided
into two more rectangles according to a heuristic. Initially, the remaining area is partitioned
into three new rectangles. These rectangles are defined by two segments that intersect with the
corner of the allocated application and the edge of the rectangle it is being allocated into (see
Sa and Sb in Figure 29 (a)). Depending upon the heuristic chosen when the algorithm is
started, either the shortest (SRS) (Figure 29 (c)) or longest (LRS) (Figure 29 (b)) remaining
segment is used to divide the remaining area into two rectangles. The details of the original
rectangle are then removed from the list and replaced with the size and location of the two
new ones.
86
Chapter 5 – Resource allocation and application partitioning
Figure 29: The heuristic used to calculate the remaining rectangles
There were two changes made to the algorithm that was originally proposed by Bazagan.
Firstly, the algorithm can now dynamically choose between the heuristic that divides the
remaining area into rectangles. Bazagan made no mention in his paper what strategy should
be chosen for which application. The Allocator can now select from the shortest or longest
segment in an attempt to keep the available free rectangles as square as possible. For example,
a rectangle 1 CLB wide x 12 CLBs high is much harder to allocate to applications as
compared with a rectangle 3 CLBs wide x 4 CLBs high. Secondly, the algorithm was
integrated into the operating system architecture so it could be called iteratively from the
Partitioner.
The time complexity of the bottom left algorithm is linear in the number of rectangles stored
in the list. As the algorithm creates at most two new rectangles every time a new application
is allocated, the number of rectangles is at most twice the number of applications on the
FPGA. Not only does this minimise the average total runtime of the allocation function, it will
also produce a more predictable runtime because the total number of areas stored in the list is
predictable.
5.1.4 Algorithm 3 – Minkowski Sum
87
The final allocation algorithm implemented for possible use in the operating system prototype
is based on the Minkowski Sum. The Minkowski Sum algorithm is often used in motion
planning to determine the free space among a set of obstacles so that an optimal path may be
planned for traversal by some physical entity between two points in the space [58]. The
Chapter 5 – Resource allocation and application partitioning
specifications required by the allocation algorithm are similar to the motion planning problem
in that the identification of free space is needed.
The Minkowski Sum based allocation algorithm consists of calculating all of the possible
locations where the incoming application could be placed. A second step is needed to
determine which of those locations the application should be allocated to, to optimise
performance. The Minkowski Sum can be defined as the set of all points that are the sum of a
point in one set together with a point in another set. This is shown in Equation 1.
{ }QqPpqpQP ∈∈+=+ ,|
Equation 1: Minkowski Sum
For the allocation problem described in this thesis, it is assumed that the FPGA area can be
depicted as two polygons, U and F where the used area is denoted by U, and the vacant area is
denoted by F. Given that T is a rectangle depicting the area required by the incoming
application, and r is the centre point within T, all possible locations it can be allocated into is
performed by calculating the Minkowski Sum of the polygons T and U. Figure 30 below
further explains Minkowski Sum.
88
Figure 30: Minkowski Sum example
Chapter 5 – Resource allocation and application partitioning
89
The polygon or application labelled T with the centre point p inside it represents a new
application in the queue waiting to be allocated onto the FPGA (Figure 30 (a)). The
crosshatched polygons represent the used space or other applications that have already been
allocated on the FPGA (Figure 30 (b)). Figure 30 (c) shows the Minkowski Sum of the
applications and is represented by the dotted area (S) surrounding those applications. This
area combined with the application area (U) is the location where the centre point p inside
polygon T cannot be placed, indicated as prohibited in the figure. The rest of the area (F) is
considered to be safe to allocate the application T into. If polygon T is translated around the
boundary of the dotted area S (Figure 30 (c)), with centre point p of t fixed to the boundary’s
edge, it is clear that T will always be touching but never intersecting with the shaded polygon.
For the Minkowski Sum algorithm to have a runtime complexity of O (U + T) as needed by
the Allocator, the polygons of U and T must both be convex. If the incoming applications will
be rectangular in shape, T will always be convex. However depending upon previous
allocation history, U may either be convex or non-convex. To ensure the polygon will be
convex, De Berg [46] proposed the following method that avoided using any non-convex
polygons in the Minkowski Sum.
1. Decompose the non-convex polygon U into discrete polygons u1, u2, …, un.
2. For each of the polygons ui in the set of polygons u1, u2, …, un, find the Minkowski
Sums s1, s2, …, sn of T and si.
3. Use the elementary polygon union operator to combine polygons s1, s2, …, sn.
This results in both of the polygons in the Minkowski Sum being convex and therefore meets
the runtime requirements of the Allocator previously defined.
The area where the incoming application can be allocated has now been calculated. However,
within this area, there are numerous locations where the application could be allocated. A
simple bottom left corner heuristic is used to determine the exact location of where to allocate
the incoming application. This involves calculating the location of all the corners of the
available area (see Figure 31 (a)) and then allocating the application into the segment whose
corner is closest to the bottom left of the FPGA and is large enough to accommodate the
application (see Figure 31 (b)). If there is no segment of area that is large enough to
accommodate the application but the total amount of free area is less than the pre-compiled
Chapter 5 – Resource allocation and application partitioning
estimate of the application, the algorithm will determine the largest segment and pass its
details onto the Partitioner for further processing. If the FPGA does not have enough total
area to accommodate the FPGA application, it will be blocked and put back into the ready
queue until more area becomes available.
Figure 31: Bottom left heuristic used with the Minkowski Sum
5.1.5 Algorithm performance
The performance of a hardware resource allocation algorithm can be measured by how much
of the resource cannot be utilised by any application. For example, if a microprocessor is
being shared, a measure could be how many instructions it’s processing over a set period of
time are not related to any applications. To measure the performance of an FPGA allocation
algorithm, the raw area utilisation of the FPGA surface could be used. However, such a
measurement ignores the amount of area that is wasted because the algorithm does not tightly
pack all incoming applications, possibly preventing any application from being loaded there.
FPGA area is a finite and expensive resource and a trade-off between execution runtime and
area utilisation needs to be made. As each of the algorithms allocate the applications
according to a different set of rules, the overall utilisation of the FPGA area will vary for each
one. To measure the overall usage of the FPGA area, a metric known as fragmentation will be
introduced.
As the allocation algorithm in this operating system will be used at runtime, the amount of
time it takes to execute will also be a measure of its performance. To gain a more accurate
measurement of the expected runtime of each algorithm, an experiment will be performed to
measure it when each algorithm is used with various sizes and numbers of applications. As it
90
Chapter 5 – Resource allocation and application partitioning
91
is currently unknown what type and size of applications will be used in conjunction with the
operating system, three sets of varying sized applications are proposed. The allocation
algorithms will then be tested using all three sets of applications. These sets of applications
have been selected to produce the worst degradation in runtime performance that would be
expected in the actual operating system environment.
In this section, an experiment that measures both the execution runtime and the amount of
area that is wasted due to poor allocations is performed. Initially, the experiment test bed is
described which includes the number and size of applications used in the experiment. The first
part of the experiment will be to measure the runtime consumed by all three algorithms under
various conditions. In the second part of the experiment, the same set of applications will be
loaded onto the FPGA and the fragmentation will be measured after each application has been
allocated. In both parts, results and graphs will be presented and then conclusions will be
drawn on the result obtained.
Experimental test bed
In order to create an environment in which realistic results could be generated, an assumption
on the size of the applications had to be made. As the operating system can accommodate
multiple concurrent applications, sets of incoming applications had to be generated. As it is
currently unclear what type of application the operating system will primarily be used in
conjunction with, it is difficult to estimate the size, arrival rate and execution time of the
incoming applications. As such, three categories of different means and standard deviation
area usage of the applications were used. These values were chosen to represent small, typical
and large applications with respect to the target size of the FPGA and the variations were
modelled on a Gaussian distribution. The application arrival and execution time were selected
for each category so there would always be several applications waiting in the queue. The
details are shown in Table 9.
Chapter 5 – Resource allocation and application partitioning
92
Mea
n ar
ea
(per
cent
age
of F
PG
A)
Sta
ndar
d de
viat
ion
area
(p
erce
ntag
e of
FP
GA
)
Mea
n in
ter-
arriv
al ti
me
(uni
t of t
ime)
Mea
n ex
ecut
ion
time
(uni
t of t
ime)
Typical size applications 4% 2% 20 200
Large size applications 8% 4% 30 200
Small size applications 2% 1% 6 200
Table 9: Parameters of the applications used to measure the
execution runtime of the allocation and partitioning algorithms
Execution runtime
The experiment to measure the execution runtime of each allocation algorithm involved
generating three sets of applications, allocating them onto the FPGA with each of the
algorithms, and then measuring how long it took to complete the allocation for each
application (algorithm execution runtime). The execution time in this experiment was
measured as wall clock time on an otherwise idle Celeron 1.2 GHz microprocessor with
256Mb of RAM, running Microsoft Windows XP.
The experiment began by generating fifty applications for each of the sets of typical, small
and large sized applications. Each of these application sets were then allocated onto the FPGA
by all three algorithms, resulting in nine iterations of the experiment. A single iteration of the
experiment involved the following. Initially, the first application in the selected set was
allocated onto an empty FPGA and the execution runtime it took to do so was recorded. This
should result in the minimum runtime requirement associated with the particular allocation
algorithm. For each remaining application in the set or until the applications could no longer
fit, it was allocated onto the FPGA, the number of other applications resident on the FPGA
was recorded, and the execution runtime of the algorithm to perform the allocation was
measured. For each set of applications (typical, small and large), a graph summarising the
number of applications resident on the FPGA versus the execution runtime for each algorithm
is presented in Figure 32.
Chapter 5 – Resource allocation and application partitioning
Allocation Execution Runtime (typical)
0
100
200300
400
500600
700800
900
0 5 10 15 20 25 30No of applications already resident on the FPGA
Tim
e (m
S) o
n an
Inte
l Cel
eron
1.2
GH
z m
icro
proc
esso
r
Greedy Bottom Left Minkowski Sum with best fit allocation Allocation Execution Runtime (small)
0
100
200300
400
500600
700800
900
0 10 20 30 40No of applications already resident on the FPGA
Tim
e (m
S) o
n an
Inte
l Cel
eron
1.2
GH
z m
icro
proc
esso
r
50
Greedy Bottom Left Minkowski Sum with best fit allocation Allocation Execution Runtime (large)
0
100
200300
400
500600
700800
900
0 5 10 15 20No of applications already resident on the FPGA
Tim
e (m
S) o
n an
Inte
l Cel
eron
1.2
GH
z m
icro
proc
esso
r
Greedy Bottom Left Minkowski Sum with best fit allocation
Figure 32: The execution runtime of the greedy, bottom left
and Minkowski Sum allocation algorithms
93
Chapter 5 – Resource allocation and application partitioning
94
The graphs in Figure 32 can be summarised as follows. The greedy based algorithm
consumed the most execution runtime averaging between 40ms when the FPGA is empty, up
to 812ms when the incoming application is unable to successfully be allocated. The bottom
left algorithm consumes the second least execution runtime averaging 173% less, ranging
between 8ms when the FPGA is empty up to 356ms depending upon the size and number of
applications already resident on the FPGA. The algorithm that consumed the least execution
runtime is the Minkowski Sum with best fit allocation, averaging 140% less than the bottom
left and 598% less than the greedy algorithm. It ranged between 18ms when the FPGA was
empty up to 156ms depending upon the size and number of applications already resident on
the FPGA. There are several points worth noting regarding these results.
1. The number of applications that could be allocated onto the FPGA varied depending
upon the algorithm used. Shown in Table 10 is the total number of applications that
could be allocated onto the FPGA by each algorithm for all three sets of applications.
The least number of applications were allocated onto the FPGA when the Minkowski
Sum algorithm was used, followed by the Bottom Left algorithm, and the most
applications were able to be allocated when the Greedy algorithm was used. This was
consistent across all three application sets. There was also little variation between the
numbers of applications allocated onto the FPGA by each algorithm indicating the
amount of area wasted due to fragmentation would be similar. This is shown later in
this section.
Algorithm Small Typical Large
Minkowski Sum 40 23 14
Bottom Left 43 25 16
Greedy 45 28 18
Table 10: Number of applications allocated onto the FPGA
2. The minimum execution times for each of the algorithms were very similar. Across all
three application sets, the minimum execution runtime averaged 24ms for the
Minkowski Sum algorithm, 26ms for the Bottom Left algorithm and 28ms for the
Greedy Based algorithm. If the operating system only operated with few resident
applications on the FPGA at one time, there would be little advantage in using any
particular allocation algorithm.
Chapter 5 – Resource allocation and application partitioning
95
3. In each of the graphs, the greedy based algorithm’s execution runtime peaked at
approximately 805ms. This situation occurred because in each case the attempt at
allocating the application failed. The number of times the virtual rectangle had to be
progressively moved through the FPGA until the allocation failed was similar for each
set, even though the applications were of various sizes. This is because the relative
sizes of the applications are quite small as compared to the size of the FPGA
(approximately 2% – 8%). This resulted in hundreds of moves and altering the number
of moves by 10 or 15 made very little difference to the execution runtime. However, if
larger sized applications were used, greater than 20% of the FPGA area, a reduction in
the overall execution runtime would be experienced.
4. The greedy based allocation algorithm consumed far more execution runtime when
used with large sized applications, averaging 259% more than the bottom left and
1123% more than Minkowski Sum. This can be explained by the way it performs its
allocation. For each application allocated, a search of the FPGA area begins in the
bottom left corner. When available area is found it immediately allocates the
application to it. Therefore as the FPGA fills, searching from the bottom left corner
becomes ineffective as it is checking area that has just been allocated.
5. The fluctuations shown in the graphs of the bottom left allocation algorithm (17 to 25
typical; 30 to 43 small; 10 to 17 large) are due to the algorithmic complexity of
recalculating the free area when applications are removed. The process of
recalculating this free area involves combining all the surrounding available area and
then dividing it into the minimum number of rectangular segments so it can more
easily be reallocated. The sharp increase on the graph reflects whether an application
was removed before or after the incoming application was allocated.
Fragmentation
During the experiment to measure the allocation algorithms, it was found that each of the
algorithms allocated the incoming applications in different locations, as would be expected.
As a result it was noted that in some of the iterations of the experiment, more applications
could be allocated onto the FPGA when particular algorithms were used. This resulted in a
better usage of the available FPGA area. For example, more FPGA area would be wasted if a
small application was allocated in the middle of a large area. To measure this efficiency, a
metric known as fragmentation will be introduced.
Chapter 5 – Resource allocation and application partitioning
Fragmentation is a term usually associated with secondary disk storage. Resources are
typically represented as linear arrays of equal sized blocks. Intuitively, disk fragmentation
exists when a large file is not able to be stored as a contiguous set of blocks and thus must be
partitioned. An FPGA is similar, as it is broken up into CLB block units, but the FPGA area is
two dimensional. There is no agreed definition of what area fragmentation on an FPGA is. It
is defined here that an FPGA is fragmented if an application is unable to be allocated to the
FPGA but the FPGA has adequate non-contiguous free FPGA area, as shown in Figure 33.
Figure 33: A Fragmented FPGA
Walder and Platzner [133] gave a method to measure fragmentation to quantify allocation
situations. Their fragmentation grade (or method for measuring it) is shown in Equation 2.
)(
)(1
2
ii
ii
an
anF
∑∑
⋅
⋅−= where ni is the number of rectangles and ai is the size.
Equation 2: Walder Fragmentation Grade
The problem with this measurement of fragmentation is it does not work well with the case
when every second CLB is used, the checkerboard case. For example, if the FPGA has 32
CLBs and 16 are used and 16 are available and are laid out in a checkerboard pattern the
fragmentation grade is shown in Equation 3.
96
Chapter 5 – Resource allocation and application partitioning
75.016
)116(1
2
=⋅
−=F
Equation 3: Example of Fragmentation Grade
In this case the fragmentation grade should be 100% as every application needs to be
partitioned, except 1 CLB large applications. To overcome this limitation, a new
fragmentation measure represented as a percentage of the number of holes left between the
previously allocated applications and the number of remaining CLBs was developed. This is
shown in Equation 4. A hole is defined as a contiguous portion of free FPGA area. For
example, if the FPGA has five holes and there are 50 free CLB blocks the percentage of
fragmentation is 8%. It also satisfies the checkerboard limitation of the Walder fragmentation
formula. This measure for fragmentation will be used to rank the allocation algorithms in
terms of fragmentation performance. It was also shown in section 7.3 that when adjusted for
the mean size of the applications on the FPGA, it is an excellent predictor for the user
response time and application throughput associated with the operating system.
⎪⎩
⎪⎨
⎧
=
>×⎟⎠⎞
⎜⎝⎛
−−
=1,0
1,10011
A
AAh
F
where h is the number of holes and A is the total free area A has units of the minimum unit of allocation (CLBs in most cases)
Equation 4: Fragmentation percentage
The second part of the experiment measures the fragmentation generated by each algorithm
when the FPGA is at various capacities. This initially involved generating over a hundred
applications in each of the previously described application sets. These applications were then
allocated onto the FPGA by all three algorithms and the fragmentation was calculated. For
each of application sets (typical, small and large), a graph summarising the fragmentation
versus FPGA capacity is shown in Figure 34.
97
Chapter 5 – Resource allocation and application partitioning
Fragmentation (typical)
0
0.2
0.4
0.6
0.8
1
1.2
20 30 40 50 60 70Amount of area used (%)
Frag
men
tatio
n
Greedy Bottom Left Minkowski Sum with Best fit allocation
Fragmentation (large)
0
0.2
0.4
0.6
0.8
1
1.2
20 30 40 50 60 70Amount of area used (%)
Frag
men
tatio
n
Greedy Bottom Left Minkowski Sum with best fit allocation
Fragmentation (small)
00.20.40.60.8
11.21.41.61.8
20 30 40 50 60 70Amount of area used (%)
Frag
men
tatio
n
Greedy Bottom Left Minkowski Sum with best fit allocation
Figure 34: Fragmentation recorded for the typical,
large and small sized applications
98
Chapter 5 – Resource allocation and application partitioning
99
The graphs in Figure 34 can be summarised as follows. Averaging all of the 14 fragmentation
measures taken in each application set, the Greedy algorithm produced the least
fragmentation. This was followed by the Bottom Left algorithm which created on average
16% more fragmentation than the Greedy algorithm. The Minkowski Sum algorithm with
bottom left corner allocation produced the most fragmentation, average 34% more than the
Greedy algorithm and 16% more than the Bottom Left. A summary of the increase in
fragmentation between the algorithms is shown in Table 11. There are several points that can
be drawn from either the table or the graphs and will be discussed below.
Category Greedy to Bottom Left Greedy to Minkowski Bottom Left to Minkowski
Typical 15.7% 33.1% 15.2%
Large 15.2% 30.3% 13.7%
Small 20.8% 46.2% 18.0%
Total 16.0% 34.2% 16%
Table 11: The average percentage increase in fragmentation
for the algorithms compared to each other
1. A higher percentage of fragmentation was generated when the small sized application
set was allocated onto the FPGA as compared to the typical and large sized application
sets. This was consistent across all three allocation algorithms with the maximum
fragmentation being recorded at 1.65% for Minkowski Sum, 1.42% for Bottom Left,
and 1.23% for Greedy. This can be explained because for the same percentage of area
usage, more of the smaller sized holes are created on the FPGA as it introduces gaps
between applications which ultimately led to an increase in the fragmentation
measure.
2. The maximum area usage of the FPGA varied depending upon the size of the
application set used. For the typical sized applications, the maximum percentage of the
FPGA that was occupied by applications was approximately 58%. This increased by
10% to 68% for large sized applications and increased a further 1% to 69% for the
small sized applications. As the execution runtime of the applications was selected so
the FPGA would fill up and a queue of applications would occur after some time,
these results reflect the maximum usage area that can be obtained with applications of
Chapter 5 – Resource allocation and application partitioning
100
the specified size. The remaining area not consumed by the applications is taken up by
fragmentation.
3. Although not conclusive, there appears to be a connection between the amount of
fragmentation generated and the FPGA area usage. Within each application set and in
most cases, as the amount of area consumed on the FPGA increased so did the
fragmentation. There are several measurements where this is not the case including
36% and 43% for the small sized; 64% for the large size; and 46% and 47% for the
typical size. However it is felt that these could have been generated by the size of the
incoming application exactly matching a hole; significantly reducing the
fragmentation or having to be allocated in a location which generated several new
holes. As the amount of the FPGA usage increased, it would be expected that the
fragmentation would increase as the number of holes created by applications not being
allocated in ideal locations would increase.
5.1.6 Algorithm selection
From the results gained in the experiment described above, the algorithm based on the
Minkowski Sum with the bottom left corner heuristic was selected as the most suitable of the
three allocation algorithms to be used in the operating system prototype. The justification is as
follows. Although the fragmentation generated by the Greedy algorithm was the least, its
absolute execution time was far too great as compared with the Bottom Left and Minkowski
Sum, especially for larger sized applications. An excessive execution runtime could ultimately
result in a much longer response, a factor that has to be minimised. Of the remaining two
algorithms, the Bottom Left generated the least fragmentation averaging 16% less but
consumed more execution runtime, approximately 140% more.
It depends how area and response time are valued in the particular situation, but as extra area
can easily be purchased, it was decided to use the Minkowski Sum allocation algorithm in the
operating system prototype as the increase in fragmentation is not great and the extra
execution time of the Bottom Left algorithm was significant. Although FPGA area is a
valuable resource and needs to be managed, it was decided that the execution runtime was
valued higher in this situation.
Chapter 5 – Resource allocation and application partitioning
101
5.2 Partitioning
Once a reconfigurable computing application is under execution on the FPGA, it will be
known as a process. This process consists of an application, or part thereof, that is structured
in a data flow graph format with data source and sink nodes inserted for easier I/O transfer. In
an attempt to reduce the user response time and increase the FPGA usage, applications will be
broken down into multiple processes of specified sizes so they can fit into particular locations
on the FPGA. This requires a logic partitioning algorithm that has the following functional
specifications as were defined in section 4.3.3.
1. To partition an application which consists of a data flow graph structure into various
specified area constraints.
2. Partition the application in a way that does not affect the integrity of its operation.
3. Minimise the effect on the application’s performance due to the partitioning.
In this section the most suitable partitioning algorithm for use in the operating system
prototype will be selected. This will be achieved through an initial survey of the previous
partitioning literature that appears in either the reconfigurable computing or other research
domains. These algorithms will then be ranked according to their complexity and runtime
performance and the highest ranked ones that meet the runtime requirements will be adapted
to suit this environment. The performance of these adapted algorithms will then be measured
using selected metrics with the aim of determining the most suitable partitioning algorithm for
use in the operating system prototype.
5.2.1 Survey of partitioning literature
Logic partitioning has been an active area of research for at least the last 25 years and has
resulted in numerous algorithms proposed and implemented. Logic partitioning has
traditionally been used to divide an application into equal sized parts when it can not fit onto
the target device. However, in the proposed operating system, logic partitioning will be used
to divide an application into a particular size and geometrical configuration. Shown in Table
12 is a summary of the partitioning algorithms presented in this thesis. They are ranked in
order of runtime complexity from least to most. n is the number of nodes in the application
which need to be partitioned.
Chapter 5 – Resource allocation and application partitioning
102
Algorithm Runtime complexity Satisfies functional specifications
Temporal Partitioning
[109]
O (V + E)
V – vertex
E – edge
Yes
FM
[57]
O (n) No
Simulated Annealing
[84]
O (V1/2 E)
V – vertex
E – edge
No
MP2
[136]
O (n2) No
KL
[83]
O (n2 log n) No
Table 12: Summary of partitioning algorithm runtime complexities
Three of the most well-known partitioning algorithms are the Kernighan and Lin (KL) [83],
Fiduccia and Mattheyses (FM) [57] and Simulated Annealing [84]. These algorithms are
based on an iterative min-cut heuristic for partitioning networks and have the runtime
complexities of O (n2 log n), O (n) per pass and O (V1/2 E) respectively. Although very
common, these algorithms are not suited to the proposed operating system as they do not meet
the previously defined partitioning specifications as they are unable to partition applications
into varying specific sizes. They are primarily used to partition an application so the number
of communication channels required between the partitions are minimised. KL and Simulated
Annealing also have runtime complexities that far exceed the linear specifications as
previously discussed. Although FM is stated as being linear per pass, it is commonly accepted
that several passes of the algorithm are required before an acceptable result is obtained.
There have been several partitioning algorithms targeted for use with FPGAs that consider the
hard size and I/O pin constraints associated with such devices. Woo and Kim [136] proposed
an extension to the FM algorithm that minimised the maximum number of I/O pins used on
the device. Kuznar et al [88] also modified the FM algorithm to address the problem of
partitioning applications onto multiple FPGAs. However again, these algorithms are not
suited to the proposed operating system as they partition applications into fixed sizes and their
runtime complexities are far too high. Purna [109] introduced the concept of temporal
Chapter 5 – Resource allocation and application partitioning
partitioning a directed acyclic graph. Given the size of the FPGA, the algorithm will partition
an application into k-way equal sized parts in linear time. Although this algorithm does not
meet the variable partition size specification set for the Partitioner, it does meet the linear
runtime complexity constraint and could be adapted to support variable sized partitioning.
Only the temporal partitioning [109] algorithm proposed by Purna meets the runtime
requirements and partially meets the functional specifications needed by the Partitioner. This
algorithm will now be described, outlining the modifications that were made to adapt it to the
operating system environment.
5.2.2 Algorithm 1 – Temporal partitioning
Although the temporal partitioning algorithm proposed by Purna meets the linear runtime
complexity required by the Partitioner, it needs modifications. The major one is to transfer
from partitioning applications in time to partitioning them in space. The current temporal
partitioning algorithm initially assigns each node in the data flow graph an ‘As Soon As
Possible’ (ASAP) execution level: or depth level (Figure 35 (b)). This level is used to
guarantee that a node can only be executed if all of its predecessors have, therefore respecting
the data flow graph node dependencies. The algorithm then uses these ASAP levels to
partition the nodes of the data flow graph into k way equal sized partitions (Figure 35 (a)) in a
time complexity of O (V + E) where V is the number of vertex and E is the number of edges
in the data flow graph.
Figure 35: Temporal partitioning proposed by Purna
103
Chapter 5 – Resource allocation and application partitioning
104
Instead of having to partition an application into equal sizes, the operating system requires the
partitioning algorithm to be able to divide the application into a number of predefined sized
partitions that match the current FPGA layout. For example if the incoming application was
30 CLBs in size (10 x 3) and there were two segments of free area of 20 CLBs (4 x 5) and 12
CLBs (4 x 3), the application would need to be partitioned to fit into these two segments.
This was achieved by integrating a target partition size, and a monitor that keeps track of how
much the application has been partitioned. The process of partitioning an application in the
operating system begins with the target partition size being calculated by the Allocator and
passing it to the Partitioner along with the application. As the application arrives at the
Partitioner, it calculates the ASAP levels for each of the data flow graph nodes. It then begins
to partition the data flow graph nodes according to their ASAP levels, starting at the lower
levels first. Once the combined area required by the nodes exceeds the target size of the
partition, the Partitioner records the last node that was put into the partition and returns back
to the Allocator the details of which nodes have been allocated into the partition. If the entire
application does not fit into the allocated segment, it will request another from the Allocator.
The Partitioner will then repeat the process described above, however it will begin with the
last node that was partitioned instead of the first node. This entire process is repeated until the
entire data flow graph has been partitioned into segments of vacant area.
The changes made to the algorithm have not affected its runtime complexity. To gain an
accurate measure of the actual time the partitioning algorithm took to execute, an experiment
described in the next section was conducted to measure it under various conditions.
5.2.3 Algorithm performance
The performance of a partitioning algorithm is usually measured by the effect it has on the
performance of the application it partitions. If the clock speed or throughput is significantly
reduced because it has been partitioned, the algorithm is usually considered to be poor.
However the performance of a partitioning algorithm within an operating system must
consider both the effect it has on application performance and the amount of execution
runtime it consumes. In this section the amount of execution runtime that the partitioning
algorithm consumes under various conditions will be measured. The experiment to measure
the loss in application performance will be held off until the operating system prototype has
been described.
Chapter 5 – Resource allocation and application partitioning
To measure the execution runtime the partitioning algorithm consumes under various
conditions, a trivial application structured as a data flow graph consisting of 40 nodes was
partitioned 40 times. Each time the application was partitioned, the number of parts it was
divided into increased. It was initially divided into two parts, then three, and so on until all 40
nodes of the data flow graph were divided into their own partition. Each time the algorithm
completed an iteration of the partitioning algorithm, its execution runtime was recorded. To
make sure only the runtime of the partitioning algorithm was being measured, an array of
target partition sizes had previously been determined as to prevent the measured times being
distorted by the runtime of the Allocator. It was decided to use an application consisting of 40
nodes across all of the iterations of the partitioning experiment because this would produce
the worst possible execution runtime in each test. For example, for the same number of
partitioned nodes, the amount of runtime the algorithm would consume is likely to be less if
the application only had a total of 20 nodes for example. It was felt that the results from this
experiment be very similar when other applications were used because although the
application only performed trivial computation, the partitioning algorithm did not consider
what computation was being performed within the nodes, only the connections between them.
The results from this experiment are graphed in Figure 36 and there are several points that can
be drawn from the results.
Executiion Runtime of the Partitioning Algorithm
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40
No of cores the application is partitioned into
Tim
e (m
S) o
n an
Inte
l Cel
eron
1.2
GH
z m
icro
proc
esso
r
Figure 36: The execution runtime obtained from the partitioning algorithm
105
Chapter 5 – Resource allocation and application partitioning
106
1. The minimum execution time consumed by the algorithm was approximately 85ms.
The major part of this execution time was consumed calculating the ASAP levels of
each node in the data flow graph as the application had 40 nodes at various levels.
This was evident because even though the number of partitions the application was
being divided into went from 2 to 5, the execution time only increased by 18ms. The
minimum execution time could be reduced if an application with fewer nodes was
partitioned.
2. The graph appears to have at most a linear relationship between the number of
partitions the application is being divided into and when the execution runtime in the
range of 5 and 20 partitions. At 20 and 35 partitions the graph appears to break from
this linear relationship and significantly increases. There does not appear to be any
obvious explanation of this.
3. There would be few situations where dividing an application into any more than 20
partitions would be suitable because the loss in performance would likely far outweigh
the benefits of squeezing the application into the last few percent of spare area on the
FPGA.
In summary, it is felt the execution runtime of the modified temporal partitioning algorithm
would not introduce too much of an overhead if integrated into the operating system. The
average application would only be partitioned into between 2 and 15 partitions and as such
would introduce approximately 85ms to 250ms of delay. This was considered to be an
acceptable overhead. Therefore, the modified temporal partitioning algorithm will be
integrated into the prototype operating system described in the next chapter. This concludes
the experiments conducted into the performance measurement of the allocation and
partitioning algorithms.
Chapter 5 – Resource allocation and application partitioning
107
5.3 Conclusion
This chapter resulted in two major deliverables; an algorithm for the Allocator and an
algorithm for the Partitioner. This was achieved by firstly creating a list of algorithms that
matched the runtime and functional specifications of the Allocator and Partitioner from either
the reconfigurable computing or non-reconfigurable computing domains. These algorithms
were then sorted based on their runtime complexity and the most promising were modified to
suit the architecture and then implemented for further experimentation. An experiment to
measure the execution runtime of the algorithms was then performed to determine whether it
was acceptable. From this experiment it was judged the best performing allocation algorithm
was the Minkowski Sum with bottom left heuristic which recorded a maximum execution
runtime of 100ms. It was also determined that the modified temporal partitioning algorithm
also meets the runtime requirements with a maximum execution runtime of approximately
200ms. Both of these algorithms will now be integrated into the operating system prototype
described in the next chapter.
Chapter 6 – Operating system prototype and metrics
6 6 Operating system prototype & metrics
In the previous chapters a set of abstractions, an architecture, algorithm specifications, and
specific allocation and partitioning algorithms for a reconfigurable computing operating
system were all defined. This chapter describes ReConfigME; the prototype of a
reconfigurable computing operating system. The chapter also reports on the experience
running applications on the operating system and introduces the metrics that will be used to
assess its performance in chapter 7. Figure 37 illustrates the methodology that is associated
with this chapter.
Figure 37: Previous work, methodology and
deliverables associated with this chapter
This chapter is divided into two sections and each section is associated with a deliverable. The
first section details how the prototype operating system known as ReConfigME was
constructed according to the architecture and algorithm specifications that were previously
defined in this thesis. This section will include a discussion on the prototype’s target platform,
application and primitive architecture, operating system structure, the applications developed
for use with ReConfigME, and issues that were faced during the implementation. Previous
108
Chapter 6 – Operating system prototype and metrics
109
research literature from both the software and reconfigurable computing operating system
domains will be used to influence the construction of ReConfigME. In the second section, a
set of metrics will be selected that will be used in the following chapter to measure the
associated performance of the operating system prototype. These metrics will be selected by
reviewing previous literature to determine what application designers perceive reconfigurable
computing application performance to be. These will be combined with any metrics that can
be transferred from the software operating system domain that measure important operating
system performance characteristics.
The actual programming of ReConfigME was carried out by the author, Martyn George,
Maria Dahlquist, and Mark Jasiunas based on the detailed design demonstrated here under the
direction of the author and his supervisor. The work was funded by the Sir Ross and Sir Keith
Smith Trust Fund and acknowledgements are made here to the programmers and the funding
authority that supported them over several years.
Chapter 6 – Operating system prototype and metrics
6.1 Operating system prototype
In the previous chapters, the process, address space, and inter-process communication
abstractions, an architecture, and allocation and partitioning algorithms were all discussed and
decisions were made on which were the most suitable for use in a reconfigurable computing
operating system. In this section all of these details and decisions are combined into the
construction of a prototype operating system known as ReConfigME. The purpose of
ReConfigME is to manage applications on the FPGA but ReConfigME does not run on the
FPGA. In theory it could run on the same FPGA if that FPGA had a suitable hard or soft core
processor that can support high level languages and the FPGA supported self reconfiguration.
However current commercially FPGAs do not have a fast enough hard core processor and do
not have self reconfiguration capabilities. The algorithms comprising of the operating system
could also be adapted to run in hardware but since this is the first prototype, the emphasis is
on an easy implementation platform. Therefore ReConfigME is Java based applications
executing in software on a standard PC. Shown in Figure 38 is the internal structure of
ReConfigME. This implementation architecture is more complex than the original
architecture proposed in chapter 4 as that architecture only maps to the Colonel component of
the operating system.
Figure 38: ReConfigME implementation architecture
110
Chapter 6 – Operating system prototype and metrics
111
The ReConfigME implementation is structured into three tiers consisting of user, platform
and operating system which are connected via a standard TCP/IP network. Users connect to
ReConfigME through a custom built client interface which enables them to load applications,
transfer application data and configuration information, and monitor the reconfigurable
computing platform status. ReConfigME enforces a strict FPGA application architecture
consisting of a data flow graph structure, memory based I/O, EDIF application file format,
and the associated software only components. It supports multiple applications through the
use of FPGA hardware resource allocation, application logic partitioning, runtime bitstream
generation, and runtime reconfiguration. For easier implementation and due to technology
limitations, ReConfigME has a limit on the number of concurrent applications and uses static
application memory allocation. The current FPGAs and their design tools do not support
dynamic runtime reconfiguration of arbitrary sized applications, thus ReConfigME simulates
dynamic runtime reconfiguration. When ReConfigME wants to allocate a new application to
the FPGA, all running applications are check pointed, the FPGA clock is stopped, and a new
bitstream including the new and all the existing applications is downloaded. The existing and
new applications are then started or restarted.
This section is structured as follows. The reconfigurable computing platform used and the
factors affecting its selection is described first. In section 6.1.2, the restrictions placed on the
application’s design are outlined as an application architecture. The primitive architecture
used to support the inter-process communication abstraction is then detailed. In section 6.1.4,
ReConfigME’s software implementation structure is described which includes the use of a
three-tier networked communication architecture. A detailed listing of the procedure involved
in executing an application under ReConfigME is then described through the use of a sample
application. The applications that were implemented to test the correct functionality of
ReConfigME are then detailed. The section concludes by reviewing why the implementation
did not entirely match the proposed architecture and a set of implementations issues that were
faced during construction.
6.1.1 Hardware platform
The prototype of ReConfigME was developed on a standard PC with a Celoxica RC1000pp
development board, in a typical co-processor configuration. The RC1000pp is a standard PCI
bus card equipped with a Xilinx Virtex XCV1000 part with 1 million system gates. It has
8Mb of SRAM directly connected to the FPGA in four 32-bit wide memory banks. The
Chapter 6 – Operating system prototype and metrics
memory is dual ported to the host CPU across the PCI bus accessible by DMA transfer or as a
virtual address. Figure 39 is a block diagram showing the connections between the
components of the RC1000pp development board.
Figure 39: RC1000pp Block Diagram
This platform was selected for the operating system prototype for several reasons. Firstly, the
platform consists of a medium grained FPGA, loosely coupled to a modern high performance
microprocessor via a standard PCI bus. This configuration was determined in section 2.1.2 to
best suit an operating system environment. The medium grained FPGA has ample resources
to be shared amongst multiple concurrent applications and the PCI bus has sufficient I/O
bandwidth to support the streaming of data into the applications. Secondly, the platform has
four banks of high capacity dual port memory. As was described in the inter-process
communication abstraction, processes will communicate with each other and the external
microprocessor via the platform’s on-board memory. External I/O data will be loaded into a
memory bank via the PCI bus and then passed into the process via the FPGA pins and
memory controller. This type of I/O transfer requires dual-port memory as the host and FPGA
can communicate directly with the memory bank. Finally, the platform supports runtime
reconfiguration via SelectMAP over PCI.
112
Chapter 6 – Operating system prototype and metrics
Figure 40: The RC1000pp
6.1.2 Application architecture
The applications used in conjunction with ReConfigME need to be designed with the
following four characteristics. Firstly, the applications should be structured according to the
data flow graph model defined in section 4.1.1 and shown in Figure 41. This enables the
Partitioner to divide the application into several partitions that better match the geometric
dimensions of the vacant FPGA area. However, applications that are not structured as a data
flow graph model can still be used with ReConfigME but no attempt to partition them will be
made. This may result in an extended application response time, as not being able to partition
the application will increase the chance of it being blocked if the particular size of vacant area
needed is not available.
Figure 41: Application architecture for ReConfigME
Secondly, data source and sink nodes are inserted into the applications at points where input
or output data is required. As all inter-process communication is conducted via on-board
memory, these nodes provide the interface between the application and the on-chip memory
controller. The applications in the prototype have access to 1Mb of memory each starting at a
113
Chapter 6 – Operating system prototype and metrics
114
virtual address of 0x00. The memory controller will then convert this virtual address to a real
address based on the static allocation of external memory. The on-chip memory controller in
conjunction with the ReConfigME server is then responsible for reading and writing the data
to and from the applications and the appropriate memory location. This interface allows
applications to be programmed in both VHDL and Handel-C.
Thirdly, each of the nodes of the data flow graph must be relocatable on the FPGA as
ReConfigME will determine where to allocate the nodes at runtime. Since the current tools do
not support runtime routing of pre-placed and pre-routed applications, this prevents them from
being arbitrary relocated. So each of the nodes of an application is synthesised into an
intermediate file format at compile time, but not placed and routed to a bitstream. The
intermediate file format chosen in ReConfigME is EDIF. This file format has advantages over
many of the others because almost all design entry methods can generate it, it’s not
commercially specific to an FPGA or company, it has an open source specification, and
multiple EDIFs can easily be merged together to result in a single FPGA bitstream. EDIF’s
are combined by the operating system with an area constraint file that specifies the location of
each node and the complete FPGA is then place and routed.
Finally, each of the nodes in the data flow graph model and the entire model itself must have
an estimate of the geometric dimensions of the FPGA area they will require when passed into
ReConfigME (see Figure 41). Thus it is necessarily at design time to execute the place and
route tools over each of the nodes to gain a size estimate. It can be expected that this will not
be an entirely accurate area estimate especially if the aspect ratio has to be change and as such
a margin for error is added to the area estimate used in ReConfigME.
6.1.3 Primitive architecture
The primitive architecture of ReConfigME is that part of the hardware that is configured onto
the FPGA before any user applications and remains there. The primitive architecture is used
to support the previously defined inter-process communication abstraction. It consists of a
memory controller and network terminators. The memory controller is responsible for
granting access to the memory when requested by an application, and managing the transfer
of the I/O to the particular application. As the RC1000 consists of four 2Mb memory banks,
accessible either via the host computer or FPGA, the memory controller has to negotiate with
the platform memory arbitrator to ensure both the host and FPGA applications do not write to
Chapter 6 – Operating system prototype and metrics
the same memory bank simultaneously. For ease of implementation, the memory controller
logically divides the memory into fixed sized blocks each of which are then allocated to a
single process requiring I/O. Although this limits the total number of processes resident on the
FPGA, it will not impact on the results gained from the set of experiments that will be
conducted on the prototype. Shown in Figure 42 are the memory and the primitive
architecture associated with ReConfigME.
Figure 42: Operating system primitive architecture
As I/O arrives at the memory controller from a process, it negotiates with the memory
arbitrator to ensure it has exclusive access to the particular memory bank. Once access has
been granted, it then has to convert the local addressing scheme that each process is using into
the global addressing scheme to ensure the data is loaded into the correct location in memory.
The memory controller then either reads or writes the data into the calculated memory
position.
Each of the processes is connected to the memory controller via a single network terminator.
The network terminators simply provides the matching interface for the data source and sink
nodes so processes can easily connect to it. This currently consists of a custom bus of 21
address lines, 32 data lines, 4 single bit control lines, and a single bit clock line. Processes can
then either read or write to anywhere within the range up to 1 Mb which is allocated to it.
6.1.4 ReConfigME implementation architecture
The overall architecture of the operating system is component based with each operation
separated into small independent components which communicate via a simple message based
115
Chapter 6 – Operating system prototype and metrics
116
mechanism. As there are many issues relating to reconfigurable computing operating systems
that have not been fully researched, this type of architecture was chosen over the more
traditional monolithic operating system architecture. As the requirements of a modern
traditional operating system have been well defined, the implementation of a monolithic
architecture is relatively straightforward. However in a reconfigurable computing operating
system the requirements are still unclear and as such the construction of a monolithic
architecture had to be avoided because they are difficult to maintain as the detailed
requirements emerge.
A simple multi-client server arrangement to structure the inter-component communication
was chosen to be used with the prototype. This involved one client server connection between
the user and bitstream generation components, and another between the bitstream generation
components and the reconfigurable computing platform. This allows the user to be remotely
located from the majority of the operating system components, possibly via a remote web
front end, and the reconfigurable computer to be remotely located from the bitstream
generation tools. Another benefit of this design is that ReConfigME can manage multiple
FPGA cards which can be physically located within the same machine or in separate
machines making it easily scalable. Such an arrangement allows maximum flexibility with
respect to location of the user, platform and ReConfigME’s bitstream generation components.
The inter-component communication structure has ReConfigME divided into three tiers; user,
operating system, and platform. Although there is no general agreement about what
contributes as a tier [6], a machine separated by network communication is considered a tier
in this prototype. The client tier primarily performs the interaction between the operating
system tier and the user by providing a shell as an interface. The operating system tier
contains the operating system architecture that consists of the resource allocation, application
partitioning and bitstream generation. The platform tier consists of the reconfigurable
computing platform and the components needed to access it. ReConfigME’s components
were then separated into these three tiers and can be seen in Figure 38. The curved cornered
rectangles indicate the component was constructed specifically for the prototype. Rectangular
components represent off the shelf products. This figure is very similar to a protocol stack;
data enters the tier via the bottom component which is connected to the others via a physical
network. Data progresses through the tier until it reaches the destination component. Likewise
data that needs to be transferred to another tier will progress down through the tier until it
Chapter 6 – Operating system prototype and metrics
reaches the physical network. Each of the components and tiers will now be discussed in more
detail.
Platform tier
The platform tier consists of seven components and is primarily responsible for the
communications to and from the reconfigurable computing platform. All of the components
except the reconfigurable computer and network are all resident in software on a PC that hosts
the reconfigurable computing platform. The top level component is the hardware abstraction
layer (HAL) server and is responsible for hiding the platform specific API. It is a simple API
written in Java that can be used with various platforms to offer access and control over the
hardware. It provides methods for reading and writing bitstreams to the FPGA, reading and
writing to the on-board memory, and clock management. As the RC1000 used in
ReConfigME is shipped with C++ libraries, Java native method calls were used to connect the
hardware abstraction layer API to the corresponding RC1000 library method. The advantage
of the hardware abstraction layer is the same API can be used to communicate to any number
of different target platforms.
The hardware abstraction layer also supports a client/server paradigm so the reconfigurable
computing platform can be remotely located (see Figure 43). Connections are made to the
HAL server via standard TCP/IP sockets from the HAL client, located in the operating system
tier. Bitstream files, input and output data, and clock configurations are then passed back and
forth between the client and server.
Figure 43: Platform tier architecture
The other components in the platform tier are used to support the HAL server. Java was
chosen as the implementation language because of its ease of internetworking, its object
orientated semantics, and its portability across different hosts, operating systems, and
hardware. The PC operating system component, in this case Windows XP, is needed to
117
Chapter 6 – Operating system prototype and metrics
manage the hardware resources of the host computer and the TCP/IP and network components
are required to provide the connectivity between the HAL server and the HAL client.
Operating system tier
The operating system tier consists of seven components and is responsible for allocating and
partitioning applications, the generation of the FPGA bitstreams, and the transfer of
application data and configuration information between the platform and user tiers. The top
level component of the operating system tier is dubbed “Colonel” (analogous to a software
operating system but is spelt differently to avoid confusion). The Colonel does everything
except the transfer of data between the other tiers. It consists of three sub-components and the
bitstream generation tools (see Figure 44) in a structure that reflects the architecture of the
operating system that was described in section 4.2.2.
Figure 44: Architecture of ReConfigME’s Colonel
As a user connects to ReConfigME, their application and configuration information is passed
into the Colonel via the user server. The application and its pre-compiled geometric
dimensions are then passed onto the Allocator and in conjunction with the Partitioner, will
determine whether the application can configured onto the FPGA or is blocked and put into a
queue because of the lack of vacant area. The Allocator consists of the Minkowski Sum with
bottom left fit algorithm that was described in section 5.1.4 and the Partitioner consists of the
modified temporal partitioning algorithm that was described in section 5.2.2.
Once all the locations of the application’s partitions have been determined, the Allocator will
create a file which ensures the application’s absolute placement details calculated by the
Allocator are followed once the FPGA bitstream is generated. In ReConfigME, the constraints
file is in the standard vendor format. The main control loop will then create and call a script
that executes the place and route tools. This will generate an FPGA bitstream that includes all
118
Chapter 6 – Operating system prototype and metrics
119
of the loaded applications in their correct locations. It is then passed onto the HAL client who
is responsible for connecting to the platform and configuring the new bitstream onto the
FPGA.
The Colonel also manages the transfer of application data involving capturing the input data
from the user loading it into the on-board memory, and reading the output data from the on-
board memory and passing it back to the user. This task primarily consists of an address
translation. The local addressing scheme is translated into the platform’s global addressing
scheme to ensure the correct location in the platform’s memory is accessed for either reading
or writing. The Colonel also passes specific clock and platform information between the HAL
client and user server.
The second level component of the operating system tier is the HAL client and user client (see
Figure 45). The HAL client component is responsible for creating a connection to the desired
platform and passing all of the I/O, bitstreams and configuration information between the two.
It allows the platform to be remotely located from the Colonel. The advantage in this is
ReConfigME can target numerous different platforms without having to have them all located
in the same machine as the Colonel.
The user server handles all the communications between the user client in the user tier and the
Colonel. This includes input and output of application data, incoming applications, and
platform configuration information such as clock settings. The user server accepts
connections via standard TCP/IP sockets from numerous remotely located clients located in
the user tier. Once a connection has been established, it is responsible for passing the data to
the Colonel and then responding to the client with the associated response. The advantage of
having the communication component separate from the Colonel is if the network protocol or
client/server API is altered, only those components need to be modified, not the complex
Colonel itself.
Chapter 6 – Operating system prototype and metrics
Figure 45: Operating system tier
User tier
The user tier contains five components and is primarily responsible for providing a user
interface and connection to the operating system tier. The top level component is the user
interface (see Figure 46) and consists of a combination of a simple command line interface for
user input and a graphical user interface for displaying the geometrically layout of currently
executing application on the reconfigurable computing platform. Via the command line
interface, users are able to load applications, stream I/O data to the platform’s on-board
memory and configure particular platform settings such as clock values. The graphical user
interface displays the results of the allocation and partitioning of applications as they are
loaded into ReConfigME (see Figure 51).
The user client is the second level component in the user tier and provides an interface to the
Colonel via the user server. It communicates via standard TCP/IP sockets to the user server
located in the operating system tier and simply converts user requests from the command line
interface into the API defined for use between the user client and server components. The
advantage in using the user client and server is other user interfaces can easily be added with
little or no change to the Colonel.
120 Figure 46: User tier architecture
Chapter 6 – Operating system prototype and metrics
6.1.5 Sample application execution
There are two types of files that are needed to be created for an application to be loaded onto
the reconfigurable computer via ReConfigME: the application itself with an EDIF file for
each data flow graph node, and a Java class file that defines how each of these EDIF files are
connected together in data flow graph model. The first stage in developing an application for
use with ReConfigME is the generation of the series of EDIF files that describes the
behaviour of the application. This procedure initially involves the designer determining how
the application will be structured. Shown in Figure 47 is the complete sample application
structured as a data flow graph model with the ADD 1, XOR, and AND representing one node
each.
Figure 47: Complete sample application in data flow graph format
Each of these nodes will then result in a single EDIF file. Almost any design entry method
can be used to create these nodes but in this example the hardware description language
developed by Celoxica known as Handel-C [30] was used. Shown in Figure 48 is a code
listing of the first node in a sample application.
Figure 48: Handel-C code listing for add one data graph flow node
It simply reads a 32 bit number from the first location in memory, adds one to the number,
and then writes the result back into the second location in memory. As can be seen from the 121
Chapter 6 – Operating system prototype and metrics
code, the data is loaded into the memory from the host via the readMem() and writeMem()
methods. These methods insert the data source and sink nodes into the application so it can be
connected to the memory controller. The Handel-C source code for the other nodes in the data
flow graph look very similar except instead of adding one to the number, the second node
performs a logical XOR against a set mask and the third node performs a logical AND against
another set mask. All the Handel-C source files are then compiled and an EDIF file is
generated for each node. As is shown in the code in Figure 49, three new cores and their
dimensions which represent each node in the graph are added into the instance tg.
Figure 49: Java class file defining data flow graph structure
The code initially involves creating an instance of the class TaskGraph which represents a
data flow graph, and initialising the parameters defining its geometric dimensions, name and
whether it should be partitioned. An application can be prevented from being partitioned by
ReConfigME if the designer believes it has strict performance constraints. Each of the EDIF
filenames and the area they will consume are then added into the structure of the data flow
graph as nodes by simply adding a new Vertex into an array within the instance. In this
sample execution, the data flow graph consist of three nodes or EDIF files; add_one_core,
XOR_core, and AND_core. The edges which represent the communication links between the
nodes in the data flow graph are created in the instance by calling the method addEdge and
passing the core numbers of the communicating nodes. In the sample application, the
add_one_core connects to the XOR_Core, which connects to the AND_core. Shown in Figure
50 is a static class diagram of the complete data flow graph application.
122
Chapter 6 – Operating system prototype and metrics
Figure 50: ReConfigME data flow graph class structure
The next part in the Java class file is to define the connection to the ReConfigME server and
pass the TaskGraph object containing all the data flow graph details. This is simply performed
by creating an instance of the class RC1000 with the parameters of the TaskGraph, IP address
and port number of the server. This results in the instance of the data flow graph and all
associated EDIF files being loaded into ReConfigME so the generation of the bitstream can
begin. Once the bitstream has been generated and dynamically configured onto the FPGA, the
necessary read and writes to and from the memory are performed.
Shown in Figure 51 and Figure 52 are screen captures of ReConfigME processing the sample
application. Figure 51 (a) shows the primitive architecture consisting of the memory
controller (shown in grey) and two network terminators (shown in black and white)
configured onto the FPGA when the prototype is started. Once the user client connects to
ReConfigME, it begins processing the application and a log of this is shown in Figure 51 (b).
This involves allocating the application onto the FPGA, connecting the data source and sink
nodes onto the network terminator, and generating and configuring the bitstream onto the
FPGA. Once the application is configured onto the FPGA, the user interface shown in Figure
52 (a) is updated to reflect the new allocation layout. The final stage in the sample application
execution is to read and write the I/O data with the output data being stored in a local file. The
status of these actions is reported via the client interface and is shown in Figure 52 (b). Once
the client disconnects from the ReConfigME server, the application is removed from the
FPGA and the new bitstream is generated and configured onto the FPGA. This completes the
sample application execution listing.
123
Chapter 6 – Operating system prototype and metrics
(a) User interface (b) log file
Figure 51: Status displayed before the allocation of the application
(a) User interface (b) log file
Figure 52: Status displayed after the allocation of the application
6.1.6 Applications for ReConfigME
There were four applications implemented to verify the correct functionality of ReConfigME.
The first one was described in the previous section and was a simple mathematical based
application used to demonstrate the process of designing and executing applications under
ReConfigME. The next two applications were implemented to show that ReConfigME can be
used with real applications that require large amounts of I/O to be transferred between the
hardware circuit and software part of the application. These two applications are described
here. The last application implemented for use with ReConfigME is based on encryption and
was used to measure the performance of the prototype including the allocation and
partitioning algorithms as reported in the next chapter.
In this section, the applications of blob tracking and edge enhancement implemented for use
with ReConfigME will be detailed. This will include a description of the application, the
application’s specifications such as area consumption, and the output generated by
ReConfigME to show its allocation details.
124
Chapter 6 – Operating system prototype and metrics
Blob tracking
Blob tracking is a term used in the vision tracking research community which is the process of
finding the location of a known object in a series of images. In the application described here,
the object of interest is an orange coloured ball and a series of images were taken as the ball
was randomly moved. The first step in blob tracking algorithm implemented for ReConfigME
was to separate the orange coloured ball from the rest of image. This is achieved by
performing a threshold operating on the image, based on a colour value that matched the
orange ball. Each pixel in the image was examined to determine if it matched the colour of
interest. If the pixel matched the colour, in the output image it was set to white whereas if it
did not match, it was set to black. This procedure was repeated for every pixel in a frame.
Once the known colour had been separated from the image, the centre of these pixels had to
be calculated. This was achieved by simply calculating the mean location of all the pixels that
matched the threshold colour of interest. This point was then indicated by the use of red
crosshairs. Shown in Figure 53 is a screen capture of the blob tracking application.
Figure 53: Screen capture of the blob tracking application executing on ReConfigME
The application consists of two parts: the hardware circuit containing the blob tracking
algorithm which performs the threshold and calculation of the centre location written in
Handel-C, and the software application responsible for transferring the I/O to ReConfigME,
capturing the video in real time via a camera, and displaying the threshold image and location
of the crosshairs. The hardware circuit of the blob tracking application consumes
approximately 400 CLBs or 7% of the target FPGA, has pre-defined dimensions calculated to
be 20 CLBs by 20 CLBs and is 70 lines of non-commented Handel-C. Shown in Figure 54 is
a screen capture of the location of where the application was allocated on the FPGA when
125
Chapter 6 – Operating system prototype and metrics
loaded by ReConfigME. The light blue rectangle is the memory controller, the pink rectangle
is the network terminator, and the red rectangle is the blob tracking hardware circuit. The
remaining blue area is the available FPGA area for allocation.
Figure 54: Allocation status of the FPGA when the blob tracking is loaded onto the
FPGA by ReConfigME
Edge enhancement
Edge enhancement is another well-known image processing algorithm and involves
identifying the edges of objects in an image. This algorithm is often the first stage in template
matching or target recognition. The algorithm firstly involves performing a threshold of the
intensity change across a window of pixels. If the intensity change exceeds the selected
threshold, the pixel is marked as an edge. The window is moved across the entire image in
both a horizontal and a vertical direction. The output from the edge enhancement application
is shown in Figure 55.
Figure 55: Screen capture of the edge enhancement application
executing on ReConfigME
126
Chapter 6 – Operating system prototype and metrics
As was the case in the blob tracking application, the edge enhancement application consists of
two parts: the hardware circuit that executes the edge detection algorithm, and the software
part that transfers the I/O to ReConfigME and displays the resultant edge detection. The
hardware circuit consumes 480 CLBs, has pre-calculated dimensions of 40 CLBs wide by 12
CLBs high and is 111 lines of non-commented Handel-C code. Shown in Figure 56 is a
screen capture of the location of where the edge enhancement application was allocated on the
FPGA when loaded by ReConfigME as the sole application. The red rectangle is the memory
controller, the pink rectangle is the network terminator, and the grey rectangle is the blob
tracking hardware circuit. The remaining blue area is the available FPGA area for allocation
to other applications. The colours are different compared to the blob tracking application as
ReConfigME generates random colours to incoming applications.
Figure 56: Allocation status of the FPGA when the edge enhancement
is loaded onto the FPGA by ReConfigME
Multiple concurrent applications with ReConfigME
Shown in Figure 57 is the allocation status when both the blob tracking and edge
enhancement applications were allocated onto the FPGA at the same time. The edge
enhancement application was loaded first (shown in grey), and next the blob tracking
application was loaded (shown in white). The memory controller is shown in pink.
127
Chapter 6 – Operating system prototype and metrics
Figure 57: Allocation status of the FPGA when the edge enhancement and the blob
tracking are loaded onto the FPGA by ReConfigME
With both the applications allocated onto the FPGA and the clock set to 25MHz, both
applications executed correctly and there was no noticeable difference in the frame rate of
both applications as compared to running them separately. The output from both applications
was identical when compared to the output generated when each application had exclusive use
of the FPGA. The edge enhancement application was removed by the operating system, the
network terminator was re-allocated, and the blob tracking application continued to execute
correctly. Finally, the blob tracking application was removed from ReConfigME and the
FPGA was re-configured with no applications. Shown in Figure 58 is the output from the
Xilinx Floor-planner design tool which shows both applications configured onto FPGA. The
blob tracking application is shown in yellow, the edge enhancement in green and the memory
controller in light grey. This figure reflects the allocation constraints placed onto the
applications by ReConfigME.
Figure 58: Screen capture from Xilinx Floorplanner verifying
the locations of the applications on the FPGA
128
Chapter 6 – Operating system prototype and metrics
129
In this section it has been shown that ReConfigME can correctly manage real reconfigurable
computing applications. Both of these applications were designed to be used with the
operating system but likewise, existing applications could easily be either loaded with every
little modification, or re-designed according to the operating system application architecture
so as to take advantage of application partitioning.
6.1.7 Implementation issues
In an attempt to minimise the implementation complexity of ReConfigME, as well as several
technology limitations, selected characteristics that were defined in the architecture of an
operating system in chapter 5 were not transferred into the construction of the prototype. The
most noticeable is the omission of a shared bus network to support inter-process
communication. In section 4.1.3 it was determined that inter-process communication could be
optimised through the use of such a network. However the runtime routing of the bus proved
impractical with the available tools. Although direct bitstream manipulation is possible
through the use of the JBits API, it was found through experimentation that the JBits runtime
router was unable to achieve a successful route for the cases of complexity requested by the
operating system. Dynamic reconfiguration, where one application is running whilst another
is being loaded was also not implemented due to the FPGA architecture providing only
column based partial reconfiguration. Together with limits on the location and configuration
of tri-state buffers, the FPGA was found to be too inflexible to support dynamic
reconfiguration under the operating system. Even if this limitation on dynamic
reconfiguration did not exist, the tool flow’s inability to support relocatable pre-placed and
pre-routed nodes and runtime routability is currently the main limitation to its practical use
due to the extra time required to re-place and re-route applications each time a context switch
occurs.
Chapter 6 – Operating system prototype and metrics
130
6.2 Metrics
With any set of new abstractions, metrics are required to define particular characteristics of a
system’s performance. Two metrics that are commonly used to measure and compare the
performance of traditional software operating systems are response time and throughput, and
shown in Table 13. As many of the goals of a reconfigurable computing operating system are
similar to that of the traditional operating system, it was felt that these metrics should also be
used to measure the performance of the prototype ReConfigME. In this section response time
and throughput will be outlined in more detail with the aim of using them in the next chapter
for a performance evaluation.
Metric Definition
Response time The amount of time the operating system takes
to respond to a user request
Throughput The amount of processing on user level tasks per unit of time
Table 13: A summary of the metrics designed for
reconfigurable computing operating systems
6.2.1 Response time
Response time is a well defined metric commonly described as the amount of time an
operating system takes to respond to user requests. It is heavily influenced by scheduling
policies, context switch time slices, and I/O latency. It is considered a very important measure
in an operating system that performs a significant number of applications that have real-time
user interaction. In the use of a reconfigurable computer, the response time, also known as
latency in this case, would have been the amount of time taken between the user loading the
application and the first results to arrive back. In the prototype reconfigurable computing
operating system, this response time primarily consists of the execution runtime of the
allocation and partitioning algorithms, commercial place and route tools, and FPGA
reconfiguration time as they are all completed after the application execute requests have been
entered into.
Chapter 6 – Operating system prototype and metrics
131
6.2.2 Throughput
Throughput is a commonly used metric in software operating systems to measure the amount
of output an application can generate over a specific time. In a reconfigurable computer, the
definition of throughput is no different. In order to measure the throughput of a hardware
application though, a characteristic commonly known as circuit delay is measured. Circuit
delay by definition is the amount of time between successive outputs of the circuit. The
inverse of this is usually the circuit’s clock speed. When comparing two identical
applications, the larger the circuit delay, the lower the throughput.
It is expected that the throughput of the individual reconfigurable computing applications will
be reduced when used in conjunction with an operating system. As previously stated, the
traditional design flow tools and algorithms have been implemented to maximise this.
However in the operating system, the allocation and partitioning algorithms have traded this
throughout for a decrease in execution time. Therefore in the next chapter an experiment to
measure the loss in throughput in applications used in conjunction with ReConfigME will be
performed.
Chapter 6 – Operating system prototype and metrics
132
6.3 Conclusion
The chapter resulted in two deliverables. One is a prototype operating system known as
ReConfigME which is based on the architecture described in chapter 4. This included details
on the selected platform and the detailed implementation. It was discussed how the
applications are created so they are structured according to the data flow graph model, the
primitive architecture that is used to support inter-process communication, the networked tier
architecture used to implement the prototype itself, a sample application execution listing, and
implementation issues that arose during its construction. The second deliverable was a set of
metrics including response time and application throughput that will be used in the next
chapter to measure the performance of the operating system and its applications.
Chapter 7 – Performance evaluation
7
7 Performance evaluation
In the previous chapter, a prototype operating system for reconfigurable computing known as
ReConfigME was presented and an example of concurrently executing applications was
given. Along with this prototype, a set of commonly used metrics from the software operating
system domain that measure an operating system’s impact on user response and application
performance were identified. In this chapter, these metrics will be transferred into the
reconfigurable computing operating system domain and will be used in a series of
experiments to measure the performance of applications under ReConfigME control.
Correlations between throughput, response time, and fragmentation will then be presented,
and a formula for predicting the likelihood of the operating system allocating a particular
application onto an FPGA will be developed. A summary of the previous work,
methodologies and deliverables associated with this chapter are shown in Figure 59.
Figure 59: Previous work, methodology, and deliverables
associated with this chapter
The chapter is divided into three sections with each reflecting a particular deliverable. In the
first section, the test environment, benchmark application, and test cases that will be used in
the experiments to measure the response time and throughput of ReConfigME will be
outlined. In the second section there is a detailed description of the three experiments Page 133
Chapter 7 – Performance evaluation
Page 134
conducted on the operating system, the results obtained, and the conclusions that can be
drawn from the results. In the first experiment, the average time the user has to wait for their
application to have its resources allocated on the FPGA will be measured. This will be
performed by loading a series of different sized applications onto the FPGA under varying
conditions. In the second experiment, the throughput of a reconfigurable computing
application executing in conjunction with ReConfigME will be compared to the throughput
when executed in the reconfigurable computing environment without an operating system. In
the third section, all the results from these experiments will be compared to determine if any
correlation between the user response time, throughput, and area usage can be achieved
through the introduction of a new fragmentation metric.
Chapter 7 – Performance evaluation
Page 135
7.1 Experimental environment
In the previous chapter, the operating system architecture defined in section 4.2.2 was
implemented into a prototype known as ReConfigME and the execution of some non-
partitionable applications were demonstrated. To measure ReConfigME’s performance such
as user response time or application performance, a suitable partitionable benchmark
application and test environment generating simulated workloads had to be developed. In this
section, the benchmark application and the reasons why it was selected to be used in the
experiments, as well as the test cases used to generate the performance results will be
outlined.
7.1.1 Benchmark application
Although benchmarks for general purpose computers have been deeply investigated, there
still appears to be very few that are specifically designed for reconfigurable computing
applications running under an operating system. The Adaptive Computer System (ACS)
benchmark suite [87] and the Reconfigurable Architecture Workstation (RAW) [11]
benchmark suite provide a set of benchmarks in the form of commonly implemented
reconfigurable computing applications. These benchmark applications have been commonly
used to evaluate the performance of placement and routing algorithms by measuring the
characteristics of versatility, capacity, timing sensitivity, and scalability. However many of
these applications cannot be applied immediately to the operating system because they need to
be extensively re-designed to suit the new application architecture.
An application that can be readily re-designed according to the operating system application
architecture is the Data Encryption Algorithm (DEA). DEA or the ANSI equivalent Data
Encryption Standard (DES) [56] is a widely used method of encryption that enciphers and
deciphers blocks of data consisting of 64 bits under control of a 56-bit key. The algorithm
consists of three different procedures which encrypt the text: an initial permutation (IP), a
complex key-dependent computation and another permutation which is the inverse of the
initial one as shown in Figure 60. These processes are repeated 16 times to produce the
encrypted or decrypted text. To reduce the chance of the text being decrypted by unauthorised
parties, a stronger version of DES called Triple DES is used. This simply involves three
copies of the DES application used in sequence.
Chapter 7 – Performance evaluation
Figure 60: DES block architecture
There are several hardware implementations of the DES algorithm previously published [1,
118]. The Free-DES [59] implementation was chosen as the basis for the benchmark
application because it is able to fit onto the target FPGA (approximately 16% of the target
area on a Xilinx Virtex 1000) and has a high performance in terms of clock speed and
throughput. The Free-DES implementation was modified to suit the application architecture
by structuring it as the data flow graph model with three nodes. Three copies of the modified
Free-DES application were then combined in sequence to form a triple DES application. A
data source and sink node was then inserted on the first and last node respectively. The
resultant triple DES application consists of nine individual parts that can be independently
partitioned.
7.1.2 Experiential configuration
To ensure ReConfigMe’s performance is measured in situations that approximate a realistic
operating environment, a set of four tests cases were developed to be used in each of the
experiments. These tests cases were generated as follows. A workload generator produces
simulated applications and the actual DES application which is then placed in a ready queue.
The operating system ready queue is a first in first out (FIFO) where applications are stored
prior to being allocated to the FPGA. The simulated tasks are assigned a size and execution
time, each of which is selected from different and independent Gaussian distributions. When a
simulated application has been allocated onto the FPGA, it is also connected to the memory
controller via an I/O bus to generate a most realistic routing resource usage. The workload
generator also samples an inter-arrival time for each simulated application from a random
Page 136
Chapter 7 – Performance evaluation
Page 137
distribution which determines when each application is entered onto the ready queue.
Depending on the arrival and execution time chosen, the ready queue can be empty which
implies that the FPGA is not fully loaded with applications. If the ready queue contains one or
more applications the FPGA is considered for the purpose of the chapter to be fully loaded.
In the experiments, the mean and variance of the inter-arrival time was selected so that at least
one application was on average in the queue so that all experiments represented a fully loaded
FPGA in the context of using the operating system. Note that a fully loaded FPGA does not
mean that all the area was used by executing applications because fragmentation prevents this
from occurring. A fully loaded FPGA was used in the experiment as it is expected to result in
the worst case degradation in performance.
The experiments were divided into four cases each and two sets. Cases two, three and four
involve allocating the DES application after all the simulated applications. Case one, shown
in Figure 61 is special in that the DES application is allocated first and no other simulated
applications are allocated at all. Case one was generated to provide the expected best case
when using the operating system. These values will then be used as the baseline to compare
the subsequent cases. For cases two, three, and four, the state of the task laid out on the FPGA
just before the DES core is placed in the ready queue is denoted the initial floor-plan for the
case.
In case two the operating system is run with applications being allocated and then as they
complete being removed and replaced with new tasks. It is possible to run the operating
system indefinitely but we expect that after a certain number of application completions some
steady state will be reached. To examine if this happened, case two has a number of runs of
different lengths n. A run consists of the operating system allocating n tasks to the FPGA and
then placing the DES application in the queue. In case two n was chosen to be 100, 200, 300,
400, and 500. The initial floor-plans for this test case are shown in Figure 62 and Figure 63.
In case three a run for n = 100 was repeated 5 times. This allowed an examination of
variability in the measured mean performance values due to the random nature of the
selection of simulated tasks (the random nature of the initial floor plan). It was then assumed
that all data collected from the experiment has this amount of random variability. The floor-
plans for this test case are shown in Figure 64.
Chapter 7 – Performance evaluation
Page 138
In case four a run of 100 applications was generated. Simulated applications were then added
continually after that until an initial floor plan was found that required the partitioning of the
DES application to exactly m partitions where m ranged from one to nine. The DES
application was then partitioned and allocated onto the FPGA using the OS and performance
measured. The floor-plans for this test case are shown in Figure 65 and Figure 66.
In test cases two and four all the experiments were duplicated for two sets. Set one, denoted as
“typical” corresponded to a mean application size of 4% of the FPGA area for the simulated
tasks. Set two, denoted as “large” corresponded to a mean size of 8%. In both sets the
variance of size was selected by half the mean. The mean and variance of the area and the
mean inter-arrival time and execution time are all shown in Table 14. The DES application
used approximately 16% of the FPGA.
Mea
n ar
ea
(per
cent
age
of F
PG
A)
Sta
ndar
d de
viat
ion
area
(p
erce
ntag
e of
FP
GA
)
Mea
n in
ter-
arriv
al ti
me
(uni
t of t
ime)
Mea
n ex
ecut
ion
time
(uni
t of t
ime)
Typical size applications 4% 2% 20 200
Large standard deviation size 8% 4% 30 200
Table 14: Parameters of the applications used in the
response time and throughput experiments
In all of the initial floor-plans, the thin red rectangles located around the edges of the FPGA
are the memory controllers, the blue rectangles are the simulated applications and the green
rectangles are the triple DES benchmark application. If there is more than one green
rectangle, it means ReConfigME was unable to allocate the DES application without
partitioning it. If there are nine green rectangles it means every node in the data flow graph
was partitioned into a single process. In some cases, although there should be n number of
applications, it may appear as though there as less because some partitions were allocated next
to others, thus making it appear as though the application was not partitioned into the correct
number of parts. The red lines connecting the processes are the routes that would be used to
connect the applications to the memory controllers.
Chapter 7 – Performance evaluation
Figure 61: Test case 1 floor-plan
(a) n = 100 (b) n = 200
(c) n = 300 (d) n = 400
(e) n = 500
Figure 62: Test case 2, set 1 (typical sized) floor-plans
Page 139
Chapter 7 – Performance evaluation
(a) n = 100 (b) n = 200
(c) n = 300 (d) n = 400
(e) n = 500
Figure 63: Test case 2, set 2 (large sized) floor-plans
Page 140
Chapter 7 – Performance evaluation
Page 141
(a) run 1 (b) run 2
(c) run 3 (d) run 4
(e) run 5
Figure 64: Test case 3 floor-plans
Chapter 7 – Performance evaluation
Page 142
(a) m = 1 (b) m = 2
(c) m = 3 (d) m = 4
(e) m = 5 (f) m = 6
(g) m = 7 (h) m = 8
(i) m = 9
Figure 65: Test case 4, set 1 (typical sized) floor-plans
Chapter 7 – Performance evaluation
(a) m = 1 (b) m = 2
(c) m = 3 (d) m = 4
(e) m = 5 (f) m = 6
(g) m = 7 (h) m = 8
(i) m = 9
Figure 66: Test case 4, set 2 (large sized) floor-plans
Page 143
Chapter 7 – Performance evaluation
Page 144
7.2 Performance results
The use of software operating systems on a von Neumann based architecture saw the
introduction of extra unwanted overheads such as context switching. However, the average
user was prepared to accept these extra overheads as long as the advantages of an increase in
accessibility and ease of use of the hardware platform that an operating system can provide
outweighed them. A similar situation seems likely to occur with the introduction of an
operating system to reconfigurable hardware.
In the previous chapter, it was identified that the execution time of the operating system
causes applications to have extra latency at start up and to have possible loss in application
performance because of the use of the application architecture and partitioning. These are the
major contributors to these overheads in a reconfigurable computing operating system. The
metrics of user response time and application throughput were selected to measure these
overheads. To judge if the increased accessibility and ease of use provided by the operating
system outweigh the overheads, a series of experiments to measure the user response time and
application throughput have been carried out.
7.2.1 User response time
The user response time in an operating system performing real-time user interaction should be
kept to a minimum. Users are only prepared to wait for a certain length in time for a response
to their input. In a software operating system, the majority of the user response time is
consumed when a context switch is performed, and when other applications have use of the
microprocessor. However, in the reconfigurable computing environment proposed in this
thesis, a context switch is performed far less often. When an application is loaded into the
reconfigurable computing operating system, other applications can continue to execute while
the new application is allocated and partitioned. When the application is about to be loaded to
the FPGA, assuming the new application can fit onto the FPGA, the other applications can
continue to execute if the new application would be loaded via dynamic reconfiguration. This
is unlike a software operating system where the application using the microprocessor would
need to be stopped. Therefore context switches are not as expensive in terms of response time
for reconfigurable computing. The user response time in the prototype operating system is
only the latency experienced when loading the application onto the FPGA. The majority of
this latency is the execution runtime of the allocation and partitioning algorithms.
Chapter 7 – Performance evaluation
Page 145
The user response time in this experiment will be calculated by measuring the execution
runtime of the Colonel component of ReConfigME. This execution time includes the runtime
consumed by the allocation and partitioning algorithms, the interactions between them, and
the generation of the placement constraints file.
The execution runtime of the entire bitstream generation is excluded in the user response time
because it was performed at runtime in ReConfigME only due to tool limitations. With the
future development of new tool flows which support relocatable pre-placed and pre-routed
cores, and a hierarchical routing structure where only the top level needs to be routed at
runtime, there will be no need to place and route the entire bitstream at runtime. All of the
nodes in the data flow graph would be pre-placed and routed at compile time. ReConfigME
would then determine the location of these nodes at runtime and simply relocate them to the
allocated positions. After this, a runtime router would connect all of the communicating nodes
together via a special routing layer on the FPGA reserved for inter-process communication.
This would only require a few new routes as compared to routing the entire application and
thus could be completed in a short amount of execution time. As the current place and route
consumes approximately five minutes for the triple DES application at present, there appears
to be no reason why a new tool flow with these features could not reduce this to times
comparable with the rest of the operating system algorithms.
The actual reconfiguration time is also excluded from the user response time because it is
likely that future development of new FPGA architectures will better support dynamic
reconfiguration because of its increased interest within the reconfigurable computing
community. The majority of the current FPGAs supporting dynamic reconfiguration only
allow column based partial reconfiguration which is too restrictive in the current environment
so the entire device is reconfigured every time a change is made. As complete reconfiguration
on a modern FPGA such as a Virtex II Pro currently consumes less than a second,
architectures supporting true dynamic reconfiguration whilst making a contribution to the
operating system performance, will be secondary to the impact of improved tool flows that
allow relocatable pre-placed and pre-routed cores.
The aim of the user response time experiment is thus to measure only the operating system
allocation and partitioning latency when the benchmark application is partitioned into a range
of processes (test case 4), and when it is allocated onto the FPGA when a number of
Chapter 7 – Performance evaluation
Page 146
applications are already allocated onto the FPGA (test case 2). The variance in user response
time will also be measured through the use of test case 3 as an estimate of experimental
variability.
Results
In test cases three and four, the number of applications already allocated on the FPGA, the
user response time, and the remaining FPGA area was measured for each initial floor-plan. In
addition, in test case two, the number of partitions the application was divided into was also
recorded. The results from these experiments are shown in Table 15, Table 16, Table 17 and
Table 18. In the tables, n is the number of previously allocated applications and m is the
number of partitions the application it is divided into.
Test case 1
Applications aleady resident on the FPGA
User response time (ms)
m = 1 0 58
Table 15: User response time for test case 1
Test
cas
e 2
set 1
(typ
ical
)
App
licat
ions
al
read
y re
side
nt
on th
e FP
GA
No.
of p
artit
ions
Use
r res
pons
e tim
e (m
s)
Rem
aini
ng a
rea
(CLB
s)
Test
cas
e 2
set 2
(lar
ge)
App
licat
ions
al
read
y re
side
nt
on th
e FP
GA
No.
of p
artit
ions
Use
r res
pons
e tim
e (m
s)
Rem
aini
ng a
rea
(CLB
s)
n = 100 8 3 332 2851 n = 100 4 4 267 2551
n = 200 8 5 493 2739 n = 200 7 7 490 2802
n = 300 9 4 400 2953 n = 300 6 5 358 2775
n = 400 6 5 444 2861 n = 400 3 3 211 2421
n = 500 11 7 675 2842 n = 500 7 5 362 2587
Table 16: User response time for test case 2
Chapter 7 – Performance evaluation
Test
cas
e 3
n =
100
m =
3
App
licat
ions
re
side
nt o
n th
e FP
GA
Use
r res
pons
e tim
e (m
s)
Rem
aini
ng a
rea
(CLB
s)
Run 1 9 339 3118
Run 2 9 337 2912
Run 3 8 330 2835
Run 4 10 345 2651
Run 5 6 299 2897
Table 17: User response time for test case 3
Page 147
Table 18: User response time for test case 4
Shown in Table 15 is the user response time when the DES application was allocated without
partitioning onto an empty FPGA. As expected this generated the lowest user response time
of all the experiments at 58ms. This is because it did not need to be partitioned, there were no
other applications on the FPGA, and only one partition needed to be allocated. In Table 16,
Test case 4 Set 1 (typical)
Number of partitions m A
pplic
atio
ns
resi
dent
on
the
FPG
A
Use
r res
pons
e tim
e (m
s)
Rem
aini
ng a
rea
(CLB
s)
Test case 4 Set 1
(typical)
Number of partitions m A
pplic
atio
ns
resi
dent
on
the
FPG
A
Use
r res
pons
e tim
e
(ms)
Rem
aini
ng a
rea
(CLB
s)
m = 1 10 98 3118 m = 1 3 60 2059
m = 2 6 238 2268 m = 2 2 175 2097
m = 3 7 306 3054 m = 3 4 217 2268
m = 4 10 391 2891 m = 4 4 259 2984
m = 5 8 478 3231 m = 5 5 343 2551
m = 6 10 596 2778 m = 6 3 375 2954
m = 7 7 615 2665 m = 7 4 412 2891
m = 8 10 818 2779 m = 8 3 503 2054
m = 9 9 899 2685 m = 9 2 560 2211
Chapter 7 – Performance evaluation
Page 148
given the number of applications already allocated onto the FPGA, the user response time
varied from 332ms for three applications to 675ms for seven applications for set 1, and 211ms
for three applications to 490ms for seven applications in set 2. It also shows the amount of
remaining area in CLBs was fairly consistent for all values of n. This indicates the system has
reached a steady state after 100 applications have been allocated.
As it was determined from the previous test case that the system had reached a steady state
after 100 applications had been completed for both tests, from Table 17, the variance in the
user response time was calculated from five runs of n = 100 and m = 3, and was found to be
approximately 16ms. Only five runs were used to calculate the variance in the user response
time because it appears as though the variation in user response time was almost constant
between these five runs. This variance will be used on all graphs and response time
calculations to indicate the range of error in any one measurement.
In the final experiment to measure the response time, when m (number of partitions) ranged
from 1 to 9 is shown in Table 18. In set 2 (large) in test cases 2 and 4, a reduction of
approximately 40% in the user response time was recorded when compared to set 1. For
example in test case 4, when the application was partitioned into two partitions, in set 1 the
user response time was 238ms as compared to 175ms for set 2.
Relationships
Shown in Figure 67 is a graph of the response time versus the number of partitions the
application is divided into for both sets in test case 4. There appears to be a linear like
relationship between the response time and the number of partitions the application is divided
into. That is for whatever number of partitions the application is being divided into, the
response time increases at the same rate. This is an expected result because the allocation and
partitioning algorithms that allocate and divide the application have a linear runtime
complexity as defined in chapter 5.
Shown in Figure 68 is a graph of the response time versus the number of applications already
allocated onto the FPGA before the DES application was placed for both sets in test case 2
and 4. There appears to be very little correlation between the user response time and the
number of applications already allocated on the FPGA. This is evident in both sets because
Chapter 7 – Performance evaluation
for example, the user response time ranged from 60ms to 500ms for three applications
allocated in set 1, and ranged from 100ms to 820ms for 10 applications allocated in set 2.
Response time vs. number of partitions the application is partitioned into for test case 4
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5 6 7 8 9
Number of partitions the application is divided into
Res
pons
e tim
e (m
s)
10
Set 1 (typical) Set 2 (large)
Figure 67: The response time verses the number of partitions the application is
divided into for sets 1 (typical) and 2 (large) in test case 4
Page 149
Chapter 7 – Performance evaluation
Response time vs. number of applications on the FPGA for test case 2 and 4
0
100
200
300
400
500
600
700
800
900
1000
0 2 4 6 8 10
Number of applications on the FPGA
Res
pons
e Ti
me
(ms)
12
Set 1 (typical) Set 2 (large)
Figure 68: The response time verses the number of applications already allocated
onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4
7.2.2 Application throughput
Application throughput is a commonly used metric that measures the amount of output an
application can produce per unit of time. The use of an operating system to manage the
hardware resources can result in a drop in application throughput because of the introduced
application architecture and operating system policies. Users may be prepared to trade some
of the application throughput for an increase in platform accessibility and ease of application
design but too large a loss in throughput could make the operating system unattractive.
In ReConfigME, designing an application using a data flow graph structure will not reduce
the concurrency, that is change the number of pipeline stages, however it might be expected to
reduce the clock rate due to the introduced design methodology. The application throughput
of a reconfigurable computing application will be measured in this experiment by calculating
the benchmark application’s signal delay (clock rate). Signal delay by definition is the amount
of time it takes between outputs. When comparing two identical applications, the larger the
signal delays, the lower the throughput.
Page 150
Chapter 7 – Performance evaluation
The aim of this experiment is to measure the effect on application throughput when the
number of partitions the application is divided into (test case 4) and the number of
applications that are already allocated on the FPGA (test case 2) are varied. The worst case
application throughput was measured on a floor-plan generated when the benchmark
application is partitioned into the maximum of nine individual partitions with all
communicating partitions separated by the maximum wire length. This floor-plan is shown in
Figure 69. The benchmark’s application best case throughput will be measured when not
under control of the operating system by not placing any allocation constraints onto it and
using the standard design flow and place and route tools alone. These test cases should
generate the two extreme values and will be used as a baseline to compare the application
throughput measured from the other tests.
Figure 69: Possible worse case signal delay
Results
In all of test cases, the signal delay measured in milliseconds, the performance loss compared
to the application throughput when the benchmark was not under operating system control,
measured as a percentage, and the fragmentation calculated from Equation 4 and measured in
a percentage were calculated for each of the final floor-plans. The results from these
experiments are shown in Table 19, Table 20, Table 21, Table 22 and Table 23. In the tables,
n is the number of previously allocated applications and m is the number of partitions the
application it is divided into.
Page 151
Chapter 7 – Performance evaluation
Page 152
Signal delay(ms)
Performance loss (%)
Fragmentation %
Remaining area (CLBs)
Expected worst case
Figure 69 46.047 33.14 0.1239
4479
Expected best case not under
OS control 30.016 0 Not applicable
4719
Table 19: Application throughput for the worse case and when the application was
not under the operating system control
Signal delay
(ms) Performance
loss (%) Fragmentation
%
Best case under operating system control (Test case 1) 30.785 2.49 0.0214
Table 20: Application throughout for test case 1
Test
cas
e 2
Set
1
(typi
cal)
Sig
nal D
elay
(m
s)
Per
form
ance
Los
s (%
)
Frag
men
tatio
n %
Test
cas
e 2
Set
2
(larg
e)
Sig
nal D
elay
(m
s)
Per
form
ance
Los
s (%
)
Frag
men
tatio
n %
n = 100 31.475 4.64 0.7213 n = 100 36.341 17.41 0.6778
n = 200 36.529 17.83 1.2402 n = 200 34.205 12.25 0.7631
n = 300 32.949 8.9 0.8810 n = 300 27.676 -8.45 0.5509
n = 400 34.836 13.84 0.9448 n = 400 40.221 25.37 0.7851
n = 500 34.305 12.5 1.9143 n = 500 35.902 16.39 1.1310
Table 21: Application Throughput for test case 2
Chapter 7 – Performance evaluation
Page 153
Test case 3
n = 100 Signal Delay
(ms) Performance
Loss (%) Fragmentation
%
Run 1 32.572 7.85 1.1286
Run 2 33.102 9.32 1.1637
Run 3 30.768 2.44 0.8449
Run 4 32.132 6.59 1.3728
Run 5 31.056 3.35 0.7119
Table 22: Application throughput for test case 3
Test
cas
e 4
Set
1
(typi
cal)
m =
num
ber o
f pa
rtitio
ns
Sig
nal D
elay
(m
s)
Per
form
ance
Los
s (%
)
Frag
men
tatio
n %
Test
cas
e 4
Set
2
(larg
e)
m =
num
ber o
f pa
rtitio
ns
Sig
nal D
elay
(m
s)
Per
form
ance
Los
s (%
)
Frag
men
tatio
n %
m = 1 30.514 1.63 0.8151 m = 1 30.972 3.08 0.4753
m = 2 30.675 2.15 0.7619 m = 2 30.578 1.85 0.4284
m = 3 31.578 4.95 0.8912 m = 3 31.279 4.04 0.5207
m = 4 34.254 12.38 1.0253 m = 4 33.927 11.51 0.5732
m = 5 35.102 14.49 1.0477 m = 5 35.087 14.46 0.7248
m = 6 37.501 19.96 1.6852 m = 6 38.047 21.12 0.8012
m = 7 40.258 25.44 1.7526 m = 7 38.482 21.99 0.7218
m = 8 43.024 30.32 2.0140 m = 8 39.973 24.9 0.8212
m = 9 44.023 31.82 2.1542 m = 9 42.731 29.76 0.9096
Table 23: Application throughput for test case 4
Shown in Table 19 is the signal delay for the worst case test, and as expected it was 33%
higher than in any other experiments. The use of longer routes for the inter-process
communication significantly affects the amount of throughput the application could generate
in this floor-plan. Also reported in this table is the signal delay when the application was not
executed under the operating system. This value is used as a baseline to compare all of the
other signal delays against. The fragmentation was not calculated in this part of experiment
Chapter 7 – Performance evaluation
Page 154
because the operating system did not perform the allocation as it was automatically performed
by the commercial place and route tools.
The signal delay recorded in the expected best case test and shown in Table 20 resulted in an
increase of 2.5% compared to when the application was used with the operating system. This
can be explained because when the application was not used with the operating system, the
place and route algorithms were able to further optimise the application’s performance as
there were no allocation restrictions placed onto it.
Shown in Table 21 is the response time recorded when a different number of applications
were already allocated onto the FPGA (test case 2). The signal delay varied from 31.475ms
for three applications to 36.529ms for five applications for set 1, and 27.676ms for five
applications to 40.221ms for three applications in set 2. When n = 300 in both set 1 and 2
there was minimal impact in performance due to the partitioning. This can be explained
because in each test case the communicating partitions were allocated next to each other, thus
minimising signal delay. From the results shown in Table 22, it was determined in the user
response time experiment that the system reached a state of stability after 100 applications
had been completed, the variance in signal delay was calculated to be 0.906ms. In test case 4
shown in Table 23, the signal delay when m (number of partitions) ranged from 1 to 9 was
from 30.514ms for m = 1 to 44.023ms for m = 9 in set 1, and 30.072ms for m = 1 and
42.731ms for m = 9 in set 2.
Relationships
Shown in Figure 70 is a graph of the application throughput versus the number of partitions
the application was divided into, for both sets in test case 4. There appears to be a linear
relationship between the signal delay and the number of partitions the application is
partitioned into beyond 3 partitions. That is the rate at which the signal delay increases is
proportional to the number of partitions the application is divided into. The flat part of the
graph between 1 and 3 partitions shows that partitioning the application into 1 to 3 partitions
has in little affect on its application throughput. The size and number of the applications
already on the FPGA appears to have little effect on the application throughput as shown in
Figure 71.
Chapter 7 – Performance evaluation
Application Thoughput vs. No. of partitions the application is partitioned into for test case 4
25
27
29
31
33
35
37
39
41
43
45
0 1 2 3 4 5 6 7 8 9 1
No of partitions the application is divided into
Sign
al D
elay
(mS)
0
Set 1 (typical) Set 2 (large)
Figure 70: The application throughput versus the number of partitions the application
is divided into for sets 1 (typical) and 2 (large) in test case 4
Application throughput vs. number of applications on the FPGA for test case 2 and 4
25
30
35
40
45
50
0 2 4 6 8 10
Number of applications on the FPGA
Sign
al D
elay
(ms)
12
Set 1 (typical) Set 2 (large)
Figure 71 : The application throughput versus the number of applications already
allocated onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4
Page 155
Chapter 7 – Performance evaluation
Page 156
7.2.3 Conclusion
In this section, experiments were conducted to measure the impact on the user response time
and application performance due to the introduction of the prototype operating system
ReConfigME. The user response time ranged from 58ms recorded in the expected best case
(test case 1) to 899ms when the benchmark application was partitioned into a maximum of
nine partitions in set 1 of test case 4. The signal delay ranged from 30.016ms in the expected
best case not under the operating system control to 46.047ms in the expected worst case test.
From these experiments it was found that both the user response time and signal delay only
significantly increased when the application was partitioned. The Partitioner itself did not
consume the all of the runtime, the Allocator used most of it as it had to be called for every
partition. The signal delay increased with the number of partitions because of the extra wire
length introduced due to the inter-process communication. It was also found that the number
of applications already allocated onto the FPGA did not significantly affect either the user
response time or signal delay. Overall, it was concluded that the extra user response time and
lower application throughput introduced by the prototype operating system were not excessive
and would be outweighed by the advantages that an operating system provide.
Chapter 7 – Performance evaluation
7.3 Predictor metrics
In the previous experiments, the user response time of ReConfigME and the application
throughput were measured. A common factor that appears to likely cause a loss in the
response time and throughput is the fragmentation of the FPGA (the formula proposed for
calculating the fragmentation is repeated from chapter 5 and is shown in Equation 5). It would
be expected that response time and signal delay would increase with the fragmentation. This
is because the more fragmented the FPGA becomes, the more chance the application will
need to be partitioned. Once the application is partitioned, the response time and signal delay
will increase because the Allocator has to be called multiple times and more inter-process
communication routes are used. However the amount of increase is currently unknown. In this
section, an investigation will verify if there is a correlation between response time, signal
delay, and fragmentation, and quantify it.
⎪⎩
⎪⎨
⎧
=
>×⎟⎠⎞
⎜⎝⎛
−−
=1,0
1,10011
A
AAh
F
where h is the number of holes and A is the total free area A has units of the minimum unit of allocation (CLBs in most cases)
Equation 5: Fragmentation percentage
7.3.1 Response time
From the results obtained in the experiments that calculated the response time and application
throughput (the signal delay), and fragmentation percentage was measured. A graph of the
signal delay versus the fragmentation percentage for all of these test cases is shown in Figure
72. From this graph, there appears to be a linear like relationship between the response time
and the fragmentation which is highlighted by the two linear regression graphs, one for set 1
and set 2. However, the rate at which the response time increases compared to the
fragmentation differs between the two sets of applications. For the same response time, the
large sized applications in set 2 have a lower fragmentation percentage compared with the
smaller sized applications in set 1.
Page 157
Chapter 7 – Performance evaluation
Response Time vs Fragmentation
0
100
200
300
400
500
600
700
800
900
1000
0 0.5 1 1.5 2
Fragmentation (%)
Res
pons
e Ti
me
(ms)
2.5
Test case 1 Test case 2 - set 1 Test case 2 - set 2 Test case 3 - set 1Test case 4 - set 1 Test case 4 - set 2 Linear (Test case 4 - set 1) Linear (Test case 4 - set 2)
Figure 72: The user response time versus the fragmentation
percentage for both sets and all test cases
Before the fragmentation percentage can be used as a predictor for response time, the
fragmentation percentages for the larger sized applications need to be adjusted. To determine
the amount of adjustment, the linear equations for the response time versus the fragmentation
percentage for both sets were calculated from the linear regression graphs shown in Figure 72.
These equations are shown in Equation 6. From these equations it was determined that the
fragmentation percentage for set 1 (typical size) had to be adjusted by multiplying it by a
factor of 1.9. This value has been calculated with a limited amount of experimental data and
further investigation would be required before applying it to a wider set of situations. A graph
of the response time versus the adjusted fragmentation that can be used as a predictor for
response time is shown in Figure 73.
150470250895
−=−=
xYlxYt
where
Yt = set 1 typical sized application Yl = set 2 large sized application
Equation 6: Linear equations for the response time versus
fragmentation percentage for both sets
Page 158
Chapter 7 – Performance evaluation
Response time vs Adjusted Fragmentationfor response time
0100200300400500600700800900
1000
0 0.5 1 1.5 2
Adjusted fragmentation for response time (%)
Res
pons
e tim
e (m
s)
2.5
Test case 1 Test case 2 - set 1 (typical) Test case 2 - set 2 (large)Test case 3 Test case 4 - set 1 (typical) Test case 4 - set 2 (large)Combined test cases Linear (Combined test cases)
Figure 73: User response time versus the adjusted fragmentation
for both sets and all test cases
A linear regression data analysis was performed on these adjusted values and the R2 value was
calculated at 0.766. This means that the regression explained 76% of the variations. An
adjusted fragmentation formula was thus developed that can be used as a predictor for
response time is shown in Equation 7. It is simply the original fragmentation formula shown
in Equation 4 with an adjustment for the mean size of the applications already allocated onto
the FPGA represented as a percentage of the entire FPGA area multiplied by the adjustment
value.
⎟⎠⎞
⎜⎝⎛
⎟⎠⎞
⎜⎝⎛
−−
=FM
AhFr 5.23*
11
where
Fr = adjusted fragmentation percentage for response time h = number of holes on the FPGA
A = number of free CLBs on the FPGA M = mean size of the applications on the FPGA in CLBs
F = total size of the FPGA in CLBs
Equation 7: Adjusted fragmentation percentage for predicting user response time
Page 159
Chapter 7 – Performance evaluation
7.3.2 Application throughput
When examining the results obtained from the experiment which measured application
throughput, it was noticed that there appeared to be a connection between the fragmentation
percentage and the signal delay. Shown in Figure 74 is a graph of the fragmentation versus the
measured signal delay for each of the floor-plans in both sets for all test cases. Similar to the
response time versus fragmentation, there is a linear-like relationship between the
fragmentation and signal delay highlighted by the linear regression plots. This is an expected
result because as the fragmentation of the FPGA increases, so does the number of vacant
holes on the FPGA. This often leads to a higher percentage of chance that the application will
need to be partitioned. Once the application has been partitioned, wires need to be routed
between the processes for inter-process communication. However, the rate at which the signal
delay increases is different between the two sets representing different application sizes. For
the same fragmentation percentage, the larger sized applications in set 2 have a higher signal
delay.
Application Throughput vs Fragmentation
25
30
35
40
45
50
0 0.5 1 1.5 2 2
Fragmentation (%)
Sign
al D
elay
(ms)
.5
Test case 1 Test case 2 - Set 1 Test case 2 - Set 2 Test case 3Test case 4 - Set 1 Test case 4 - Set 2 Linear (Test case 4 - Set 2) Linear (Test case 4 - Set 1)
Figure 74: A graph of application throughput versus fragmentation
Again, the two linear equations were calculated from the graph and are shown in Equation 6.
From these equations it was determined that the fragmentation percentage for set 1 (typical
size) had to be adjusted by multiplying it by a factor of 2.55. A graph of the response time
Page 160
Chapter 7 – Performance evaluation
versus the adjusted fragmentation that can be used as a predictor for response time is shown in
Figure 75.
24105.1921
+=+=
xYlxYt
where
Yt = set 1 typical sized application Yl = set 2 large sized application
Equation 8: Linear equations for the signal delay versus
fragmentation percentage for both sets
Application Throughput vs Adjusted fragmentationfor signal delay
25
30
35
40
45
50
0 0.5 1 1.5 2 2
Adjusted fragmentation for signal delay (%)
Sign
al D
elay
(ms)
.5
Test case 1 Test case 2 - Set 1 (typical) Test case 2 - Set 2 (large)Test case 3 Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)Combined test cases Linear (Combined test cases)
Figure 75: User response time versus the adjusted fragmentation
for both sets and all test cases
A linear regression data analysis was performed on these adjusted values and the R2 value was
calculated at 0.6749. This resulted in an adjusted fragmentation formula that can be used as a
predictor for signal delay and is shown in Equation 9.
Page 161
Chapter 7 – Performance evaluation
⎟⎠⎞
⎜⎝⎛
⎟⎠⎞
⎜⎝⎛
−−
=FM
AhFs 25*
11
where
Fs = adjusted fragmentation percentage for signal delay h = number of holes on the FPGA
A = number of free CLBs on the FPGA M = mean size of the applications on the FPGA in CLBs
F = total size of the FPGA in CLBs
Equation 9: Adjusted fragmentation percentage for predicting signal delay
From the response time and signal delay correlation analysis, adjusted fragmentation formulas
that predict the user response time and signal delays were derived. From this result, it was
shown that the mean size of the applications already allocated on the FPGA affects the
response time under the operating system and the application throughput for a given
fragmentation. The larger the application size the more signal delay and response time will be
experienced for the same fragmentation value. To minimise the response time and signal
delay, it is proposed to separate the FPGA into regions where large and small sized
applications are allocated. By doing this, this will result in higher application throughputs and
lower user response times for the smaller sized applications. That is the large sized
applications will not have as much affect on the performance of the smaller sized applications.
7.3.3 Comparison of fragmentation measure
There is only one other fragmentation formula for calculating the amount of fragmentation on
an FPGA (see chapter 5) that appears in the research literature. Walder proposed a formula
shown in Equation 2 which is derived from a histogram of free rectangular areas. He states the
lower the value of the fragmentation, the higher the probability that a future application can
be mapped. However, as shown in the graphs in Figure 76 and Figure 77, there is a very small
correlation between the Walder fragmentation percentage and either response time or signal
delay. It can be concluded that the Walder fragmentation measure is not able to be used as a
good predictor for either operating system response time or application throughput.
Page 162
Chapter 7 – Performance evaluation
Response time vs Walder fragmentation measure
0
100
200
300
400
500
600
700
800
900
1000
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Walder fragmentation measure
Res
pons
e tim
e (m
s)
Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)
Figure 76: Response time versus Walder fragmentation measure
Application Throughput vs Walder Fragmentation measure
29
31
33
35
37
39
41
43
45
47
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Walder fragmentation measure
Sign
al D
elay
(ms)
Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)
Figure 77: Signal delay versus Walder fragmentation measure
7.3.4 Chance of allocation
The fragmentation of the FPGA area in this thesis has been defined as a ratio of the number of
holes left between the previously allocated applications and the amount of remaining area. It
Page 163
Chapter 7 – Performance evaluation
is calculated from the formula shown in Equation 4. Although the percentage of
fragmentation existing on an FPGA at a particular point in time is useful for measuring the
performance of the allocation algorithm, for predicting the possible user response time and
application throughput, it gives no indication to the user of the likelihood that their
application will be successfully allocated. The factors affecting the chance of an application
being allocated successfully are likely to include available area on the FPGA, the number of
holes that have been created by previously allocated applications, and the size of the incoming
application. These three characteristics have been combined into a formula shown in Equation
10 that predicts the chance that an incoming application will be successfully allocated.
APhA
S×
=
where
S = percentage of successful allocation of application
A = number of free CLBs available h = number of holes on the FPGA
PA = number of CLBs of the application being allocated
Equation 10: The percentage chance of allocating a process
This formula considers the size of the application attempting to be allocated PA, the number of
holes that exist on the FPGA h due to previous allocations, and the amount of area still
available A. The size of the incoming application and the amount of available area must both
be measured in the same unit. The larger S the more chance the application has of being
allocated and the closer the value gets to zero, the less chance there is of it being allocated. If
the value is greater than 100%, the application will be almost always be allocated.
The results obtained in the experiments described in section 7.2 have been used to verify the
formula for predicting the success of allocating a particular sized application onto an already
occupied FPGA. The process of verifying the formula was conducted as follows. For all of
the initial floor-plans generated in test case 3, and four of the initial floor-plans generated in
test case 2, the amount of free area in CLBs and the number of holes present in each were
calculated. An initial floor-plan has previously been defined to be the resultant arrangement of
applications by ReConfigME processing all of the input applications prior to allocating the
benchmark application. The two test cases were chosen to be used in the verification as the
application was partitioned into the same number of processes in each test case and a different
Page 164
Chapter 7 – Performance evaluation
Page 165
sized incoming application was successfully allocated in each test case. These results are
shown in Table 24.
Test case
3
Free area
(CLBs) Holes
S (%)
420 CLBs
Success
S (%)
930 CLBs
Failed
Testcase
2
Free area
(CLBs) holes
S (%)
306 CLBs
Success
S (%)
420 CLBs
Failed
Run 1 3919 23 40.6 18.3 200 3193 22 47.4 34.5
3499 25 33.3 15.0 2887 24 39.3 28.6
3079 28 26.2 11.8
Run 2 3668 22 39.7 17.9 300 3380 21 52.6 38.2
3248 24 32.2 14.6 3074 24 41.9 30.5
2828 28 24.1 10.9
Run 3 3864 19 48.4 21.9 400 3294 20 53.8 39.2
3444 20 41 18.5 2988 23 42.5 30.9
3024 21 34.29 15.5
Run 4 3592 23 37.2 16.8 500 3229 29 36.4 26.5
3172 28 27 12.2 2923 32 29.9 21.7
2752 29 23 10.2 2617 35 24.4 17.8
Run 5 4211 19 52.8 23.8
3791 21 43 19.4
3371 22 36.4 16.5
Table 24: Results from an experiment to verify the allocation successful formula
In all of the floor-plans in test case 3, the triple DES benchmark application which consumes
930 CLBs (30 x 31) of area was unable to be allocated by ReConfigME without being
partitioned, i.e. it was not successfully allocated in one process. This resulted in the triple
DES application being partitioned into three single DES processes, each consuming 420
CLBs (20 x 21) of area which were able to be successfully allocated. This is shown in the
floor-plans in Figure 64 as there are three green rectangles in each. Similarly, in the floor-
plans in test case 2, the single DES process was unsuccessfully allocated onto the FPGA. This
resulted in ReConfigME having to further partition the benchmark application into two
Chapter 7 – Performance evaluation
smaller processes, each consuming 306 CLBs (17 x 18) of area, which were able to be
successfully allocated onto the FPGA. This is shown in the floor-plans in Figure 62. For each
of the floor-plans, the percentage chance of successfully allocating the incoming application
was calculated for both the allocated and non-allocated process.
From the graph shown in Figure 78, applications with S calculated above 39% were
successfully allocated, whereas S less than 18% resulted in the application not being allocated.
For S between 18 and 39, some applications were allocated and some were not. These results
verify that the formula can be used to predict the likelihood that a particular sized application
will or will not be successfully allocated to an FPGA floor-plan.
Percentage chance of an application being allocated onto a fragmented FPGA
0
10
20
30
40
50
60
0 2 4 6 8 10 12 14
Sample number
Perc
enta
ge o
f suc
cess
16
Allocated Test case 3 Failed Test case 3 Allocated Test case 2 Failed Test case 2
Figure 78: A graph of the percentage success and
failed allocation of applications in test case 2 and 3
This formula can also be used for comparing the performance of FPGA area allocation
algorithms. For example, if an identical series of applications are loaded onto two FPGAs by
different allocation algorithms, the formula can then be used to predict the chance of
allocating the next application. If the percentage is less in one FPGA than the other, it can be
concluded that the performance of that allocation algorithm compared to the other is better.
Page 166
Chapter 7 – Performance evaluation
Page 167
7.4 Conclusion
This chapter resulted in three major deliverables: an experimental test environment, a
benchmark application, user response time and application throughput performance results,
and correlation factors between fragmentation, throughput, and user response time, including
a formula for predicting the successful allocation of an incoming application. The
experimental test environment consisted of a series of initial floor-plans and a benchmark
application which were used to generate the user response time and application throughput for
a variety of situations. The results in these experiments were then correlated through the use
of a linear regression data analysis. A formula for predicting the user response time and signal
delay based upon the amount of fragmentation the floor-plan has been derived. It was
discovered that as well as the fragmentation measured by Equation 4, the mean size of the
applications already allocated onto the FPGA affect both the user response time and signal
delay of the next application to be allocated. It was concluded that to minimise the affect of
different sized applications, the FPGA should be segmented into regions where large and
small applications are allocated separately. From the results obtained in the experiments, a
formula for predicting the successful allocation of an application was developed and verified.
It was determined that a variation of fragmentation reported in Equation 4 predicts the chance
of an application being allocated.
Chapter 8 – Conclusion
8 8 Conclusion and Future Work
The focus in this thesis was to design, build, and evaluate the performance of a resource
allocating operating system for a reconfigurable computer. This was achieved by firstly
describing a set of reconfigurable computing abstractions, defining a reconfigurable
computing operating system architecture that suits these abstractions, and outlining the
algorithm specifications that are needed in the components of the architecture. The algorithms
for resource allocation and application partitioning in the operating system were then selected
by ranking previously published algorithms according to their runtime complexity. The
algorithms with the least runtime complexity were then modified to suit the operating system
environment and implemented so experiments could be performed to measure performance.
The best performing resource allocation and partitioning algorithm were then selected to be
part of the prototype operating system, ReConfigME. This was implemented using a
commercially available reconfigurable computing platform and consisted of a three-tiered
network architecture. Users can load their applications onto the user tier, have them processed
by the Colonel on the operating system tier, and then configured onto an FPGA located in a
computer in the platform tier. A series of experiments were performed on the operating
system to measure its effect on user response time and application throughput. From these
experimental results, a predictor for user response time and signal delay based upon the
fragmentation of the FPGA were derived. Finally, a formula for estimating the percentage
chance of a successful allocation of an incoming application was also derived. In this chapter,
a summary of the contributions made in this thesis are outlined, followed by suggestions into
future work in the field.
Page 168
Chapter 8 – Conclusion
Page 169
8.1 Research contributions
In chapter 2 it was shown that there are significant gaps in the literature regarding the runtime
management of reconfigurable computing applications. In this thesis, contributions were
made in the following areas.
1. There is no agreed list of abstractions that should be used in an operating system for
reconfigurable computing (see section 4.1).
Before any operating system architecture or prototype could be built, a set of abstractions for
a reconfigurable computing operating system had to be defined. Through the uniqueness and
analogous comparison to a software operating system, a process, address space and inter-
process communication reconfigurable computing abstraction were defined. The process
abstraction consists of an application in execution and structured according to a data flow
graph model with data source and sinks nodes inserted for simplified I/O access. All I/O is
then streamed into or out of the application via these nodes. A process abstraction was defined
that had direct access to the FPGA pins if an application had hard performance limits. All
connections to these pins are handled by the designer and not the operating system. The
address space abstraction consists of a two dimensional address space that represents the
FPGA at CLB granularity, and a one dimensional address space that represents the on-board
memory. This abstraction prevents processes from accessing resources that are not allocated
to them. The inter-process communication abstraction consists of creating and passing
messages between processes, similar to what happens in a software operating system. All
inter-process communication is conducted via a memory and for performance reasons all
processes are connected together and memory via a shared multiple bus topology.
2. Current design flows have little support for dynamic reconfiguration with resource
allocation (see section 4.4).
A modified application design flow for use with the proposed operating system architecture
which consists of structuring the application according to a data flow graph model was
developed. This structure allows the operating system to use an application partitioning
algorithm to divide the application into smaller parts. Once the application has been defined,
each node of the data flow graph is then synthesized, technology mapped, and place and
Chapter 8 – Conclusion
Page 170
routed using the commercially available design tools. However each of the nodes must still be
relocatable as their position on the FPGA is not finalised until runtime.
3. Algorithms for runtime resource allocation and runtime application partitioning have
not been deeply investigated in the reconfigurable computing domain (see chapter 5).
Before any algorithm for resource allocation or application partitioning could be selected,
their functionality had to be defined. For online FPGA area allocation the functional
specifications were defined as:
• Determine the size and position of a vacant segment of area that an application can
fit into which will not interfere with already allocated applications
• If there is not enough contiguous vacant area find the largest segment of
contiguous area
• If there is not enough total area block the application and place it in a ready queue
• Minimise the amount of area wasted due to poor allocation choices
For an online application partitioning algorithm the functional specifications were defined as:
• Repetitively divide an application structured as a data flow graph into varying
specified sizes until the entire application has been partitioned
• Avoid partitioning feedback loops to minimise impact on performance
It was then determined as both algorithms will be executed at runtime, their runtime
complexity had to be linear with respect to the number of CLBs for allocation or the number
of nodes in the graph for partitioning. A survey of both the reconfigurable and non-
reconfigurable computing literature domains was then made which resulted in a list of the
promising allocation and partitioning algorithms. These algorithms, along with a greedy-
based allocation algorithm were then implemented so their performance could be calculated
by an experiment which measured their execution runtime and fragmentation. From this
experiment it was concluded that the Minkowski Sum allocation algorithm with bottom left
heuristic and temporal partitioning algorithm were suitable to be used in an operating system
for reconfigurable computing.
Chapter 8 – Conclusion
Page 171
4. There is no prototype runtime system for reconfigurable computing that demonstrates
runtime area resource allocation and partitioning (see section 4.2.2 and section 6.1).
Before a prototype reconfigurable computing operating system was constructed, the
architecture of one was proposed. This consists of six major components including an
interface, Partitioner, Allocator, Loader, hardware abstraction layer and primitive on-chip
architecture. A prototype called ReConfigME was then constructed consisting of all of the
components described in the architecture three separated into three tiers, all connected and
communicating together via standard TCP/IP message passing. The Allocator and Partitioner
were combined into the Colonel component of ReConfigME which is the core of the
operating system. Users connect to ReConfigME via a standard command line interface where
they would upload their application, and once it was executing, stream the I/O data to it. An
application architecture was also defined which consisted of a data flow graph structure and
data source and sink nodes for I/O.
5. There has been little discussion of metrics that might be used to evaluate the
performance of an operating system for reconfigurable computing (see section 6.2).
To evaluate the performance of the prototype operating system and applications executing
under its control, a set of metrics including the user response time and application throughput
were selected. The user response time consists of measuring the execution runtime of the
operating system when processing applications under various conditions. The application
throughput consists of calculating the signal delay of the application once it has been
allocated and partitioned. Partitioning the application and not co-locating the processes, can
potentially increase the total wire length which is the primary reason for a decrease in
application throughput.
6. There have been few evaluations into the affect an operating system environment will
have on reconfigurable computing application performance (see chapter 7).
Experiments were conducted to firstly measure the effect the operating system has on user
response time, and secondly the effect it has on application throughput. It was determined that
the Allocator was the main contributor to the execution runtime of ReConfigME. The
Allocator consumed far more time when the application was partitioned into multiple
Chapter 8 – Conclusion
Page 172
processes as each process required the Allocator to be called at least once. The throughput of
the applications executing under ReConfigME was only significantly affected when it was
divided into multiple partitions and the partitions were allocated in locations which resulted in
longer wire lengths. From the experimental results, good correlations between response time
and fragmentation and application throughput and fragmentation were determined. By
performing a linear regression analysis on the data set, a formula for predicting the user
response time and signal delay based upon the fragmentation was derived. It was determined
that the user response time and signal delay of smaller sized applications was lower for the
same amount of fragmentation in larger sized applications. From this result, it was concluded
that the FPGA should be segmented into regions where similar sized application should be
allocated. This will minimise the impact larger sized applications have on the performance of
smaller ones. Finally, a formula for predicting the chance of allocating a specified sized
application onto an FPGA with a particular amount of fragmentation was proposed. This
formula was verified via the experimental data.
8.1.1 Summary of major contributions
Listed below is a summary of the major research contributions that were made in this thesis.
• Abstractions for a reconfigurable computing operating system.
• An architecture of an operating system for reconfigurable computing.
• The creation of allocation and partitioning algorithms with ‘low’ execution runtime.
• The implementation of a prototype including multiple applications.
• A performance evaluation of the prototype.
• Two new fragmentation metrics which can predict signal delay and application
throughput.
• To segment the FPGA into parts where different sized applications can be allocated,
so it will reduce the fragmentation.
• The abstraction of the commercial place and route tools from the traditional user’s
experience.
• Proposed that for an FPGA to take full advantage of an operating system, its
architecture should have a separate routing layer and support true dynamic
reconfiguration.
Chapter 8 – Conclusion
Page 173
8.2 Suggestions for future work
The research results reported in this thesis give some quite definite directions for future
research in operating system for reconfigurable computers. Most importantly there needs to be
significant modifications in the design tools before the full benefits of an operating system
can be realised. This includes the ability to relocate pre-placed and pre-routed cores, a runtime
router with minimal execution time that can route inter-process communication wires after an
application has been loaded into the operating system by the user. A global routing
architecture separate from the routing hierarchy that is used in the pre-place and pre-route
phase could improve the execution time of a runtime router and would potentially result in
easier application allocations as the Allocator would not have to allocate around inter-
communication wires. The development of an FPGA architecture that can perform dynamic
partial reconfiguration without the column based restriction would also improve the
performance as applications would not need to be check-pointed and stopped between
reconfigurations.
Other future works could include a dynamic bus network that can be routed at runtime to
accommodate multiple reconfigurable computing applications. For example, once the location
of the application had been determined, the bus would dynamically extend to the location of
the new application, connect to it and then arbitrate the communications between the memory
and other applications. Finally, to improve the performance of inter-process communication, a
mechanism for routing channels directly between communicating applications and between
applications and the I/O pins of the FPGA could be developed to minimal any bottlenecks
associated with the platform memory. These types of communications could be used with
performance orientated applications or applications with significant amounts of I/O.
References
Page 174
9 References [1] Advanced_Semiconductor_Technology, "AST Des Core," Israel 2003. [2] Alpert, C. and Kahng, A., "Recent Directions in Netlist Partitioning: A Survey,"
Integration, the VLSI Journal, vol. 19, pp. 1-81, 1995. [3] Altera, "Excalibur Device Datasheet," 2004. [4] Altera, "Flex 10K FPGA Datasheet," 2003. [5] Altera, "Nios Development Tools Documentation Suite," 2003. [6] Amjad, U., Application (Re)Engineering. Upper Saddle River, New Jersey: Prentice
Hall, 1997. [7] Ashenden, P., The VHDL Cookbook, 1st ed. Adelaide: University of South Australia,
1990. [8] Athanas, P. and Abbott, A., "Image Processing on a Custom Computing Platform,"
presented at Fourth International Workshop on Field-Programmable Logic and Applications (FPL '94), Prague, Czech Republic, 1994.
[9] Athanas, P. and Silverman, H., "Processor Reconguration through Instruction-Set Metamorphosis," IEEE Computer, vol. 26, pp. 11-18, 1993.
[10] Atmel, "Edge Detection in AT6000s FPGAs," 1997. [11] Babb, J., Frank, M., Lee, V., Waingold, E., Barua, R., Taylor, M., Jang, K.,
Devabhaktuni, S., and Agarwal, A., "The RAW Benchmark Suite: Computation Structures for General Purpose Computing," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM 97), Napa Vally, CA, USA, 1997.
[12] Babb, J., Tessier, R., and Agarwal, A., "Virtual Wires: Overcoming pin limitations in FPGA-based logic emulators," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM'93), Napa Valley, CA, USA, 1993.
[13] Baker, B. S., Coffman, E. G., and Rivest, R. L., "Orthogonal packings in two dimensions," SIAM J. Comput., vol. 9, pp. 846-855, 1980.
[14] Barat, F. and Lauwereins, R., "Reconfigurable Instruction Set Processors: A Survey," presented at IEEE International Workshop on Rapid System Prototyping (RSP 2000), Paris, France, 2000.
[15] Baumgarte, V., May, F., Nuckel, A., Vorbach, M., and Weinhardt, M., "PACT XPP - A Self Reconfigurable Data Processing," presented at Engineering of Reconfigurable Systems and Algorithms (ERSA 01), Las Vegas, NV, USA, 2001.
[16] Bazargan, K., Kastner, R., Ogrenci, S., and Sarrafzadeh, M., "A C to Hardware/Software Compiler," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.
[17] Bazargan, K. and Sarrafzadeh, M., "Fast Online Placement for Reconfigurable Computing Systems," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'99), Napa Valley, CA, USA, 1999.
[18] Betrin, P. and Touati, H., "PAM Programming Environments: Practice and Experience," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'94), Napa Valley, CA, USA, 1994.
[19] Bhasker, J., A SystemC Primer: Star Galaxy Publishing, 2002. [20] Bioler, "Bioler 3 Reconfigurable Computing Platform Datasheet," 2004. [21] Bogrow, H., "Field Programmable Gate Arrays: Off-the-shelf QML Components for
Rapid Technology Insertion," presented at Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD), Greenbelt, MD, USA, 1998.
[22] Borriello, G., Ebeling, C., Hauck, S., and Burns, S., "The Triptych FPGA architecture," IEEE Transactions on VLSI Systems, vol. 3, pp. 491-500, 1995.
References
Page 175
[23] Brebner, G., "Automatic Identification of Swappable Logic Units in XC6200 Circuitry," presented at 7th International Workshop on Field-Programmable Logic and Applications (FPL'97), London, UK, 1997.
[24] Brebner, G., "A Virtual Hardware Operating System for the Xilinx XC6200," presented at 6th International Workshop on Field-Programmable Logic and Applications (FPL'96), Darmstadt, Germany, 1996.
[25] Brebner, G. and Donlin, A., "Runtime Reconfigurable Routing," presented at 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP'98), Orlando, FL, USA, 1998.
[26] Brown, S., Francis, R., Rose, J., and Vranesic, Z., Field Programmable Gate Arrays. Boston, USA: Kluwer and Acad. Publishers, 1992.
[27] Burns, J., Donlin, A., Hogg, J., Singh, S., and Wit, M., "A Dynamic Reconfiguration Run-Time System," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.
[28] Caspi, E., Chu, M., Huang, R., Yeh, J., Markovskiy, Y., DeHon, A., and Wawrzynek, J., "Stream Computations Organized for Reconfigurable Execution (SCORE): Introduction and Tutorial," presented at 10th International Workshop on Field-Programmable Logic and Applications (FPL'00), Austria, 2000.
[29] Casselman, S., "Virtual Computing and the Virtual Computer.," presented at IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'93), Napa Valley, CA, 1993.
[30] Celoxica, "Handel-C Reference Manual," 2003. [31] Celoxica, "RC1000-PP Hardware Reference Manual," 2000. [32] Celoxica, "RC2000 Hardware Reference Manual," 2003. [33] Chameleon_Systems, "CS2000 Advance Product Specification," 2000. [34] Chan, P. and Schlag, M., "Architectural Tradeoffs in Field-Programmable Device
based Computing Systems," presented at IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'93), Napa Valley, CA, USA, 1993.
[35] Chazelle, B., "The Bottom-Left Bin-Packing Heuristic: An Efficient Implementation," IEEE Transactions on Computers, vol. 32, pp. 697-707, 1983.
[36] Chen, X., Feng, W., Zhao, J., Meyer, F., and Lombardi, F., "Reconfiguring One-Time Programmable FPGAs," in IEEE Micro, vol. 19, 1999, pp. 53-63.
[37] Chiricescu, S., Leeser, M., and Vai, M., "Design and Analysis of a Dynamically Reconfigurable Three-Dimensional FPGA," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, 2001.
[38] Chow, P., Seo, S., Rose, R., Chung, K., Paez-Monzon, G., and Rahardja, I., "The Design of a SRAM-Based Field-Programmable Gate Array - Part I: Architecture," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 7, 1999.
[39] Coffman, E. G., Csirk, J., and Woeginger, G., "Approximate Solutions to Bin Packing Problems," in Handbook of Applied Optimization: Cambridge University Press, 1999, pp. 607-615.
[40] Compton, K. and Hauck, S., "Reconfigurable Computing: A Survey of Systems and Software," ACM Computing Surveys (CSUR), vol. 34, pp. 171-210, 2002.
[41] Cong, J. and Xu, S., "Technology Mapping for FPGAs with Embedded Memory Blocks," presented at ACM International Symposium on Field-Programmable Gate Arrays (FPGA'98), Monterey, CA, USA, 1998.
[42] Crnkovic, I., "Software Engineering and Science," vol. 2003: Malardalen University, 2001.
References
Page 176
[43] Danadlis, A., Prasanna, V., and Rolim, J., "An Adaptive Cryptographic Engine for IPSec Architectures," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.
[44] Davis, A. and Keller, R., "Data Flow Program Graphs," IEEE Computer, vol. 15, pp. 26-41, 1982.
[45] Davis, D., Barr, M., Bennett, T., Edwards, S., Harris, J., Miller, I., and Schanck, C., "A Java Development and Runtime Environment for Reconfigurable Computing," presented at 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP'98), Orlando, FL, USA, 1998.
[46] De Berg, M., Van Kreveld, M., Overmars, M., and Cheong, O., Computational Geometry: Algorithms and Applications: Springer-Verlag, 2000.
[47] DeHon, A., "DPGA-Coupled Microprocessors: Commodity ICs for the 21st Century," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'94), Napa Valley, CA, USA, 1994.
[48] Dick, C., Harris, F., and Rice, M., "Synchronization in Software Radios - Carrier and Timing Recovery using FPGAs," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM00), Napa Valley, CA, USA, 2000.
[49] Diessel, O., Kearney, D., and Wigley, G., "A Web-based Multi-user Operating System for Reconfigurable Computing," presented at 10th Symposium on Parallel and Distributed Processing (IPPS/SPDP'99), San Juan, Puerto Rico, 1999.
[50] Dyer, M., Plessl, C., and Platzner, M., "Partially Reconfigurable Cores for Xilinx Virtex," presented at 12th International Workshop on Field-Programmable Logic and Applications (FPL'02), Montpellier, France, 2002.
[51] Eatmon, D. and Gloster, C., "Evaluating Placement Algorithms for Run-time Reconfigurable Systems," presented at 2nd Annual Military and Aerospace Programmable Logic Device, Maryland, MD, USA, 1999.
[52] Ebeling, C., Cronquist, D., and Franklin, P., "RaPiD - Reconfigurable pipelined datapath," presented at International Workshop on Field-Programmable Logic and Applications (FPL'96), Berlin, Germany, 1996.
[53] Elixent, "D-Fabric Reconfigurable Algorithm Processing Datasheet," 2004. [54] Emmert, J. and Bhatia, D., "A Methodology for Fast FPGA Floorplanning," presented
at 7th International Symposium on Field-Programmable Gate Arrays (FPGA'99), Monterey, CA, USA, 1999.
[55] Estrin, G., "Reconfigurable Computer Origins: The UCLA Fixed-Plus-Variable (F+V) Structure Computer," IEEE Annals of the History of Computing, vol. 24, 2002.
[56] Federal_Information_Processing_Standards_Publication, "Data Encryption Standard (DES)," 46-3, October 1999.
[57] Fiduccia, C. and Mattheyses, R., "A Linear Time Heuristic for Improving Network Partitions," presented at ACM/IEEE Design Automation Conference (DAC'82), 1982.
[58] Flato, E. and Halperin, D., "Exact Minkowski Sums and Applications," presented at Annual Symposium on Computational Geometry, Barcelona, Spain, 2002.
[59] Free-IP, "Free-DES Implementation Notes," 2000. [60] George, V. and Rabaey, J., Low-Energy FPGAs: Architecture and Design: Kluwer
Academic Publishers, 2001. [61] Gokhale, M., Holmes, W., Kosper, A., Kunze, D., Lopresti, D., Lucas, S., Minnich,
R., and Olsen, P., "SPLASH: A Reconfigurable Linear Logic Array," presented at International Conference on Application-Specific Array Processing, 1990.
[62] Guccione, S., Levi, D., and Sundararajan, P., "JBits: A Java-based Interface for Reconfigurable Computing," presented at 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD).
References
Page 177
[63] Gunther, B., "SPACE 2 as a Reconfigurable Stream Processor," presented at 4th Australasian Conference on Parallel and Real-time Systems (PART'97), Singapore, 1997.
[64] Hadley, J. and Hutchings, B., "Design Methodologies for Partially Reconfigured Systems," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, CA, USA, 1995.
[65] Hauck, S., Borriello, G., Burns, S., and Ebeling, C., "MONTAGE: An FPGA for synchronous and asynchronous circuits," presented at International Workshop Field-Programmable Logic and Applications (FPL'92), Vienna, Austria, 1992.
[66] Hauser, J. and Wawrzynek, J., "Garp: A MIPS Processor with a Reconfigurable Coprocessor," presented at IEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.
[67] Hess, J., Lee, D., Harper, S., Jones, M., and Athanas, P., "Implementation and Evaluation of a Prototype Reconfigurable Router," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'99), Napa Valley, CA, USA, 1999.
[68] Hopf, J., Itzstein, G., and Kearney, D., "Specification of Concurrent Reconfigurable Hardware using Hardware Join Java," presented at IEEE International Conference on Field-Programmable Technology (FPT), Hong Kong SAR, China, 2002.
[69] Huang, A., "Processor-In-Memory System Simulator," Massachusetts Institue of Technology 1998.
[70] Huang, W., Saxena, N., and McCluskey, E., "A Reliable LZ Data Compressor on Reconfigurable Coprocessors," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.
[71] Hutchings, B. and Wirthlin, M., "Implementation Approaches for Reconfigurable Logic Applications," presented at International Workshop in Field-Programmable Logic and Applications (FPL'95), Oxford, England, 1995.
[72] Jacobs, S. and Bekker, J., "Automatic target recognition systems using high-resolution radar,," presented at 3rd Workshop on Conventional Weapon ATR, 1996.
[73] Jean, J., Tomko, K., Yavagal, V., Shah, J., and Cook, R., "Dynamic Reconfiguration to Support Concurrent Applications," IEEE Transactions on Computers, vol. 48, pp. 591-602, 1999.
[74] Jerraya, A., Park, I., and O'Brien, K., "Amical : an interactive High Level Synthesis Environment," presented at European Conference on Design Automation, Paris, France, 1993.
[75] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C., "Optimisation by Simulated Annealing: An Experimental Evaluation Part i, Graph Partitioning," Operations Research, vol. 37, pp. 865-892, 1989.
[76] Kafura, D., Object-Oriented Software Design and Construction with C++: Prentical-Hall, 1998.
[77] Kaul, M. and Vemuri, R., "Optimal Temporal Partitioning and Synthesis for Reconfigurable Architectures," presented at Design Automation and Test, Paris, France, 1998.
[78] Kaul, M., Vemuri, R., Govindarajan, S., and Ousiss, I., "An Automated Temporal Partitioning and Loop Fission Approach for FPGA Based Reconfigurable Synthesis of DSP Applications," presented at 36th Annual Conference on Design Automation Conference (DAC'99), New Orleans, LA, USA, 1999.
[79] Kean, T., "Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI implmentation," in Dept. Computer Science: University of Edinburgh, 1988.
References
Page 178
[80] Kean, T. and Duncan, A., "DES Key Breaking, Encryption and Decryption on the XC6212," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.
[81] Kearney, D. and Veldman, G., "Evaluation of Network Topologies for a Runtime Re-routable Network on a Programmable Chip," presented at IEEE International Conference on Field-Programmable Technology (FPT), Toyko, Japan, 2003.
[82] Keller, E., "JRoute: A Run-time Routing API for FPGA Hardware," Parallel and Distributed Processing, pp. 874-881, 2000.
[83] Kernighan, B. and Lin, S., "An Efficient Heuristic Procedure for Partitioning Graphs," Bell System Technical Journal, vol. 49, pp. 291-307, 1970.
[84] Kirkpatrick, S., Gelatt, C., and Vecchi, M., "Optimisation by Simulated Annealing," Science 220, pp. 671-680, 1983.
[85] Kress, R., "A Fast Reconfigurable ALUs for Xputers," Univ. Kaiserslautern, 1996. [86] Kress, R. and Hartenstein, U., "An Operating System for Custom Computing
Machines based on the Xputer Paradigm," presented at International Workshop on Field-Programmable Logic and Applications (FPL'97), London, UK, 1997.
[87] Kumar, S., Pires, L., Ponnuswamy, S., and Spaanenburg, H., "A Benchmark Suite for Evaluating Configurable Computing Systems - Status, Reflections, and Future Directions," presented at International Symposium on Field-Programmable Gate Arrays (FPGA'00), Monterey, CA, USA, 2000.
[88] Kuznar, R., Brglez, F., and Kozminski, K., "Cost Minimization of Partitions into Multiple Devices," presented at 30th ACM/IEEE Design Automation Conference (DAC'93), Dallas, TX, USA, 1993.
[89] Le, M., Burghardt, F., and Rabaey, J., "Software Architecture of the Infopad System," presented at Mobidata Workshop on Mobile and Wireless Information Systems, New Brunswick, NJ, USA, 1994.
[90] Lee, C. and Lee, D., "A simple on-line bin packing algorithm," Journal of ACM, vol. 32, pp. 562-572, 1985.
[91] Leong, P., Cheung, O., Tsoi, K., and Leong, P., "A Bit-Serial Implementation of the International Data Encryption Algorithm IDEA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.
[92] Leong, P., Cheung, O., Tung, T., Kwok, C., Wong, M., and Lee, K., "Pilchard - A Reconfigurable Computing Platform with Memory Slot Interface," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'01), Napa Valley, CA, USA, 2001.
[93] Loo, S., Wells, B., and Kulick, J., "Handel-C for Rapid Prototyping of VLSI Coprocessors for Real Time Systems," presented at Southeastern Symposium on System Theory, Huntsville, Alabama, USA, 2002.
[94] Madsen, J., "Hardware Synthesis - An Introduction." Denmark: Technical University of Denmark, 2002.
[95] Mangione-Smith, W., "Seeking Solutions in Configurable Computing," IEEE Computer, vol. 30, pp. 38-43, 1997.
[96] McKusick, M., Joy, W., Leffler, S., and Fabry, R., "A Fast File System for UNIX," University of California, Berkeley, Berkeley, CA 18th February 1984.
[97] McLoone, M. and McCanny, J., "Single-Chip FPGA Implemenation of the Advanced Encryption Standard Algorithm," presented at International Workshop on Field-Programmable Logic and Applications (FPL'01), Belfast, UK, 2001.
[98] Mehta, M. and DeWitt, D., "Dynamic Memory Allocation for Multiple Query Workloads," presented at 19th International Confernce on Very Large Databases, Dublin, Ireland, 1993.
References
Page 179
[99] Mencer, O., Morf, M., and Flynn, M., "PAM-Blox: High Performance FPGA Design for Adaptive Computing," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.
[100] Mignolet, J., Vernalde, S., Verkest, D., and Lauwereins, R., "Enabling hardware-software multitasking on a reconfigurable computing platform for networked portable multimedia applications," presented at Engineering of Reocnfigurable Systems and Algorithms (ERSA'02), Las Vegas, NV, USA, 2002.
[101] Mikroelektronik, "Programmable Logic Devices - Design Entry," Bolton Institute 2001.
[102] Milenkovic, V., Daniels, K., and Li, Z., "Placement and Compaction of Nonconvex Polygons for Clothing Manufacture," presented at 4th Canadian Conference on Computational Geometry, Newfoundland, Canada, 1992.
[103] Miyamori, T. and Olokotun, K., "REMARC: Reconfigurable Multimedia Array Coprocessor," presented at International Symposium on Field Programmable Gate Arrays (FPGA'98), Monterey, CA, USA, 1998.
[104] Moseley, R., "Reconnetics: A System for the Dynamic Implementation of Mobile Hardware Processes in FPGAs," Communicating Process Architectures 2002, pp. 177-190, 2002.
[105] Nallatech, "BallyVision Reconfigurable Computing Datasheet," 2002. [106] Nallatech, "Field Upgradeable Systems Environment (FUSE)," 2002. [107] Parsons, E. and Sevcik, K., "Coordinated Allocation of Memory and Processors in
Multiprocessors," presented at Conference on Measurement and Modelling of Computer Systems, Philadelphia, PA, USA, 1996.
[108] Patterson, J. and Agah, H., "Synopsys and Xilinx Unveil Next Generation Flow for Platform FPGAs," Xcell Journal Online, vol. 41, 2001.
[109] Purna, K. and Bhatia, D., "Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers," IEEE Transactions on Computers, vol. 48, pp. 579-590, 1999.
[110] QuickSilver_Technologies, "Adaptive Computing Machine," 2003. [111] Rakhmatov, D., "Dynamic Scheduling in Run-Time Reconfigurable Systems,"
University of Arizona, 1998. [112] Rakhmatov, D., Vrudhula, S., Brown, T., and Nagarandal, A., "Adaptive Multiuser
Online Reconfigurable Engine," in IEEE Design & Test of Computers, vol. 17, 2000, pp. 53-67.
[113] Ratha, N., Jain, A., and Rover, D., "Convolution on Splash 2," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, CA, USA, 1995.
[114] Scalera, S. and Vazquez, J., "The Design and Implementation of a Context Switching FPGA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.
[115] Sharma, A., Programmable Logic Handbook: PLDs, CPLDs and FPGAs: McGraw-Hill Professional, 1998.
[116] Shirazi, N., Luk, W., and Cheung, P., "Automating Production of Run-time Reconfigurable Designs," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.
[117] Shirazi, N., Luk, W., and Cheung, P., "Run-time Management of Dynamically Reconfigurable Designs," presented at International Workshop on Field-Programmable Logic and Applications (FPL'98), Tallinn, Estonia, 1998.
[118] Shoghi_Communications, "DES/TDES Core," New Delhi, India 2003.
References
Page 180
[119] Sidhu, R., Wadhwa, S., Mei, A., and Prasanna, V., "A Self-Reconfigurable Gate Array Architecture," presented at 10th International Workshop on Field Programmable Logic and Applications (FPL'00), 2000.
[120] Silberschatz, A., Galvin, P., and Gagne, G., Applied Operating System Concepts: John Wiley & Sons, 2000.
[121] Silicon_Strategies, "Chip Makers Post Mixed Results," TechWeb, San Jose, CA, USA 2004.
[122] Sima, M., Vassiliadis, S., Cotofana, S., Eijndhoven, J., and Vissers, K., "A Taxonomy of Custom Computing Machines," presented at 1st Workshop on Embeded Systems and Software, Utrecht, The Netherlands, 2000.
[123] Simmler, H., Levinson, L., and Manner, R., "Multitasking on FPGA Coprocessors," presented at 10th International Workshop on Field-Programmable Logic and Applications (FPL'00), Villach, Austria, 2000.
[124] Smith, D. and Bhatia, D., "RACE: Reconfigurable and Adaptive Computing Environment," presented at International Workshop on Field-Programmable Logic and Applications (FPL'96), Darmstadt, Germany, 1996.
[125] Stallings, W., Operating Systems: Internals and Design Principles, 4th Edition ed: Prentice Hall, 2000.
[126] Tanenbaum, A., Operating Systems - Design and Implementation: Prentice Hall, 1997. [127] Tennenhouse, D., Smith, J., Sincoskie, W., Wetherall, D., and Minden, G., "A Survey
of Active Network Research," in IEEE Communications Magazine, vol. 35, 1997, pp. 80-86.
[128] Trimberger, S., "A Time-Multiplexed FPGA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.
[129] Vasilko, M., Gibson, D., Long, D., and Holloway, S., "Towards a Consistant Design Methodology for Run-Time Reconfigurable Systems," presented at IEE Colloquium on Reconfigurable Systems, Glasgow, Scotland, 1999.
[130] Villasenor, J. and Mangione-Smith, W., "Configurable Computing," in Scientific American, 1997.
[131] Villasenor, J., Schoner, B., Chia, K., and Zapata, C., "Configurable Computing Solutions for Automatic Target Recognition," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'96), Napa Valley, CA, USA, 1996.
[132] Vuillemin, J., Bertin, P., Roncin, P., Shand, M., Touati, H., and Boucard, P., "Programmable active memories: Reconfigurable systems come of age," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 4, pp. 56-69, 1996.
[133] Walder, H. and Platzner, M., "Non-preemptive Multitasking on FPGAs: Task Placement and Footprint Transform," presented at International Conference on Engineering of Reconfigurable Systems and Algorithms, Las Vegas, NV, USA, 2002.
[134] Wilberg, J., Kuth, A., Camposano, R., Rosenstiel, W., and Vierhaus, T., "A Design Exploration Environment," presented at 6th Great Lakes Symposium on VLSI, Iowa State University, 1996.
[135] Wolinski, C., "Reconfigurable Computing Systems," Los Alamos National Laboratory 2003.
[136] Woo, N. and Kim, J., "An Efficient Method of Partitioning Circuits for Multiple-FPGA Implementation," presented at 30th ACM/IEEE Design Automation Conference (DAC'93), Dallas, TX, USA, 1993.
[137] Xilinx, "MicroBlaze Datasheet," 2003. [138] Xilinx, "Two Flows for Partial Reconfiguration: Module Based or Difference Based,"
Application Notes XAPP290 September 2004. [139] XIlinx, "Virtex Architecture Datasheet," 2002.
References
Page 181
[140] Xilinx, "Virtex-II Architecture Datasheet," 2000. [141] Xilinx, "Virtex-II Pro Datasheet," 2003. [142] Xilinx, "XC6200 Field Programmable Gate Arrays Datasheet," 1997. [143] Yi-Ran, S., Kumar, S., and Jantsch, A., "Simulation and Evaluation for a Network on
a chip architecture using NS-2," presented at 20th IEEE Norchip Conference, Copenhagen, 2002.
[144] Yoo, S., Choo, H., Youn, H., Yu, C., and Lee, Y., "On Task Relocation in Two-Dimensional Meshes," Journal of Parallel and Distributed Computing, vol. 60, pp. 616-638, 2000.
[145] Zhong, P., Martonosi, M., Ashar, P., and Malik, S., "Accelerating Boolean Satisfiability with Configurable Hardware," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.