design and analysis of fpga based self-timed system with ...  · web viewfig 20 architecture of...

39
Design and analysis of FPGA based self-timed system with specific focus to xilinx FPGAs M.SRIRAMAN Electronics and Communication department Anna University Ceg, Anna university, guindy, Chennai-25 INDIA Abstract:- The ASIC based self-timed systems use custom cells, which are not commercially available to the design community. Hence, Asynchronous self- timed designs have been a concept restricted to certain organizations around the world, which can afford the cost and effort. In order to make the methodology widely available it should be possible to build these systems using basic components available to the entire design community. In this dissertation work the focus will be on building a self-timed system using components available in a basic Xilinx FPGA, which is the most widely used FPGA in the world. The flow will be based on designing basic macro modules, which are consistent in timing in themselves and building bigger designs using such macro modules. At the end of the dissertation work a set of guidelines and design flow will be available to the design community to design self-timed systems using FPGAs. An example design of an arbitration based memory access module is given to illustrate the credibility and desirability of the methodology. It is part of a complex design in a larger chip and was a nightmare to meet the 125 MHz timing in Xilinx Virtex FPGA. Also it has some performance bottlenecks in terms of giving the required throughput. The self-timed design will be used to show that the design will exceedingly meet both the timing requirement and the performance requirement in the same FPGA. Keywords:-Self timed systems, macro modules, Delay insensitive, Dual rail signaling, Bundled data signaling. Introduction The current day VLSI systems predominantly employ clock based synchronous systems owing to their design simplicity and predictability. Though started as early as 1950, the asynchronous design methodologies have not picked up much due to many reasons like lack of proper methodologies, 1

Upload: others

Post on 04-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

Design and analysis of FPGA based self-timed system with specific focus to xilinx FPGAs

M.SRIRAMANElectronics and Communication department

Anna UniversityCeg, Anna university, guindy, Chennai-25

INDIA

Abstract:- The ASIC based self-timed systems use custom cells, which are not commercially available to the design community. Hence, Asynchronous self-timed designs have been a concept restricted to certain organizations around the world, which can afford the cost and effort. In order to make the methodology widely available it should be possible to build these systems using basic components available to the entire design community. In this dissertation work the focus will be on building a self-timed system using components available in a basic Xilinx FPGA, which is the most widely used FPGA in the world. The flow will be based on designing basic macro modules, which are consistent in timing in themselves and building bigger designs using such macro modules. At the end of the dissertation work a set of guidelines and design flow will be available to the design community to design self-timed systems using FPGAs. An example design of an arbitration based memory access module is given to illustrate the credibility and desirability of the methodology. It is part of a complex design in a larger chip and was a nightmare to meet the 125 MHz timing in Xilinx Virtex FPGA. Also it has some performance bottlenecks in terms of giving the required throughput. The self-timed design will be used to show that the design will exceedingly meet both the timing requirement and the performance requirement in the same FPGA.

Keywords:-Self timed systems, macro modules, Delay insensitive, Dual rail signaling, Bundled data signaling.

16. Introduction

The current day VLSI systems predominantly employ clock based synchronous systems owing to their design simplicity and predictability. Though started as early as 1950, the asynchronous design methodologies have not picked up much due to many reasons like lack of proper methodologies, possible design hazards, non-availability of synthesis tools etc. The indisputable popularity of the synchronous design methodology was also a key reason behind the design community not giving much botheration to the asynchronous design methodologies. The clock-based synchronous systems have been able to cater to the growing needs of the design community so far. But, as the complexity, size and speed of the system increases, they bring with them major issues like speed limited by the worst case timing delay, global synchronization with low skew, power consumption due all-time active clock and many more with all centered around the clock signal.

The topic of this thesis work is to analyze the gradual applicability of asynchronous designs to certain parts of synchronous designs, which can significantly boost the performance. The resulting systems are mixed synchronous-asynchronous systems. The type of asynchronous circuits dealt with in this work is also called self-timed systems since they don’t need any external signal to synchronize or time their operations. There are many issues that face the self-timed design community:

1. Implementing reliable Handshake protocols2. Synchronization of the control and data

paths, which can be delay based or Completion detection based

3. Interfacing with the external/synchronous world

4. Avoiding design hazards/glitches, which can propagate through the design

This work is towards evolving a set of guidelines that will help self-timed systems realizable using commercially FPGAs.

1

Page 2: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

16. Scope of Dissertation Work The dissertation work will restrict itself to:

1. Evolving a design methodology for self-timed systems, which in turn will have the flow, usage of standard macro modules detailed in this work and dos and don’ts

2. An example design of a memory access arbiter, which will be implemented and verifying using simulation

16. Objectives1. Expanding the reach of self-timed systems by

enabling design of self-timed systems easily available to the entire design community

2. Coming up with a flow/guidelines for the FPGA based self-timed systems so that the success is repeatable

3. Demonstrating the advantage of synchronous-self-timed design with an example

16. Background of the Dissertation work

The idea of asynchronous designs has been there since the birth of VLSI designs. The design community had to choose between asynchronous and synchronous design in the early 1950s. Owing to their simplicity and ease of design, synchronous design won over the asynchronous design. Still, background work has been going on in various universities and some VLSI design powerhouses like Sun Microsystems, Philips Semiconductor, Intel etc., on asynchronous clock less designs. Infact, companies like Philips have commercial products, which don’t use any clock.The work in asynchronous design dates back to mid 1950s, with Huffman’s theoretical work on asynchronous circuits. A major breakthrough came from Muller who proposed a class of circuits, which are closer to the modern asynchronous circuits. He introduced a request-ack based 4-way protocol, which aids the control/data flow through the design.

In 1969, Stephen Unger published his classic textbook on asynchronous circuit design, which presented a comprehensive look into asynchronous design methods. Following this many popular and practical asynchronous design projects have been executed to name some:

1. Using Speed independent circuits in the design of ILLIAC and ILLIAC-II computes by Muller et al in University of Illinois

2. Mainframe computers like MU-5 and Atlas constructed using fully asynchronous circuits

3. Design of the asynchronous AMULET processor in Manchester University

4. Also in the 1970s, asynchronous techniques were used at the university of Utah in the design of the first operational dataflow computer [102, 103] and at Evans and Sutherland in design of the first commercial graphics system.

5. Design of asynchronous 80C51 microcontroller by Philips

6. RAPPID project at Intel, which demonstrated that a fully asynchronous instruction length decoder for the x86 instruction set could achieve threefold improvement in speed and power compared synchronous design

The asynchronous design is gaining popularity owing to its great advantages like:

1. Low power2. The circuit performance being dependent on

average case delay rather than worst case delay as in synchronous designs

3. Easily local scalability and change flexibility of components. When components are to replaced a only a part can be changed so long as the interface remains the same without bothering about the clock frequency/speed of the adjacent blocks

4. No pain of routing and balancing low clock trees

In spite of the advantages the asynchronous design community faces many disadvantages, which is a primary cause of it not gaining enough popularity over synchronous design that it deserves. Some of them are:

1. Non-availability of standard and clear-cut flow to all

2. Non-availability of commercial tools for verification and synthesis of asynchronous circuits

2

Page 3: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

3. Usage of custom VLSI components for asynchronous ASIC design by the players in the industry makes it difficult for others to use them for wide manufacturing of commercial products

4. Robustness of the design is a major challenge to be overcome by the design community

5. Testability of the Asynchronous chipsThough this work cannot solve all the issues, it intends to solve the first and the third issue.

16. Issues/Challenges with Synchronous design

Synchronous design has enjoyed monopoly in the VLSI based design field since its birth. The key reasons behind this are:

1. Simplicity2. Accessible to all design community without

needs for any special components that are proprietary to only certain groups around the world

3. Robustness4. Extensive tool support5. Matured over the years with a well defined

flow and methodologyThe synchronous design is a ‘ready to jump’ methodology that anyone can start designing and taping out chips with not much of difficulty. Millions of chips taped around the globe yielded a well-defined pathway to success and immense confidence among the design community of making successful chips. But, with increasing needs on speed and complexity a severe bottleneck to synchronous design methodology is imminent. The bottleneck comes from multiple angles. The next few subsections focus on the key issues/challenges limiting the applicability of synchronous design methodology to future chips.

13.2 The killer clockThe speed of a circuit is directly related to the quickness at which the underlying components can switch. In synchronous designs the speed of the circuit is limited by the maximum frequency at which the clock can switch. The clock is the heartbeat of a

synchronous circuit. Till date the synchronous circuits have been able to scale well in terms of speed thanks to advancement in manufacturing processes. But, even with advanced sub-micron processes the single point dependency on the clock element is to bring a set of limitations that will be hard to overcome.1. Clock skew issues

1.1. Designs with more than half million flops have become common these days

1.2. With so many flops around it is required to balance the clock appropriately so that skew between instants at which it reaches these flops is kept in less than tens of ps

1.3. This requirement is becoming more and more nightmarish for the backend teams due to loading and un-uniform distance factors

2. Clock switching and EM radiation2.1. With increasing clock speeds the EM

radiation is a serious issue to tackle.

13.2 Power factorSince all the flops in the design are being clocked all the time there is a continuous power drawn by the chip all the time. Though clock gating methods have been evolved majority of flops that are in the critical design path are not gated though they may not be used all the time. These flops contribute to a significant amount of power consumed that can be saved otherwise.

13.2 Portability/ScalabilityIn a synchronous design all components that interact with each are tightly timing coupled with the assumption that they all operate with the same clock. Suppose one of the following requirements come into picture:1. A particular block has to be scaled to higher

frequency for performance improvement1.1. In this case it is not possible since the other

components interacting with this component also need to be scaled to higher frequency. A solution would be to pipeline the design and increase the frequency. This will not scale the performance as that of frequency

3

Page 4: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

scaling but will increase the performance limited by the amount of pipelining.

2. The entire design needs to ported to a different technology library2.1. Here we cant be sure if the design will time

close to the new technology library – particularly if the new library is inferior in timing. E.g: ASIC to FPGA conversion

13.2 Performance – Worst case!In synchronous designs the highest clock speed achievable is limited by the largest the combinatorial delay between two flops in the design. This means that the design performance is limited by the worst-case delay or rather the weakest chain in the design.

16. Self-Timed systemsSince the invention of transistor asynchronous designs have existed. Unfortunately they did not gain enough popularity owing to their complexity and absence of any big community to aid its progress. The asynchronous design methodology holds a lot of promises that can easily overcome the challenges that face the synchronous designs. [9] gives a good account on the comparison of current day synchronous and asynchronous designs.

13.2 Clockless designsThe self-timed designs totally get rid of clocks. Hence, there is no central timing signal that synchronizes all the events. This will be a major relief from nightmarish clock balancing and EM radiation nightmares. The entire design operates with a set of asynchronous handshaking protocols between the blocks.

13.2 Cool chips – Power savingDue to the absence of a central clock not all the components are active all the time. Only those areas of the design that are actively processing at any given time will dissipate power at a given instant of time.

This will significantly save power compared to the synchronous designs.

13.2 PortabilitySince all the components of the design are inter-connected using an asynchronous request-ack protocol, porting to a different technology to improve performance, or changing a particular part of the design for a performance improvement is naturally possible.

13.2 Performance – Average case!The performance of an asynchronous design varies between the best and worst case delays depending on the input. In effect the design will operate at average case delay unlike synchronous designs that are always worst-case delay based.

16. Self-Timed designs – Issues/Limitations

The previous sections listed the various promises that asynchronous designs have to overcome the hurdles facing the self-timed designs in the near future. From the previous section it may appear that self-timed design will be clear winner over the synchronous designs and will be the favorite choice for the design community in future. But, it is not as simple as it sounds. Self-timed designs have a set of serious challenges that will cause their wide applicability to be postponed by more than a decade. Unless, these challenges are faced with an appropriate scalable solution self-timed design will only be a far to reach dream for the VLSI design community. Some of the challenges are as follows:1. Availability of design components:

1.1. Most of the current day self-timed designs are based on custom VLSI circuits, which are accessible only to certain organization/groups around the globe. An easy to follow methodology with simple design components should be available so that the design community can jump into it easily

4

Page 5: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

1.2. The EDA companies for synthesis and backend do not show enough interest towards self-timed systems

1.3. This work focuses on making self-timed systems independent of any EDA tool

2. Design Hazards2.1. Since the designs are clock less

combinatorial loops are inevitable. Handling the combinatorial loops and other design hazards is essential. Some works are based using special synthesis tools to handle these issues. Achieving hazard free asynchronous state machines are given in [12]. Also [18] lists some techniques for robust asynchronous designs.

3. Memory3.1. Networking designs are memory intensive

designs. There is no clear-cut memory based design available in literature.

3.2. This work gives a way to realize memories in self-timed systems, which will make its applicability to networking designs possible

4. Ease of design4.1. Definitely self-timed designs are not as

simple as synchronous designs. This is primarily because of some special care required in designing theses circuits and lack of exposure to these methodologies.

4.2. With a well-defined design/implementation approach the resistance will come down. This is one of the objectives of this work.

5. Completion detection5.1. Since there is no central timing to mark the

completion of operations a different method should be adopted to detect the availability of proper values of signals before operations can be started. Similarly, some method is required to mark the completion of an operation so as kick start the following operation. Comparison of various completion detection circuits are given in [10]. [13] has good account on speculative completion detection for adders

6. External interfacing6.1. Since most of the chips are synchronous in

nature a fully asynchronous chip will be impossible in the near future. Attempting a fully asynchronous chip is going to slow down the progress and widespread applicability of self-timed designs.

7. Testability

7.1. This will be a new area to explore for ASIC self-timed designs. But, for FPGA this does not matter since FPGAs are pre-tested hardware components. Some work towards this direction can be seen in [11] and [19].The thesis work focuses on evolving a design methodology for self-timed designs in Xilinx FPGAs by which it is widely accessible to design community to try out without need for any special custom VLSI circuits.

16. Past work on self-timed designsThere is quite a lot of work in the self-timed design area. But, majority of work focus on using custom VLSI circuits or special algorithms or special synthesis tools or hand routing for delay control. [1] and [4] give a good account on the various available asynchronous design methodologies.On broader basis the various works in this are can be categorized under the following heads:1. Using custom VLSI circuits:

1.1. [10], [13] use custom VLSI circuits for asynchronous designs

1.2. [14] gives a custom FPGA to achieve self-timed system

1.3.2. Algorithmic

2.1. Petri net based designs/special synthesis tools

2.2. [15], [17], [12] use algorithmic level approach to asynchronous designs

2.3. [4] has a chapter dedicated to this3. Globally Asynchronous and Locally

Synchronous (GALS)3.1. [20] analyses locally synchronous module

design in a GALS system4. Macro-module based design methodology

4.1. These works are closer to the current day synchronous design methodology. They approach the self-timed designs by design multiple macro-modules and putting them together to achieve a bigger design

4.2. [3] is a highly referred work for this thesis4.3. [2] gives an FPGA example for this

Some of the issues not addressed in these works and which becomes the highlight of the current thesis work are:

5

Page 6: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

1. Synchronous-asynchronous approach for control path designs with immediate applicability in current day designs

2. Memory based design3. Reducing the number of specially designed

components so that the designer has to spend less time on creating these components

4. Increasing the ease of design so as to bring self-timed designs closer to current day synchronous designs

5. Reducing the hazard areas through designs and not by tools

Rest of the paper will deal with self-timed circuits and their design.

16. Synchronous-asynchronous mixed design approach

In the previous sections the various issues facing the synchronous design were explained following by the promises self-timed design to solve some of these issues. At the same time self-timed systems have their own set of limitations to be widely accepted by the design community. A fully asynchronous chip is a distant reality provided current level of maturity of the self-timed design methodologies. In this thesis work a mixed synchronous-asynchronous design is attempted. In a bigger synchronous design certain parts, which are performance critical and are difficult to meet the required timing can be replaced with a self-timed equivalent. The self-timed circuit will help in meeting the required performance of the overall design. With increasing complexity and speed of the current day designs there are some requirement in synchronous designs that stand in the way of meeting the required performance:1. Increased pipelining to meet the frequency

requirement1.1. To keep the frequency of operation of the

chip high designs are highly pipelined. These pipelines become a headache in latency sensitive designs like arbiters. Also pipelining increases the power consumption.

1.2. Increasing the frequency of the design to improve the performance requires reconsideration of the pipelining

2. Highly complex designs have a lot of design blocks. In order to predictably meet the inter-

module timing it is required to have affinity flopping for the inputs and outputs. All inputs should be clocked once before using in the module and all outputs should be clocked once before driving out of the module. This adds to latency

3. Memory interfaces should one affinity flopping near the module and one near the memory. This makes the memory access latency to be 5-6 cycles. Memory access arbiters greatly suffer due to this.

Every networking design has the above issues, which becomes a critical bottleneck to the designers in meeting the required performance of the design. Let us consider an example design of a memory access arbiter, which serves to multiple clients in accessing the memory. Assume that each client is a read-modify-write type of clients. Now, each client will perform the following operations:

1. Read the memory2. Modify the read data based on some

conditions3. Write back the modified contents

Suppose the clients are to be chosen on a priority basis. In this case there are few catch points for the designer:

1. The client’s operations are limited by the speed at which the read can happen and it can proceed with further operation on the read data and write it back.

2. This read latency is again dependent on the memory access latency and latency

3. If the number of clients is more the arbiter will become complex and would require appropriate pipelining to meet the timing.

Fig 1 Synchronous memory arbiter

In the above figure a synchronous memory arbiter is shown. In this the arbitration latency can be 2-3 cycles and the memory access latency due to affinity flopping would be 5-6 cycles. Hence, the total access latency becomes 7-9 cycles. Each client would suffer

6

ARBITER MEMORY

CLIENTS

CLIENTS

.

.

.CLIENTSARB

LATENCY

AFFINITY FLOPPING

Page 7: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

a minimum of this much latency for memory accesses. This would be a major limiting factor for the performance of the designs. A self-timed system would fit into such a scenario neatly to save a lot of access cycles.

Fig 2 Self-timed design in a synchronous system

In the above example the arbiter and Memory have been replaced with a self-timed equivalent. Now, there is no pipelining and the only delays are combo and routing delays. This is surely much lesser than 7-9 cycles in the case of synchronous design. Also the total power of the circuit will reduce due to self-timed modules. One disadvantage that must be immediately visible in the example is the latency due to the synchronization at the asynchronous interface, which will be 4 cycles. The self-timed modules are advantageous if their total latency is lesser than 3-5 cycles. As the system grows more complex the arbiter will be having more latency and the self-timed system will be of sure advantage.

13.2 Delay insensitive control flowSelf-timed circuits apply a different type of structure to circuit design. Rather than let signals flow through the circuit whenever they are able as with an unstructured asynchronous circuit, or require that the entire system be synchronized to a single global timing signal as with clocked systems, self-timed circuits avoid clock-related timing problems by enforcing a simple communication protocol between circuit elements. This is quite different from traditional synchronous signaling conventions where signal events occur at specific times and may remain asserted for specific time intervals. In asynchronous systems it is important only that the correct sequence of signals be maintained. The timing of these signals

is an issue of performance that can be handled separately. If this protocol is insensitive to delays through circuit components or the wires that are used to connect them it is known as a delay-insensitive protocol [1].Self-timed protocols are often defined in terms of a pair of signals that request an action, and acknowledge that the requested action has been completed. One module, the sender, sends a request event (Req) to another module, the receiver. Once the receiver has completed the requested action, it sends an acknowledge event (Ack) back to the sender to complete the transaction. In order to maintain this sequence of events, the communicating modules must obey the following rules:Rule 1 The sender must not produce a new request event until the previous request event has been acknowledged.Rule 2 The receiver must not produce an acknowledge event unless it has received a request event.Rule 3 After initialization, the sender may produce a new request event.

Fig 3 Delay insensitive control flow

These rules define the operation of the modules which follows the common idea of passing a token of some sort back and forth between two participants. Imagine that a single token is owned by the sending module. To issue a request event it passes that token to the receiver. When the receiver is finished with its processing it produces an acknowledge event by passing that token back to the sender. The sequence of events in this communication transaction is an alternating sequence of request and ack events. This sequence of events in a communication transaction is called the protocol. In this case the protocol is simply for request and ack to alternate, although in general a protocol may be much more complicated and involve many interface signals. Proper operation of a protocol requires the cooperation of more than one module; a single module cannot enforce all the rules for orderly

7

Self-timed Arbiter Self-timed

Memory

CLIENTS

CLIENTS

.

.

.CLIENTS

Sync to async

interface

Combo latency

Asynchronous Interface

Page 8: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

communication. Notice that in the rules for this simple protocol Rule 1 must be observed by the sender, and Rule 2 observed by the receiver. If either module breaks the rules, the protocol is also broken.

13.2 Request-ack protocol and timingFollowing the rules for the delay insensitive protocol defined in the previous sections there are two variations of the same:

1. Two-way handshake

2. Four-way handshake

13.1.1 Two-way handshakeEvery transition (high to low and low to high) of the request and acknowledge signal has a meaning. This can be called transition signalling. As shown in the fig 4, a transaction is triggered by the low to high going edge of request signal. This is acknowledged by low to high in acknowledge. The transaction completes with ack transitioning from low to high. The second transaction starts with high to low on request, which is acknowledged by a high to low in ack.

Transaction 1

Transaction 2

Transaction 3

Request

Ack

Fig 4 Two-phase protocol

Though this protocol looks a lot more efficient and simple it is complex to implement. The four-way handshake is more appropriate for self-timed circuits.

13.1.1 Four-way handshakeIn this a low to high transition in the request signal alone will start a transaction. A transaction has to go through the following steps:

1. Request going high

2. Ack going high

3. Request going low

4. Ack going low

This requires four changes, two each on request and ack signals for a transaction to complete. This is inefficient compared to two-way handshake in the sense that transactions happen at half the speed. But, this is more simpler to implement in self-timed circuits.

Transaction 1 Transaction 2

Request

Ack

Fig 5 Four-way handshake

13.1.1 Bundled data signalingIn self-timed systems the four-way handshake is used for communications between any two blocks. Hence, every data unit that moves through the system will be associated with the request-ack signals. This combination of request-ack and data communication is called as bundled data signaling.

13.2 Dual rail signalingThe success of the request-ack protocol lies in how well the timing of the data and request signals can be synchronized. By synchronized it means that the data signals should be ready before the request signal is asserted. Ideally it is very difficult to achieve this. Also such a system become delay sensitive since it relies purely on the delay of the data and request signals from source to destination.In order to make the circuit delay independent the data signals are encoded in such a way that the destination can know when the data is ready so as to kick-start its operations. The most popular encoding is the dual rail signaling. In this signaling each of the data signal is associated with 2 lines, one line is for encoding logic 1 and the other for logic 0. Out of the four states for the 2 lines only the states 2’b01 and 2’b10 are valid. The state 2’b00 is idle state into which they go in between every transaction signifying the completion of the transaction. The state 2’b11 is

8

Page 9: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

invalid and never happens. With this type of encoding there is no need to have an associated request signal and the data in all valid state is itself a request.In fig 6 a transaction involving two data lines and an ack from the destination are shown. The two data lines D[1:0] are encoded in {D0, D0’} and {D1, D1’} pairs. D0 and D1 are for logic 1 encoding and D0’ and D1’ are for logic 0 encoding. Initially, for the cycles 1 and 2 there is no request and all the lines are at zero. In cycle 3 D1’ alone becomes 1 indicating that there is request with a value of logic 1 in D[1]. In the clock 3 D0 and D0’ are still zeros. Hence, this is an intermediate state where the request is yet to stabilize. The destination will take action in this state. In cycle 4 D0 settles at 1 meaning a logic 1 in D[0]. Now the request is stable. The destination performs the operation and acks in cycles 6. As response to this the request is de-asserted by making all lines zero in cycle 9. Again cycle 8 is another intermediate cycle where the request is in the process of de-assertion. Once all the signals have become zero, which is an indication of request de-assertion the ack signal deasserts in cycle 10. A complete transaction comes to an end. A new transaction starts in cycle 13. In dual rail signaling there is no need for a separate request signal and the encoded data itself acts as the request signal. This encoding helps in making the circuit totally delay insensitive. Idle INT Valid

RequestINT Idle INT VR

1 2 3 4 5 6 7 8 9 10 11 12 13

D1

D1’

D0

D0’

Ack

Fig 6 Dual Rail Handshaking

13.2 Muller-C ElementSimilar to flops in synchronous designs Muller-C elements act as timing reference points in self-timed systems. These Muller-C elements play key role in bundled data signaling in causing the request-ack control and data flow through the system. The property of the Muller-C element is such that the output of the element will become 1 only when both of the inputs are 1 and the output will go back to 0 only when the both the inputs become 0. Below is the symbol and circuit realization of a Muller-C element.

Fig7 Muller-C Element

There is strict timing requirement in realizing a muller-c element using the combo circuit shown above. It is that the feedback from z back to the circuit should be faster than any other feedback to the inputs X and Y. this requires careful timing constraints on this path. In order do away with such a requirement an alternate way of realizing the muller-c element, which is flip-flop based is given here. The circuit is as in Fig 8.

Fig 8 Realization of Muller-C element using a Set/Reset flip-flop

In future FPGAs the AND and NOR gates can be integrated into the FPGA itself with special pins instead of wasting 2 LUTs per component.

9

Page 10: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

13.2 Memory elementA Memory element should be able store a value based on the input data and retain it forever. This can be achieved simply with a set/reset flip-flop connecting d to set pin and d’ to reset pin. In dual rail signaling based input the data will automatically be stored properly.

Fig 9 Memory element using a flip-flop

Multiple such elements have to be put together with associated decoding logic to build a memory block. This is given in the design approach section.

13.2 Register elementA register element is one, which stores the input data when the load signal is one and retains the value when the load signal has become 1’b0. The logic for this is taken from [3]. This is similar to a latch.

Fig 10 Symbol and circuit realization for REG Module

Alternately REG module can also be realized using flip-flop as in the circuit below.

Fig 11 Realization of REG module using set/reset flip-flop

13.2 Examples of self-timed basic components using dual rail signaling

Self-timed components should fully follow the request-ack protocol requirement. It is that the operation should start only after all the input signals are in the valid state. This means that the output of the current module should become valid only after all the inputs to this module are valid. Also the output of this module should be go to Idle state only after all the inputs have gone to invalid state. AND gate, OR gate, MUX and REGISTER modules have been taken as example to demonstrate this function. The AND/OR and REGISTER modules are taken from [3].

13.1.1 Dual Rail AND/OR/MUX Gate

Fig 12 Dual rail AND Gate design from [3]

10

Page 11: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

In this design output lines will become valid only when the inputs are valid as per the dual rail signaling and outputs will go to idle only when the inputs have gone to idle. This is possible due to the Muller-C elements. The same applies to OR gate below.

Fig 13 Circuit for a dual rail OR gate and its symbol

Below is the 2x1 Mux module realized using AND and OR gates.

Fig 14 2x1 Dual Rail Mux using Dual Rail AND-OR Gates

13.1.1 Register moduleREGISTER module is a packet of REGISTER elements, which follows the request-ack protocol as a whole. The REGISTER module is used as state holding elements in self-timed designs. More details of this module is available in [3]. Only the schematic is given here.

Fig 15 Schematics of a REGISTER module

The design requires the module to latch the incoming data once all the lines have become valid. The ‘D’ elements qualify the validity of the data signals in terms of dual rail protocol.

13.1.1 Achieving state storage registersSince the REGISTER modules require the data at input to become zero (idle state) before latching in the next set of valid data. Because of this a single REGISTER module cannot act a state storing register. In state machine realized as shown in Fig 16, having a single REGISTER module will clear of the state bit as soon as it latched as per the Request-ack protocol. Hence, a series of 2 REGISTER modules would be required in order to store the state and present it back to the state machine in the next operation cycle.

Fig 16 State machine using REGISTER Module

The state combo starts its operation only when both the state variables and the inputs are in valid state.

11

Page 12: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

16. A self-timed design flow for FPGAs

Before we venture into the design of this thesis work a proper flow is established for self-timed systems in

FPGAs. The flow should consider that the design should be realizable in FPGAs, which have only limited type of components. The flowchart shown in figure 18 depicts the design flow for self-timed systems in FPGAs.

Fig 17 Self-timed system design flow

12

Page 13: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

13.2 Design/Implementation ApproachThis sections gives in detail the approach for a mixed synchronous-asynchronous design. Certain guidelines are given, which will be useful for anyone to perform a self-timed design independently.

10.1.1 Synchronous-asynchronous domain

isolationThis is an important step where the designer has to identify the part of the synchronous design that is critical in meeting the given performance. This part of the design has to be carefully isolated from the rest of the design. By isolation it means that a clear protocol should be defined that performance critical part and the rest of the design so that any dependency between these parts are restricted to the interface protocol. An asynchronous protocol is to be used while interfacing with the self-timed island of the design. Double synchronization of the control signals followed by latching of the data signal would be ideal. The synchronous module should pass the data to the self-timed design following the request-ack dual rail protocol only. This means that the data has to be dual rail encoded and should go to zero state between every transaction. This might add to some overhead. While moving some functionality to the self-timed domain the following limitations in the self-timed domain as listed earlier have to be bore in mind. To list some:1. The data should not change once it reaches a

valid dual rail value. This cause unpredictable behavior in the self-timed domain. [3] gives an arbitration module design that handles closely changing signal. But, this design is prone to unpredictable behavior based on the timing of the combinatorial loop.

2. A proper strategy should be in place to bring a state based self-timed system out of reset. Since all the components of a self-timed system will be under reset initially the state REGISTER modules will have all zeros. The state combo will not trigger from this state. Hence an external special pattern is required to bring the state machine out of reset. A suggestion is given in this work for a state based Round-robin scheduler

3. Resetting the self-timed module is not possible through conventional circuits

4. Self-timed Memories are not available. Using the synchronous memories is not possible due to few reasons like:4.1. They would need a clock for write operation

even if read is asynchronous4.2. Asserting the clock/write enable signal after

the data to the RAM is valid and de-asserting the clock/write before data becomes invalid require tight timing control, which will make the design delay sensitive.

4.3. A self-timed RAM module is suggested in work as solution for this issue

10.1.2 Breaking the design into request-ack

islands

This work is much similar to structuring a synchronous module into multiple design groups each having specific goal. Each of the modules has a set of inputs following the dual rail protocol and the module drives an ack back to the driver of the data. The module has an output that is generated based on the current valid data. The output stays in the idle state so long as all the inputs have not become valid. Once the all the input have become valid the module generates a valid dual rail data and waits for the ack to get de-asserted. It will be easier to break the design into multiple task specific group of logic and implement each of them in request-ack protocol using dual rail encoding.

10.1.3 Violation of dual rail encodingThere may be conditions where once has to control the dual rail protocol using non-dual rail logic (for example using non-dual AND/OR gates). Some examples can be seen in the example design. The designer has to carefully take the decision here based on some points:

1. The output should not glitch or it should not

intermittently go to valid state and fall back

to Idle state

2. The output will not go to invalid state 2’b11

3. Delay variations across the various signals

should not cause the circuit to misbehave

13

Page 14: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

10.1.4 Design of individual ModulesAs of current work the design of individual modules are done at gate level since no defined behavioral constructs are available for self-timed designs. Hence decomposition of designs to gate structures was necessary. Once the design is decomposed to gate level logic it has to be realized using the basic dual rail based components defined before. In the design process it may be necessary to violate the dual rail components usage at certain instances, which should be done carefully.

10.1.5 Muller-C elementAs already discussed the Muller-C element can be achieved in two ways:

1. Using combo logic

2. Using a flip-flop

Using combo logic is straightforward coding of the component in Verilog. Using flip-flop would require schematic level creation of the component since the flip-flop with Set/reset should be inferred.. It is recommended to build a wrapper around these basic components before using. This helps in easily switching over across multiple realizations of the component and compare the results.

10.1.6 Memory realizationAs defined before a memory component can be realized using a flip-flop. The associated decoding components also should be embedded along with the memory element. The design of a memory element along with the decoding logic is given in this section.The memory is a two dimensional array of basis memory elements. Each row stores a word of the memory. The row to be accessed is selected based on the address. The selection happens only when the address is a valid value as per the self-timed protocol. A write happens to the element only when the write enable and the row enable signals are valid dual rail signals. A read happens only when the row enable signal is a valid dual rail signal.

Fig 18 Memory element design

The element_we signal is ANDed version of ce for that row and we. The write happens to the memory element only if element_we[1] is asserted. This is to ensure that the write data does not corrupt the memory element during read process where element_we will be 2’b01. Similarly the read data is passed out only if it is a read operation, which is qualified by ANDing ce and inverted version of we. Inversion is achieved by simply connecting the dual rail signal in reverse way ([1] to [0] and [0] to [1]).

11 Example designTo illustrate the advantage of the mixed synchronous-asynchronous approach an example design is taken here with the use of which we traverse through the entire flow given above. The design involves a memory access arbiter with multiple clients contending to access a memory element. The challenges that a synchronous designer would face in designing this system are:1. Client to arbiter interfacing latency

1.1. For better timing in self-timed system the signals across the interfaces are clocked once before input and once before output. These clocking stages introduce latency effectively affecting the performance

2. Arbitration latency incase of multiple clients2.1. When multiple clients are present the

arbitration logic will involve a huge combinatorial path, which cannot complete in one cycle

2.2. Usually, this path is pipelined, which adds to latency when requests are spaced apart thus not utilizing any pipeline efficiency, if any

3. Memory access latency

14

Page 15: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

3.1. Memory access latency incase if higher than clock period is resolved in multiple of clock cycles. For safe margin an additional clock cycle is provided. This adds to wastages

4. Memory affinity flopping4.1. Since memory elements are not placed close

to the logic affinity flopping will be done near the logic and the memory

4.2. This significantly adds to the performance degradation that sometimes becomes very difficult to manage in synchronous designs

First the design is achieved using synchronous approach. The performance bottlenecks will be studied. Later the same design is achieved using synchronous-asynchronous approach and the performance improvement will be studied. The logic, power data also will be compared at the end

13.2 Requirements of the Memory Arbiter1. Multiple clients requiring to access a memory

element1.1. The access can be a read or a write access1.2. Read access

1.2.1. A Request is asserted along with address being stable

1.2.2. Request is a level signal till ack is received from the arbiter

1.2.3. The memory access arbiter provides the read data alone with the ack

1.2.4. An access completes with request going low followed by ack going low

1.3. Write access1.3.1. A request is asserted along with

address and write data being stable1.3.2. The memory access arbiter performs

the write into the given location with the data and asserts ack back to the client

1.3.3. The access completes with request going low followed by ack going low

2. The memory access arbiter should be scalable to handle multiple clients

3. The arbitration scheme is priority based arbitration scheme

4. The memory can be placed anywhere in the chip and hence the access latency can vary significantly due to routing delay

Fig 19 System picture with external interfaces

15

Page 16: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

16. Full Synchronous approach

13.2 ArchitectureThe full synchronous design will have an arbiter that takes the requests from all the clients. The requests are detected at the rising edge and kept as sticky bits. A priority encoder chooses the next port to serve. Once chosen the selection logic is disabled till the memory access completes for the current access. A 4x1 multiplexer selects the signals for the selected port and performs the appropriate access. After memory access delay an ack is generated to the clients.

13.1.1 Arbiter Module12.1.1.1 Features

1. Latching of requests2. Priority encoding and latching the selected

port3. Blocking any further port change till the

selected access completes4. Generating an ack after memort access

latency12.1.1.2 DesignThe design of the arbiter module is shown in the figure below.

16

Page 17: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

To achieve predictable timing closure the ASIC design guidelines strictly enforce affinity flopping at inter-module level and memory interfaces. These affinity floppings greatly affect the performance of the design. The resulting timing diagram for the above module shows that it takes 15 cycles for the access to complete and for the client to get an ack for a request. This is 150ns in 100Mhz clock. A read-modify-write example given previous will get badly affected by this latency.

12.1.2 Performance bottleneckIt is evident from the previous section that the arbiter and affinity flopping become major bottlenecks in arbiter based control path designs where memory access becomes the core of the issue. This introduces unnecessary latency issues, which can bring down the performance well below the maximum possible. In the next section the synchronous-asynchronous combination will be tried out to see the improvement in performance.

16. Synchronous-Asynchronous approach

13.2 Architecture

13.1.1 Separation of synchronous and self-timed domains

The primary bottleneck in the memory arbiter design is the arbiter and the memory access latency. These areas require some performance tweaking. Increasing the frequency will have a lot of impact from system point of view. Again it would require more pipelining, which can increase the latency. Hence, the arbiter and memory access logic can be replaced with a self-timed logic in order to improve the performance. This approach will be tried to see if there is any performance improvement and the extent of improvement. Along with performance improvement other factors like logic increase will also be studied to give a neutral picture of the effect of the mixed synchronous-asynchronous approach.It must be evident by now that the synchronous part of the design will be the clients (which is not considered in this design) and an interface module

that will convert the client request protocol to the request-ack protocol using dual rail signaling. This synchronous interface module will consist of synchronizers and single rail to dual rail converters for interfacing with the self-timed modules

13.1.2 ArchitectureThe architecture of the synchronous interface block and the self-timed arbiter is shown in the Fig . The architecture shown is an obvious one for a typical arbiter accessing the memory.The functionality of the synchronous interface block (SIB) has been seen in the previous section. The Arbiter block can be a priority based or round-robin based arbitration module. For every valid dual rail request information from the SIB the arbiter will generate port information to the MUX module. The port information is in dual rail format. The port becomes the location of access for the memory. The MUX module receives the per port data and read/write command signal from the SIB. It uses the port information from the ARB to multiplex the data and read/write signals from the SIB. The multiplexed data, read/write information and the port information are passed to the memory module and the ACKGEN module.The MEM module performs the memory access based on the commands from the MUX modules. The ACKGEN waits for the command completion before generating an ack to the SIB. In case of read ACKGEN waits for the read data from MEM module to become valid before asserting ack along with read data and port information to the SIB. Incase of write the ACKGEN generates an ack to the SIB as soon as the information from the MUX are valid. Here it is assumed that the write information will reach the memory before the ack reaches the SIB module and it deasserts the data, read/write command signals. This requirement will surely be met since the SIB will synchronize the ack signal before deasserting the signals.

The design of each of the modules will be given in detail in the following sections. The enforcement of the self-timed design requirements will be done along with the description of the design of the blocks.

17

Page 18: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

Fig 20 Architecture of the memory arbiter block

13.1.3 Synchronous Interface Block (SIB)

13.1.3.1 Features1. Latch the request from the clients2. Pass the latched request to ARBITER and the

data, read/write signals to the MUX module in dual rail protocol. Keep the signals stable till ack is received from ACKGEN

3. Double synchronize the ack from ACKGEN. On detecting the double synchronized ack from ACKGEN assert ack to the client along with latched read data

4. Reset the request, data and read/write signals asynchronously with assertion of ack signal from ACKGEN. This helps in saving few cycles during the transaction

13.1.3.2 Design

1. Single to Dual rail converter1.1. This takes the single rails and converts them

to dual rail by connecting the input to [1] and the inverted value to [0]

2. ARB-MUX interface registers and the reset/latch control logic2.1. These registers should follow the dual rail

based request-ack protocol2.2. Once a valid request is present and the

ag_sync_ack signal is 1’b0 it latches the request once and waits for the ack

2.3. The request/ack module checks for the input request signal to disable the latching once a valid request is found. Since it is possible that the ag_sync_ack is still present and the registers may not latch the inputs it also

18

Page 19: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

checks if the requests have got reflected at the output of the registers before disabling the latching

2.4. The latching happens in dual rail format directly. This required since all the dual rail signals should reset to zero in between accesses

2.5. When ag_sync_ack gets asserted this asynchronously resets the registers. This saves the double synchronization time in the entire handshake.

3. Double synchronization and data/port latching logic3.1. This is a simple double synchronization

module for synchronizing the ag_sync_ack signal. A rising edge is detected on the double synchronized signal and this pulse is used to latch the read data and port. The port directly becomes the ack signal.

3.2. It is assumed that by the time the zero passes back to the read data it would have got latched

Fig SEQ Figure \* ARABIC 21 Design of the SIB block

19

Page 20: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

The interactions are shown in the figure below:

EMBED Word.Picture.8

DUAL RAIL ZERO

Request Valid

DUAL RAIL REQ

DUAL RAIL DATA/RW DUAL RAIL ZERO

DUAL RAIL ZERO DUAL RAIL RDATA/PORT

DUAL RAIL ZERO

DUAL RAIL ZERO

LATCHED RDATA/ACK

ZERO

External Req

Sync_arb_ req

Sync_arb_data

Ag_sync_ack

Ag_sync_data

Double Synchronization

Rdata to client

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Fig SEQ Figure \* ARABIC 22 SIB to ARB/ACKGEN interactions

In the the delay from deassertion of the sync_arb_data in cycle 9 to latching of read data and port in cycle 13 is important. The delay should be higher than 3 clock cycles for the read data to be

latched properly. This timing will be easily met considering the amount of logic in the arbiter, mux, memory and ackgen modules

20

Page 21: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

13.1.4 Arbiter module design

13.1.4.1 Features1. Performing the priority based scheduling on

the request signals from the SIB module2. The priority encoder is structured in such a

way that it can be scaled to higher number of ports

3. Safely returning a fast ack to the SIB module incase all the requests are zeroes due to error in SIB. This ensures the MUX and the following modules don’t get mistriggered when all zero request is generated by SIB.

13.1.4.2 DesignThe design of the arbiter block is shown in the figure below. The arbiter block is primarily the 4x1 priority encoder with additional ack returning on zero request logic. The AND logic on the port number ensures that the MUX is activated if the PRI encoder does not find any valid request. The PRI encoder indicates the presence of any valid request by making pri_req[1:0] as dual rail 1 (2’b10). If all the requests are dual rail zero (2’b01) the pri_req is also dual rail zero (2’b01).

Fig 20 Design of the ARBITER block

Note that the AND gates used here are not dual rail AND gates. There is a violation of the dual rail encoding in these places. But, none of the logic will spuriously generate any glitches. The possible transitions on the pri_req signal are:

1. 00 10 00

2. 00 01 00

In these only the first transition will enable the outputs to MUX module. The transition from 10 to 00 will follow from the requests becoming zero and the intermediate gates in the PRI encoder are all dual rail gates. Hence, this transition is glitch free.

Similarly, the second transition is also a dual rail transition and is glitch free. During this transition the output AND gates will full remain drive zeros without any issues. Hence, this is a safe violation.The design of 4x1 PRI encoder involves two levels of 2x1 PRI encoders. The highest priority goes to port 0 and lowest to 3. The first two 2x1 PRI encoders will choose between ports 0/1 and 2/3. The second level will choose between the inputs from these two priority encoders.Each of 2x1 the priority encoders has an OR logic to check if any request is present. The selected port is just inversion of the high priority port request. This works as follows:

1. If port zero has a request then the selected port is zero

2. If port zero has no request then selected port is 1

3. Since the Ored request stays as qualifier it will qualify if the port 1 has a request

4. Hence, the selected port is just inversion of the request for port 0.

In dual rail terminology:5. The selected port is cross connection of the

dual rail request of the port 0The circuit realization of the 2x1 priority encoder is

shown in the figure below.

Fig 21 2x1 Priority Encoder

The req_out signals from the 2 2x1 priority encoders are taken as requests to the next 2x1 priority encoder. This priority encoder will generate final selected port bit [1]. Also it will generate the req_out signal, which will go to the arbiter. The port bit [0] is generated by muxing the selected_port output from first 2 2x1 priority encoder using the selected_port[1:0] from the second level 2x1 priority encoder as the select signal. The design is shown in the figure below.

21

Page 22: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

13.1.5 MUX Block

13.1.5.1 Features1. 4x1 Muxing of the data and the read/write

enable signal using the port signal from the arbiter block

2. Ensure dual rail completeness3. Generate the write data, port and read/enable

signals to the memory block

Fig 22 Design of 4x1 Priority encoder for 2x priority encoders

13.1.5.2 DesignInstead of performing a direct 4x1 muxing the 2-bit select signal is decoder into 4-output and then and AND-OR plane is used for muxing. The logic is replicated for each of the data signals and the read/write signals. The logic for one of the data signal is shown in the schematic below.

Fig 23 Design of 4x1 MUX module

The OR gate is a 4x1 OR gate, which is again built out of 3 2x1 dual rail OR gates. The decoder is built out of four dual rail AND gates. This is a fully dual rail design.

13.1.6 Memory module design

13.1.6.1 Features1. 4x8 memory element with read and write

facility2. Read data to become dual rail valid only for

read operation. It should be in all zeros state during write state

3. Read data not to glitch during read operation

13.1.6.2 DesignThe memory module is built out of the memory elements described in . From implementation point of view 8 basic memory elements are put together to form a single word location. A common element_we signal and re signals are generated for the entire saving muller elements. 4 such words can be stacked to form a memory block. This architecture is highly scalable to deeper and wider memory blocks. The design of a memory word is shown in the figure below.

Fig 24 Design of Memory word

22

Page 23: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

The memory block is constructed by putting together multiple such rows and Oring the read data from each of the rows.

13.1.7 ACKGEN module

13.1.7.1 Features1. In read case generating an ack only after the

read data from memory and port from mux module are dual rail valid

2. In write case generating an ack after the write data signal and the port from the mux module are dual rail valid

3. Ensuring glitch free assertion and deassertion of ack signal

4. Careful violation of dual rail protocol to generate the single rail ack signal

13.1.7.2 DesignThe ackgen module uses the D element to detect the dual rail validity of various signal to generates the read and write acks. The D element design is as shown in the figure below. It is taken from [3]. The elemt gives an output of 1 once all the inputs are dual rail valid. The output becomes zero only when all the inputs have become dual rail zero state.

Fig 25 Design of D element taken from[3]

In write case the write data and port is given to the D element. The output and the rw_n[0] (1 means write) are given to another muller-c element to generate write ack. This logic will get set only after the write

data and port signal are valid and the access is a write access. In read case the read data and port is given to the D element. The output and the rw_n[1] are given to another muller-c element read ack. This logic will get set only after the read data and port signal are valid and the access is a read access. The read and write ack are Ored to generate ag_sync_ack. Since the acks are finally controller rw_n[1:0] signal only one will be asserted and there will not be any possibility of glitch due to the D and muller-c elements. The design is shown below.

Fig 26 Design of ACKGEN module

13.2 Performance measurementThe synchronous-asynchronous architecture was synthesized and fitted in Xilinx Virtex device XCV50. The post routed netlist was taken for simulation with delay back annotation using the SDF file provided by the tool. The simulation timing results are shown in the figure below.

1. It can be seen from the below figure that the request assertion to ack delay is 18+62+30 = 110ns

2. Due to additional clocking at Client finally it becomes 120ns

3. This is 30ns or 3 clock cycles lesser than the full synchronous design. In many design the performance is affected in the margin 2 to 3 cycles where this improvement will be a big saving

23

Page 24: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

4. The requirement that the read data and ack should not get deasserted before the read data is latched is also met from the timing diagram

5. The saving in the number of clock cycles is not a big number. This number is just an indicator of possible savings in bigger designs where the pipelining requirement would be even more.

6. Every pipeline is going to increase latency by one clock cycle. But, adding logic in a self-

timed design is going to increase the latency only by the amount of logic delay and routing delay

7. Every pipeline is going to increase latency by one clock cycle. But, adding logic in a self-timed design is going to increase the latency only by the amount of logic delay and routing delay

16. SummaryParameters Fully

SynchronousSynchronous-Asynchronous

including memory and

Muller element logic

Synchronous-asynchronous

excluding memory element

Synchronous-asynchronous

excluding memory element and with special muller-c flops

Access Latency at

100Mhz

150ns 120ns 120ns 120ns

LUTs 96 968 780 238Flip-flops 188 496 444 444

Logic Power (mW)

14 13 13 13

Table 1 Comparative figures

24

Page 25: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

Figure 31 Comparative Charts

LUTs Flip-flopsSIB 116 149ARB 57 24

ACKGEN 38 4MUX 576 268MEM 188 52

Table 2 Unit-wise Logic breakup

The above tables gives the comparative figures for fully synchronous and synchronous-asynchronous design of the memory arbiter. It is evident from the above table that the MUX module is the major contributor to the logic. It consumes more that 50% of the total logic. The main reason behind this is the dual rail protocol. An alternate encoding scheme would surely reduce the area to a greater extent. Also there is not much of power saving. The only advantage is the latency improvement as of now. Hence, self-timed with synchronous logic approach would be appropriate in those areas where the cost of performance is much higher than the cost of area. Surely, there are designs, which have this requirement and synchronous-asynchronous approach can be tried in those areas. This approach

anyway has to wait for some improvement in area before it will be widely accepted by the design community.

16. Future scopeFrom the summary quite few improvement areas can be identified, which would open gates towards future work in this line:

1. Reduction in area by coming up with a newer encoding scheme other than dual rail

2. An interpreter that can translate synchronous design to self-timed design

3. Direct self-timed memory and Muller-c flop realization in FPGAs and ASICs

25

Page 26: Design and analysis of FPGA based self-timed system with ...  · Web viewFig 20 Architecture of the memory arbiter block 13.1.3 Synchronous Interface Block (SIB) 13.1.3.1 Features

16. ConclusionIt has been shown that it is possible to realize successful mixed synchronous-asynchronous design that can help overcoming performance bottlenecks in the control path designs like arbiters. But, with the huge logic cost the self-timed systems have a long way to go before they get wide acceptance in among the design community. As of now the applicability of such mixed designs can be restricted to areas where performance is more critical than logic increase. Anyway, this thesis work has proven the feasibility of mixed synchronous-asynchronous design, which itself can act as a major encouragement in improving the areas like area reduction and realizing special asynchronous components in ASIC/FPGAs.

17. Reference

[1] Scott Hauck, Asynchronous design methodologies, Processdings of the IEEE, Vol. 83, No.1. pp.69-93, January, 1995

[2] Erik Brunvand, Using FPGAs to Implement Self-Timed Systems, Computer Science Dept., University of Utah, January 8, 1992[3] Narinder Pal Singh, A Design Methodology for Self-timed systems, massachusetts Institute of Technology[4] Chris Myers, Asynchronous Circuit Design, by Chris Myers, JOHN WILEY & SONS[5] D. A. Edwards W. B. Toms, The Status of Asynchronous Design in Industry, Information Society Technologies (IST) Programme Concerted Action Thematic Network Contract IST-1999-29119 2nd Edition, Feb 2003[6] Ivan E. Sutherland, MICROPIPELINES, Communications of the ACM, June 1989, Volume 32, Number 6[7] Al Davis and Steven M.Nowick,An Introduction to Asynchronous Circuit Design, September 19, 1997[8] Erik Brunvand, Introduction to Asynchronous Circuits and Systems, University of Utah, USA[9] H. (Kees) van Berkel, Mark B. Josephs, and Steven M. Nowick Scanning the Technology: Applications of Asynchronous Circuits, C[10] Fu-Chiung Cheng, Practical Design and

Performance Evaluation of Completion Detection Circuits, Department of Computer Science, Columbia University[11] David A. Rennels and Hyeongil Kim, Concurrent Error Detection in Self-Timed VLSI Computer Science Department, USC, LA. Using Custom VLSI components. Gives a good account on testability of self-timed designs.[12] Robert M Dept. Symbolic Hazard-Free Minimization and Encoding of Asynchronous Finite State Machines of Computer Science Columbia University New York, NY 10027, Fuhrer Bill Lin IMEC Laboratory Kapeldreef 75 B-3001 Leuven, Belgium, Steven M. Nowick Dept. of Computer Science Columbia University New York, NY 10027[13] Speculative Completion for the Design of High-Performance Asynchronous Dynamic Adders, IEEE Async97[14] Scott Hauck, Steven Burns, Gaetano Borriello, Carl Ebeling,AN FPGA FOR IMPLEMENTING ASYNCHRONOUS CIRCUITS, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195[15] Christian D.Nielsan and Alain J.Martin, A Delay Insensitive Multiply-Accumulate Unit, Computer Science Department, Cal Tech, Feb 12, 1992[16] Karl M. Fant, Scott A. Brandt, Null Convention logic, Theseus Logic, Inc.[17] Tak Kwan Lee, A General Approach to Performance Optimization of Asynchronous Circuits, Thesis , Cal Tech, May 1995[18] Alain J.Martin, Robustness in Asynchronous VLSI Techniques, Department of Computer Science, Cal Tech, June 1994[19] Henrik Hulgard, Steven M. Burns and Gactano Borriello,Testing the Asynchronous Circuits: A Survery, Dept of Computer science and Engineering, March 1994[20] Kip C.Killpack, Analysis and Characterization of a Locally Clocked Module, Thesis by University of Utah, May 2002[21] Virtex-II, Field Programmable Gate Arrays, Advance Product Specification, V1.9, September 26,2002

26