fpga implementation of soft decision low power convolutional decoder using viterbi algorithm

1

FPGA IMPLEMENTATION OF SOFT

DECISION LOW POWER CONVOLUTIONAL

DECODER USING VITERBI ALGORITHM

2

CHAPTER – 1

INTRODUCTION

1.1. OVER VIEW

"Viterbi Algorithm (VA) decoders are very popular. They are

currently used in about one billion Cell phones. This is probably one

of the largest number in any application. However, the largest current

consumer of VA processor cycles is probably digital video

broadcasting. A recent estimate at Qualcomm is that approximately

1015 bits per second are now being decoded by the VA in digital TV

sets around the world, every second of every day.

1.2.OBJECTIVE

Purpose of this project is to introduce the reader to a forward

error correction technique known as convolutional coding with Viterbi

decoding. The detailed description of the algorithms for generating

random binary data, Convolutionally encoding the data, passing the

encoded data through a noisy channel, quantizing the received

channel symbols, and performing Viterbi decoding on the quantized

channel symbols to recover the original binary data.

The purpose of forward error correction (FEC) is to improve the

capacity of a channel by adding some carefully designed redundant

information to the data being transmitted through the channel. The

process of adding this redundant information is known as channel

coding. Convolutional coding and block coding are the two major

forms of channel coding. Convolutional codes operate on serial data,

one or a few bits at a time. Block codes operate on relatively large

(typically, up to a couple of hundred bytes) message blocks. There are

a variety of useful convolutional and block codes, and a variety of

algorithms for decoding the received coded information sequences to

recover the original data.

3

Convolutional encoding with Viterbi decoding is a FEC

technique that is particularly suited to a channel in which the

transmitted signal is corrupted mainly by additive white gaussian

noise (AWGN). You can think of AWGN as noise whose voltage

distribution over time has characteristics that can be described using

a Gaussian, or normal, statistical distribution, i.e. a bell curve. This

voltage distribution has zero mean and a standard deviation that is a

function of the signal-to-noise ratio (SNR) of the received signal. Let's

assume for the moment that the received signal level is fixed. Then if

the SNR is high, the standard deviation of the noise is small, and vice-

versa. In digital communications, SNR is usually measured in terms of

Eb/N0, which stands for energy per bit divided by the one-sided noise

density.

Let's take a moment to look at a couple of examples. Suppose

that we have a system where a '1' channel bit is transmitted as a

voltage of -1V, and a '0' channel bit is transmitted as a voltage of +1V.

This is called bipolar non-return-to-zero (bipolar NRZ) signaling. It is

also called binary "antipodal" (which means the signaling states are

exact opposites of each other) signaling. The receiver comprises a

comparator that decides the received channel bit is a '1' if its voltage is

less than 0V and a „0‟ if its voltage is greater than or equal to 0V. One

would want to sample the output of the comparator in the middle of

each data bit interval. Let's see how our example system performs,

first, when the Eb/N0 is high, and then when the Eb/N0 is lower.

The following figure1 shows the results of a channel simulation

where one million (1 x 106) channel bits are transmitted through an

AWGN channel with an Eb/N0 level of 20 dB (i.e. the signal voltage is

ten times the rms noise voltage). In this simulation, a '1' channel bit

is transmitted at a level of -1V, and a '0' channel bit is transmitted at

a level of +1V. The x axis of this figure1 corresponds to the received

signal voltages, and the y axis represents the number of times each

voltage level was received.

4

Figure 1.1 Results of a channel simulation

Our simple receiver detects a received channel bit as a '1' if its

voltage is less than 0V and as a „0 if its voltage is greater than or equal

to 0V. Such a receiver would have little difficulty correctly receiving a

signal as depicted in the figure1. Very few (if any) channel bit

reception errors would occur. In this example simulation with the

Eb/N0 set at 20 dB, a transmitted '0' was never received as a '1', and a

transmitted '1' was never received as a '0'.

The figure2 shows the results of a similar channel simulation

when 1 x 106 channel bits are transmitted through an AWGN channel

where the Eb/N0 level has decreased to 6 dB (i.e. the signal voltage is

two times the rms noise voltage):

5

Figure 1.2 results of a channel simulation

Now observe how the right-hand side of the red curve in the

figure2 crosses 0V, and how the left-hand side of the blue curve also

crosses 0V. The points on the red curve that are above 0V represent

events where a channel bit that was transmitted as a one (-1V) was

received as a zero. The points on the blue curve that are below 0V

represent events where a channel bit that was transmitted as a zero

(+1V) was received as a one. Obviously, these events correspond to

channel bit reception errors in our simple receiver. In this example

simulation with the Eb/N0 set at 6 dB, a transmitted '0' was received

as a '1' 1,147 times, and a transmitted '1' was received as a '0' 1,207

times, corresponding to a bit error rate (BER) of about 0.235%. That's

not so good, especially if you're trying to transmit highly compressed

data, such as digital television. I will show you that by using

convolutional coding with Viterbi decoding, you can achieve a BER of

better than 1 x 10-7 at the same Eb/N0, 6 dB.

Convolutional codes are usually described using two

parameters: the code rate and the constraint length. The code rate,

6

k/n, is expressed as a ratio of the number of bits into the

convolutional encoder (k) to the number of channel symbols output by

the convolutional encoder (n) in a given encoder cycle. The constraint

length parameter, K, denotes the "length" of the convolutional

encoder, i.e. how many k-bit stages are available to feed the

combinatorial logic that produces the output symbols. Closely related

to K is the parameter m, which indicates how many encoder cycles an

input bit is retained and used for encoding after it first appears at the

input to the convolutional encoder. The m parameter can be thought

of as the memory length of the encoder. I focussed on rate 1/2

convolutional codes in this project.

1.3.LITERATURE SURVEY

On the part of literature survey before going to implement the

proposed work, the following research papers have been referred to

and considered their contents.

Research Paper 1:

F. Chan and D. Haccoun, “Adaptive Viterbi decoding of

convolution codes over memory less channels,” IEEE Trans. Commun.,

vol. 45, no. 11, pp. 1389–1400, Nov. 1997.

Objective

In this paper, an adaptive decoding algorithm for convolutional

codes, which is a modification of the Viterbi algorithm (VA) is

presented. For a given code, the proposed algorithm yields nearly the

same error performance as the VA while requiring a substantially

smaller average number of computations. Unlike most of the other

suboptimum algorithms, this algorithm is self-synchronizing. If the

transmitted path is discarded, the adaptive Viterbi algorithm (AVA)

can recover the state corresponding to the transmitted path after a few

trellis depths. Using computer simulations over hard and soft 3-bit

quantized additive white Gaussian noise channels, it is shown that

7

codes with a constraint length K up to 11 can be used to improve the

bit-error performance over the VA with K=7 while maintaining a

similar average number of computations. Although a small variability

of the computational effort is present with our algorithm, this

variability is exponentially distributed, leading to a modest size of the

input buffer and, hence, a small probability of overflow.

Research Paper 2 :

S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson, “A

dynamically reconfigurable adaptive Viterbi decoder,” in Proc.

FPGA’02, 2002.

Objective

The use of error-correcting codes has proven to be an effective

way to overcome data corruption in digital communication channels.

Although widely-used, the most popular communications decoding

algorithm, the Viterbi algorithm, requires an exponential increase in

hardware complexity to achieve greater decode accuracy. In this

paper, we describe the analysis and implementation of a reduced-

complexity decode approach, the adaptive Viterbi algorithm (AVA). Our

AVA design is implemented in reconfigurable hardware to take full

advantage of algorithm parallelism and specialization. Run-time

dynamic reconfiguration is used in response to changing channel

noise conditions to achieve improveddecoder performance.

Implementation parameters for thedecoder have been determined

through simulation and thedecoder has been implemented on a Xilinx

XC4036-basedPCIb oard. An overall decode performance improvement

of 7.5X for AVA has been achieved versus algorithm implementation

on a Celeron-processor based system. The useof dynamic

reconfiguration leads to a 20% performance improvementover a static

implementation with no loss of decodeaccuracy.

8

Research Paper 3:

R. Henning and C. Chakrabarti, “Low-power approach for

decoding convolutional codes with adaptive Viterbi algorithm

approximations,” in Proc. ISLPED’02, Monterey, CA, Aug. 12–14, 2002.

Objective:

Significant power reduction can be achieved by exploiting real-

time variation in system characteristics while decoding convolutional

codes.The approach proposed herein adaptively approximates Viterbi

decoding by varying truncation length and pruning threshold of the T-

algorithm while employing trace-back memory management.

Adaptation is performed according to variations in signal-to-noise

ratio, code rate, and maximum acceptable bit error rate.Potential

energy reduction of 70 to 97.5% compared to Viterbi decoding is

demonstrated.Superiority of adaptive T-algorithm decoding compared

to fixed T-algorithm decoding is studied.General conclusions about

when applications can particularly benefit from this approach are

given.

1.4. AIM OF THE PROJECT

This project is aimed tried to develop a Viterbi decoder with

strongly connected trellis decoding which is proposed to reduce the

number of add compare select computations in viterbi decoding

process.The idea is to develop a low power, low delay performance

Viterbi Decoder.

1.5. ADVANTAGES / APPLICATIONS

Viterbi decoding is one of two types of decoding algorithms used

with convolutional encoding-the other type is sequential decoding.

Sequential decoding has the advantage that it can perform very well

9

with long-constraint-length convolutional codes, but it has a variable

decoding time.

Viterbi decoding has the advantage that it has a fixed decoding

time. It is well suited to hardware decoder implementation. But its

computational requirements grow exponentially as a function of the

constraint length, so it is usually limited in practice to constraint

lengths of K = 9 or less. Stanford Telecom produces a K = 9 Viterbi

decoder that operates at rates up to 96 kbps, and a K = 7 Viterbi

decoder that operates at up to 45 Mbps. Advanced Wireless

Technologies offers a K = 9 Viterbi decoder that operates at rates up to

2 Mbps. NTT has announced a Viterbi decoder that operates at 60

Mbps. Moore's Law applies to Viterbi decoders as well as to

microprocessors, so consider the rates mentioned above as a snapshot

of the state-of-the-art taken in early 1999.

For years, convolutional coding with Viterbi decoding has been

the predominant FEC technique used in space communications,

particularly in geostationary satellite communication networks, such

as VSAT (very small aperture terminal) networks. I believe the most

common variant used in VSAT networks is rate 1/2 convolutional

coding using a code with a constraint length K = 7. With this code,

you can transmit binary or quaternary phase-shift-keyed (BPSK or

QPSK) signals with at least 5 dB less power than you'd need without

it. That's a reduction in Watts of more than a factor of three! This is

very useful in reducing transmitter and/or antenna cost or permitting

increased data rates given the same transmitter power and antenna

sizes.

But there's a tradeoff-the same data rate with rate 1/2

convolutional coding takes twice the bandwidth of the same signal

without it, given that the modulation technique is the same. That's

because with rate 1/2 convolutional encoding, you transmit two

channel symbols per data bit. However, if you think of the tradeoff as

10

a 5 dB power savings for a 3 dB bandwidth expansion, you can see

that you come out ahead. Remember: if the modulation technique

stays the same, the bandwidth expansion factor of a convolutional

code is simply n/k.

Many radio channels are AWGN channels, but many,

particularly terrestrial radio channels also have other impairments,

such as multipath, selective fading, interference, and atmospheric

(lightning) noise. Transmitters and receivers can add spurious signals

and phase noise to the desired signal as well. Although convolutional

coding with Viterbi decoding might be useful in dealing with those

other problems, it may not be the best technique.

In the past several years, convolutional coding with Viterbi

decoding has begun to be supplemented in the geostationary satellite

communication arena with Reed-Solomon coding. The two coding

techniques are usually implemented as serially concatenated block

and convolutional coding. Typically, the information to be transmitted

is first encoded with the Reed-Solomon code, then with the

convolutional code. On the receiving end, Viterbi decoding is

performed first, followed by Reed-Solomon decoding. This is the

technique that is used in most if not all of the direct-broadcast

satellite (DBS) systems, and in several of the newer VSAT products as

well. At least, that's what the vendors are advertising.

In the year 1993 a new parallel-concatenated convolutional

coding technique known as turbo coding has emerged. Initial

hardware encoder and decoder implementations of turbo coding have

already appeared on the market. This technique achieves substantial

improvements in performance over concatenated Viterbi and Reed-

Solomon coding. A variant in which the codes are product codes has

also been developed, along with hardware implementations.

11

Applications:

Viterbi Decoder is commonly used in decoding convolution codes for

wireless communication like

-Decoding convolution codes in satellite communications.

-Computer storage devices such as hard disc drives.

- digital video broadcasting.

-Mobile brad band Applications

1.6 THESIS ORGANISATION

The present study is based on the implementation of the soft

decision low power convolutional encoder and decoder Using Viterbi

Algorithm using VHDL. Chapter one provides an introduction to the

project as a whole. Chapter two gives a detailed account of Viterbi

Algorithm. Chapter three provides insights into the logical aspects of

Viterbi encoder and decoder. Chapter four deals with the

specifications and the design aspects involved in the implementation

of VHDL. Chapter five contains simulation results of the proposed

work. Chapter six deals with the VHDL implementation, which

contains the synthesis reports, RTL views, port diagrams and timing

analysis. Chapter seven present the concluding remarks and future

scope of the work.

12

CHAPTER 2

VITERBI ALGORITHM

2.1 INTRODUCTION

The steps involved in simulating a communication channel using Soft

Decision Viterbi decoding are as follows:

Generate the data to be transmitted through the channel-

result is binary data bits.

Convolutionally encode the data-result is channel symbols.

Map the one/zero channel symbols onto an antipodal

baseband signal, producing transmitted channel symbols.

Add noise to the transmitted channel symbols-result is received

channel symbols.

Quantize the received channel levels-one bit quantization is

called hard-decision, and two to n bit quantization is called soft-

decision(n is usually three or four).

Perform Viterbi decoding on the quantized received channel

symbols-result is again binary data bits.

Compare the decoded data bits to the transmitted data bits and

count the number of errors.

Many of you notice that I left out the steps of modulating the

channel symbols onto a transmitted carrier, and then demodulating

the received carrier to recover the channel symbols. You‟re right, but

we can accurately model the effects of AWGN even through we bypass

those steps.

2.2 GENERATING THE DATA

Generating the data to be transmitted through the channel can

be accomplished quite simply by using a random number generator.

One that produces a uniform distribution of numbers on the interval 0

to a maximum value. Using this function, we can say that any value

13

less than half of the maximum value is a zero. Any value greater than

or equal to half of the maximum value is a one.

2.3 CONVOLUTIONALLY ENCODING THE DATA

Convolutionally encoding the data is accomplished using a shift

register and associated combinatorial logic that performs modulo-two

addition. (A shift register is merely a chain of flip-flops where in the

output of the nth flip-flop is tied to the input of the (n+1)th flip-flop.

Every time the active edge of the clock occurs, the input to the flip-flop

is clocked through to the output, and thus the data are shifted over

one stage.) The combinatorial logic is often in the form of cascaded

exclusive-or gates. As a reminder, exclusive-or gates are two-input,

one-output gates often represented by the logic symbol as shown in

figure1,

Figure 2.1. XOR gate

That implements the following truth-table:

Table 2.1 truth table of Ex-Or gate

The exclusive-or-gate performs modulo-two addition of its

inputs. When you cascade q two-input exclusive-or gates, with the

14

output of the first one feeding one of the inputs of the second one, the

output of the second one feeding one of the inputs of the third one,

etc., the output of the last one in the chain is the modulo-two sum of

the q + 1 inputs.

Another way to illustrate the modulo-two adder, and the way

that is most commonly used in textbooks, is as a circle with a +

symbol inside, thus:

Now that we have the two basic components of the

convolutional encoder (flip-flops comprising the shift register and

exclusive-or gates comprising the associated modulo-two adders)

defined, let's look at a picture of a convolutional encoder for a rate

1/2, K = 3, m = 2 codes:

Figure 2.2 convolutional encoder

In this encoder, data bits are provided at a rate of k bits per

second. Channel symbols are output at a rate of n = 2k symbols per

second. The input bit is stable during the encoder cycle. The encoder

cycle starts when an input clock edge occurs. When the input clock

15

edge occurs, the output of the left-hand flip-flop is clocked into the

right-hand flip-flop, the previous input bit is clocked into the left-hand

flip-flop, and a new input bit becomes available. Then the outputs of

the upper and lower modulo-two adders become stable. The output

selector (SEL A/B block) cycles through two states-in the first state, it

selects and outputs the output of the upper modulo-two adder; in the

second state, it selects and outputs the output of the lower modulo-

two adder.

The encoder shown above encodes the K = 3, (7, 5) convolutional

code. The octal numbers 7 and 5 represent the code generator

polynomials, which when read in binary (1112 and 1012) correspond to

the shift register connections to the upper and lower modulo-two

adders, respectively. This code has been determined to be the "best"

code for rate 1/2, K = 3. It is the code I will use for the remaining

discussion and examples, for reasons that will become readily

apparent when we get into the Viterbi decoder algorithm.

Let's look at an example input data stream, and the

corresponding output data stream:

Let the input sequence be 0101110010100012.

Assume that the outputs of both of the flip-flops in the shift

register are initially cleared, i.e. their outputs are zeroes. The first

clock cycle makes the first input bit, a zero, available to the encoder.

The flip-flop outputs are both zeroes. The inputs to the modulo-two

adders are all zeroes, so the output of the encoder is 002.

The second clock cycle makes the second input bit available to

the encoder. The left-hand flip-flop clocks in the previous bit, which

was a zero, and the right-hand flip-flop clocks in the zero output by

the left-hand flip-flop. The inputs to the top modulo-two adder are

1002, so the output is a one. The inputs to the bottom modulo-two

16

adder are 102, so the output is also a one. So the encoder outputs 112

for the channel symbols.

The third clock cycle makes the third input bit, a zero, available

to the encoder. The left-hand flip-flop clocks in the previous bit, which

was a one, and the right-hand flip-flop clocks in the zero from two bit-

times ago. The inputs to the top modulo-two adder are 0102, so the

output is a one. The inputs to the bottom modulo-two adder are 002,

so the output is zero. So the encoder outputs 102 for the channel

symbols.

And so on. The timing diagram shown below illustrates the process:

Figure 2.3. Timing diagram of Encoder

After all of the inputs have been presented to the encoder, the output

sequence will be:

00 11 10 00 01 10 01 11 11 10 00 10 11 00 112.

17

Notice that I have paired the encoder outputs-the first bit in

each pair is the output of the upper modulo-two adder; the second bit

in each pair is the output of the lower modulo-two adder.

You can see from the structure of the rate 1/2 K = 3

convolutional encoder and from the example given above that each

input bit has an effect on three successive pairs of output symbols.

That is an extremely important point and that is what gives the

convolutional code its error-correcting power. The reason why will

become evident when we get into the Viterbi decoder algorithm.

Now if we are only going to send the 15 data bits given above, in

order for the last bit to affect three pairs of output symbols, we need

to output two more pairs of symbols. This is accomplished in our

example encoder by clocking the convolutional encoder flip-flops two (

= m) more times, while holding the input at zero. This is called

"flushing" the encoder, and results in two more pairs of output

symbols. The final binary output of the encoder is thus 00 11 10 00

01 10 01 11 11 10 00 10 11 00 11 10 112. If we don't perform the

flushing operation, the last m bits of the message have less error-

correction capability than the first through (m - 1)th bits had. This is a

pretty important thing to remember if you're going to use this FEC

technique in a burst-mode environment. So's the step of clearing the

shift register at the beginning of each burst. The encoder must start in

a known state and end in a known state for the decoder to be able to

reconstruct the input data sequence properly.

Now, let's look at the encoder from another perspective. You can

think of the encoder as a simple state machine. The example encoder

has two bits of memory, so there are four possible states. Let's give the

left-hand flip-flop a binary weight of 21, and the right-hand flip-flop a

binary weight of 20. Initially, the encoder is in the all-zeroes state. If

the first input bit is a zero, the encoder stays in the all zeroes state at

the next clock edge. But if the input bit is a one, the encoder

18

transitions to the 102 state at the next clock edge. Then, if the next

input bit is zero, the encoder transitions to the 012 state, otherwise, it

transitions to the 112 state. The following table gives the next state

given the current state and the input, with the states given in binary:

Table 2.2 Next state table of encoder

The above table is often called a state transition table. We'll refer

to it as the next state table. Now let us look at a table that lists the

channel output symbols, given the current state and the input data,

which we'll refer to as the output table:

Table 2.3 Out put table

19

You should now see that with these two tables, you can

completely describe the behavior of the example rate 1/2, K = 3

convolutional encoder. Note that both of these tables have 2(K - 1) rows,

and 2k columns, where K is the constraint length and k is the number

of bits input to the encoder for each cycle. These two tables will come

in handy when we start discussing the Viterbi decoder algorithm.

MAPPING THE CHANNEL SYMBOLS TO SIGNAL LEVELS

Mapping the one/zero output of the convolutional encoder onto

an antipodal baseband signaling scheme is simply a matter of

translating zeroes to +1s and ones to -1s. This can be accomplished

by performing the operation y = 1 - 2x on each convolutional encoder

output symbol.

2.4 ADDING NOISE TO THE TRANSMITTED SYMBOLS

Adding noise to the transmitted channel symbols produced by

the convolutional encoder involves generating Gaussian random

numbers, scaling the numbers according to the desired energy per

symbol to noise density ratio, Es/N 0, and adding the scaled Gaussian

random numbers to the channel symbol values.

For the uncoded channel, Es/N0 = Eb/N 0, since there is one

channel symbol per bit. However, for the coded channel, Es/N0 =

Eb/N0 + 10log10(k/n). For example, for rate 1/2 coding, E s/N0 = Eb/N0

+ 10log10(1/2) = Eb/N0 - 3.01 dB. Similarly, for rate 2/3 coding, Es/N0

= Eb/N0 + 10log10 (2/3) = Eb/N0 - 1.76 dB.

The Gaussian random number generator is the only interesting

part of this task. C only provides a uniform random number

generator, rand(). In order to obtain Gaussian random numbers, we

take advantage of relationships between uniform, Rayleigh, and

Gaussian distributions:

20

Given a uniform random variable U, a Rayleigh random variable R can

be obtained by:

Where is the variance of the Rayleigh random variable, and given

R and a second uniform random variable V, two Gaussian random

variables G and H can be obtained by

G = R cos V and H = R sin V.

In the AWGN channel, the signal is corrupted by additive noise,

n(t), which has the power spectrum No/2 watts/Hz. The variance

of this noise is equal to . If we set the energy per symbol Es equal

to 1, then .

So

2.5 QUANTIZING THE RECEIVED CHANNEL SYMBOLS

An ideal Viterbi decoder would work with infinite precision, or at

least with floating-point numbers. In practical systems, we quantize

the received channel symbols with one or a few bits of precision in

order to reduce the complexity of the Viterbi decoder, not to mention

the circuits that precede it. If the received channel symbols are

quantized to one-bit precision (< 0V = 1, > 0V = 0), the result is called

hard-decision data. If the received channel symbols are quantized with

more than one bit of precision, the result is called soft-decision data.

A Viterbi decoder with soft decision data inputs quantized to three or

four bits of precision can perform about 2 dB better than one working

with hard-decision inputs. The usual quantization precision is three

bits. More bits provide little additional improvement.

21

The selection of the quantizing levels is an important design

decision because it can have a significant effect on the performance of

the link. The following is a very brief explanation of one way to set

those levels. Let's assume our received signal levels in the absence of

noise are -1V = 1, +1V = 0. With noise, our received signal has mean

+/- 1 and standard deviation . Let's use a uniform,

three-bit quantizer having the input/output relationship shown in the

figure below, where D is a decision level that we will calculate shortly:

Figure 2.4.Quantiser input output relationship

The decision level, D, can be calculated according to the formula

Where Es/N0 is the energy per symbol to noise density ratio.

22

2.6 PERFORMING VITERBI DECODING

The Viterbi decoder itself is the primary focus of this tutorial.

Perhaps the single most important concept to aid in understanding

the Viterbi algorithm is the trellis diagram. The figure below shows the

trellis diagram for our example rate 1/2 K = 3 convolutional encoder,

for a 15-bit message:

The four possible states of the encoder are depicted as four rows

of horizontal dots. There is one column of four dots for the initial state

of the encoder and one for each time instant during the message. For

a 15-bit message with two encoder memory flushing bits, there are 17

time instants in addition to t = 0, which represents the initial

condition of the encoder. The solid lines connecting dots in the

diagram represent state transitions when the input bit is a one. The

dotted lines represent state transitions when the input bit is a zero.

Notice the correspondence between the arrows in the trellis diagram

and the state transition table discussed above. Also notice that since

the initial condition of the encoder is State 002, and the two memory

flushing bits are zeroes, the arrows start out at State 002 and end up

at the same state.

The following diagram shows the states of the trellis that are

actually reached during the encoding of our example 15-bit message:

http://home.netcom.com/~chip.f/viterbi/algrthms.html#statetable

23

The encoder input bits and output symbols are shown at the

bottom of the diagram. Notice the correspondence between the

encoder output symbols and the output table discussed above. Let's

look at that in more detail, using the expanded version of the

transition between one time instant to the next shown below:

The two-bit numbers labeling the lines are the corresponding

convolutional encoder channel symbol outputs. Remember that dotted

lines represent cases where the encoder input is a zero, and solid lines

represent cases where the encoder input is a one. (In the figure above,

the two-bit binary numbers labeling dotted lines are on the left, and

the two-bit binary numbers labeling solid lines are on the right.)

OK, now let's start looking at how the Viterbi decoding

algorithm actually works. For our example, we're going to use hard-

decision symbol inputs to keep things simple. (The example source

code uses soft-decision inputs to achieve better performance.)

Suppose we receive the above encoded message with a couple of bit

24

errors:

Each time we receive a pair of channel symbols, we're going to

compute a metric to measure the "distance" between what we received

and all of the possible channel symbol pairs we could have received.

Going from t = 0 to t = 1, there are only two possible channel symbol

pairs we could have received: 002, and 112. That's because we know

the convolutional encoder was initialized to the all-zeroes state, and

given one input bit = one or zero, there are only two states we could

transition to and two possible outputs of the encoder. These possible

outputs of the encoder are 00 2 and 112.

The metric we're going to use for now is the Hamming distance

between the received channel symbol pair and the possible channel

symbol pairs. The Hamming distance is computed by simply counting

how many bits are different between the received channel symbol pair

and the possible channel symbol pairs. The results can only be zero,

one, or two. The Hamming distance (or other metric) values we

compute at each time instant for the paths between the states at the

previous time instant and the states at the current time instant are

called branch metrics. For the first time instant, we're going to save

these results as "accumulated error metric" values, associated with

states. For the second time instant on, the accumulated error metrics

will be computed by adding the previous accumulated error metrics to

the current branch metrics.

25

At t = 1, we received 002. The only possible channel symbol pairs

we could have received are 002 and 112. The Hamming distance

between 002 and 002 is zero. The Hamming distance between 002 and

112 is two. Therefore, the branch metric value for the branch from

State 002 to State 002 is zero, and for the branch from State 002 to

State 102 it's two. Since the previous accumulated error metric values

are equal to zero, the accumulated metric values for State 002 and for

State 102 are equal to the branch metric values. The accumulated

error metric values for the other two states are undefined. The figure

below illustrates the results at t = 1:

Note that the solid lines between states at t = 1 and the state at

t = 0 illustrate the predecessor-successor relationship between the

states at t = 1 and the state at t = 0 respectively. This information is

shown graphically in the figure, but is stored numerically in the actual

implementation. To be more specific, or maybe clear is a better word,

at each time instant t, we will store the number of the predecessor

state that led to each of the current states at t.

Now let's look what happens at t = 2. We received a 112 channel

symbol pair. The possible channel symbol pairs we could have

received in going from t = 1 to t = 2 are 002 going from State 002 to

State 002, 112 going from State 002 to State 102, 102 going from State

102 to State 01 2, and 012 going from State 102 to State 11 2. The

Hamming distance between 002 and 112 is two, between 112 and 112

26

is zero, and between 10 2 or 012 and 112 is one. We add these branch

metric values to the previous accumulated error metric values

associated with each state that we came from to get to the current

states. At t = 1, we could only be at State 002 or State 102. The

accumulated error metric values associated with those states were 0

and 2 respectively. The figure below shows the calculation of the

accumulated error metric associated with each state, at t = 2.

That's all the computation for t = 2. What we carry forward to t

= 3 will be the accumulated error metrics for each state, and the

predecessor states for each of the four states at t = 2, corresponding to

the state relationships shown by the solid lines in the illustration of

the trellis.

Now look at the figure for t = 3. Things get a bit more

complicated here, since there are now two different ways that we could

get from each of the four states that were valid at t = 2 to the four

states that are valid at t = 3. So how do we handle that? The answer

is, we compare the accumulated error metrics associated with each

branch, and discard the larger one of each pair of branches leading

into a given state. If the members of a pair of accumulated error

metrics going into a particular state are equal, we just save that value.

The other thing that's affected is the predecessor-successor history

we're keeping. For each state, the predecessor that survives is the one

with the lower branch metric. If the two accumulated error metrics are

27

equal, some people use a fair coin toss to choose the surviving

predecessor state. Others simply pick one of them consistently, i.e. the

upper branch or the lower branch. It probably doesn't matter which

method you use. The operation of adding the previous accumulated

error metrics to the new branch metrics, comparing the results, and

selecting the smaller (smallest) accumulated error metric to be

retained for the next time instant is called the add-compare-select

operation. The figure below shows the results of processing t = 3:

Note that the third channel symbol pair we received had a one-

symbol error. The smallest accumulated error metric is a one, and

there are two of these.

Let's see what happens now at t = 4. The processing is the same

as it was for t = 3. The results are shown in the figure:

28

Notice that at t = 4, the path through the trellis of the actual

transmitted message, shown in bold, is again associated with the

smallest accumulated error metric. Let's look at t = 5:

At t = 5, the path through the trellis corresponding to the actual

message, shown in bold, is still associated with the smallest

accumulated error metric. This is the thing that the Viterbi decoder

exploits to recover the original message.

Perhaps you're getting tired of stepping through the trellis. I

know I am. Let's skip to the end.

At t = 17, the trellis looks like this, with the clutter of the

intermediate state history removed:

29

The decoding process begins with building the accumulated

error metric for some number of received channel symbol pairs, and

the history of what states preceded the states at each time instant t

with the smallest accumulated error metric. Once this information is

built up, the Viterbi decoder is ready to recreate the sequence of bits

that were input to the convolutional encoder when the message was

encoded for transmission. This is accomplished by the following steps:

First, select the state having the smallest accumulated error

metric and save the state number of that state.

Iteratively perform the following step until the beginning of the

trellis is reached: Working backward through the state history

table, for the selected state, select a new state which is listed in

the state history table as being the predecessor to that state.

Save the state number of each selected state. This step is called

trace back.

Now work forward through the list of selected states saved in

the previous steps. Look up what input bit corresponds to a

transition from each predecessor state to its successor state.

That is the bit that must have been encoded by the

convolutional encoder.

The following table shows the accumulated metric for the full 15-bit

(plus two flushing bits) example message at each time t:

30

Table 2.4 Accumulated error metric for 15 bit message

It is interesting to note that for this hard-decision-input Viterbi

decoder example, the smallest accumulated error metric in the final

state indicates how many channel symbol errors occurred.

The following state history table shows the surviving

predecessor states for each state at each time t:

Table 2.5 surviving predecessor states for each state at each time t

The following table shows the states selected when tracing the

path back through the survivor state table shown above:

Table 2.6 States selected when tracing the path back

31

Using a table that maps state transitions to the inputs that

caused them, we can now recreate the original message. Here is what

this table looks like for our example rate 1/2 K = 3 convolutional code:

Figure 2.7 State transition Maps to the inputs

Note: In the above table, x denotes an impossible transition from one

state to another state:

So now we have all the tools required to recreate the original

message from the message we received:

Figure 2.8 Recreating the original message

The two flushing bits are discarded.

Here‟s an insight into how the traceback algorithm eventually

finds its way onto the right path even if it started out choosing the

wrong initial state. This could happen if more than one state had the

smallest accumulated error metric, for example I‟ll use the figure for

the trellis at t=3 again to illustrate this point:

32

See how at t = 3, both States 012 and 112 had an accumulated

error metric of 1. The correct path goes to State 012 -notice that the

bold line showing the actual message path goes into this state. But

suppose we choose State 112 to start our traceback. The predecessor

state for State 112 , which is State 102 , is the same as the

predecessor state for State 012! This is because at t = 2, State 102 had

the smallest accumulated error metric. So after a false start, we are

almost immediately back on the correct path.

For the example 15-bit message, we built the trellis up for the

entire message before starting traceback. For longer messages, or

continuous data, this is neither practical nor desirable, due to

memory constraints and decoder delay. Research has shown that a

traceback depth of K x 5 is sufficient for Viterbi decoding with the type

of codes we have been discussing. Any deeper traceback increases

decoding delay and decoder memory requirements, while not

significantly improving the performance of the decoder. The exception

is punctured codes, which I'll describe later. They require deeper

traceback to reach their final performance limits.

To implement a Viterbi decoder in software, the first step is to

build some data structures around which the decoder algorithm will

be implemented. These data structures are best implemented as

arrays. The primary six arrays that we need for the Viterbi decoder are

as follows:

33

A copy of the convolutional encoder next state table, the state

transition table of the encoder. The dimensions of this table

(rows x columns) are 2(K - 1) x 2k. This array needs to be

initialized before starting the decoding process.

A copy of the convolutional encoder output table. The

dimensions of this table are 2(K - 1) x 2k. This array needs to be


An array (table) showing for each convolutional encoder current

state and next state, what input value (0 or 1) would produce

the next state, given the current state. We'll call this array the

input table. Its dimensions are 2(K - 1) x 2(K - 1). This array needs

to be initialized before starting the decoding process.

An array to store state predecessor history for each encoder

state for up to K x 5 + 1 received channel symbol pairs. We'll

call this table the state history table. The dimensions of this

array are 2 (K - 1) x (K x 5 + 1). This array does not need to be


An array to store the accumulated error metrics for each state

computed using the add-compare-select operation. This array

will be called the accumulated error metric array. The

dimensions of this array are 2 (K - 1) x 2. This array does not need

to be initialized before starting the decoding process.

An array to store a list of states determined during traceback

(term to be explained below). It is called the state sequence

array. The dimensions of this array are (K x 5) + 1. This array

does not need to be initialized before starting the decoding

process.

2.7 CONCLUSION

This chapter describes the detailed description of viterbi

algorithm which is based on coding theory

34

CHAPTER 3

VITERBI DECODER

3.1 INTRODUCTION

In this chapter the main concept of convolutional encoding and

decoding of encoded data by the use of Viterbi algorithm and the

different blocks of viterbi decder are discussed. Convolutional

encoding with Viterbi decoding is a powerful method for forward error

correction. It has been widely deployed in many wireless

communication systems to improve the limited capacity of the

communication channels. The Viterbi algorithm is the most

extensively employed decoding algorithm for convolutional codes. The

Viterbi algorithm develops as an asymptotically optimal decoding

algorithm for convolutional codes. It is well suited to hardware

decoder implementation. Viterbi decoding of convolutional codes

found to be efficient and robust.

Viterbi decoding is the best-known implementation of the

maximum likely-hood decoding. Here we narrow the options

systematically at each time tick. The Viterbi decoder examines an

entire received sequence of a given length. The decoder computes a

metric for each path and makes a decision based on this metric. All

paths are followed until two paths converge on one node. Then the

path with the higher metric is kept and the one with lower metric is

discarded. The most common metric used is the Hamming distance

metric.

35

3.2 CONVOLUTIONAL ENCODER

Figure 3.1 block diagram for convolutional encoder

As shown in figure 3.1, it will represent the block diagram for

convolutional encoder. Mainly for this encoder block we will give 1-bit

input data, after doing encoding it will generates c1 and c2 are the two

outputs. To calculate c1 it will do XOR operation in between input bit,

F6, F5, F4 & F1 outputs. Where as to calculate c2 it will do XOR

operation in between input bit, F5, F4, F2, F1 outputs.

3.3 DECODER

This project presents a configurable 3-bit soft decision Viterbi

decoder implementation that meets the requirements for WLAN and

broadband applications. The programmable design supports a

constraint length K=7 soft decision Viterbi decoder (SDVD) realization

with a code rate (R) of 1/2. To assure a high throughput, an

architecture incorporating 64 Add Compare Select (ACS) units

operating in parallel has been selected. Further low power adaptive

viterbi decoder is developed which used strongly connected trellis

decoding.

36

To implement Viterbi algorithm in Reconfigurable Hardware like

FPGA with the help of Hardware Descriptor Language, it‟s essential to

translate algorithm in to digital Block. Algorithm demand computation

of Path and Branch Metric and storage of Path Metric decision to

Trace Back. Based on requirement Blocks are arranged as shown in

figure 3.3

Figure 3.2 Block Diagram for Decoder

3.3.1 Branch Metric Unit(BMU)

A branch metric unit's function is to calculate branch metrics,

which are normed distances between every possible symbol in the

code alphabet, and the received symbol.

There are hard decision and soft decision Viterbi decoders. A

hard decision Viterbi decoder receives a simple bit stream on its input,

and a Hamming distance is used as a metric. A soft decision Viterbi

decoder receives a bitstream containing information about the

reliability of each received symbol.

The squared Euclidean distance is used as a metric for soft

decision decoders.

37

Figure 3.3 block diagram for BMU

As shown in figure 3.3, it will represent the block diagram for

BMU. BMU Stands for Branch metric Unit and its function is to

calculate branch metrics, which are normed distances between every

possible symbol in the code alphabet and the received symbol. V1V2

represent a symbol where V1 and V2 is two bit of symbol. Normed or

Hamming Distance is calculated with help of two Xor gate and 1 Half

Adder.

3.3.2 Path Metric Unit (PMU)

Figure 3.4 block diagram for PMU

38

A path metric unit summarizes branch metrics to get metrics for

2K − 1 paths, one of which can eventually be chosen as optimal. Every

clock it makes 2K − 1 decisions, throwing off wittingly nonoptimal

paths. The results of these decisions are written to the memory of a

traceback unit.

In Path Metric Unit Eight Add Compare and Select Unit are

arranged in particular fashion based on formation of Trelli‟s Diagram.

Add Compare Select Unit

Figure 3.5 block diagram for ACSU

The core elements of a PMU are ACS (add-compare-select) units.

The way in which they are connected between themselves is defined

by a specific code‟s trellis diagram.

Comparator

After the computation of path metric, we have to find out the ACSU

corresponding to we have maximum path metric.

For this block mainly we will give the encoded output of 2-bit data.

For the encoder block we gave in1 as a input in the sequence

“0110001”. After decoding it has to give in the same sequence but in

the reverse order i.e. we are getting as a decoded_ sequence output of

7-bit data.

3.3.3 Trace Back Unit(TBU)

In the FPGA-based implementation, the memory for storing the

survivor paths is divided into four memory banks, each of which

39

stores 64 survivor paths. The trace back length of the design is chosen

as D=8 that is equivalent to a trace back length of 8(K-1)=64 low-

connectivity trellis stages. As mentioned earlier, this is large enough to

ensure the convergence of the survivor paths. Since 2-D circular

memory addressing scheme is used in this design, the size of each

memory bank should be equal to 64X2-D=1024. and its word length 8

bits. The survivor path memory is organized as shown in Fig.3.6.

Figure 3.6 Architecture of the trace back unit.

There are three operation phases in the management of the

memory, write new data (WR), trace back read (TB), and decode read

(DR), which occupy memory with depths of D/2, D, and D/2,

respectively. These three operations access the three logical blocks of

the memory, respectively, by using 2-D circular memory addressing

scheme , in which the trace back read and the decode read operations

are performed by using one pointer instead of multiple ones. In view of

the memory depths of WR, TB, and DR, it is noted that the read rate

should be three times that of the write rate, so that the two read-

phases and one write-phase can be performed simultaneously within

the same time period. The write pointer advances forward from stage

to stage in the trellis, and the decisions made by the ACS

computations at all the trellis states regarding the survivors are

written into the memory block that is just freed by the decode read

operation. The read pointer traces back from stage to stage so that the

RAM1024_0 RAM1024_1 RAM1024_2 RAM1024_3

BUS_MUX

s5--s0s7

6 bits address for each RAM bank

s62/4 Decoder

8 bit Shift register

6

32 bit FILO register

en0 en1

Decoder output

en2 en3

8 8 8 8

8

en0 en1 en2 en3

Sur Sur Sur Sur

40

retrieved data after each read operation can be treated as a pointer to

indicate the corresponding state number of the previous stage. When

the decisions are written into the memory for one stage, the read

pointer will trace back the memory by three stages. After D such trace

backs, all the survivor paths will converge and the actual decoding

takes place. There are D/2X8 = 32 bits of data decoded at each trace

back iteration, and these retrieved data should be rearranged in its

original order.

The architecture of the trace back unit is shown in Fig.3.6. The

four block select RAMs each of size of 1024X8 bits provided by the

FPGA chip are used to implement the four memory banks,

RAM1024_0 to RAM1024_3. At each stage, the survivor decisions

made by the four pairs of the systolic arrays in Fig. 6 are written into

the four corresponding memory banks. At each trace back read, the

previously retrieved data (the state number) points to the current data

to be read. The most significant two bits and , of the previously

retrieved data in the register are sent to the 2/4 Decoder and

BUS_MUX. These bits are used to select the memory bank to be

accessed and to have its stored data retrieved. The least significant 6

bits , of the previously retrieved data in the register are used to

address the 64 decisions in one of the memory banks selected at each

read operation. In other words, at every trace back operation, one of

the 64 decisions (corresponding to the 64 survivor paths) stored in the

four memory banks can be read by using the previously retrieved data

as a pointer. After D trace backs, corresponding to the eight strongly

connected trellis stages, the decoded data is loaded into the 8-bit shift

register and then shifted to the first-in-last-out (FILO) register. Finally,

the decoded data is sent out in the reverse order from the FILO

register.

3.4 CONCLUSION This chapter describes the basic viterbi decoder and uses the

viterbi algorithm describe in chapter2.this gives basis for more

complex viterbi decoders developed in next chapter.

41

CHAPTER 4 DESIGN CONSIDERATIONS

4.1 INTRODUCTION

In this chapter, we describe the design considerations of the

Adaptive Viterbi algorithm and propose a low-power design of Viterbi

decoders. We also describe top-down design approach employed to

implement both Adaptive Viterbi decoder and non adaptive viterbi

decoder for the constraint length of K=7 and the code rate of r=1/2.

4.2 MODULAR MESCRIPTION

In order to design and implement the modified adaptive Viterbi

decoder, we choose the following specifications that have been used

for hardware implementation of the Viterbi decoder:

1) Constraint length: K=7(64 states);

2) Code rate: r=1/2

3) Survivor path length: L=64

4) Decision level: 3-bit soft decision.

Figure 4.1 Block diagram of the adaptive Viterbi decoder.

The block diagram of the modified adaptive Viterbi decoder is

shown in Fig. 4.1. In this decoder, the composite branch metric

42

generation(CBM) unit is designed to collect the two soft input

sequences and to compute all the possible branch metrics

corresponding to the eight low-connectivity trellis stages. The path

metric update unit is designed to generate the composite branch

metrics and to process the matrix–vector ACS computation. The trace

back unit can retrieve the decoded sequence from the survivor path

memory through a trace back strategy.

Figure 4.2 input buffer for the two soft inputs

4.3 COMPOSITE BRANCH METRIC UNIT

4.3.l Logic Diagram

43

Figure 4.3 Unit to compute the eight one-stage branch metrics

corresponding to a composite branch metric.

The input buffer is designed to gather the two soft input

sequences corresponding to a given strongly connected trellis stage so

that the composite branch metrics of the stage can be generated. As

shown in Fig. 5.1, the two soft inputs are sent to the corresponding 3-

bit registers depending on the enable signal generated by the 3/8

decoder. The 3/8 decoder outputs the eight enable signals in a

circular way so that the two soft input sequences corresponding to a

strongly connected trellis stage are loaded into the 3-bit registers. The

sequences are then buffered into two 24-bit registers and used to

compute the corresponding eight one-stage branch metrics. Each of

the eight one-stage branch metrics has four possible results

depending on the corresponding two-bit codeword in the original low

connectivity trellis stage. Table I gives the expressions for the

evaluation of a one-stage branch metric under the four possible values

of the two-bit codeword, where in1 and in2 represent the two soft

inputs at the low-connectivity trellis stage. The eight one-stage branch

metrics corresponding to a composite branch metric are computed by

the unit shown in Fig.4.2, in which each one-stage branch metric is

computed under the four possible values of the two-bit codeword, and

is output to the array processors in BM_4s shown in Fig. 4.3.

4.3.2 Source Code Description

In this module Composite one stage branch metrics are

calculated combining 8 one stage branch metrics corresponding to a

one stage branch metric.the inputs a,b,reset are applied to to one

stage branch metricsto generate 9 bit BMU out put.

44

4.4 ADD-COMPARE-SELECT

4.4.1 Logic Diagram

Figure 4.4Arithmetic pipelined processor for ACS-4

The arithmetic-pipelining processor for an ACS_4 is shown in

Fig.5.7. This processor is implemented with two pipeline stages. The

adder in the first pipeline stage is used to compute the 64 path

metrics at any of the corresponding 16 states of a given stage, whereas

the adder in the second pipeline stage is used to make comparisons

amongst the 64 path metrics computed by the first pipeline stage. The

length of the adders including a sign bit in the two pipeline stages

should be 10 bits in order to perform the modulo arithmetic. In the

second pipeline stage, the comparison between the path metrics

computed at the current and previous clock cycles at a state of a given

stage are carried out by adding the path metric at the current clock

cycle to the compliment of the path metric at the previous clock cycle,

increased by unity. In this way, the sign-bit of the adder can

represent the result of the comparison between the two path metrics.

If the sign-bit is “1,” the path metric at the current clock cycle is

smaller than the path metric at the previous clock cycle, and if the

sign-bit is “0,” then the former is larger than the latter. In the

8

8

8

8

10

10

10

10

10

Composite

branch metric

Path metric

Path metric

Decision bits

(or Sur)

Adder Adder

Reg

Reg

Reg

Reg

Counter

mu

x1

mu

x2

Enable

0

0

1

1

sign

CLK

45

meantime, the path metric with the smaller metric at the state of a

given stage is selected by the multiplexer (Mux1) and sent to the

Register. In addition, the 8-bit counter in the second pipeline stage is

used to represent the 64 possible states of the previous stage from

which the paths originate and terminate at the various states of the

present stage. One of these 64 paths would be the survivor path

depending on the sign bit of the adder in the second pipeline stage,

and is selected by the multiplexer (Mux2) at each clock cycle. Thus, at

a given stage, the path metric and the survivor corresponding to any

of the16 states can be updated through 64 clock cycles. It should be

mentioned that the critical path of the design as determined from the

Xilinx report is the second pipeline stage and consists of the 10-bit

adder and the multiplexer. The synchronization between ACS_4 and

BM_4 can be achieved by simply designing a timing scheme that

makes them start at different time instants. This means that at a

given stage the processors in an ACS_4 will not start their

computations until the results from the processors in the

corresponding BM_4 have been generated. In this way, throughput of

the design will not be changed, but a latency between ACS_4 and

BM_4 is introduced. This is in fact the latency of any processor in

BM_4.

Systolic Array Architecture Based Branch and Path metric

Updating

The modified adaptive Viterbi algorithm can be implemented

using the systolic array architecture shown in Fig.5.8 in which time

multiplexing, arithmetic pipelining are exploited. For K=7, the matrix

vector ACS computation can be processed by 64×64 adjacency matrix

Bq and the 1×64 path metric vector Pq-1.In this design ,the adjacency

matrix Bq is partitioned in to four 64×16 sub matrices, and the four

sub matrices along with the path metric vector Pq-1 are used to update

the corresponding 64 path metrics of the given stage .This is achieved

by the four pairs of interconnected systolic arrays is formed as shown

46

in Fig 5.8,where the top interconnected four processors represent the

systolic array BM-4 and the bottom interconnected four processors

represent the systolic array ACS-4 .

Figure 4.5 Architecture of a pair of systolic arrays.

In the top systolic array the composite branch metric is

generated by using equation (2) and (3).The code word vector C0j is

divided in to four 1×64 sub vectors to be stored in four ROMs,ROM0 to

ROM3and the code word Ci0 stored in the ROM. The codeword C0j stays

inside each processor, and Ci0 moves to the right. In the bottom

systolic array, the path metric Pq(j) of stage q is computed according

to (10) and decisions on the survivor paths are made. The

intermediate ACS result Rq(j) denoted by stays inside the j th

processor, whereas Pq-1(j) of stage q-1 moves to the right. All data

movements in the two systolic arrays are synchronized. This design

can thus be viewed as a “result stay” systolic array architecture.

47

Figure 4.6 Systolic array based architecture for adaptive viterbi

decoder

As shown in Fig. 4.7, for computing the corresponding four

subsets of Bq at each stage, the codeword vector C0j is divided into

four 1X 64 sub vectors to be stored in four ROMs, ROM0 to ROM3,

and the codeword vector Ci0 stored in the ROM. Globally, the four

sub matrix–vector ACS computations are carried out simultaneously

by the corresponding four pairs of systolic arrays. Locally, inside each

pair of the systolic arrays, the corresponding sub matrix–vector ACS

computation is time multiplexed. For a given K the number of trellis

states at a given stage is equal to 2K−1 . If the time-multiplexing

technique is not employed, the number of pairs of systolic arrays

(each array having 4 processors) is 2K−1/4 =2K−3 . For K=7 this is 64,

BUS_MUX

dm_plus_T

Mux1

Mux2

ROM0 ROM1 ROM2 ROM3

BM_4 BM_4 BM_4BM_4

ACS_4 ACS_4 ACS_4 ACS_4

Mux_P Mux_P Mux_P Mux_PMux_S Mux_S Mux_S Mux_S

Sur Sur Sur Sur

CMP CMP CMP CMP

RAM0 RAM1 RAM2 RAM3

8 8 8 8

8 8 8 8

10 10 10 10

16 16 1616

1616

10 10 10 10

11 11 11 11

1111 11 11

11

10

10 10

10

ROM

A pair of

systolic

arrays

32Yq

Sq-1(j)

Pq-1(0),Pq-1(1),......,Pq-1(255)

Sq-1(j)

48

and the total number of array processors is 512. As K increases, the

number of array processors will increase exponentially, which is the

same as in the case of the conventional butterfly architecture-based

parallel Viterbi decoder. However, we employ a time-multiplexing

technique. In this technique, when K=7, each pair of systolic arrays

(each array having four processors) processes 64 out of 256 states

sequentially in 16 iterations. For a given K , the more the number of

systolic array pairs used, the fewer the number of iterations needed.

Thus, as K increases, our time-multiplexing approach provides the

flexibility to trade off the number of iterations with the number of

systolic array pairs.

4.4.2Source code Description

In this module the airthmatic pype line based Add compare

select unit is developed.The two 9 bit inputs from BMU are given as

input along with decode_in ,m,n,p,q,r,s,clk_sm,en,en_main,en2 and

reset .This module generates 9 bit out puts which will be inputs for

trace back operation.

4.5 TRACE BACK UNIT

4.5.1 Logic Diagram

Figure 4.7 Architecture of the trace back unit.

RAM1024_0 RAM1024_1 RAM1024_2 RAM1024_3

BUS_MUX

s5--s0s7

6 bits address for each RAM bank

s62/4 Decoder

8 bit Shift register

6

32 bit FILO register

en0 en1

Decoder output

en2 en3

8 8 8 8

8

en0 en1 en2 en3

Sur Sur Sur Sur

49

In the FPGA-based implementation, the memory for storing the

survivor paths is divided into four memory banks, each of which

stores 64 survivor paths. The trace back length of the design is chosen

as D=8 that is equivalent to a trace back length of 8(K-1)=64 low-

connectivity trellis stages. As mentioned earlier, this is large enough to

ensure the convergence of the survivor paths. Since 2-D circular

memory addressing scheme is used in this design, the size of each

memory bank should be equal to 64X2-D=1024. and its word length 8

bits. The survivor path memory is organized as shown in Fig.4.8.

There are three operation phases in the management of the

memory, write new data (WR), trace back read (TB), and decode read

(DR), which occupy memory with depths of D/2, D, and D/2,

respectively. These three operations access the three logical blocks of

the memory, respectively, by using 2-D circular memory addressing

scheme , in which the trace back read and the decode read operations

are performed by using one pointer instead of multiple ones. In view of

the memory depths of WR, TB, and DR, it is noted that the read rate

should be three times that of the write rate, so that the two read-

phases and one write-phase can be performed simultaneously within

the same time period. The write pointer advances forward from stage

to stage in the trellis, and the decisions made by the ACS

computations at all the trellis states regarding the survivors are

written into the memory block that is just freed by the decode read

operation. The read pointer traces back from stage to stage so that the

retrieved data after each read operation can be treated as a pointer to

indicate the corresponding state number of the previous stage. When

the decisions are written into the memory for one stage, the read

pointer will trace back the memory by three stages. After D such trace

backs, all the survivor paths will converge and the actual decoding

takes place. There are D/2X8 = 32 bits of data decoded at each trace

back iteration, and these retrieved data should be rearranged in its

original order

50

The architecture of the trace back unit is shown in Fig.4.8. The

four block select RAMs each of size of 1024X8 bits provided by the

FPGA chip are used to implement the four memory banks,

RAM1024_0 to RAM1024_3. At each stage, the survivor decisions

made by the four pairs of the systolic arrays in Fig. 4.7 are written

into the four corresponding memory banks. At each trace back read,

the previously retrieved data (the state number) points to the current

data to be read. The most significant two bits of the previously

retrieved data in the register are sent to the 2/4 Decoder and

BUS_MUX. These bits are used to select the memory bank to be

accessed and to have its stored data retrieved. The least significant 6

bits , of the previously retrieved data in the register are used to


read operation. In other words, at every trace back operation, one of

the 64 decisions (corresponding to the 64 survivor paths) stored in the

four memory banks can be read by using the previously retrieved data

as a pointer. After D trace backs, corresponding to the eight strongly

connected trellis stages, the decoded data is loaded into the 8-bit shift

register and then shifted to the first-in-last-out (FILO) register. Finally,

the decoded data is sent out in the reverse order from the FILO

register.

4.5.2 Source code Description

In this module The source code is developed based on the

inputs from BMU and ACSU.This module retraces the Hamming

distances.It links previously retrieved data in the register are used to


read operation.

51

4.6 ADAPTIVE VITERBI DECODER

4.6.1 Logic Diagram

Figure 4.8 Block diagram of the adaptive Viterbi decoder.

In this we are integrating all the sub modules i.e, CBMU ,

Systolic architecture of the AVA and the Trace Back Unit together to

decode the encoded information shown in 5.11 for the constraint

length K=7 and the code rate r=1/2.

In this decoder the composite branch metric generation(CBM)

unit is designed to collect the two soft input sequences and to

compute all the possible branch metrics corresponding to the eight

low-connectivity trellis stages. The path metric update unit is designed

to generate the composite branch metrics and to process the matrix–

vector ACS computation. The trace back unit can retrieve the decoded

sequence from the survivor path memory through a trace back

strategy.

4.6.2 Source code Description

In this all the submodule codes are integrated as top module.The

inputs are Decoder_in ,clk_sm,en and reset given as input to the top

module and a 12 bit count_div_2_out ,onethird_clk_out,viterbi_out are

generated giving aerror free decoded data.The convolutional decoder

using adaptive Viterbi decoder is improvised version.

52

4.7. CONCLUSION

In this chapter the modular description of the three sub modules

along with the Top order module are described elaborately.Based on

this the source code implementation is taken up in next chapters.

53

CHAPTER 5

SIMULATION RESULTS

5.1 INTRODUCTION

This chapter focusses on the program flow description of each

module. This chapter gives the simulation results of

Encoder,Composte branch metric unit, add compare select unit,

Survivor memory unit modules and Adaptive Viterbi Decoder (top

order module). All these modules are synthesized using Xilinx ISE

navigator tool.

Simulations are usually divided in to following 5 categories.

Behavioral simulation

Functional simulation

Gate level simulation or post synthesis

simulation

Switch level simulation

Transistor-level or Circuit level simulation

This is ordered from high-level to low-level simulation (high

level being more abstract and low-level being more detailed).

Proceeding from high-level to low-level, the simulation becomes more

accurate, but they also become progressively more complex and take

longer to run. While it is positive to perform a behavioral-level

simulation of the whole system, it is just impossible to perform

circuit-level simulation of more than few hundred transistors.

Behavioral Simulation

This method models large pieces of a system as black boxes

withinput and outputs. This is done often using VHDL and Verilog.

Functional Simulation

This simulation ignores timing and includes delta-delay

simulation, which sets the delays to a fixed value. Once a behavioral

54

or functional simulation verifies the system working, the next step is

to check the timing performance

Logic Simulation or Gate-level simulation

This simulation is used to check the timing performance of an

ASIC. In the Gate-level simulation, a logic cell is treated as a black box

modeled by a function whose variables are the input signals. The

function may also model the delay through the logic cell setting all the

delays to unit signals. The function may also model the delay through

the logic cell. Setting all the delays to unit values is the equivalent of

functional simulation.

Switch-level Simulation

This simulation can provide more accurate timing predictions

than Gate-level simulation.

Transistor-level Simulation

These are the most accurate, but at the same time most

complex and time-consuming simulation of all the simulations. This

requires models of transistors, describing their nonlinear voltage and

current characteristics.

Simulation can also be divided on the basis of layout into two

categories.

Pre-layout Simulation

Post-layout Simulation

Simulation is used at many stages during the design. Initial Pre-

layout Simulation includes logic-cell delays but no inter-connect

delays. Estimates of capacitance may be included after completing

logic synthesis, but only after physical design is over, Post-layout

Simulation can be performed. In the Post-layout simulation, an SDF

(Standard Delay Format) file is included in the simulation

environment.

55

5.2ENCODER

5.2.1 Simulation results

Figure 5.1 Simulation results of Encoder

5.2.2 Signal Description

S.No. Signal

name

Type Description

1. Clk Input triggers output

2. Data_in Input „1‟ bit stream

3. Reset Input Reset encoder

4. V2v1 Output Convolution output

56

Port diagram

Figure 5.2 port diagram of encoder

5.2.3 Logical description

Mainly for this encoder block we will give 1-bit input data, after

doing encoding it will generates V1 and V2 are the two outputs. To

calculate V1 it will do XOR operation in between input bit, F6, F5, F4

& F1 outputs. Where as to calculate V2 it will do XOR operation in

between input bit, F5, F4, F2, F1 outputs.

As shown in figure 5.2, it will represent the simulation result for

Convolutional encoder. Depending upon input i.e.data_in, it will

generates the output i.e. v2v1 of 6-bit data.

57

5.3 COMPOSITE BRANCH METRIC UNIT

5.3.1Simulation Results

Figure 5.3 simulation results of composite branch metric unit


S.NO Signal

Name

Type Description

1. a Input Load 2- bit input data

2. b Input Load 2- bit input data

3. Reset Input Reset pin is used to restart the BMU

4. c Output Generates 9-bit data

58

Port diagram

Figure 5.4 port diagram of composite branch metric unit

5.3.3 Logical Description

The two soft inputs are sent to the corresponding 3-bit registers

depending on the enable signal generated by the 3/8 decoder. The 3/8

decoder outputs the eight enable signals in a circular way so that the

two soft input sequences corresponding to a strongly connected trellis

stage are loaded into the 3-bit registers.

5.4 ADD COMPARE SELECT UNIT

5.4.1Simulation Results

5.5 Simulation result of Add compare select Unit

59


S.NO Signal Name Type Description

1. decoder_in Input Load 2- bit input data say „11‟

2. m Input Load 2- bit input data „00‟

3. n Input Reset pin is used to restart the

BMU

4. p Input Load 2-bit data say „01‟

5.

q Input Load 2 bit input data say „00‟

6. r Input Load 9 bit input data from BMU

7. s Input Load 9 bit input data from BMU

8. clk_sm Input If clk_sum =‟1‟ then ACS process

sstarts 9. en Input en input „1‟

10. en_main Input En_main input „1‟

11. en2 Input 2nd enable

12. reset Input For resetting ACSU

13. x output 9 bitOutput for TBU

14. y output 9 bit output for TBU

15. e output Inputs for TBU

16. f output Input for TBU

60

Port diagram

Figure 5.6 port diagram of ACSU


The length of the adders including a sign bit in the two pipeline

stages should be 10 bits in order to perform the modulo arithmetic. In

the second pipeline stage, the comparison between the path metrics

computed at the current and previous clock cycles at a state of a given

stage are carried out by adding the path metric at the current clock

cycle to the compliment of the path metric at the previous clock cycle,

increased by unity.

61

5.5 TRACE BACK UNIT

5.5.1Simulation results

Figure 5.7 Simulation Result of Trace Back Unit.

5.5.2 Signal description

The Trace back operation is having 64 Survivor memory paths and is

having large no. of signals.


The memory for storing the survivor paths is divided into four memory

banks, each of which stores 64 survivor paths. The trace back length

of the design is chosen as D=8 that is equivalent to a trace back

length of 8(K-1)=64 low-connectivity trellis stages.

5.6 VITERBI DECODER

In this we are integrating the all the sub modules ie CBMU ,

Systolic architecture of the AVA and the Trace Back Unit together to

decode the encoded information shown in 5.11 for the constraint

length K=7 and the code rate r=1/2.The simulation results are as

follows.

62

5.6.1 Simulation Results

Figure 5.8 Simulation Result of complete Adaptive Viterbi Decoder


S.No. SignalName Type Description

1. Decoder_in Input Convolutional Input

2. Clkk_sm Input Clock for triggering

3 En Input Enables viterbi decoder

4 Reset Input For resetting decoder

5 Count_div_2_out output 11 bit decoder output

6 Onethird_clk_out Output Clock reduction

7 Viterbi_out Output Viterbi output

5.6.3 Logical explanation

In this decoder, the composite branch metric generation(CBM) unit is

designed to collect the two soft input sequences and to compute all

the possible branch metrics corresponding to the eight low-

63

connectivity trellis stages. The path metric update unit is designed to

generate the composite branch metrics and to process the matrix–

vector ACS computation. The trace back unit can retrieve the decoded

sequence from the survivor path memory through a trace back

strategy.

5.7 CONCLUSION

The focus of work in the low-power design of Viterbi decoders at

logic level is reduction of dynamic power dissipation in the standard

cell design environment. We considered two methods in this low power

design , strongly connected trellis decoding and systolic array

processing, in our design. These methods were applied to the survivor

path storage block to reduce the number of ACS computations.

The design flow that was followed is similar to the one used in

industry. The behavior of a Viterbi decoder is described in VHDL.

Then the design is synthesized to generate a gate level circuit in

VERTEX II xc2vp30-7-ff896 as a target device in xilinx tool.

64

CHAPTER 6

FPGA IMPLEMENTATION

6.1 INTRODUCTION

This chapter describes the FPGA implementation of the

proposed work using hardware description languages .This includes

the pop module Viterbi decoder synthesis It describes the RTL

views,port diagram,timing analysis chip vie floor plan etc.. of the

implemented work.

6.2 SYNTHESIS REPORT

Figure 6.1 Synthesis report of Viterbi decode

65

HDL Synthesis Report of Adaptive Vitebi Decoder in VERTEX II

xc2vp30-7-ff896 Macro statistics

# RAMs : 4

1024x8-bit single-port RAM : 4

# Adders/Subtractors : 161

10-bit adder : 48

11-bit adder : 1

4-bit adder carry out : 16

5-bit adder : 16


6-bit adder : 16


7-bit adder : 16


# Counters : 16

8-bit up counter : 16

# Registers : 529

1-bit register : 16

10-bit register : 17

11-bit register : 1

16-bit register : 6

66

4-bit register : 336

5-bit register : 32

6-bit register : 32

7-bit register : 32

8-bit register : 57

# Latches : 1

4-bit latch : 1

# Multiplexers : 138

10-bit 4-to-1 multiplexer : 4




# Xors : 4

16-bit xor2 : 4

=========================================================

================

* Final Report *

=========================================================

================

Final Results

RTL Top Level Output File Name : viterbi_full.ngr

Top Level Output File Name : viterbi_full

67

Output Format : NGC

Optimization Goal : Speed

Keep Hierarchy : NO

Design Statistics

# IOs : 19

Cell Usage :

# BELS : 7147

# BUF : 9

# GND : 1

# INV : 4

# LUT1 : 779

# LUT2 : 1799

# LUT2_D : 1

# LUT3 : 812

# LUT3_L : 6

# LUT4 : 993

# LUT4_D : 14

# MUXCY : 1627

# MUXF5 : 110

# MUXF6 : 48

# MUXF7 : 24

68

# MUXF8 : 12

# VCC : 1

# XORCY : 907

# FlipFlops/Latches : 2854

# FDCE : 2461

# FDE : 384

# FDPE : 1

# FDPE_1 : 1

# LD : 6

# LDCE_1 : 1

# RAMS : 384

# RAM32X1S : 384

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 18

# IBUF : 4

# OBUF : 14

=========================================================

================

69

Device utilization summary:

---------------------------

Selected Device : 2vp30ff896-7

Number of Slices: 2778 out of 13696 20%

Number of Slice Flip Flops: 2854 out of 27392 10%

Number of 4 input LUTs: 5176 out of 27392 18%

Number used as logic: 4408

Number used as RAMs: 768

Number of IOs: 19

Number of bonded IOBs: 19 out of 556 3%

Number of GCLKs: 1 out of 16 6%

Observation /comments

Synthesis report gives the Gate level net list from a model

described in VHDL. This gives the total information about the top

module Viterbi Decoder in the form of number of logic gates, tri state

buffers, number of Flip-Flops used and total number of registers..It is

synthesized using Xilinx ISE navigator tool. The device utilized is

xc2vp30-7-ff896.

70

6.3 RTL VIEW

The RTL view of Viterbi decoder is shown below.

Figure 6.2 RTL Schematic of Adaptive Viterbi Decoder

The RTL view gives a detailed account of the no.of Registers,

Multiplexors, Buffers, etc.. in the implementation.The adaptive Viterbi

decoder takes a little more area over Non adaptive viterbi decoder.But

better performance in Power and lesser delay.

71

6.4 PORT DIAGRAM

Figure 6.3 Port diagram of Viterbi Decoder

Port diagram explains the over all input output concept clearly.

It describes How a convolutionally encoded input passing thro the

Viterbi decoder gives error free reproduction of out put.

6.5 TIMING ANALYSYS

Timing Summary:

---------------

Speed Grade: -7

Minimum period: 5.176ns (Maximum Frequency: 193.192MHz)

Minimum input arrival time before clock: 6.206ns

Maximum output required time after clock: 3.514ns

Maximum combinational path delay: No path found

72

6.6 FLOOR PLAN

Figure 6.4 Floor plan of Viterbi decoder

Floor plan gives a lay out on the IC.Gives a clear idea about area

of utilization of the device.

73

6.7 CHIP VIEW

Figure 6.5 Chip view of Viterbi decoder

Chip view Gives idea of gives idea about the Chip a lay out

description and before making final ASIC this helps.

74

6.8 POWER DATA

Figure 6.6 power data of viterbi Decoder

Power data gives the final description of power calculation.In

our implementation 28.1mw power is consumed. We measured the

dynamic power dissipation of a circuit based on the switching activity

from gate level simulation. Static power dissipation for cells is not

considered for the measurement. This method is a compromise

between power and area, and it is sufficient for our purpose.

75

6.9 CONCLUSION

Comparison of Synthesis Results

Parameter Adaptive Viterbi

Decoder

Non Adaptive

Viterbi Decoder

Power 28.1 mw 103 mw

Delay 1.13 nSec 2.71 nSec

Area(No.Of IOBS) 223 205

Comparision of Synthesis Results.

The design flow that was followed is similar to the one used in




76

CHAPTER 7

CONCLUSION AND FUTURE SCOPE OF WORK

7.1 CONCLUSION

-Convolutional coding is a coding scheme often employed in

deep space communications and recently in digital wireless

communications.

-Viterbi decoders are used to decode convolutional codes. Viterbi

decoders employed in digital wireless communications are complex in

its implementation and dissipate large power. We proposed a Viterbi

decoder design for low-power dissipation in this work.

-The focus of work in the low-power design of Viterbi decoders

at logic level is reduction of dynamic power dissipation in the standard

cell design environment.

- We considered two methods in this low power design , strongly

connected trellis decoding and systolic array processing, in our

design. These methods were applied to the survivor path storage block

to reduce the number of ACS computations.

-The design flow that was followed is similar to the one used in




-We measured the dynamic power dissipation of a circuit based

on the switching activity from gate level simulation. Static power

dissipation for cells is not considered for the measurement. This

method is a compromise between power and area, and it is sufficient

for our purpose.

- The systolic array processing with strongly connected trellis

for adaptive viterbi decoder dissipates 28.1mW and has 1.13nSec of

77

delay. The power saving is 66% compared with the non adaptive

viterbi decoding with a less increase in area.

-The data rates achievable can be very high compared to

decoders of Constraint lenghths 3 or 4.

7.2 FUTURE SCOPE

-As the VLSI technology is increasing rapidly in future the

viterbi decoders with higher constraint lengths of 9 or 11 can be

implemented for higher data rates.

78

BIBLOGRAPHY

1. Abdul Rafeeq ,Abdul –shakoor and Valek Szwarc ”A High

Performance soft decission viterbi decoder for WLAN and broand band

application “IEEE Trans commun.,may 2006

2. F. Chan and D. Haccoun, “Adaptive Viterbi decoding of convolution

codes over memory less channels,” IEEE Trans. Commun., vol. 45, no.

11, pp. 1389–1400, Nov. 1997.

3. C.-Y. Chang and K. Yao, “Systolic array processing of the Viterbi

algorithm,” IEEE Trans. Inf. Theory, vol. 35, no. 1, pp. 76–86, Jan.

1989.

4. Y.-N. Chang, H. Suzuki, and K. K. Parhi,“A 2-Mb/s 256-state10-

mw rate-1/3 Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 35, no.

6, pp.826–834, Jun. 2000.

. 5. Kang and A. N. Willson Jr., “Low-power Viterbi decoder for CDMA

mobile terminals,” IEEE J. Solid-State Circuits, vol. 33, no. 3, pp.473–

482, Mar. 1998.

6. C.-Y. Chang and K. Yao, “Systolic array processing of the Viterbi

algorithm,”IEEE Trans. Inf. Theory, vol. 35, no. 1, pp. 76–86, Jan. 1989.

7. C. G. Caraiscos and K. Z. Pekmestzi, “Low-latency bit-parallel

systolic VLSI implementation of FIR digital filters,” IEEE Trans.

Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 7, pp. 529–

537, Jul. 1996.

8. S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson, “A

dynamically reconfigurable adaptive Viterbi decoder,” in Proc.

FPGA’02, 2002.

9. R. Henning and C. Chakrabarti, “Low-power approach for decoding

79

convolutional codes with adaptive Viterbi algorithm

approximations,”

in Proc. ISLPED’02, Monterey, CA, Aug. 12–14, 200 2

Books:

1. “Digital Logic and Design” – Moris Mano, PHI Publications, 3rd

Edition.

2. “VHDL Premier” – J. Bhaskar, 3rd Edition.

3. Douglas L.Perry, “VHDL programming by example”, fourth

edition, TATA McGraw –Hill edition.

4. Charles H.Roth, Jr “Digital Design using VHDL”, Thamson

Brooks/Cole.

5. Peter J.Ashenden “The Designers Guide to VHDL”, Second

Edition, Morgan Kaufmanns Publishers, 2001.

6. “coding and decoding with convolutional codes” by charan

Langton, editor.

Websites:

1. www.wikipedia.com

2. www.xilinx.com

3. www.opencores.org

4. www. Complextoreal.com

http://www.wikipedia.com/

http://www.xilinx.com/

http://www.opencores.org/

80

APPENDIX-A

VLSI DESIGN FLOW

INTRODUCTION

The word digital has made a dramatic impact on our society.

More significant is a continuous trend towards digital solutions in all

areas – from electronic instrumentation, control, data manipulation,

signals processing, telecommunications etc., to consumer electronics.

Development of such solutions has been possible due to good digital

system design and modelling techniques.

CONVENTIONAL APPROACH TO DIGITAL DESIGN

Digital ICs of SSI and MSI types have become universally

standardized and have been accepted for use. Whenever a designer

has to realize a digital function, he uses a standard set of ICs along

with a minimal set of additional discrete circuitry.

Consider a simple example of realizing a function as

Q n+1 = Q n + (A B)

Here On, A, and B are Boolean variables, with Q n being the value of Q

at the nth time step. Here A B signifies the logical AND of A and B; the

„+‟ symbol signifies the logical OR of the logic variables on either side.

A circuit to realize the function is shown in Figure 1. The circuit can

be realized in terms of two ICs – an A-O-I gate and a flip-flop. It can be

directly wired up, tested, and used.

81

A simple digital circuit

With comparatively larger circuits, the task mostly reduces to one of

identifying the set of ICs necessary for the job and interconnecting;

rarely does one have to resort to a micro level design [Wakerly]. The

accepted approach to digital design here is a mix of the top-down and

bottom-up approaches as follows [Hill & Peterson]:

• Decide the requirements at the system level and translate them to

circuit requirements.

• Identify the major functional blocks required like timer, DMA unit,

register file etc., and say as in the design of a processor.

• Whenever a function can be realized using a standard IC, use the

same –for example programmable counter, mux, demux, etc.

• Whenever the above is not possible, form the circuit to carry out the

block functions using standard SSI – for example gates, flip-flops, etc.

• Use additional components like transistor, diode, resistor, capacitor,

etc., wherever essential.

Once the above steps are gone through, a paper design is ready.

Starting with the paper design, one has to do a circuit layout. The

physical location of all the components is tentatively decided; they are

interconnected and the „circuit-onpaper‟ is made ready. Once a paper

design is done, a layout is carried out and a net-list prepared. Based

82

on this, the PCB is fabricated and populated and all the populated

cards tested and debugged.

Sequence of steps in conventional electronic circuit design

At the debugging stage one may encounter three types of problems:

• Functional mismatch: The realized and expected functions are

different. One may have to go through the relevant functional block

carefully and locate any error logically. Finally the necessary

correction has to be carried out in hardware.

• Timing mismatch: The problem can manifest in different forms. One

possibility is due to the signal going through different propagation

delays in two paths and arriving at a point with a timing mismatch.

This can cause faulty operation. Another possibility is a race condition

in a circuit involving asynchronous feedback. This kind of problem

may call for elaborate debugging. The preferred practice is to do

debugging at smaller module stages and ensuring that feedback

through larger loops is avoided: It becomes essential to check for the

existence of long asynchronous loops.

83

• Overload: Some signals may be overloaded to such an extent that the

signal transition may be unduly delayed or even suppressed. The

problem manifests as reflections and erratic behaviour in some cases

(The signal has to be suitably buffered here.). In fact, overload on a

signal can lead to timing mismatches.

The above have to be carried out after completion of the prototype PCB

manufacturing; it involves cost, time, and also a redesigning process

to develop a bug free design.

VLSI DESIGN

The complexity of VLSI‟s being designed and used today makes

the manual approach to design impractical. Design automation is the

order of the day. With the rapid technological developments in the last

two decades, the status of VLSI technology is characterized by the

following [Wai-kai, Gopalan]:

• A steady increase in the size and hence the functionality of the ICs.

• A steady reduction in feature size and hence increase in the speed of

operation as well as gate or transistor density.

• A steady improvement in the predictability of circuit behaviour.

• A steady increase in the variety and size of software tools for VLSI

design.

The above developments have resulted in a proliferation of approaches

to VLSI design. We briefly describe the procedure of automated design

flow [Rabaey, Smith MJ]. The aim is more to bring out the role of a

Hardware Description Language (HDL) in the design process. An

abstraction based model is the basis of the automated design.

84

Abstraction Model

The model divides the whole design cycle into various domains

(see Figure 3). With such an abstraction through a division process

the design is carried out in different layers. The designer at one layer

can function without bothering about the layers above or below. The

thick horizontal lines separating the layers in the figure signify the

compartmentalization. As an example, let us consider design at the

gate level. The circuit to be designed would be described in terms of

truth tables and state tables. With these as available inputs, he has to

express them as Boolean logic equations and realize them in terms of

gates and flip-flops. In turn, these form the inputs to the layer

immediately below. Compartmentalization of the approach to design in

the manner described here is the essence of abstraction; it is the basis

for development and use of CAD tools in VLSI design at various levels.

The design methods at different levels use the respective aids

such as Boolean equations, truth tables, state transition table, etc.

But the aids play only a small role in the process. To complete a

design, one may have to switch from one tool to another, raising the

issues of tool compatibility and learning new environments.

ASIC DESIGN FLOW

As with any other technical activity, development of an ASIC

starts with an idea and takes tangible shape through the stages of

development as shown in Figure 4 and shown in detail in Figure 5.

The first step in the process is to expand the idea in terms of

behaviour of the target circuit. Through stages of programming, the

same is fully developed into a design description – in terms of well

defined standard constructs and conventions.

85

Design domain and levels of abstraction

Major activities in ASIC design

The design is tested through a simulation process; it is to check,

verify, and ensure that what is wanted is what is described.

Simulation is carried out through dedicated tools. With every

simulation run, the simulation results are studied to identify errors in

the design description. The errors are corrected and another

simulation run carried out. Simulation and changes to design

86

description together form a cyclic iterative process, repeated until an

error-free design is evolved.

Design description is an activity independent of the target technology

or manufacturer. It results in a description of the digital circuit. To

translate it into a tangible circuit, one goes through the physical

design process. The same constitutes a set of activities closely linked

to the manufacturer and the target Technology.

Design Description

The design is carried out in stages. The process of transforming

the idea into a detailed circuit description in terms of the elementary

circuit components constitutes design description. The final circuit of

such an IC can have up to a billion such components; it is arrived at

in a step-by-step manner. The first step in evolving the design

description is to describe the circuit in terms of its behaviour. The

description looks like a program in a high level language like C. Once

the behavioural level design description is ready, it is tested

extensively with the help of a simulation tool; it checks and confirms

that all the expected functions are carried out satisfactorily. If

necessary, this behavioural level routine is edited, modified, and rerun

– all done manually. Finally, one has a design for the expected system

– described at the behavioural level. The behavioural design forms the

input to the synthesis tools, for circuit synthesis. The behavioural

constructs not supported by the synthesis tools are replaced by data

flow and gate level constructs. To surmise, the designer has to develop

synthesizable codes for his design.

87

ASIC design and development flow

The design at the behavioural level is to be elaborated in terms

of known and acknowledged functional blocks. It forms the next

detailed level of design description. Once again the design is to be

tested through simulation and iteratively corrected for errors. The

elaboration can be continued one or two steps further. It leads to a

detailed design description in terms of logic gates and transistor

switches.

88

Optimization

The circuit at the gate level – in terms of the gates and flip-flops

– can be redundant in nature. The same can be minimized with the

help of minimization tools. The step is not shown separately in the

figure. The minimized logical design is converted to a circuit in terms

of the switch level cells from standard libraries provided by the

foundries. The cell based design generated by the tool is the last step

in the logical design process; it forms the input to the first level of

physical design [Micheli].

Simulation

The design descriptions are tested for their functionality at every

level – behavioural, data flow, and gate. One has to check here

whether all the functions are carried out as expected and rectify them.

All such activities are carried out by the simulation tool. The tool also

has an editor to carry out any corrections to the source code.

Simulation involves testing the design for all its functions, functional

sequences, timing constraints, and specifications. Normally testing

and simulation at all the levels – behavioural to switch level – are

carried out by a single tool; the same is identified as “scope of

simulation tool” in Figure 5.

Synthesis

With the availability of design at the gate (switch) level, the

logical design is complete. The corresponding circuit hardware

realization is carried out by a synthesis tool. Two common approaches

are as follows:

• The circuit is realized through an FPGA [Old field]. The gate level

design description is the starting point for the synthesis here. The

FPGA vendors provide an interface to the synthesis tool. Through the

interface the gate level design is realized as a final circuit. With many

synthesis tools, one can directly use the design description at the data

89

flow level itself to realize the final circuit through an FPGA. The FPGA

route is attractive for limited volume production or a fast development

cycle.

• The circuit is realized as an ASIC. A typical ASIC vendor will have

his own library of basic components like elementary gates and flip-

flops. Eventually the circuit is to be realized by selecting such

components and interconnecting them conforming to the required

design. This constitutes the physical design. Being an elaborate and

costly process, a physical design may call for an intermediate

functional verification through the FPGA route. The circuit realized

through the FPGA is tested as a prototype. It provides another

opportunity for testing the design closer to the final circuit.

Physical Design

A fully tested and error-free design at the switch level can be the

starting point for a physical design [Baker & Boyce, Wolf]. It is to be

realized as the final circuit using (typically) a million components in

the foundry‟s library. The step-by-step activities in the process are

described briefly as follows:

• System partitioning: The design is partitioned into convenient

compartments or functional blocks. Often it would have been done at

an earlier stage itself and the software design prepared in terms of

such blocks. Interconnection of the blocks is part of the partition

process.

• Floor planning: The positions of the partitioned blocks are planned

and the blocks are arranged accordingly. The procedure is analogous

to the planning and arrangement of domestic furniture in a residence.

Blocks with I/O pins are kept close to the periphery; those which

interact frequently or through a large number of interconnections are

kept close together, and so on. Partitioning and floor planning may

have to be carried out and refined iteratively to yield best results.

90

• Placement: The selected components from the ASIC library are

placed in position on the “Silicon floor.” It is done with each of the

blocks above.

• Routing: The components placed as described above are to be

interconnected to the rest of the block: It is done with each of the

blocks by suitably routing the interconnects. Once the routing is

complete, the physical design cam is taken as complete. The final

mask for the design can be made at this stage and the ASIC

manufactured in the foundry.

Post Layout Simulation

Once the placement and routing are completed, the performance

specifications like silicon area, power consumed, path delays, etc., can

be computed. Equivalent circuit can be extracted at the component

level and performance analysis carried out. This constitutes the final

stage called “verification.” One may have to go through the placement

and routing activity once again to improve performance.

Critical Subsystems

The design may have critical subsystems. Their performance

may be crucial to the overall performance; in other words, to improve

the system performance substantially, one may have to design such

subsystems afresh. The design here may imply redefinition of the

basic feature size of the component, component design, placement of

components, or routing done separately and specifically for the

subsystem. A set of masks used in the foundry may have to be done

afresh for the purpose.

ROLE OF HDL

An HDL provides the framework for the complete logical design

of the ASIC. All the activities coming under the purview of an HDL are

shown enclosed in bold dotted lines in Figure 1.4. Verilog and VHDL

are the two most commonly used HDLs today. Both have constructs

91

with which the design can be fully described at all the levels. There

are additional constructs available to facilitate setting up of the test

bench, spelling out test vectors for them and “observing” the outputs

from the designed unit. IEEE has brought out Standards for the

HDLs, and the software tools conform to them. Verilog as an HDL was

introduced by Cadence Design Systems; they placed it into the public

domain in 1990. It was established as a formal IEEE Standard in

1995. The revised version has been brought out in 2001. However,

most of the simulation tools available today conform only to the 1995

version of the standard. VHDL used by a substantial number of the

VLSI designers today is the used in this project for modelling the

design.