fpga implementation of soft decision low power convolutional decoder using viterbi algorithm
TRANSCRIPT
1
FPGA IMPLEMENTATION OF SOFT
DECISION LOW POWER CONVOLUTIONAL
DECODER USING VITERBI ALGORITHM
2
CHAPTER – 1
INTRODUCTION
1.1. OVER VIEW
"Viterbi Algorithm (VA) decoders are very popular. They are
currently used in about one billion Cell phones. This is probably one
of the largest number in any application. However, the largest current
consumer of VA processor cycles is probably digital video
broadcasting. A recent estimate at Qualcomm is that approximately
1015 bits per second are now being decoded by the VA in digital TV
sets around the world, every second of every day.
1.2.OBJECTIVE
Purpose of this project is to introduce the reader to a forward
error correction technique known as convolutional coding with Viterbi
decoding. The detailed description of the algorithms for generating
random binary data, Convolutionally encoding the data, passing the
encoded data through a noisy channel, quantizing the received
channel symbols, and performing Viterbi decoding on the quantized
channel symbols to recover the original binary data.
The purpose of forward error correction (FEC) is to improve the
capacity of a channel by adding some carefully designed redundant
information to the data being transmitted through the channel. The
process of adding this redundant information is known as channel
coding. Convolutional coding and block coding are the two major
forms of channel coding. Convolutional codes operate on serial data,
one or a few bits at a time. Block codes operate on relatively large
(typically, up to a couple of hundred bytes) message blocks. There are
a variety of useful convolutional and block codes, and a variety of
algorithms for decoding the received coded information sequences to
recover the original data.
3
Convolutional encoding with Viterbi decoding is a FEC
technique that is particularly suited to a channel in which the
transmitted signal is corrupted mainly by additive white gaussian
noise (AWGN). You can think of AWGN as noise whose voltage
distribution over time has characteristics that can be described using
a Gaussian, or normal, statistical distribution, i.e. a bell curve. This
voltage distribution has zero mean and a standard deviation that is a
function of the signal-to-noise ratio (SNR) of the received signal. Let's
assume for the moment that the received signal level is fixed. Then if
the SNR is high, the standard deviation of the noise is small, and vice-
versa. In digital communications, SNR is usually measured in terms of
Eb/N0, which stands for energy per bit divided by the one-sided noise
density.
Let's take a moment to look at a couple of examples. Suppose
that we have a system where a '1' channel bit is transmitted as a
voltage of -1V, and a '0' channel bit is transmitted as a voltage of +1V.
This is called bipolar non-return-to-zero (bipolar NRZ) signaling. It is
also called binary "antipodal" (which means the signaling states are
exact opposites of each other) signaling. The receiver comprises a
comparator that decides the received channel bit is a '1' if its voltage is
less than 0V and a „0‟ if its voltage is greater than or equal to 0V. One
would want to sample the output of the comparator in the middle of
each data bit interval. Let's see how our example system performs,
first, when the Eb/N0 is high, and then when the Eb/N0 is lower.
The following figure1 shows the results of a channel simulation
where one million (1 x 106) channel bits are transmitted through an
AWGN channel with an Eb/N0 level of 20 dB (i.e. the signal voltage is
ten times the rms noise voltage). In this simulation, a '1' channel bit
is transmitted at a level of -1V, and a '0' channel bit is transmitted at
a level of +1V. The x axis of this figure1 corresponds to the received
signal voltages, and the y axis represents the number of times each
voltage level was received.
4
Figure 1.1 Results of a channel simulation
Our simple receiver detects a received channel bit as a '1' if its
voltage is less than 0V and as a „0 if its voltage is greater than or equal
to 0V. Such a receiver would have little difficulty correctly receiving a
signal as depicted in the figure1. Very few (if any) channel bit
reception errors would occur. In this example simulation with the
Eb/N0 set at 20 dB, a transmitted '0' was never received as a '1', and a
transmitted '1' was never received as a '0'.
The figure2 shows the results of a similar channel simulation
when 1 x 106 channel bits are transmitted through an AWGN channel
where the Eb/N0 level has decreased to 6 dB (i.e. the signal voltage is
two times the rms noise voltage):
5
Figure 1.2 results of a channel simulation
Now observe how the right-hand side of the red curve in the
figure2 crosses 0V, and how the left-hand side of the blue curve also
crosses 0V. The points on the red curve that are above 0V represent
events where a channel bit that was transmitted as a one (-1V) was
received as a zero. The points on the blue curve that are below 0V
represent events where a channel bit that was transmitted as a zero
(+1V) was received as a one. Obviously, these events correspond to
channel bit reception errors in our simple receiver. In this example
simulation with the Eb/N0 set at 6 dB, a transmitted '0' was received
as a '1' 1,147 times, and a transmitted '1' was received as a '0' 1,207
times, corresponding to a bit error rate (BER) of about 0.235%. That's
not so good, especially if you're trying to transmit highly compressed
data, such as digital television. I will show you that by using
convolutional coding with Viterbi decoding, you can achieve a BER of
better than 1 x 10-7 at the same Eb/N0, 6 dB.
Convolutional codes are usually described using two
parameters: the code rate and the constraint length. The code rate,
6
k/n, is expressed as a ratio of the number of bits into the
convolutional encoder (k) to the number of channel symbols output by
the convolutional encoder (n) in a given encoder cycle. The constraint
length parameter, K, denotes the "length" of the convolutional
encoder, i.e. how many k-bit stages are available to feed the
combinatorial logic that produces the output symbols. Closely related
to K is the parameter m, which indicates how many encoder cycles an
input bit is retained and used for encoding after it first appears at the
input to the convolutional encoder. The m parameter can be thought
of as the memory length of the encoder. I focussed on rate 1/2
convolutional codes in this project.
1.3.LITERATURE SURVEY
On the part of literature survey before going to implement the
proposed work, the following research papers have been referred to
and considered their contents.
Research Paper 1:
F. Chan and D. Haccoun, “Adaptive Viterbi decoding of
convolution codes over memory less channels,” IEEE Trans. Commun.,
vol. 45, no. 11, pp. 1389–1400, Nov. 1997.
Objective
In this paper, an adaptive decoding algorithm for convolutional
codes, which is a modification of the Viterbi algorithm (VA) is
presented. For a given code, the proposed algorithm yields nearly the
same error performance as the VA while requiring a substantially
smaller average number of computations. Unlike most of the other
suboptimum algorithms, this algorithm is self-synchronizing. If the
transmitted path is discarded, the adaptive Viterbi algorithm (AVA)
can recover the state corresponding to the transmitted path after a few
trellis depths. Using computer simulations over hard and soft 3-bit
quantized additive white Gaussian noise channels, it is shown that
7
codes with a constraint length K up to 11 can be used to improve the
bit-error performance over the VA with K=7 while maintaining a
similar average number of computations. Although a small variability
of the computational effort is present with our algorithm, this
variability is exponentially distributed, leading to a modest size of the
input buffer and, hence, a small probability of overflow.
Research Paper 2 :
S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson, “A
dynamically reconfigurable adaptive Viterbi decoder,” in Proc.
FPGA’02, 2002.
Objective
The use of error-correcting codes has proven to be an effective
way to overcome data corruption in digital communication channels.
Although widely-used, the most popular communications decoding
algorithm, the Viterbi algorithm, requires an exponential increase in
hardware complexity to achieve greater decode accuracy. In this
paper, we describe the analysis and implementation of a reduced-
complexity decode approach, the adaptive Viterbi algorithm (AVA). Our
AVA design is implemented in reconfigurable hardware to take full
advantage of algorithm parallelism and specialization. Run-time
dynamic reconfiguration is used in response to changing channel
noise conditions to achieve improveddecoder performance.
Implementation parameters for thedecoder have been determined
through simulation and thedecoder has been implemented on a Xilinx
XC4036-basedPCIb oard. An overall decode performance improvement
of 7.5X for AVA has been achieved versus algorithm implementation
on a Celeron-processor based system. The useof dynamic
reconfiguration leads to a 20% performance improvementover a static
implementation with no loss of decodeaccuracy.
8
Research Paper 3:
R. Henning and C. Chakrabarti, “Low-power approach for
decoding convolutional codes with adaptive Viterbi algorithm
approximations,” in Proc. ISLPED’02, Monterey, CA, Aug. 12–14, 2002.
Objective:
Significant power reduction can be achieved by exploiting real-
time variation in system characteristics while decoding convolutional
codes.The approach proposed herein adaptively approximates Viterbi
decoding by varying truncation length and pruning threshold of the T-
algorithm while employing trace-back memory management.
Adaptation is performed according to variations in signal-to-noise
ratio, code rate, and maximum acceptable bit error rate.Potential
energy reduction of 70 to 97.5% compared to Viterbi decoding is
demonstrated.Superiority of adaptive T-algorithm decoding compared
to fixed T-algorithm decoding is studied.General conclusions about
when applications can particularly benefit from this approach are
given.
1.4. AIM OF THE PROJECT
This project is aimed tried to develop a Viterbi decoder with
strongly connected trellis decoding which is proposed to reduce the
number of add compare select computations in viterbi decoding
process.The idea is to develop a low power, low delay performance
Viterbi Decoder.
1.5. ADVANTAGES / APPLICATIONS
Viterbi decoding is one of two types of decoding algorithms used
with convolutional encoding-the other type is sequential decoding.
Sequential decoding has the advantage that it can perform very well
9
with long-constraint-length convolutional codes, but it has a variable
decoding time.
Viterbi decoding has the advantage that it has a fixed decoding
time. It is well suited to hardware decoder implementation. But its
computational requirements grow exponentially as a function of the
constraint length, so it is usually limited in practice to constraint
lengths of K = 9 or less. Stanford Telecom produces a K = 9 Viterbi
decoder that operates at rates up to 96 kbps, and a K = 7 Viterbi
decoder that operates at up to 45 Mbps. Advanced Wireless
Technologies offers a K = 9 Viterbi decoder that operates at rates up to
2 Mbps. NTT has announced a Viterbi decoder that operates at 60
Mbps. Moore's Law applies to Viterbi decoders as well as to
microprocessors, so consider the rates mentioned above as a snapshot
of the state-of-the-art taken in early 1999.
For years, convolutional coding with Viterbi decoding has been
the predominant FEC technique used in space communications,
particularly in geostationary satellite communication networks, such
as VSAT (very small aperture terminal) networks. I believe the most
common variant used in VSAT networks is rate 1/2 convolutional
coding using a code with a constraint length K = 7. With this code,
you can transmit binary or quaternary phase-shift-keyed (BPSK or
QPSK) signals with at least 5 dB less power than you'd need without
it. That's a reduction in Watts of more than a factor of three! This is
very useful in reducing transmitter and/or antenna cost or permitting
increased data rates given the same transmitter power and antenna
sizes.
But there's a tradeoff-the same data rate with rate 1/2
convolutional coding takes twice the bandwidth of the same signal
without it, given that the modulation technique is the same. That's
because with rate 1/2 convolutional encoding, you transmit two
channel symbols per data bit. However, if you think of the tradeoff as
10
a 5 dB power savings for a 3 dB bandwidth expansion, you can see
that you come out ahead. Remember: if the modulation technique
stays the same, the bandwidth expansion factor of a convolutional
code is simply n/k.
Many radio channels are AWGN channels, but many,
particularly terrestrial radio channels also have other impairments,
such as multipath, selective fading, interference, and atmospheric
(lightning) noise. Transmitters and receivers can add spurious signals
and phase noise to the desired signal as well. Although convolutional
coding with Viterbi decoding might be useful in dealing with those
other problems, it may not be the best technique.
In the past several years, convolutional coding with Viterbi
decoding has begun to be supplemented in the geostationary satellite
communication arena with Reed-Solomon coding. The two coding
techniques are usually implemented as serially concatenated block
and convolutional coding. Typically, the information to be transmitted
is first encoded with the Reed-Solomon code, then with the
convolutional code. On the receiving end, Viterbi decoding is
performed first, followed by Reed-Solomon decoding. This is the
technique that is used in most if not all of the direct-broadcast
satellite (DBS) systems, and in several of the newer VSAT products as
well. At least, that's what the vendors are advertising.
In the year 1993 a new parallel-concatenated convolutional
coding technique known as turbo coding has emerged. Initial
hardware encoder and decoder implementations of turbo coding have
already appeared on the market. This technique achieves substantial
improvements in performance over concatenated Viterbi and Reed-
Solomon coding. A variant in which the codes are product codes has
also been developed, along with hardware implementations.
11
Applications:
Viterbi Decoder is commonly used in decoding convolution codes for
wireless communication like
-Decoding convolution codes in satellite communications.
-Computer storage devices such as hard disc drives.
- digital video broadcasting.
-Mobile brad band Applications
1.6 THESIS ORGANISATION
The present study is based on the implementation of the soft
decision low power convolutional encoder and decoder Using Viterbi
Algorithm using VHDL. Chapter one provides an introduction to the
project as a whole. Chapter two gives a detailed account of Viterbi
Algorithm. Chapter three provides insights into the logical aspects of
Viterbi encoder and decoder. Chapter four deals with the
specifications and the design aspects involved in the implementation
of VHDL. Chapter five contains simulation results of the proposed
work. Chapter six deals with the VHDL implementation, which
contains the synthesis reports, RTL views, port diagrams and timing
analysis. Chapter seven present the concluding remarks and future
scope of the work.
12
CHAPTER 2
VITERBI ALGORITHM
2.1 INTRODUCTION
The steps involved in simulating a communication channel using Soft
Decision Viterbi decoding are as follows:
Generate the data to be transmitted through the channel-
result is binary data bits.
Convolutionally encode the data-result is channel symbols.
Map the one/zero channel symbols onto an antipodal
baseband signal, producing transmitted channel symbols.
Add noise to the transmitted channel symbols-result is received
channel symbols.
Quantize the received channel levels-one bit quantization is
called hard-decision, and two to n bit quantization is called soft-
decision(n is usually three or four).
Perform Viterbi decoding on the quantized received channel
symbols-result is again binary data bits.
Compare the decoded data bits to the transmitted data bits and
count the number of errors.
Many of you notice that I left out the steps of modulating the
channel symbols onto a transmitted carrier, and then demodulating
the received carrier to recover the channel symbols. You‟re right, but
we can accurately model the effects of AWGN even through we bypass
those steps.
2.2 GENERATING THE DATA
Generating the data to be transmitted through the channel can
be accomplished quite simply by using a random number generator.
One that produces a uniform distribution of numbers on the interval 0
to a maximum value. Using this function, we can say that any value
13
less than half of the maximum value is a zero. Any value greater than
or equal to half of the maximum value is a one.
2.3 CONVOLUTIONALLY ENCODING THE DATA
Convolutionally encoding the data is accomplished using a shift
register and associated combinatorial logic that performs modulo-two
addition. (A shift register is merely a chain of flip-flops where in the
output of the nth flip-flop is tied to the input of the (n+1)th flip-flop.
Every time the active edge of the clock occurs, the input to the flip-flop
is clocked through to the output, and thus the data are shifted over
one stage.) The combinatorial logic is often in the form of cascaded
exclusive-or gates. As a reminder, exclusive-or gates are two-input,
one-output gates often represented by the logic symbol as shown in
figure1,
Figure 2.1. XOR gate
That implements the following truth-table:
Table 2.1 truth table of Ex-Or gate
The exclusive-or-gate performs modulo-two addition of its
inputs. When you cascade q two-input exclusive-or gates, with the
14
output of the first one feeding one of the inputs of the second one, the
output of the second one feeding one of the inputs of the third one,
etc., the output of the last one in the chain is the modulo-two sum of
the q + 1 inputs.
Another way to illustrate the modulo-two adder, and the way
that is most commonly used in textbooks, is as a circle with a +
symbol inside, thus:
Now that we have the two basic components of the
convolutional encoder (flip-flops comprising the shift register and
exclusive-or gates comprising the associated modulo-two adders)
defined, let's look at a picture of a convolutional encoder for a rate
1/2, K = 3, m = 2 codes:
Figure 2.2 convolutional encoder
In this encoder, data bits are provided at a rate of k bits per
second. Channel symbols are output at a rate of n = 2k symbols per
second. The input bit is stable during the encoder cycle. The encoder
cycle starts when an input clock edge occurs. When the input clock
15
edge occurs, the output of the left-hand flip-flop is clocked into the
right-hand flip-flop, the previous input bit is clocked into the left-hand
flip-flop, and a new input bit becomes available. Then the outputs of
the upper and lower modulo-two adders become stable. The output
selector (SEL A/B block) cycles through two states-in the first state, it
selects and outputs the output of the upper modulo-two adder; in the
second state, it selects and outputs the output of the lower modulo-
two adder.
The encoder shown above encodes the K = 3, (7, 5) convolutional
code. The octal numbers 7 and 5 represent the code generator
polynomials, which when read in binary (1112 and 1012) correspond to
the shift register connections to the upper and lower modulo-two
adders, respectively. This code has been determined to be the "best"
code for rate 1/2, K = 3. It is the code I will use for the remaining
discussion and examples, for reasons that will become readily
apparent when we get into the Viterbi decoder algorithm.
Let's look at an example input data stream, and the
corresponding output data stream:
Let the input sequence be 0101110010100012.
Assume that the outputs of both of the flip-flops in the shift
register are initially cleared, i.e. their outputs are zeroes. The first
clock cycle makes the first input bit, a zero, available to the encoder.
The flip-flop outputs are both zeroes. The inputs to the modulo-two
adders are all zeroes, so the output of the encoder is 002.
The second clock cycle makes the second input bit available to
the encoder. The left-hand flip-flop clocks in the previous bit, which
was a zero, and the right-hand flip-flop clocks in the zero output by
the left-hand flip-flop. The inputs to the top modulo-two adder are
1002, so the output is a one. The inputs to the bottom modulo-two
16
adder are 102, so the output is also a one. So the encoder outputs 112
for the channel symbols.
The third clock cycle makes the third input bit, a zero, available
to the encoder. The left-hand flip-flop clocks in the previous bit, which
was a one, and the right-hand flip-flop clocks in the zero from two bit-
times ago. The inputs to the top modulo-two adder are 0102, so the
output is a one. The inputs to the bottom modulo-two adder are 002,
so the output is zero. So the encoder outputs 102 for the channel
symbols.
And so on. The timing diagram shown below illustrates the process:
Figure 2.3. Timing diagram of Encoder
After all of the inputs have been presented to the encoder, the output
sequence will be:
00 11 10 00 01 10 01 11 11 10 00 10 11 00 112.
17
Notice that I have paired the encoder outputs-the first bit in
each pair is the output of the upper modulo-two adder; the second bit
in each pair is the output of the lower modulo-two adder.
You can see from the structure of the rate 1/2 K = 3
convolutional encoder and from the example given above that each
input bit has an effect on three successive pairs of output symbols.
That is an extremely important point and that is what gives the
convolutional code its error-correcting power. The reason why will
become evident when we get into the Viterbi decoder algorithm.
Now if we are only going to send the 15 data bits given above, in
order for the last bit to affect three pairs of output symbols, we need
to output two more pairs of symbols. This is accomplished in our
example encoder by clocking the convolutional encoder flip-flops two (
= m) more times, while holding the input at zero. This is called
"flushing" the encoder, and results in two more pairs of output
symbols. The final binary output of the encoder is thus 00 11 10 00
01 10 01 11 11 10 00 10 11 00 11 10 112. If we don't perform the
flushing operation, the last m bits of the message have less error-
correction capability than the first through (m - 1)th bits had. This is a
pretty important thing to remember if you're going to use this FEC
technique in a burst-mode environment. So's the step of clearing the
shift register at the beginning of each burst. The encoder must start in
a known state and end in a known state for the decoder to be able to
reconstruct the input data sequence properly.
Now, let's look at the encoder from another perspective. You can
think of the encoder as a simple state machine. The example encoder
has two bits of memory, so there are four possible states. Let's give the
left-hand flip-flop a binary weight of 21, and the right-hand flip-flop a
binary weight of 20. Initially, the encoder is in the all-zeroes state. If
the first input bit is a zero, the encoder stays in the all zeroes state at
the next clock edge. But if the input bit is a one, the encoder
18
transitions to the 102 state at the next clock edge. Then, if the next
input bit is zero, the encoder transitions to the 012 state, otherwise, it
transitions to the 112 state. The following table gives the next state
given the current state and the input, with the states given in binary:
Table 2.2 Next state table of encoder
The above table is often called a state transition table. We'll refer
to it as the next state table. Now let us look at a table that lists the
channel output symbols, given the current state and the input data,
which we'll refer to as the output table:
Table 2.3 Out put table
19
You should now see that with these two tables, you can
completely describe the behavior of the example rate 1/2, K = 3
convolutional encoder. Note that both of these tables have 2(K - 1) rows,
and 2k columns, where K is the constraint length and k is the number
of bits input to the encoder for each cycle. These two tables will come
in handy when we start discussing the Viterbi decoder algorithm.
MAPPING THE CHANNEL SYMBOLS TO SIGNAL LEVELS
Mapping the one/zero output of the convolutional encoder onto
an antipodal baseband signaling scheme is simply a matter of
translating zeroes to +1s and ones to -1s. This can be accomplished
by performing the operation y = 1 - 2x on each convolutional encoder
output symbol.
2.4 ADDING NOISE TO THE TRANSMITTED SYMBOLS
Adding noise to the transmitted channel symbols produced by
the convolutional encoder involves generating Gaussian random
numbers, scaling the numbers according to the desired energy per
symbol to noise density ratio, Es/N 0, and adding the scaled Gaussian
random numbers to the channel symbol values.
For the uncoded channel, Es/N0 = Eb/N 0, since there is one
channel symbol per bit. However, for the coded channel, Es/N0 =
Eb/N0 + 10log10(k/n). For example, for rate 1/2 coding, E s/N0 = Eb/N0
+ 10log10(1/2) = Eb/N0 - 3.01 dB. Similarly, for rate 2/3 coding, Es/N0
= Eb/N0 + 10log10 (2/3) = Eb/N0 - 1.76 dB.
The Gaussian random number generator is the only interesting
part of this task. C only provides a uniform random number
generator, rand(). In order to obtain Gaussian random numbers, we
take advantage of relationships between uniform, Rayleigh, and
Gaussian distributions:
20
Given a uniform random variable U, a Rayleigh random variable R can
be obtained by:
Where is the variance of the Rayleigh random variable, and given
R and a second uniform random variable V, two Gaussian random
variables G and H can be obtained by
G = R cos V and H = R sin V.
In the AWGN channel, the signal is corrupted by additive noise,
n(t), which has the power spectrum No/2 watts/Hz. The variance
of this noise is equal to . If we set the energy per symbol Es equal
to 1, then .
So
2.5 QUANTIZING THE RECEIVED CHANNEL SYMBOLS
An ideal Viterbi decoder would work with infinite precision, or at
least with floating-point numbers. In practical systems, we quantize
the received channel symbols with one or a few bits of precision in
order to reduce the complexity of the Viterbi decoder, not to mention
the circuits that precede it. If the received channel symbols are
quantized to one-bit precision (< 0V = 1, > 0V = 0), the result is called
hard-decision data. If the received channel symbols are quantized with
more than one bit of precision, the result is called soft-decision data.
A Viterbi decoder with soft decision data inputs quantized to three or
four bits of precision can perform about 2 dB better than one working
with hard-decision inputs. The usual quantization precision is three
bits. More bits provide little additional improvement.
21
The selection of the quantizing levels is an important design
decision because it can have a significant effect on the performance of
the link. The following is a very brief explanation of one way to set
those levels. Let's assume our received signal levels in the absence of
noise are -1V = 1, +1V = 0. With noise, our received signal has mean
+/- 1 and standard deviation . Let's use a uniform,
three-bit quantizer having the input/output relationship shown in the
figure below, where D is a decision level that we will calculate shortly:
Figure 2.4.Quantiser input output relationship
The decision level, D, can be calculated according to the formula
Where Es/N0 is the energy per symbol to noise density ratio.
22
2.6 PERFORMING VITERBI DECODING
The Viterbi decoder itself is the primary focus of this tutorial.
Perhaps the single most important concept to aid in understanding
the Viterbi algorithm is the trellis diagram. The figure below shows the
trellis diagram for our example rate 1/2 K = 3 convolutional encoder,
for a 15-bit message:
The four possible states of the encoder are depicted as four rows
of horizontal dots. There is one column of four dots for the initial state
of the encoder and one for each time instant during the message. For
a 15-bit message with two encoder memory flushing bits, there are 17
time instants in addition to t = 0, which represents the initial
condition of the encoder. The solid lines connecting dots in the
diagram represent state transitions when the input bit is a one. The
dotted lines represent state transitions when the input bit is a zero.
Notice the correspondence between the arrows in the trellis diagram
and the state transition table discussed above. Also notice that since
the initial condition of the encoder is State 002, and the two memory
flushing bits are zeroes, the arrows start out at State 002 and end up
at the same state.
The following diagram shows the states of the trellis that are
actually reached during the encoding of our example 15-bit message:
23
The encoder input bits and output symbols are shown at the
bottom of the diagram. Notice the correspondence between the
encoder output symbols and the output table discussed above. Let's
look at that in more detail, using the expanded version of the
transition between one time instant to the next shown below:
The two-bit numbers labeling the lines are the corresponding
convolutional encoder channel symbol outputs. Remember that dotted
lines represent cases where the encoder input is a zero, and solid lines
represent cases where the encoder input is a one. (In the figure above,
the two-bit binary numbers labeling dotted lines are on the left, and
the two-bit binary numbers labeling solid lines are on the right.)
OK, now let's start looking at how the Viterbi decoding
algorithm actually works. For our example, we're going to use hard-
decision symbol inputs to keep things simple. (The example source
code uses soft-decision inputs to achieve better performance.)
Suppose we receive the above encoded message with a couple of bit
24
errors:
Each time we receive a pair of channel symbols, we're going to
compute a metric to measure the "distance" between what we received
and all of the possible channel symbol pairs we could have received.
Going from t = 0 to t = 1, there are only two possible channel symbol
pairs we could have received: 002, and 112. That's because we know
the convolutional encoder was initialized to the all-zeroes state, and
given one input bit = one or zero, there are only two states we could
transition to and two possible outputs of the encoder. These possible
outputs of the encoder are 00 2 and 112.
The metric we're going to use for now is the Hamming distance
between the received channel symbol pair and the possible channel
symbol pairs. The Hamming distance is computed by simply counting
how many bits are different between the received channel symbol pair
and the possible channel symbol pairs. The results can only be zero,
one, or two. The Hamming distance (or other metric) values we
compute at each time instant for the paths between the states at the
previous time instant and the states at the current time instant are
called branch metrics. For the first time instant, we're going to save
these results as "accumulated error metric" values, associated with
states. For the second time instant on, the accumulated error metrics
will be computed by adding the previous accumulated error metrics to
the current branch metrics.
25
At t = 1, we received 002. The only possible channel symbol pairs
we could have received are 002 and 112. The Hamming distance
between 002 and 002 is zero. The Hamming distance between 002 and
112 is two. Therefore, the branch metric value for the branch from
State 002 to State 002 is zero, and for the branch from State 002 to
State 102 it's two. Since the previous accumulated error metric values
are equal to zero, the accumulated metric values for State 002 and for
State 102 are equal to the branch metric values. The accumulated
error metric values for the other two states are undefined. The figure
below illustrates the results at t = 1:
Note that the solid lines between states at t = 1 and the state at
t = 0 illustrate the predecessor-successor relationship between the
states at t = 1 and the state at t = 0 respectively. This information is
shown graphically in the figure, but is stored numerically in the actual
implementation. To be more specific, or maybe clear is a better word,
at each time instant t, we will store the number of the predecessor
state that led to each of the current states at t.
Now let's look what happens at t = 2. We received a 112 channel
symbol pair. The possible channel symbol pairs we could have
received in going from t = 1 to t = 2 are 002 going from State 002 to
State 002, 112 going from State 002 to State 102, 102 going from State
102 to State 01 2, and 012 going from State 102 to State 11 2. The
Hamming distance between 002 and 112 is two, between 112 and 112
26
is zero, and between 10 2 or 012 and 112 is one. We add these branch
metric values to the previous accumulated error metric values
associated with each state that we came from to get to the current
states. At t = 1, we could only be at State 002 or State 102. The
accumulated error metric values associated with those states were 0
and 2 respectively. The figure below shows the calculation of the
accumulated error metric associated with each state, at t = 2.
That's all the computation for t = 2. What we carry forward to t
= 3 will be the accumulated error metrics for each state, and the
predecessor states for each of the four states at t = 2, corresponding to
the state relationships shown by the solid lines in the illustration of
the trellis.
Now look at the figure for t = 3. Things get a bit more
complicated here, since there are now two different ways that we could
get from each of the four states that were valid at t = 2 to the four
states that are valid at t = 3. So how do we handle that? The answer
is, we compare the accumulated error metrics associated with each
branch, and discard the larger one of each pair of branches leading
into a given state. If the members of a pair of accumulated error
metrics going into a particular state are equal, we just save that value.
The other thing that's affected is the predecessor-successor history
we're keeping. For each state, the predecessor that survives is the one
with the lower branch metric. If the two accumulated error metrics are
27
equal, some people use a fair coin toss to choose the surviving
predecessor state. Others simply pick one of them consistently, i.e. the
upper branch or the lower branch. It probably doesn't matter which
method you use. The operation of adding the previous accumulated
error metrics to the new branch metrics, comparing the results, and
selecting the smaller (smallest) accumulated error metric to be
retained for the next time instant is called the add-compare-select
operation. The figure below shows the results of processing t = 3:
Note that the third channel symbol pair we received had a one-
symbol error. The smallest accumulated error metric is a one, and
there are two of these.
Let's see what happens now at t = 4. The processing is the same
as it was for t = 3. The results are shown in the figure:
28
Notice that at t = 4, the path through the trellis of the actual
transmitted message, shown in bold, is again associated with the
smallest accumulated error metric. Let's look at t = 5:
At t = 5, the path through the trellis corresponding to the actual
message, shown in bold, is still associated with the smallest
accumulated error metric. This is the thing that the Viterbi decoder
exploits to recover the original message.
Perhaps you're getting tired of stepping through the trellis. I
know I am. Let's skip to the end.
At t = 17, the trellis looks like this, with the clutter of the
intermediate state history removed:
29
The decoding process begins with building the accumulated
error metric for some number of received channel symbol pairs, and
the history of what states preceded the states at each time instant t
with the smallest accumulated error metric. Once this information is
built up, the Viterbi decoder is ready to recreate the sequence of bits
that were input to the convolutional encoder when the message was
encoded for transmission. This is accomplished by the following steps:
First, select the state having the smallest accumulated error
metric and save the state number of that state.
Iteratively perform the following step until the beginning of the
trellis is reached: Working backward through the state history
table, for the selected state, select a new state which is listed in
the state history table as being the predecessor to that state.
Save the state number of each selected state. This step is called
trace back.
Now work forward through the list of selected states saved in
the previous steps. Look up what input bit corresponds to a
transition from each predecessor state to its successor state.
That is the bit that must have been encoded by the
convolutional encoder.
The following table shows the accumulated metric for the full 15-bit
(plus two flushing bits) example message at each time t:
30
Table 2.4 Accumulated error metric for 15 bit message
It is interesting to note that for this hard-decision-input Viterbi
decoder example, the smallest accumulated error metric in the final
state indicates how many channel symbol errors occurred.
The following state history table shows the surviving
predecessor states for each state at each time t:
Table 2.5 surviving predecessor states for each state at each time t
The following table shows the states selected when tracing the
path back through the survivor state table shown above:
Table 2.6 States selected when tracing the path back
31
Using a table that maps state transitions to the inputs that
caused them, we can now recreate the original message. Here is what
this table looks like for our example rate 1/2 K = 3 convolutional code:
Figure 2.7 State transition Maps to the inputs
Note: In the above table, x denotes an impossible transition from one
state to another state:
So now we have all the tools required to recreate the original
message from the message we received:
Figure 2.8 Recreating the original message
The two flushing bits are discarded.
Here‟s an insight into how the traceback algorithm eventually
finds its way onto the right path even if it started out choosing the
wrong initial state. This could happen if more than one state had the
smallest accumulated error metric, for example I‟ll use the figure for
the trellis at t=3 again to illustrate this point:
32
See how at t = 3, both States 012 and 112 had an accumulated
error metric of 1. The correct path goes to State 012 -notice that the
bold line showing the actual message path goes into this state. But
suppose we choose State 112 to start our traceback. The predecessor
state for State 112 , which is State 102 , is the same as the
predecessor state for State 012! This is because at t = 2, State 102 had
the smallest accumulated error metric. So after a false start, we are
almost immediately back on the correct path.
For the example 15-bit message, we built the trellis up for the
entire message before starting traceback. For longer messages, or
continuous data, this is neither practical nor desirable, due to
memory constraints and decoder delay. Research has shown that a
traceback depth of K x 5 is sufficient for Viterbi decoding with the type
of codes we have been discussing. Any deeper traceback increases
decoding delay and decoder memory requirements, while not
significantly improving the performance of the decoder. The exception
is punctured codes, which I'll describe later. They require deeper
traceback to reach their final performance limits.
To implement a Viterbi decoder in software, the first step is to
build some data structures around which the decoder algorithm will
be implemented. These data structures are best implemented as
arrays. The primary six arrays that we need for the Viterbi decoder are
as follows:
33
A copy of the convolutional encoder next state table, the state
transition table of the encoder. The dimensions of this table
(rows x columns) are 2(K - 1) x 2k. This array needs to be
initialized before starting the decoding process.
A copy of the convolutional encoder output table. The
dimensions of this table are 2(K - 1) x 2k. This array needs to be
initialized before starting the decoding process.
An array (table) showing for each convolutional encoder current
state and next state, what input value (0 or 1) would produce
the next state, given the current state. We'll call this array the
input table. Its dimensions are 2(K - 1) x 2(K - 1). This array needs
to be initialized before starting the decoding process.
An array to store state predecessor history for each encoder
state for up to K x 5 + 1 received channel symbol pairs. We'll
call this table the state history table. The dimensions of this
array are 2 (K - 1) x (K x 5 + 1). This array does not need to be
initialized before starting the decoding process.
An array to store the accumulated error metrics for each state
computed using the add-compare-select operation. This array
will be called the accumulated error metric array. The
dimensions of this array are 2 (K - 1) x 2. This array does not need
to be initialized before starting the decoding process.
An array to store a list of states determined during traceback
(term to be explained below). It is called the state sequence
array. The dimensions of this array are (K x 5) + 1. This array
does not need to be initialized before starting the decoding
process.
2.7 CONCLUSION
This chapter describes the detailed description of viterbi
algorithm which is based on coding theory
34
CHAPTER 3
VITERBI DECODER
3.1 INTRODUCTION
In this chapter the main concept of convolutional encoding and
decoding of encoded data by the use of Viterbi algorithm and the
different blocks of viterbi decder are discussed. Convolutional
encoding with Viterbi decoding is a powerful method for forward error
correction. It has been widely deployed in many wireless
communication systems to improve the limited capacity of the
communication channels. The Viterbi algorithm is the most
extensively employed decoding algorithm for convolutional codes. The
Viterbi algorithm develops as an asymptotically optimal decoding
algorithm for convolutional codes. It is well suited to hardware
decoder implementation. Viterbi decoding of convolutional codes
found to be efficient and robust.
Viterbi decoding is the best-known implementation of the
maximum likely-hood decoding. Here we narrow the options
systematically at each time tick. The Viterbi decoder examines an
entire received sequence of a given length. The decoder computes a
metric for each path and makes a decision based on this metric. All
paths are followed until two paths converge on one node. Then the
path with the higher metric is kept and the one with lower metric is
discarded. The most common metric used is the Hamming distance
metric.
35
3.2 CONVOLUTIONAL ENCODER
Figure 3.1 block diagram for convolutional encoder
As shown in figure 3.1, it will represent the block diagram for
convolutional encoder. Mainly for this encoder block we will give 1-bit
input data, after doing encoding it will generates c1 and c2 are the two
outputs. To calculate c1 it will do XOR operation in between input bit,
F6, F5, F4 & F1 outputs. Where as to calculate c2 it will do XOR
operation in between input bit, F5, F4, F2, F1 outputs.
3.3 DECODER
This project presents a configurable 3-bit soft decision Viterbi
decoder implementation that meets the requirements for WLAN and
broadband applications. The programmable design supports a
constraint length K=7 soft decision Viterbi decoder (SDVD) realization
with a code rate (R) of 1/2. To assure a high throughput, an
architecture incorporating 64 Add Compare Select (ACS) units
operating in parallel has been selected. Further low power adaptive
viterbi decoder is developed which used strongly connected trellis
decoding.
36
To implement Viterbi algorithm in Reconfigurable Hardware like
FPGA with the help of Hardware Descriptor Language, it‟s essential to
translate algorithm in to digital Block. Algorithm demand computation
of Path and Branch Metric and storage of Path Metric decision to
Trace Back. Based on requirement Blocks are arranged as shown in
figure 3.3
Figure 3.2 Block Diagram for Decoder
3.3.1 Branch Metric Unit(BMU)
A branch metric unit's function is to calculate branch metrics,
which are normed distances between every possible symbol in the
code alphabet, and the received symbol.
There are hard decision and soft decision Viterbi decoders. A
hard decision Viterbi decoder receives a simple bit stream on its input,
and a Hamming distance is used as a metric. A soft decision Viterbi
decoder receives a bitstream containing information about the
reliability of each received symbol.
The squared Euclidean distance is used as a metric for soft
decision decoders.
37
Figure 3.3 block diagram for BMU
As shown in figure 3.3, it will represent the block diagram for
BMU. BMU Stands for Branch metric Unit and its function is to
calculate branch metrics, which are normed distances between every
possible symbol in the code alphabet and the received symbol. V1V2
represent a symbol where V1 and V2 is two bit of symbol. Normed or
Hamming Distance is calculated with help of two Xor gate and 1 Half
Adder.
3.3.2 Path Metric Unit (PMU)
Figure 3.4 block diagram for PMU
38
A path metric unit summarizes branch metrics to get metrics for
2K − 1 paths, one of which can eventually be chosen as optimal. Every
clock it makes 2K − 1 decisions, throwing off wittingly nonoptimal
paths. The results of these decisions are written to the memory of a
traceback unit.
In Path Metric Unit Eight Add Compare and Select Unit are
arranged in particular fashion based on formation of Trelli‟s Diagram.
Add Compare Select Unit
Figure 3.5 block diagram for ACSU
The core elements of a PMU are ACS (add-compare-select) units.
The way in which they are connected between themselves is defined
by a specific code‟s trellis diagram.
Comparator
After the computation of path metric, we have to find out the ACSU
corresponding to we have maximum path metric.
For this block mainly we will give the encoded output of 2-bit data.
For the encoder block we gave in1 as a input in the sequence
“0110001”. After decoding it has to give in the same sequence but in
the reverse order i.e. we are getting as a decoded_ sequence output of
7-bit data.
3.3.3 Trace Back Unit(TBU)
In the FPGA-based implementation, the memory for storing the
survivor paths is divided into four memory banks, each of which
39
stores 64 survivor paths. The trace back length of the design is chosen
as D=8 that is equivalent to a trace back length of 8(K-1)=64 low-
connectivity trellis stages. As mentioned earlier, this is large enough to
ensure the convergence of the survivor paths. Since 2-D circular
memory addressing scheme is used in this design, the size of each
memory bank should be equal to 64X2-D=1024. and its word length 8
bits. The survivor path memory is organized as shown in Fig.3.6.
Figure 3.6 Architecture of the trace back unit.
There are three operation phases in the management of the
memory, write new data (WR), trace back read (TB), and decode read
(DR), which occupy memory with depths of D/2, D, and D/2,
respectively. These three operations access the three logical blocks of
the memory, respectively, by using 2-D circular memory addressing
scheme , in which the trace back read and the decode read operations
are performed by using one pointer instead of multiple ones. In view of
the memory depths of WR, TB, and DR, it is noted that the read rate
should be three times that of the write rate, so that the two read-
phases and one write-phase can be performed simultaneously within
the same time period. The write pointer advances forward from stage
to stage in the trellis, and the decisions made by the ACS
computations at all the trellis states regarding the survivors are
written into the memory block that is just freed by the decode read
operation. The read pointer traces back from stage to stage so that the
RAM1024_0 RAM1024_1 RAM1024_2 RAM1024_3
BUS_MUX
s5--s0s7
6 bits address for each RAM bank
s62/4 Decoder
8 bit Shift register
6
32 bit FILO register
en0 en1
Decoder output
en2 en3
8 8 8 8
8
en0 en1 en2 en3
Sur Sur Sur Sur
40
retrieved data after each read operation can be treated as a pointer to
indicate the corresponding state number of the previous stage. When
the decisions are written into the memory for one stage, the read
pointer will trace back the memory by three stages. After D such trace
backs, all the survivor paths will converge and the actual decoding
takes place. There are D/2X8 = 32 bits of data decoded at each trace
back iteration, and these retrieved data should be rearranged in its
original order.
The architecture of the trace back unit is shown in Fig.3.6. The
four block select RAMs each of size of 1024X8 bits provided by the
FPGA chip are used to implement the four memory banks,
RAM1024_0 to RAM1024_3. At each stage, the survivor decisions
made by the four pairs of the systolic arrays in Fig. 6 are written into
the four corresponding memory banks. At each trace back read, the
previously retrieved data (the state number) points to the current data
to be read. The most significant two bits and , of the previously
retrieved data in the register are sent to the 2/4 Decoder and
BUS_MUX. These bits are used to select the memory bank to be
accessed and to have its stored data retrieved. The least significant 6
bits , of the previously retrieved data in the register are used to
address the 64 decisions in one of the memory banks selected at each
read operation. In other words, at every trace back operation, one of
the 64 decisions (corresponding to the 64 survivor paths) stored in the
four memory banks can be read by using the previously retrieved data
as a pointer. After D trace backs, corresponding to the eight strongly
connected trellis stages, the decoded data is loaded into the 8-bit shift
register and then shifted to the first-in-last-out (FILO) register. Finally,
the decoded data is sent out in the reverse order from the FILO
register.
3.4 CONCLUSION This chapter describes the basic viterbi decoder and uses the
viterbi algorithm describe in chapter2.this gives basis for more
complex viterbi decoders developed in next chapter.
41
CHAPTER 4 DESIGN CONSIDERATIONS
4.1 INTRODUCTION
In this chapter, we describe the design considerations of the
Adaptive Viterbi algorithm and propose a low-power design of Viterbi
decoders. We also describe top-down design approach employed to
implement both Adaptive Viterbi decoder and non adaptive viterbi
decoder for the constraint length of K=7 and the code rate of r=1/2.
4.2 MODULAR MESCRIPTION
In order to design and implement the modified adaptive Viterbi
decoder, we choose the following specifications that have been used
for hardware implementation of the Viterbi decoder:
1) Constraint length: K=7(64 states);
2) Code rate: r=1/2
3) Survivor path length: L=64
4) Decision level: 3-bit soft decision.
Figure 4.1 Block diagram of the adaptive Viterbi decoder.
The block diagram of the modified adaptive Viterbi decoder is
shown in Fig. 4.1. In this decoder, the composite branch metric
42
generation(CBM) unit is designed to collect the two soft input
sequences and to compute all the possible branch metrics
corresponding to the eight low-connectivity trellis stages. The path
metric update unit is designed to generate the composite branch
metrics and to process the matrix–vector ACS computation. The trace
back unit can retrieve the decoded sequence from the survivor path
memory through a trace back strategy.
Figure 4.2 input buffer for the two soft inputs
4.3 COMPOSITE BRANCH METRIC UNIT
4.3.l Logic Diagram
43
Figure 4.3 Unit to compute the eight one-stage branch metrics
corresponding to a composite branch metric.
The input buffer is designed to gather the two soft input
sequences corresponding to a given strongly connected trellis stage so
that the composite branch metrics of the stage can be generated. As
shown in Fig. 5.1, the two soft inputs are sent to the corresponding 3-
bit registers depending on the enable signal generated by the 3/8
decoder. The 3/8 decoder outputs the eight enable signals in a
circular way so that the two soft input sequences corresponding to a
strongly connected trellis stage are loaded into the 3-bit registers. The
sequences are then buffered into two 24-bit registers and used to
compute the corresponding eight one-stage branch metrics. Each of
the eight one-stage branch metrics has four possible results
depending on the corresponding two-bit codeword in the original low
connectivity trellis stage. Table I gives the expressions for the
evaluation of a one-stage branch metric under the four possible values
of the two-bit codeword, where in1 and in2 represent the two soft
inputs at the low-connectivity trellis stage. The eight one-stage branch
metrics corresponding to a composite branch metric are computed by
the unit shown in Fig.4.2, in which each one-stage branch metric is
computed under the four possible values of the two-bit codeword, and
is output to the array processors in BM_4s shown in Fig. 4.3.
4.3.2 Source Code Description
In this module Composite one stage branch metrics are
calculated combining 8 one stage branch metrics corresponding to a
one stage branch metric.the inputs a,b,reset are applied to to one
stage branch metricsto generate 9 bit BMU out put.
44
4.4 ADD-COMPARE-SELECT
4.4.1 Logic Diagram
Figure 4.4Arithmetic pipelined processor for ACS-4
The arithmetic-pipelining processor for an ACS_4 is shown in
Fig.5.7. This processor is implemented with two pipeline stages. The
adder in the first pipeline stage is used to compute the 64 path
metrics at any of the corresponding 16 states of a given stage, whereas
the adder in the second pipeline stage is used to make comparisons
amongst the 64 path metrics computed by the first pipeline stage. The
length of the adders including a sign bit in the two pipeline stages
should be 10 bits in order to perform the modulo arithmetic. In the
second pipeline stage, the comparison between the path metrics
computed at the current and previous clock cycles at a state of a given
stage are carried out by adding the path metric at the current clock
cycle to the compliment of the path metric at the previous clock cycle,
increased by unity. In this way, the sign-bit of the adder can
represent the result of the comparison between the two path metrics.
If the sign-bit is “1,” the path metric at the current clock cycle is
smaller than the path metric at the previous clock cycle, and if the
sign-bit is “0,” then the former is larger than the latter. In the
8
8
8
8
10
10
10
10
10
Composite
branch metric
Path metric
Path metric
Decision bits
(or Sur)
Adder Adder
Reg
Reg
Reg
Reg
Counter
mu
x1
mu
x2
Enable
0
0
1
1
sign
CLK
45
meantime, the path metric with the smaller metric at the state of a
given stage is selected by the multiplexer (Mux1) and sent to the
Register. In addition, the 8-bit counter in the second pipeline stage is
used to represent the 64 possible states of the previous stage from
which the paths originate and terminate at the various states of the
present stage. One of these 64 paths would be the survivor path
depending on the sign bit of the adder in the second pipeline stage,
and is selected by the multiplexer (Mux2) at each clock cycle. Thus, at
a given stage, the path metric and the survivor corresponding to any
of the16 states can be updated through 64 clock cycles. It should be
mentioned that the critical path of the design as determined from the
Xilinx report is the second pipeline stage and consists of the 10-bit
adder and the multiplexer. The synchronization between ACS_4 and
BM_4 can be achieved by simply designing a timing scheme that
makes them start at different time instants. This means that at a
given stage the processors in an ACS_4 will not start their
computations until the results from the processors in the
corresponding BM_4 have been generated. In this way, throughput of
the design will not be changed, but a latency between ACS_4 and
BM_4 is introduced. This is in fact the latency of any processor in
BM_4.
Systolic Array Architecture Based Branch and Path metric
Updating
The modified adaptive Viterbi algorithm can be implemented
using the systolic array architecture shown in Fig.5.8 in which time
multiplexing, arithmetic pipelining are exploited. For K=7, the matrix
vector ACS computation can be processed by 64×64 adjacency matrix
Bq and the 1×64 path metric vector Pq-1.In this design ,the adjacency
matrix Bq is partitioned in to four 64×16 sub matrices, and the four
sub matrices along with the path metric vector Pq-1 are used to update
the corresponding 64 path metrics of the given stage .This is achieved
by the four pairs of interconnected systolic arrays is formed as shown
46
in Fig 5.8,where the top interconnected four processors represent the
systolic array BM-4 and the bottom interconnected four processors
represent the systolic array ACS-4 .
Figure 4.5 Architecture of a pair of systolic arrays.
In the top systolic array the composite branch metric is
generated by using equation (2) and (3).The code word vector C0j is
divided in to four 1×64 sub vectors to be stored in four ROMs,ROM0 to
ROM3and the code word Ci0 stored in the ROM. The codeword C0j stays
inside each processor, and Ci0 moves to the right. In the bottom
systolic array, the path metric Pq(j) of stage q is computed according
to (10) and decisions on the survivor paths are made. The
intermediate ACS result Rq(j) denoted by stays inside the j th
processor, whereas Pq-1(j) of stage q-1 moves to the right. All data
movements in the two systolic arrays are synchronized. This design
can thus be viewed as a “result stay” systolic array architecture.
47
Figure 4.6 Systolic array based architecture for adaptive viterbi
decoder
As shown in Fig. 4.7, for computing the corresponding four
subsets of Bq at each stage, the codeword vector C0j is divided into
four 1X 64 sub vectors to be stored in four ROMs, ROM0 to ROM3,
and the codeword vector Ci0 stored in the ROM. Globally, the four
sub matrix–vector ACS computations are carried out simultaneously
by the corresponding four pairs of systolic arrays. Locally, inside each
pair of the systolic arrays, the corresponding sub matrix–vector ACS
computation is time multiplexed. For a given K the number of trellis
states at a given stage is equal to 2K−1 . If the time-multiplexing
technique is not employed, the number of pairs of systolic arrays
(each array having 4 processors) is 2K−1/4 =2K−3 . For K=7 this is 64,
BUS_MUX
dm_plus_T
Mux1
Mux2
ROM0 ROM1 ROM2 ROM3
BM_4 BM_4 BM_4BM_4
ACS_4 ACS_4 ACS_4 ACS_4
Mux_P Mux_P Mux_P Mux_PMux_S Mux_S Mux_S Mux_S
Sur Sur Sur Sur
CMP CMP CMP CMP
RAM0 RAM1 RAM2 RAM3
8 8 8 8
8 8 8 8
10 10 10 10
16 16 1616
1616
10 10 10 10
11 11 11 11
1111 11 11
11
10
10 10
10
ROM
A pair of
systolic
arrays
32Yq
Sq-1(j)
Pq-1(0),Pq-1(1),......,Pq-1(255)
Sq-1(j)
48
and the total number of array processors is 512. As K increases, the
number of array processors will increase exponentially, which is the
same as in the case of the conventional butterfly architecture-based
parallel Viterbi decoder. However, we employ a time-multiplexing
technique. In this technique, when K=7, each pair of systolic arrays
(each array having four processors) processes 64 out of 256 states
sequentially in 16 iterations. For a given K , the more the number of
systolic array pairs used, the fewer the number of iterations needed.
Thus, as K increases, our time-multiplexing approach provides the
flexibility to trade off the number of iterations with the number of
systolic array pairs.
4.4.2Source code Description
In this module the airthmatic pype line based Add compare
select unit is developed.The two 9 bit inputs from BMU are given as
input along with decode_in ,m,n,p,q,r,s,clk_sm,en,en_main,en2 and
reset .This module generates 9 bit out puts which will be inputs for
trace back operation.
4.5 TRACE BACK UNIT
4.5.1 Logic Diagram
Figure 4.7 Architecture of the trace back unit.
RAM1024_0 RAM1024_1 RAM1024_2 RAM1024_3
BUS_MUX
s5--s0s7
6 bits address for each RAM bank
s62/4 Decoder
8 bit Shift register
6
32 bit FILO register
en0 en1
Decoder output
en2 en3
8 8 8 8
8
en0 en1 en2 en3
Sur Sur Sur Sur
49
In the FPGA-based implementation, the memory for storing the
survivor paths is divided into four memory banks, each of which
stores 64 survivor paths. The trace back length of the design is chosen
as D=8 that is equivalent to a trace back length of 8(K-1)=64 low-
connectivity trellis stages. As mentioned earlier, this is large enough to
ensure the convergence of the survivor paths. Since 2-D circular
memory addressing scheme is used in this design, the size of each
memory bank should be equal to 64X2-D=1024. and its word length 8
bits. The survivor path memory is organized as shown in Fig.4.8.
There are three operation phases in the management of the
memory, write new data (WR), trace back read (TB), and decode read
(DR), which occupy memory with depths of D/2, D, and D/2,
respectively. These three operations access the three logical blocks of
the memory, respectively, by using 2-D circular memory addressing
scheme , in which the trace back read and the decode read operations
are performed by using one pointer instead of multiple ones. In view of
the memory depths of WR, TB, and DR, it is noted that the read rate
should be three times that of the write rate, so that the two read-
phases and one write-phase can be performed simultaneously within
the same time period. The write pointer advances forward from stage
to stage in the trellis, and the decisions made by the ACS
computations at all the trellis states regarding the survivors are
written into the memory block that is just freed by the decode read
operation. The read pointer traces back from stage to stage so that the
retrieved data after each read operation can be treated as a pointer to
indicate the corresponding state number of the previous stage. When
the decisions are written into the memory for one stage, the read
pointer will trace back the memory by three stages. After D such trace
backs, all the survivor paths will converge and the actual decoding
takes place. There are D/2X8 = 32 bits of data decoded at each trace
back iteration, and these retrieved data should be rearranged in its
original order
50
The architecture of the trace back unit is shown in Fig.4.8. The
four block select RAMs each of size of 1024X8 bits provided by the
FPGA chip are used to implement the four memory banks,
RAM1024_0 to RAM1024_3. At each stage, the survivor decisions
made by the four pairs of the systolic arrays in Fig. 4.7 are written
into the four corresponding memory banks. At each trace back read,
the previously retrieved data (the state number) points to the current
data to be read. The most significant two bits of the previously
retrieved data in the register are sent to the 2/4 Decoder and
BUS_MUX. These bits are used to select the memory bank to be
accessed and to have its stored data retrieved. The least significant 6
bits , of the previously retrieved data in the register are used to
address the 64 decisions in one of the memory banks selected at each
read operation. In other words, at every trace back operation, one of
the 64 decisions (corresponding to the 64 survivor paths) stored in the
four memory banks can be read by using the previously retrieved data
as a pointer. After D trace backs, corresponding to the eight strongly
connected trellis stages, the decoded data is loaded into the 8-bit shift
register and then shifted to the first-in-last-out (FILO) register. Finally,
the decoded data is sent out in the reverse order from the FILO
register.
4.5.2 Source code Description
In this module The source code is developed based on the
inputs from BMU and ACSU.This module retraces the Hamming
distances.It links previously retrieved data in the register are used to
address the 64 decisions in one of the memory banks selected at each
read operation.
51
4.6 ADAPTIVE VITERBI DECODER
4.6.1 Logic Diagram
Figure 4.8 Block diagram of the adaptive Viterbi decoder.
In this we are integrating all the sub modules i.e, CBMU ,
Systolic architecture of the AVA and the Trace Back Unit together to
decode the encoded information shown in 5.11 for the constraint
length K=7 and the code rate r=1/2.
In this decoder the composite branch metric generation(CBM)
unit is designed to collect the two soft input sequences and to
compute all the possible branch metrics corresponding to the eight
low-connectivity trellis stages. The path metric update unit is designed
to generate the composite branch metrics and to process the matrix–
vector ACS computation. The trace back unit can retrieve the decoded
sequence from the survivor path memory through a trace back
strategy.
4.6.2 Source code Description
In this all the submodule codes are integrated as top module.The
inputs are Decoder_in ,clk_sm,en and reset given as input to the top
module and a 12 bit count_div_2_out ,onethird_clk_out,viterbi_out are
generated giving aerror free decoded data.The convolutional decoder
using adaptive Viterbi decoder is improvised version.
52
4.7. CONCLUSION
In this chapter the modular description of the three sub modules
along with the Top order module are described elaborately.Based on
this the source code implementation is taken up in next chapters.
53
CHAPTER 5
SIMULATION RESULTS
5.1 INTRODUCTION
This chapter focusses on the program flow description of each
module. This chapter gives the simulation results of
Encoder,Composte branch metric unit, add compare select unit,
Survivor memory unit modules and Adaptive Viterbi Decoder (top
order module). All these modules are synthesized using Xilinx ISE
navigator tool.
Simulations are usually divided in to following 5 categories.
Behavioral simulation
Functional simulation
Gate level simulation or post synthesis
simulation
Switch level simulation
Transistor-level or Circuit level simulation
This is ordered from high-level to low-level simulation (high
level being more abstract and low-level being more detailed).
Proceeding from high-level to low-level, the simulation becomes more
accurate, but they also become progressively more complex and take
longer to run. While it is positive to perform a behavioral-level
simulation of the whole system, it is just impossible to perform
circuit-level simulation of more than few hundred transistors.
Behavioral Simulation
This method models large pieces of a system as black boxes
withinput and outputs. This is done often using VHDL and Verilog.
Functional Simulation
This simulation ignores timing and includes delta-delay
simulation, which sets the delays to a fixed value. Once a behavioral
54
or functional simulation verifies the system working, the next step is
to check the timing performance
Logic Simulation or Gate-level simulation
This simulation is used to check the timing performance of an
ASIC. In the Gate-level simulation, a logic cell is treated as a black box
modeled by a function whose variables are the input signals. The
function may also model the delay through the logic cell setting all the
delays to unit signals. The function may also model the delay through
the logic cell. Setting all the delays to unit values is the equivalent of
functional simulation.
Switch-level Simulation
This simulation can provide more accurate timing predictions
than Gate-level simulation.
Transistor-level Simulation
These are the most accurate, but at the same time most
complex and time-consuming simulation of all the simulations. This
requires models of transistors, describing their nonlinear voltage and
current characteristics.
Simulation can also be divided on the basis of layout into two
categories.
Pre-layout Simulation
Post-layout Simulation
Simulation is used at many stages during the design. Initial Pre-
layout Simulation includes logic-cell delays but no inter-connect
delays. Estimates of capacitance may be included after completing
logic synthesis, but only after physical design is over, Post-layout
Simulation can be performed. In the Post-layout simulation, an SDF
(Standard Delay Format) file is included in the simulation
environment.
55
5.2ENCODER
5.2.1 Simulation results
Figure 5.1 Simulation results of Encoder
5.2.2 Signal Description
S.No. Signal
name
Type Description
1. Clk Input triggers output
2. Data_in Input „1‟ bit stream
3. Reset Input Reset encoder
4. V2v1 Output Convolution output
56
Port diagram
Figure 5.2 port diagram of encoder
5.2.3 Logical description
Mainly for this encoder block we will give 1-bit input data, after
doing encoding it will generates V1 and V2 are the two outputs. To
calculate V1 it will do XOR operation in between input bit, F6, F5, F4
& F1 outputs. Where as to calculate V2 it will do XOR operation in
between input bit, F5, F4, F2, F1 outputs.
As shown in figure 5.2, it will represent the simulation result for
Convolutional encoder. Depending upon input i.e.data_in, it will
generates the output i.e. v2v1 of 6-bit data.
57
5.3 COMPOSITE BRANCH METRIC UNIT
5.3.1Simulation Results
Figure 5.3 simulation results of composite branch metric unit
5.3.2 Signal Description
S.NO Signal
Name
Type Description
1. a Input Load 2- bit input data
2. b Input Load 2- bit input data
3. Reset Input Reset pin is used to restart the BMU
4. c Output Generates 9-bit data
58
Port diagram
Figure 5.4 port diagram of composite branch metric unit
5.3.3 Logical Description
The two soft inputs are sent to the corresponding 3-bit registers
depending on the enable signal generated by the 3/8 decoder. The 3/8
decoder outputs the eight enable signals in a circular way so that the
two soft input sequences corresponding to a strongly connected trellis
stage are loaded into the 3-bit registers.
5.4 ADD COMPARE SELECT UNIT
5.4.1Simulation Results
5.5 Simulation result of Add compare select Unit
59
5.4.2 Signal Description
S.NO Signal Name Type Description
1. decoder_in Input Load 2- bit input data say „11‟
2. m Input Load 2- bit input data „00‟
3. n Input Reset pin is used to restart the
BMU
4. p Input Load 2-bit data say „01‟
5.
q Input Load 2 bit input data say „00‟
6. r Input Load 9 bit input data from BMU
7. s Input Load 9 bit input data from BMU
8. clk_sm Input If clk_sum =‟1‟ then ACS process
sstarts 9. en Input en input „1‟
10. en_main Input En_main input „1‟
11. en2 Input 2nd enable
12. reset Input For resetting ACSU
13. x output 9 bitOutput for TBU
14. y output 9 bit output for TBU
15. e output Inputs for TBU
16. f output Input for TBU
60
Port diagram
Figure 5.6 port diagram of ACSU
5.4.3 Logical Description
The length of the adders including a sign bit in the two pipeline
stages should be 10 bits in order to perform the modulo arithmetic. In
the second pipeline stage, the comparison between the path metrics
computed at the current and previous clock cycles at a state of a given
stage are carried out by adding the path metric at the current clock
cycle to the compliment of the path metric at the previous clock cycle,
increased by unity.
61
5.5 TRACE BACK UNIT
5.5.1Simulation results
Figure 5.7 Simulation Result of Trace Back Unit.
5.5.2 Signal description
The Trace back operation is having 64 Survivor memory paths and is
having large no. of signals.
5.5.3 Logical Description
The memory for storing the survivor paths is divided into four memory
banks, each of which stores 64 survivor paths. The trace back length
of the design is chosen as D=8 that is equivalent to a trace back
length of 8(K-1)=64 low-connectivity trellis stages.
5.6 VITERBI DECODER
In this we are integrating the all the sub modules ie CBMU ,
Systolic architecture of the AVA and the Trace Back Unit together to
decode the encoded information shown in 5.11 for the constraint
length K=7 and the code rate r=1/2.The simulation results are as
follows.
62
5.6.1 Simulation Results
Figure 5.8 Simulation Result of complete Adaptive Viterbi Decoder
5.6.2 Signal Description
S.No. SignalName Type Description
1. Decoder_in Input Convolutional Input
2. Clkk_sm Input Clock for triggering
3 En Input Enables viterbi decoder
4 Reset Input For resetting decoder
5 Count_div_2_out output 11 bit decoder output
6 Onethird_clk_out Output Clock reduction
7 Viterbi_out Output Viterbi output
5.6.3 Logical explanation
In this decoder, the composite branch metric generation(CBM) unit is
designed to collect the two soft input sequences and to compute all
the possible branch metrics corresponding to the eight low-
63
connectivity trellis stages. The path metric update unit is designed to
generate the composite branch metrics and to process the matrix–
vector ACS computation. The trace back unit can retrieve the decoded
sequence from the survivor path memory through a trace back
strategy.
5.7 CONCLUSION
The focus of work in the low-power design of Viterbi decoders at
logic level is reduction of dynamic power dissipation in the standard
cell design environment. We considered two methods in this low power
design , strongly connected trellis decoding and systolic array
processing, in our design. These methods were applied to the survivor
path storage block to reduce the number of ACS computations.
The design flow that was followed is similar to the one used in
industry. The behavior of a Viterbi decoder is described in VHDL.
Then the design is synthesized to generate a gate level circuit in
VERTEX II xc2vp30-7-ff896 as a target device in xilinx tool.
64
CHAPTER 6
FPGA IMPLEMENTATION
6.1 INTRODUCTION
This chapter describes the FPGA implementation of the
proposed work using hardware description languages .This includes
the pop module Viterbi decoder synthesis It describes the RTL
views,port diagram,timing analysis chip vie floor plan etc.. of the
implemented work.
6.2 SYNTHESIS REPORT
Figure 6.1 Synthesis report of Viterbi decode
65
HDL Synthesis Report of Adaptive Vitebi Decoder in VERTEX II
xc2vp30-7-ff896 Macro statistics
# RAMs : 4
1024x8-bit single-port RAM : 4
# Adders/Subtractors : 161
10-bit adder : 48
11-bit adder : 1
4-bit adder carry out : 16
5-bit adder : 16
5-bit adder carry out : 16
6-bit adder : 16
6-bit adder carry out : 16
7-bit adder : 16
7-bit adder carry out : 16
# Counters : 16
8-bit up counter : 16
# Registers : 529
1-bit register : 16
10-bit register : 17
11-bit register : 1
16-bit register : 6
66
4-bit register : 336
5-bit register : 32
6-bit register : 32
7-bit register : 32
8-bit register : 57
# Latches : 1
4-bit latch : 1
# Multiplexers : 138
10-bit 4-to-1 multiplexer : 4
11-bit 4-to-1 multiplexer : 1
4-bit 4-to-1 multiplexer : 128
8-bit 4-to-1 multiplexer : 5
# Xors : 4
16-bit xor2 : 4
=========================================================
================
* Final Report *
=========================================================
================
Final Results
RTL Top Level Output File Name : viterbi_full.ngr
Top Level Output File Name : viterbi_full
67
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 19
Cell Usage :
# BELS : 7147
# BUF : 9
# GND : 1
# INV : 4
# LUT1 : 779
# LUT2 : 1799
# LUT2_D : 1
# LUT3 : 812
# LUT3_L : 6
# LUT4 : 993
# LUT4_D : 14
# MUXCY : 1627
# MUXF5 : 110
# MUXF6 : 48
# MUXF7 : 24
68
# MUXF8 : 12
# VCC : 1
# XORCY : 907
# FlipFlops/Latches : 2854
# FDCE : 2461
# FDE : 384
# FDPE : 1
# FDPE_1 : 1
# LD : 6
# LDCE_1 : 1
# RAMS : 384
# RAM32X1S : 384
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 18
# IBUF : 4
# OBUF : 14
=========================================================
================
69
Device utilization summary:
---------------------------
Selected Device : 2vp30ff896-7
Number of Slices: 2778 out of 13696 20%
Number of Slice Flip Flops: 2854 out of 27392 10%
Number of 4 input LUTs: 5176 out of 27392 18%
Number used as logic: 4408
Number used as RAMs: 768
Number of IOs: 19
Number of bonded IOBs: 19 out of 556 3%
Number of GCLKs: 1 out of 16 6%
Observation /comments
Synthesis report gives the Gate level net list from a model
described in VHDL. This gives the total information about the top
module Viterbi Decoder in the form of number of logic gates, tri state
buffers, number of Flip-Flops used and total number of registers..It is
synthesized using Xilinx ISE navigator tool. The device utilized is
xc2vp30-7-ff896.
70
6.3 RTL VIEW
The RTL view of Viterbi decoder is shown below.
Figure 6.2 RTL Schematic of Adaptive Viterbi Decoder
The RTL view gives a detailed account of the no.of Registers,
Multiplexors, Buffers, etc.. in the implementation.The adaptive Viterbi
decoder takes a little more area over Non adaptive viterbi decoder.But
better performance in Power and lesser delay.
71
6.4 PORT DIAGRAM
Figure 6.3 Port diagram of Viterbi Decoder
Port diagram explains the over all input output concept clearly.
It describes How a convolutionally encoded input passing thro the
Viterbi decoder gives error free reproduction of out put.
6.5 TIMING ANALYSYS
Timing Summary:
---------------
Speed Grade: -7
Minimum period: 5.176ns (Maximum Frequency: 193.192MHz)
Minimum input arrival time before clock: 6.206ns
Maximum output required time after clock: 3.514ns
Maximum combinational path delay: No path found
72
6.6 FLOOR PLAN
Figure 6.4 Floor plan of Viterbi decoder
Floor plan gives a lay out on the IC.Gives a clear idea about area
of utilization of the device.
73
6.7 CHIP VIEW
Figure 6.5 Chip view of Viterbi decoder
Chip view Gives idea of gives idea about the Chip a lay out
description and before making final ASIC this helps.
74
6.8 POWER DATA
Figure 6.6 power data of viterbi Decoder
Power data gives the final description of power calculation.In
our implementation 28.1mw power is consumed. We measured the
dynamic power dissipation of a circuit based on the switching activity
from gate level simulation. Static power dissipation for cells is not
considered for the measurement. This method is a compromise
between power and area, and it is sufficient for our purpose.
75
6.9 CONCLUSION
Comparison of Synthesis Results
Parameter Adaptive Viterbi
Decoder
Non Adaptive
Viterbi Decoder
Power 28.1 mw 103 mw
Delay 1.13 nSec 2.71 nSec
Area(No.Of IOBS) 223 205
Comparision of Synthesis Results.
The design flow that was followed is similar to the one used in
industry. The behavior of a Viterbi decoder is described in VHDL.
Then the design is synthesized to generate a gate level circuit in
VERTEX II xc2vp30-7-ff896 as a target device in xilinx tool.
76
CHAPTER 7
CONCLUSION AND FUTURE SCOPE OF WORK
7.1 CONCLUSION
-Convolutional coding is a coding scheme often employed in
deep space communications and recently in digital wireless
communications.
-Viterbi decoders are used to decode convolutional codes. Viterbi
decoders employed in digital wireless communications are complex in
its implementation and dissipate large power. We proposed a Viterbi
decoder design for low-power dissipation in this work.
-The focus of work in the low-power design of Viterbi decoders
at logic level is reduction of dynamic power dissipation in the standard
cell design environment.
- We considered two methods in this low power design , strongly
connected trellis decoding and systolic array processing, in our
design. These methods were applied to the survivor path storage block
to reduce the number of ACS computations.
-The design flow that was followed is similar to the one used in
industry. The behavior of a Viterbi decoder is described in VHDL.
Then the design is synthesized to generate a gate level circuit in
VERTEX II xc2vp30-7-ff896 as a target device in xilinx tool.
-We measured the dynamic power dissipation of a circuit based
on the switching activity from gate level simulation. Static power
dissipation for cells is not considered for the measurement. This
method is a compromise between power and area, and it is sufficient
for our purpose.
- The systolic array processing with strongly connected trellis
for adaptive viterbi decoder dissipates 28.1mW and has 1.13nSec of
77
delay. The power saving is 66% compared with the non adaptive
viterbi decoding with a less increase in area.
-The data rates achievable can be very high compared to
decoders of Constraint lenghths 3 or 4.
7.2 FUTURE SCOPE
-As the VLSI technology is increasing rapidly in future the
viterbi decoders with higher constraint lengths of 9 or 11 can be
implemented for higher data rates.
78
BIBLOGRAPHY
1. Abdul Rafeeq ,Abdul –shakoor and Valek Szwarc ”A High
Performance soft decission viterbi decoder for WLAN and broand band
application “IEEE Trans commun.,may 2006
2. F. Chan and D. Haccoun, “Adaptive Viterbi decoding of convolution
codes over memory less channels,” IEEE Trans. Commun., vol. 45, no.
11, pp. 1389–1400, Nov. 1997.
3. C.-Y. Chang and K. Yao, “Systolic array processing of the Viterbi
algorithm,” IEEE Trans. Inf. Theory, vol. 35, no. 1, pp. 76–86, Jan.
1989.
4. Y.-N. Chang, H. Suzuki, and K. K. Parhi,“A 2-Mb/s 256-state10-
mw rate-1/3 Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 35, no.
6, pp.826–834, Jun. 2000.
. 5. Kang and A. N. Willson Jr., “Low-power Viterbi decoder for CDMA
mobile terminals,” IEEE J. Solid-State Circuits, vol. 33, no. 3, pp.473–
482, Mar. 1998.
6. C.-Y. Chang and K. Yao, “Systolic array processing of the Viterbi
algorithm,”IEEE Trans. Inf. Theory, vol. 35, no. 1, pp. 76–86, Jan. 1989.
7. C. G. Caraiscos and K. Z. Pekmestzi, “Low-latency bit-parallel
systolic VLSI implementation of FIR digital filters,” IEEE Trans.
Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 7, pp. 529–
537, Jul. 1996.
8. S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson, “A
dynamically reconfigurable adaptive Viterbi decoder,” in Proc.
FPGA’02, 2002.
9. R. Henning and C. Chakrabarti, “Low-power approach for decoding
79
convolutional codes with adaptive Viterbi algorithm
approximations,”
in Proc. ISLPED’02, Monterey, CA, Aug. 12–14, 200 2
Books:
1. “Digital Logic and Design” – Moris Mano, PHI Publications, 3rd
Edition.
2. “VHDL Premier” – J. Bhaskar, 3rd Edition.
3. Douglas L.Perry, “VHDL programming by example”, fourth
edition, TATA McGraw –Hill edition.
4. Charles H.Roth, Jr “Digital Design using VHDL”, Thamson
Brooks/Cole.
5. Peter J.Ashenden “The Designers Guide to VHDL”, Second
Edition, Morgan Kaufmanns Publishers, 2001.
6. “coding and decoding with convolutional codes” by charan
Langton, editor.
Websites:
1. www.wikipedia.com
2. www.xilinx.com
3. www.opencores.org
4. www. Complextoreal.com
80
APPENDIX-A
VLSI DESIGN FLOW
INTRODUCTION
The word digital has made a dramatic impact on our society.
More significant is a continuous trend towards digital solutions in all
areas – from electronic instrumentation, control, data manipulation,
signals processing, telecommunications etc., to consumer electronics.
Development of such solutions has been possible due to good digital
system design and modelling techniques.
CONVENTIONAL APPROACH TO DIGITAL DESIGN
Digital ICs of SSI and MSI types have become universally
standardized and have been accepted for use. Whenever a designer
has to realize a digital function, he uses a standard set of ICs along
with a minimal set of additional discrete circuitry.
Consider a simple example of realizing a function as
Q n+1 = Q n + (A B)
Here On, A, and B are Boolean variables, with Q n being the value of Q
at the nth time step. Here A B signifies the logical AND of A and B; the
„+‟ symbol signifies the logical OR of the logic variables on either side.
A circuit to realize the function is shown in Figure 1. The circuit can
be realized in terms of two ICs – an A-O-I gate and a flip-flop. It can be
directly wired up, tested, and used.
81
A simple digital circuit
With comparatively larger circuits, the task mostly reduces to one of
identifying the set of ICs necessary for the job and interconnecting;
rarely does one have to resort to a micro level design [Wakerly]. The
accepted approach to digital design here is a mix of the top-down and
bottom-up approaches as follows [Hill & Peterson]:
• Decide the requirements at the system level and translate them to
circuit requirements.
• Identify the major functional blocks required like timer, DMA unit,
register file etc., and say as in the design of a processor.
• Whenever a function can be realized using a standard IC, use the
same –for example programmable counter, mux, demux, etc.
• Whenever the above is not possible, form the circuit to carry out the
block functions using standard SSI – for example gates, flip-flops, etc.
• Use additional components like transistor, diode, resistor, capacitor,
etc., wherever essential.
Once the above steps are gone through, a paper design is ready.
Starting with the paper design, one has to do a circuit layout. The
physical location of all the components is tentatively decided; they are
interconnected and the „circuit-onpaper‟ is made ready. Once a paper
design is done, a layout is carried out and a net-list prepared. Based
82
on this, the PCB is fabricated and populated and all the populated
cards tested and debugged.
Sequence of steps in conventional electronic circuit design
At the debugging stage one may encounter three types of problems:
• Functional mismatch: The realized and expected functions are
different. One may have to go through the relevant functional block
carefully and locate any error logically. Finally the necessary
correction has to be carried out in hardware.
• Timing mismatch: The problem can manifest in different forms. One
possibility is due to the signal going through different propagation
delays in two paths and arriving at a point with a timing mismatch.
This can cause faulty operation. Another possibility is a race condition
in a circuit involving asynchronous feedback. This kind of problem
may call for elaborate debugging. The preferred practice is to do
debugging at smaller module stages and ensuring that feedback
through larger loops is avoided: It becomes essential to check for the
existence of long asynchronous loops.
83
• Overload: Some signals may be overloaded to such an extent that the
signal transition may be unduly delayed or even suppressed. The
problem manifests as reflections and erratic behaviour in some cases
(The signal has to be suitably buffered here.). In fact, overload on a
signal can lead to timing mismatches.
The above have to be carried out after completion of the prototype PCB
manufacturing; it involves cost, time, and also a redesigning process
to develop a bug free design.
VLSI DESIGN
The complexity of VLSI‟s being designed and used today makes
the manual approach to design impractical. Design automation is the
order of the day. With the rapid technological developments in the last
two decades, the status of VLSI technology is characterized by the
following [Wai-kai, Gopalan]:
• A steady increase in the size and hence the functionality of the ICs.
• A steady reduction in feature size and hence increase in the speed of
operation as well as gate or transistor density.
• A steady improvement in the predictability of circuit behaviour.
• A steady increase in the variety and size of software tools for VLSI
design.
The above developments have resulted in a proliferation of approaches
to VLSI design. We briefly describe the procedure of automated design
flow [Rabaey, Smith MJ]. The aim is more to bring out the role of a
Hardware Description Language (HDL) in the design process. An
abstraction based model is the basis of the automated design.
84
Abstraction Model
The model divides the whole design cycle into various domains
(see Figure 3). With such an abstraction through a division process
the design is carried out in different layers. The designer at one layer
can function without bothering about the layers above or below. The
thick horizontal lines separating the layers in the figure signify the
compartmentalization. As an example, let us consider design at the
gate level. The circuit to be designed would be described in terms of
truth tables and state tables. With these as available inputs, he has to
express them as Boolean logic equations and realize them in terms of
gates and flip-flops. In turn, these form the inputs to the layer
immediately below. Compartmentalization of the approach to design in
the manner described here is the essence of abstraction; it is the basis
for development and use of CAD tools in VLSI design at various levels.
The design methods at different levels use the respective aids
such as Boolean equations, truth tables, state transition table, etc.
But the aids play only a small role in the process. To complete a
design, one may have to switch from one tool to another, raising the
issues of tool compatibility and learning new environments.
ASIC DESIGN FLOW
As with any other technical activity, development of an ASIC
starts with an idea and takes tangible shape through the stages of
development as shown in Figure 4 and shown in detail in Figure 5.
The first step in the process is to expand the idea in terms of
behaviour of the target circuit. Through stages of programming, the
same is fully developed into a design description – in terms of well
defined standard constructs and conventions.
85
Design domain and levels of abstraction
Major activities in ASIC design
The design is tested through a simulation process; it is to check,
verify, and ensure that what is wanted is what is described.
Simulation is carried out through dedicated tools. With every
simulation run, the simulation results are studied to identify errors in
the design description. The errors are corrected and another
simulation run carried out. Simulation and changes to design
86
description together form a cyclic iterative process, repeated until an
error-free design is evolved.
Design description is an activity independent of the target technology
or manufacturer. It results in a description of the digital circuit. To
translate it into a tangible circuit, one goes through the physical
design process. The same constitutes a set of activities closely linked
to the manufacturer and the target Technology.
Design Description
The design is carried out in stages. The process of transforming
the idea into a detailed circuit description in terms of the elementary
circuit components constitutes design description. The final circuit of
such an IC can have up to a billion such components; it is arrived at
in a step-by-step manner. The first step in evolving the design
description is to describe the circuit in terms of its behaviour. The
description looks like a program in a high level language like C. Once
the behavioural level design description is ready, it is tested
extensively with the help of a simulation tool; it checks and confirms
that all the expected functions are carried out satisfactorily. If
necessary, this behavioural level routine is edited, modified, and rerun
– all done manually. Finally, one has a design for the expected system
– described at the behavioural level. The behavioural design forms the
input to the synthesis tools, for circuit synthesis. The behavioural
constructs not supported by the synthesis tools are replaced by data
flow and gate level constructs. To surmise, the designer has to develop
synthesizable codes for his design.
87
ASIC design and development flow
The design at the behavioural level is to be elaborated in terms
of known and acknowledged functional blocks. It forms the next
detailed level of design description. Once again the design is to be
tested through simulation and iteratively corrected for errors. The
elaboration can be continued one or two steps further. It leads to a
detailed design description in terms of logic gates and transistor
switches.
88
Optimization
The circuit at the gate level – in terms of the gates and flip-flops
– can be redundant in nature. The same can be minimized with the
help of minimization tools. The step is not shown separately in the
figure. The minimized logical design is converted to a circuit in terms
of the switch level cells from standard libraries provided by the
foundries. The cell based design generated by the tool is the last step
in the logical design process; it forms the input to the first level of
physical design [Micheli].
Simulation
The design descriptions are tested for their functionality at every
level – behavioural, data flow, and gate. One has to check here
whether all the functions are carried out as expected and rectify them.
All such activities are carried out by the simulation tool. The tool also
has an editor to carry out any corrections to the source code.
Simulation involves testing the design for all its functions, functional
sequences, timing constraints, and specifications. Normally testing
and simulation at all the levels – behavioural to switch level – are
carried out by a single tool; the same is identified as “scope of
simulation tool” in Figure 5.
Synthesis
With the availability of design at the gate (switch) level, the
logical design is complete. The corresponding circuit hardware
realization is carried out by a synthesis tool. Two common approaches
are as follows:
• The circuit is realized through an FPGA [Old field]. The gate level
design description is the starting point for the synthesis here. The
FPGA vendors provide an interface to the synthesis tool. Through the
interface the gate level design is realized as a final circuit. With many
synthesis tools, one can directly use the design description at the data
89
flow level itself to realize the final circuit through an FPGA. The FPGA
route is attractive for limited volume production or a fast development
cycle.
• The circuit is realized as an ASIC. A typical ASIC vendor will have
his own library of basic components like elementary gates and flip-
flops. Eventually the circuit is to be realized by selecting such
components and interconnecting them conforming to the required
design. This constitutes the physical design. Being an elaborate and
costly process, a physical design may call for an intermediate
functional verification through the FPGA route. The circuit realized
through the FPGA is tested as a prototype. It provides another
opportunity for testing the design closer to the final circuit.
Physical Design
A fully tested and error-free design at the switch level can be the
starting point for a physical design [Baker & Boyce, Wolf]. It is to be
realized as the final circuit using (typically) a million components in
the foundry‟s library. The step-by-step activities in the process are
described briefly as follows:
• System partitioning: The design is partitioned into convenient
compartments or functional blocks. Often it would have been done at
an earlier stage itself and the software design prepared in terms of
such blocks. Interconnection of the blocks is part of the partition
process.
• Floor planning: The positions of the partitioned blocks are planned
and the blocks are arranged accordingly. The procedure is analogous
to the planning and arrangement of domestic furniture in a residence.
Blocks with I/O pins are kept close to the periphery; those which
interact frequently or through a large number of interconnections are
kept close together, and so on. Partitioning and floor planning may
have to be carried out and refined iteratively to yield best results.
90
• Placement: The selected components from the ASIC library are
placed in position on the “Silicon floor.” It is done with each of the
blocks above.
• Routing: The components placed as described above are to be
interconnected to the rest of the block: It is done with each of the
blocks by suitably routing the interconnects. Once the routing is
complete, the physical design cam is taken as complete. The final
mask for the design can be made at this stage and the ASIC
manufactured in the foundry.
Post Layout Simulation
Once the placement and routing are completed, the performance
specifications like silicon area, power consumed, path delays, etc., can
be computed. Equivalent circuit can be extracted at the component
level and performance analysis carried out. This constitutes the final
stage called “verification.” One may have to go through the placement
and routing activity once again to improve performance.
Critical Subsystems
The design may have critical subsystems. Their performance
may be crucial to the overall performance; in other words, to improve
the system performance substantially, one may have to design such
subsystems afresh. The design here may imply redefinition of the
basic feature size of the component, component design, placement of
components, or routing done separately and specifically for the
subsystem. A set of masks used in the foundry may have to be done
afresh for the purpose.
ROLE OF HDL
An HDL provides the framework for the complete logical design
of the ASIC. All the activities coming under the purview of an HDL are
shown enclosed in bold dotted lines in Figure 1.4. Verilog and VHDL
are the two most commonly used HDLs today. Both have constructs
91
with which the design can be fully described at all the levels. There
are additional constructs available to facilitate setting up of the test
bench, spelling out test vectors for them and “observing” the outputs
from the designed unit. IEEE has brought out Standards for the
HDLs, and the software tools conform to them. Verilog as an HDL was
introduced by Cadence Design Systems; they placed it into the public
domain in 1990. It was established as a formal IEEE Standard in
1995. The revised version has been brought out in 2001. However,
most of the simulation tools available today conform only to the 1995
version of the standard. VHDL used by a substantial number of the
VLSI designers today is the used in this project for modelling the
design.