turbo decoding using sova

UNIVERSIDAD POLITECNICA DE MADRIDESCUELA TECNICA SUPERIOR DE INGENIEROS DE

TELECOMUNICACION

PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVAALGORITHM

Carlos Arrabal AzzaliniMadrid, Abril 2007

PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVAALGORITHM

Autor:

Carlos Arrabal Azzalini

Tutor:

Pablo Ituero Herrero

DEPARTAMENTO DE INGENIERIA ELECTRONICA

ESCUELA TECNICA SUPERIOR DE INGENIEROS DETELECOMUNICACION

UNIVERSIDAD POLITECNICA DE MADRID

Madrid, Abril 2007

PROYECTO FIN DE CARRERA: Turbo Decoder Implementation Based on theSOVA Algorithm

AUTOR: Carlos Arrabal Azzalini

TUTOR: Pablo Ituero Herrero

El tribunal nombrado para juzgar el Proyecto arriba indicado, compuesto por los si-guientes miembros:

PRESIDENTE: D. Carlos Alberto Lopez Barrio

VOCAL: Dna. Marıa Luisa Lopez Vallejo

SECRETARIO: D. Jose Luis Ayala Rodrigo

SUPLENTE: D. Gabriel Caffarena Fernandez

acuerdan otorgarle la calificacion de:

Madrid de de 2007

El Secretario del Tribunal

To my parents

Acknowledgements

First of all I would like to thank Marisa for assigning this project and the scholarship tome. I have enjoyed working on it all along.

I would like to give special thanks to my mentor and friend Pablo for his advices andsupport. I had great time working with him.

Thanks to my friends at the Lab for the fantastic environment.

Finally I would like to thank Sandra for all her support and patient and for being thereall the time.

i

Abstract

Today most common architectures for implementing the SOVA algorithm are affected bytwo parameters: the trace back depth and the reliability updating depth. These parame-ters play an important role in the BER performance, power consumption, area and systemthroughput trade-offs. In this work, we present a new approach for doing the SOVA de-coding that is not limited by the mentioned parameters and leads to an optimum SOVAalgorithm execution. Besides, the architecture is achieved by recursive units which con-sume less power since the amount of employed registers is reduced. We also present anew scheme to improve the SOVA BER performance which is based on a approximationto the BR-SOVA algorithm. With this scheme the BER achieved is 0.1 dB from the oneobtained with a Max-Log-Map algorithm.

iii

Contents

1 Introduction 1

2 Turbo Codes 5

2.1 Binary Phase Shift Keying Communication System Model. . . . . . . . . . 5

2.2 Soft Information and Log-Likelihood Ratios in Channel Coding. . . . . . . . 7

2.3 Convolutional Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Trellis Diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Turbo Codes Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Trellis Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Decoding Turbo Codes : Soft Output Viterbi Algorithm 13

3.1 Turbo Codes decoding process. . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 SISO Unit: SOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Viterbi Algorithm Decoding Example . . . . . . . . . . . . . . . . . 18

3.2.2 Soft Output extension for the VA. . . . . . . . . . . . . . . . . . . . 20

3.2.3 Improving the soft output information of the SOVA algorithm. . . . 23

4 Hardware Implementation of a Turbo Decoder based on SOVA 25

4.1 Turbo Decoder RAM buffers. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Interleaving/Deinterleaving unit of the turbo decoder . . . . . . . . . . . . . 28

4.3 SOVA as the core of the SISO. . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Branch Metric Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Add Compare Select Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.6 Survival Memory Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6.1 Register Exchange Survival Memory Unit. . . . . . . . . . . . . . . . 34

4.6.2 Systolic Array Survival Memory Unit. . . . . . . . . . . . . . . . . . 35

4.6.3 Two Step approach for the Survival Memory Unit. . . . . . . . . . . 37

4.6.4 Other Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6.5 Fusion Points Survival Memory Unit. . . . . . . . . . . . . . . . . . 38

v

4.7 Fusion Points based Reliability Updating Unit. . . . . . . . . . . . . . . . . 45

4.8 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.9 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Methodology 55

6 Measures and Results 59

6.1 Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Bit Error Rate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.5 Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Conclusions and future work 71

Bibliography 72

vi

List of Figures

2.1 Simplified communication system model. . . . . . . . . . . . . . . . . . . . . 6

2.2 Discrete AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 NSC encoder of rate 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 RSC encoder of rate 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 RSC encoder used in the UMTS standard. Pfb = [1011], Pg = [1101] . . . . 9

2.6 Trellis Example of an RSC encoder with Pfb = [111], Pg = [101] . . . . . . . 10

2.7 Serial concatenated Turbo encoder . . . . . . . . . . . . . . . . . . . . . . . 11

2.8 Parallel concatenated Turbo encoder. RSC encoder with Pfb = [111], Pg =[101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.9 Turbo Encoder with trellis termination in one encoder. Pfb = [111], Pg =[101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Turbo Decoder generic scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Output during state transition for a given trellis. . . . . . . . . . . . . . . . 16

3.3 Trellis diagram for VA, Code given by Pfb = [111] , Pg = [101] . . . . . . . . 19

3.4 Soft Output extension example for the Viterbi Algorithm. Code given byPfb = [111] , Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Hardware implementation of a turbo decoder . . . . . . . . . . . . . . . . . 26

4.2 Overall system states diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Data-in RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Data-out RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 RAM La/Le and RAM Le/La connections . . . . . . . . . . . . . . . . . . . 28

4.6 Interleaving/Deinterleaving Unit . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Viterbi and SOVA decoder schemes . . . . . . . . . . . . . . . . . . . . . . . 30

4.8 BMU for the RSC encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.9 Add Compare Select Unit for the SOVA. Pfb = [111], Pg = [101] . . . . . . 32

4.10 Modular representation of the path metrics. Each path metric register hasa width of nb bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.11 Merging of paths in the traceback. . . . . . . . . . . . . . . . . . . . . . . . 33

vii

4.12 Register Exchange SMU for the SOVA. Pfb = [111], Pg = [101] . . . . . . . 34

4.13 Register Exchange processing elements. . . . . . . . . . . . . . . . . . . . . 35

4.14 Systolic Array for the Viterbi Algorithm. . . . . . . . . . . . . . . . . . . . 36

4.15 Survival unit for the Systolic Array. . . . . . . . . . . . . . . . . . . . . . . 37

4.16 Two Step idea. First tracing back, and then reliability updating. . . . . . . 37

4.17 Fusion Points based SMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.18 Possibility of fusion points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.19 Fusion Point detection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 40

4.20 Sequence of the Fusion Point algorithm . . . . . . . . . . . . . . . . . . . . 41

4.21 FPU architecture for a code with constraint length K = 3. . . . . . . . . . . 43

4.22 Reliability updating problem . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.23 One possible solution to the problem of bit reliabilities releasing. . . . . . . 46

4.24 Solution adopted for the bit reliabilities releasing problem. . . . . . . . . . . 46

4.25 Fusion Points based Reliability updating unit . . . . . . . . . . . . . . . . . 48

4.26 Recursive Updating Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.27 Recursive Updating Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.28 Control Unit General Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.29 Control Unit State Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.30 Reliability Updating Unit with BR-SOVA approximation . . . . . . . . . . 53

4.31 Recursive Update with BR-SOVA approximation . . . . . . . . . . . . . . . 54

5.1 Project Work Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Hardware-in-the-loop approach . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Hardware-in-the-loop verification procedure . . . . . . . . . . . . . . . . . . 57

6.1 ∆ quantization effect on the system BER performacne. BR-SOVA approxi-mation scheme. Simulation with quantification. MCF. Pfb = [111], Pg = [101] 60

6.2 HR-BRapprox comparison. Infinite precision simulations. MCF interleaver.Pfb = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.3 HR-SOVA HIL results. MCF interleaver. Pfb = [111], Pg = [101] . . . . . . 63

6.4 BR-SOVA approximation HIL results. MCF interleaver. Pfb = [111], Pg =[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.5 HR-BRapprox HIL comparison. MCF interleaver. Pfb = [111], Pg = [101] . 64

6.6 HR-BRapprox comparison. Infinite precision simulations. RAND inter-leaver. Pfb = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . 65

6.7 BR-SOVA approximation HIL results. RAND interleaver. Pfb = [1011],Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii

6.8 Throughput statistics. f = 25MHz, fRUU = 25MHz. Pfb = [111], Pg =[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.9 Throughput statistics. f = 25MHz, fRUU = 50MHz. Pfb = [111], Pg =[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.10 Throughput statistics. f = 16.66MHz, fRUU = 25MHz. Pfb = [111],Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.11 Throughput statistics. f = 25MHz, fRUU = 25MHz. Pfb = [1011],Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.12 Throughput statistics. f = 25MHz, fRUU = 50MHz. Pfb = [1011],Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.13 Throughput statistics. f = 16.66MHz, fRUU = 50MHz. Pfb = [1011],Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ix

Chapter 1

Introduction

The goal of any communication system is to achieve highly reliable communications witha reduced transmitted power and reach as high as possible data rates. All these pa-rameters usually represent a trade-off that designers have to deal with. Bandwidth isalso a limited resource in communication systems. Error-detecting and error-correctingtechniques are used in digital communication systems in order to get higher spectral andpower efficiencies. This is based on the fact that with these techniques more channel errorscan be tolerated and so the communication system can operate with a lower transmittedpower, transmit over longer distances, tolerate more interference, use smaller antennas,and transmit at a higher data rates.

One of the most widespread of these techniques is Forward Error Correction (FEC). Onthe transmitter side, an FEC encoder adds redundancy to the data in the form of parityinformation. Then at the receiver, an FEC decoder is able to exploit the redundancy insuch a way that a reasonable number of channel errors can be corrected. Claude Shannon—father of Information Theory— showed that if long random codes are used, reliablecommunications can take place at the minimum required Signal to Noise Ratio (SNR).However, truly random codes are not practical to implement. Codes must possess somestructure in order to have computationally tractable encoding and decoding algorithms.

Turbo Codes were introduced by Berrou, Glavieux and Thitimajshima in 1993 [3].These codes exhibit an astonishing performance close to the theoretical Shannon limit,in addition to a good feasibility of VLSI (Very Large Scale Integration) implementation.Turbo Codes are used in the two most widely adopted third-generation cellular standards(UMTS and CDMA2000). They are also incorporated into standards used by NASA fordeep space communications (CCSDS) and digital video broadcasting (DVB-T).

Decoding in Turbo Codes is carried out by a soft-output decoding algorithm: analgorithm that provides a measure of reliability for each bit that it decodes. Specificallytwo of the component decoding algorithms that are used in Turbo Codes are known asMAP (Maximum a Posteriori) and SOVA (Soft Output Viterbi Algorithm). The highcomputational complexity of the MAP algorithm makes its implementation expensiveand power-hungry. This is why most implementations perform a simplified version ofthe algorithm. The most common simplifications are: the Log-MAP and Max-Log-MAPalgorithms which work in the logarithmic domain. Regardless, these algorithms are stillmore complex and power-hungry compared to the SOVA algorithm which presents the

2 Introduction

drawback of a worse BER (Bit Error Rate) performance.

This work deals with a SOVA algorithm implementation. Today most common archi-tectures for implementing the SOVA algorithm are affected by two parameters: the traceback depth and the reliability updating depth. These parameters play an important rolein the BER performance, power consumption, area and system throughput trade-offs. Inthis work, we present a new approach for doing the SOVA decoding that is not limited bythe mentioned parameters and leads to an optimum SOVA algorithm execution. Besides,the architecture is achieved by recursive units which consume less power since the amountof employed registers is reduced. We also present a new scheme to improve the SOVABER performance. With this scheme the BER achieved is 0.1 dB from the one obtainedwith the Max-Log-Map algorithm.

The design was implemented on a low cost Spartan III FPGA (Field ProgrammableGate Array). The system was tested for two major polynomials and the system BER wasmeasured for different SNR input messages. Also throughput measures were taken whilepower estimations were carried out by simulations.

The key points of this work can be summarized in the following list:

• A complete Turbo Decoder implementation based on the SOVA algorithm has beenachieved:

– A two step approach for the SOVA decoding has been adopted [9].

– A new algorithm that does not depend on the trace back depth of the survivalpath has been introduced for the SOVA decoding.

– A new architecture for the previous algorithm has been designed.

– A new architecture for updating bit reliabilities according to the HR-SOVAalgorithm has been designed.

– A novel updating process that approximates the BR-SOVA algorithm for binaryRSC codes has been presented. With this scheme the BER performance is lessthan 0.1dB from the the Max-Log-Map approach.

• The system has been described with generic VHDL code.

• The system has been highly tested.

– BER curves have been measured for the HR-SOVA and the BR-SOVA approx-imation with different codes. (Real System).

– Throughput estimations have been obtained for different codes. (Real System).

– Power estimations have been obtained with simulation tools. (VHDL Post-Place and Route model).

The structure of this document is the following. The second chapter introduces TurboCodes and sets the environment where this work resides. The third chapter deeply de-scribes the SOVA algorithm and sets the main ideas for the fourth chapter which describestoday most common architectures and introduces the SOVA implementation proposed inthis work. It is inside the fourth chapter where the new algorithm, in conjunction withthe new architectures, is presented. The fifth chapter illustrates the practical design, from

3

implementation to verification. Finally the sixth chapter presents the results and mea-sures carried out on the real system while the seventh chapter gives the conclusions andestablishes the basis of the future work.

4 Introduction

Chapter 2

Turbo Codes

Turbo Codes were presented by Glavieux [3] in 1993. They had a tremendous impact inthe discipline of channel coding. They are, along with LDPC (Low Density Parity Check)codes, the closest approximation ever to the code that Claude Shanon probed to exist inthe mid XX century and which is able to achieve error free communications. Since theirintroduction, they have been intensively studied. The first commercial application waspresented in 1997 [1] and today they are already part of the UMTS (Universal MobileTelecommunication System) standards. They have become the first choice when workingwith low SNRs (Signal to Noise Ratio) such as in wireless applications and deep spacecommunications.

In this chapter we first introduce the communication system model which has beenemployed in this work as the scenario for channel coding tests. Next we introduce theconcept of soft information which is the key of Turbo Codes. We describe the TurboCodes encoders and finally we talk about the trellis termination. The decoding process isleft to the next chapter.

2.1 Binary Phase Shift Keying Communication System Model.

In order to explain the soft information concept and the log-likelihood ratio, we willdevelop a simplified communication model that will be the base example for the proceedingconcepts. This communication model is shown in Figure 2.1. On the transmitter side thereis a source of information that we assume to provide equally likely symbols. There is ablock for channel coding which is the main subject of this work and it is carried out bya Turbo Code. The modulation scheme is BPSK (Binary Phase-Shift Keying) and thechannel is assumed to be an AWGN (Additive White Gaussian Noise). On the receiverside, all the complement blocks for those in the transmitter are found. Also, there is amatched filter which maximizes the SNR before sampling the received data. Note that wehave omitted the synchronization recovery subsystem which will be assumed to be ideal.

As starting point the source provides message bits mi at a rate of 1T bits/sec, which

are fed into the channel coding block. In a Turbo Code context, these bits are groupedto form a frame of size L bits. The channel coding block outputs a coded frame with size2L. So, for each message bit mi there is a symbol made of two bits xi = {xsi , xpi}. Then

6 Turbo Codes

AWGNChannel

source Channel Coding01011010... BPSK Modulation110010...

+1V

-1V

Matched Filer

+1V

-1V

+1V

-1V

Channel Decoding

01011010...sink

Implies sampling and quantification

im

( )r t

iyim

)(tu

Figure 2.1: Simplified communication system model.

A W G NC h a n n e l

M a tc h e d F ile rx i y i

D is c re te A W G N c h a n n e l

Figure 2.2: Discrete AWGN Channel

the code rate is r = 12 —one input bit, two output bits. The modulator generates the

waveform signals from the input bits and transmits them through the AWGN channel.The matched-filter filters the received signals which, at the corresponding time instant,are sampled and so the yi symbols are obtained. The AWGN channel, in conjunction withthe matched filter and the sampling unit, can be modeled as a discrete AWGN channelas shown in figure 2.2. The modeling of a discrete channel is desired, since computersimulations are simplified and the computing time is reduced. The equation that governsthe behavior of this channel is the following:

yi = a√

Es (2xi − 1) + nG (2.1)

where a is a fading amplitude which is assumed to be 1. If a fading channel was underthe scope of study, then a would be assumed to be a random variable with a Rayleighdistribution. Es is the energy of the transmitted symbol and it relates to the energy perbit of information as Es = rEb. Finally nG represents the white Gaussian noise withzero mean and a power spectral density of N0

2 . For simulation purposes equation 2.1 isrewritten as:

yi = a (2xi − 1) + n′G (2.2)

where the variance of n′G becomes σ2 = N02Es .

2.2 Soft Information and Log-Likelihood Ratios in Channel Coding. 7

2.2 Soft Information and Log-Likelihood Ratios in ChannelCoding.

Whenever a symbol yi is received at the decoder, the following test rule helps us todetermine what the transmitted symbol was, based only on the observation yi and withoutthe help of the code.

P (xi = 1 | yi) > P (xi = 0 | yi) ⇒ xi = 1P (xi = 1 | yi) < P (xi = 0 | yi) ⇒ xi = 0

This rule is known as MAP (Maximum a posteriori) since P (xi = 1 | yi) and P (xi = 0 | yi)are the a posteriori probabilities. Using the Bayes theorem, the previous rule can berewritten as:

P (yi | xi = 1)P (xi = 1)P (yi)

>P (yi | xi = 0)P (xi = 0)

P (yi)⇒ xi = 1

P (yi | xi = 1)P (xi = 1)P (yi)

<P (yi | xi = 0)P (xi = 0)

P (yi)⇒ xi = 0

and rewriting equations as ratios, yields:

P (yi | xi = 1)P (xi = 1)P (yi | xi = 0)P (xi = 0)

> 1 ⇒ xi = 1

P (yi | xi = 1)P (xi = 1)P (yi | xi = 0)P (xi = 0)

< 1 ⇒ xi = 0

If we apply the natural logarithm on the previous equations, the testing result is notaltered, then we obtain:

lnP (yi | xi = 1)P (yi | xi = 0)

+ lnP (xi = 1)P (xi = 0)

> 0 ⇒ xi = 1

lnP (yi | xi = 1)P (yi | xi = 0)

+ lnP (xi = 1)P (xi = 0)

< 0 ⇒ xi = 0

The previous ratios in the log domain, are the LLR (Log Likelihood Ratio) metrics whichis a useful way to represent the soft decision of receivers or decoders. We can summarizethe previous steps with only one equation as follows:

L (xi | yi) = L (yi | xi) + L (xi)

where L (xi | yi) = ln P (xi=1|yi)P (xi=0|yi)

, L (yi | xi) = ln P (yi|xi=1)P (yi|xi=0) and L (xi) = ln P (xi=1)

P (xi=0) . Thenotation of the previous equation is usually rewritten as:

Λ′i = Lc (yi) + Lai

where Lai is the LLR of the a priori information and Lc (yi) is related to a measure of thechannel reliability. Note that the sign of the i indicates the hard decision.

8 Turbo Codes

11 1

mi

[ ]1011 =gP[ ]1112 =gP1 0 1

2ipx1 ipx

Figure 2.3: NSC encoder of rate 12

So far we have introduced the equations of soft information based on the receivedsymbol at the input of the decoder without the aid of the underlying code. The factof using channel coding in the communication system lets us improve the LLR of the aposteriori probability. This is shown in [3]. The LLR of the a posteriori information atthe output of the decoder is:

Λi = Λ′i + Lei = Lc (yi) + Lai + Lei (2.3)

The term Lei is known as the extrinsic information which actually is the improvementachieved by the decoder and the decoding process on the soft information. The extrin-sic information will be the data fed as a priori information to the other decoder in aconcatenated decoding scheme. It is important to remark that all terms in equation 2.3can be added because they are statistically independent [3]. Statistical independence ofterms is essential to allow iterative decoding and this is the reason of interleavers in theconcatenation schemes of Turbo Encoders and Turbo Decoders.

2.3 Convolutional Encoders.

Turbo Codes encoders are mainly based on convolutional encoders. In these encodersthe output signals are typically generated convolving the input signal with itself in severaldifferent configurations, consequently adding redundancy to the code. Convolutional codescan be either Non-systematic Convolutional codes (NSC) when the input word is notamong the outputs; or Recursive Systematic Convolutional codes (RSC) when the inputword is one of the outputs [8]. Figure 2.3 illustrates an example of a NSC encoder whilefigure 2.4 shows an RSC encoder. A set of registers and modulo two adders can beappreciated on the figures. The connections among those registers and the modulo twoadders determine the output sequence of the encoder. Dividing the number of inputs Iover the number of outputs O results in the code rate I

O . The cited examples through allthis work will always use an RSC encoder with rate 1

2 .

To define a convolutional encoder we need a set of polynomials which represent theconnections among the registers and the modulo two adders. For an NSC two code gener-

2.4 Trellis Diagrams. 9

mi isx[ ]1011 =gP

[ ]111=fbP1 0 1

1 1 1

ipxFigure 2.4: RSC encoder of rate 1

2

mi isx1

0

1

1 1

1 ipx1

0

Figure 2.5: RSC encoder used in the UMTS standard. Pfb = [1011], Pg = [1101]

ator polynomials define the encoder of rate 12 —see figure 2.3. On the other side, an RSC

encoder is defined by both feedback and generator polynomials —see figure 2.4.

The status of the set of registers represents the state of the encoder. Input bits mi

make the encoder memory elements change and move into another state while producingthe output bits xsi , xpi —for the case of the RSC encoder. Convolutional encoders arecharacterized by the constraint length K. An encoder with constraint length K has K−1memory elements which allows the encoder to jump through 2K−1 states.

RSC encoders are mostly used in Turbo Codes schemes rather than NSC encoders,since better BER performances have been achieved with them. For instance, the encoderused in UMTS is the one depicted in Figure 2.5.

2.4 Trellis Diagrams.

A trellis diagram is a graphical representation of the states of the encoder. It is a pow-erful tool since not only allows us to see state transitions, but also their time evolution.The MAP (Maximum a posteriori Probability) and the SOVA (Soft Input Soft Output)algorithms are used to decode Turbo Codes. They base their calculations on the trel-lis branches in order to reduce computing and this is the reason why we explain trellisdiagram.

10 Turbo Codes

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3

m i =0

m i =1

m = <110...> x=<11 10 00 ...>

{0,0}

{1,1}

{1,1}

{0,1}

{0,0}

{1,0}

{0,1}

{1,0}

Figure 2.6: Trellis Example of an RSC encoder with Pfb = [111], Pg = [101]

Figure 2.6 shows the trellis for the RSC encoder of figure 2.4. The figure also shows anexample of an input message and how this input message represents a path in the trellisdiagram. This path is colored in blue an it is known as the state sequence s.

In order to find the trellis representation of an encoder we follow these steps:

• The trellis will have 2K−1 states, at each time instant.

• The memory elements of the encoder are set to represent a given state. Usually thefirst state is 0. Then we want to calculate the connections between the present stateand the subsequent states.

– An input bit mi equal to zero is assumed. Then the output symbol is calculatedby operating with the adders and the value of the registers. Also the next stateis calculated by shifting the register inputs at the clock edge. For example infigure 2.6, we see that at state s0, an input message bit mi = 0 produces atransition to state s0. In contrast, a bit mi = 1 produces a transition to states2.

– An input bit mi equal to one is assumed. Again, the output symbol is calculatedby operating with the adders and the value of the registers. Also the nextstate is calculated by shifting the register inputs at the clock edge. Note that,whenever a transition is due to a zero input bit, then that transition is drawnas a solid line. In contrast, whenever the transition is due to an one input bit,that transition is drawn as a dashed line.

• Repeat the previous steps with the rest of the states, s1-s3 in the example.

The previous trellis diagram is given by the polynomials and therefore it is the same forall the stages. The encoded message can be thought as a particular path within the trellisdiagram as shown in the example of 2.6.

2.5 Turbo Codes Encoders. 11

I n t e r le a v e rR S C

E n c o d e rR S C

E n c o d e rm i x i

Figure 2.7: Serial concatenated Turbo encoder

mi isx1 0 1

1 1 1

ipx 1Interleaver ipx 2ipx

puncturing

Figure 2.8: Parallel concatenated Turbo encoder. RSC encoder with Pfb = [111], Pg =[101].

2.5 Turbo Codes Encoders.

As we mentioned in 2.3. Turbo Codes encoders are mainly based on convolutional en-coders. However Turbo encoders also include one or more interleavers for shuffling data.Figure 2.7 shows a serial concatenated Turbo encoder, while figure 2.8 shows a parallelconcatenated Turbo encoder of rate 1

2 which is the one used in our communication systemmodel. A lot of combinations can be achieved by concatenating different convolutionalencoders with interleavers. The reason of the interleavers is to uncorrelate data streams,so at the decoder, an iterative decoding can take place. In figure 2.8, there is a blockknown as puncturer, which basically compounds the parity bit of the resulting encoderby selecting one parity bit from each convolutional encoder at a time. If no puncturingwas done, then the rate of the entire Turbo encoder would be 1

3 —the data rate of theresulting Turbo encoder can be different from the rate of the convolutional encoder.

2.6 Trellis Termination

Before getting into the decoding process, it is important to mention the trellis terminationof the convolutional encoders since it affects the BER performance of the code. The trellis

12 Turbo Codes

mi isx1 0 1

1

1 1

ipx 1Interleaver ipx 2ipx

s1

s2

Figure 2.9: Turbo Encoder with trellis termination in one encoder. Pfb = [111], Pg = [101].

termination is basically the final state the memory elements of the convolutional encodersadopt when the end of the frame, being encoded, is reached. Since there is an interleaverbetween both convolution encoders, the trellis termination of them is not a trivial task [16].We will choose, for the purpose of this work, to terminate the first encoder and left thesecond encoder open. Figure 2.9 shows the resulting Turbo encoder. The system worksas follows: At the beginning, switch s1 is closed and switch s2 is opened. A data frameof size L− 2 is encoded and then switch s1 is opened and s2 is closed, the remaining twobits are encoded , this leads the first convolutional encoder to the state 0. Note that thedata frame, for this case L− 2 bits long, and the remaining two bits are used to terminatethe trellis.

Chapter 3

Decoding Turbo Codes : SoftOutput Viterbi Algorithm

In this chapter we will introduce a general scheme of a Turbo decoder for a parallelconcatenated code. We will go step by step through the entire decoding process anddeeply describe one of the algorithms used in the SISO (Soft Input Soft Output) unit:The SOVA algorithm.

3.1 Turbo Codes decoding process.

In the previous chapter we presented Turbo Codes and the encoding process. Now it istime to talk about the decoding process. Turbo codes are asymmetrical codes. That is,while the encoding process is relatively easy and straight forward, the decoding process iscomplex and time consuming.

The power of Turbo Codes resides on the decoding process which unlike others tech-niques, is done iteratively. Figure 3.1 shows a general scheme of a turbo decoder. Aswe can see, the decoding process is done by two SISO decoders. Signals arriving at thereceiver are sampled and processed with the aid of the channel reliability before becomingthe soft information ”parity info 1,2” and ”systematic info” shown in figure 3.1. We can

SISO

a priori Info La Interleaver+

parity Info 1

systematic Info

p

s

-

-

DeInterleaver

+

-

-

SISO

La

p

s

parity Info 2

Interleaver

DeInterleaver

Decision decoded bits

Λ Λ

Figure 3.1: Turbo Decoder generic scheme.

14 Decoding Turbo Codes : Soft Output Viterbi Algorithm

see the output of one SISO decoder becoming the input of the other decoder and viceversa, forming a feedback loop. The name of turbo code is due to this feedback loop andits comparable appearance to a turbine engine.

Final decoding is achieved by an iterative process. Soft input information is processedand as a result soft output information is obtained. The second decoder takes this softinformation as input and produces new soft output information that the first decoderwill use as input. This process continues until the system makes a hard decision. TheBER obtained improves drastically with the first iterations until it begins to convergeasymptotically [3]. A trade-off exists between the decoding delay and the bit error rateachieved. Even though eight iterations are enough to obtain a reasonable BER, decodersnot always do them all; instead they check the parity of the message header and then theydecide whether to keep iterating or not.

Note that between each decoder there is an interleaver or deinterleaver depending onthe data flow. As we mentioned in chapter 2, the interleaver/deinterleaver unit is a bigissue in turbo coding. This unit reorders soft information so a priori data, parity data andsystematic data are all time coherent at the moment of processing.

Figure 3.1 also shows how soft input information is extracted from output, in order toavoid the positive feedback which degrades the BER performance of the system.

3.2 SISO Unit: SOVA.

Even though the SOVA algorithm and the MAP algorithm are both trellis based —theytake advantage of trellis diagram to reduce computations— they differ in the final estima-tion they obtain. MAP performs better when working with low SNR and both of themare about the same when working with high SNR. MAP finds the most probable set ofsymbols in a message sequence while SOVA finds the most probable message sequence ofstates associated to a path within a trellis. Nevertheless, MAP is computationally muchheavier than SOVA.

SOVA stands for Soft Output Viterbi Algorithm. Actually, it is a modification of theViterbi Algorithm [7].We will introduce the Viterbi Algorithm based on the explanationgiven in [16] and then we will add the soft output extension. VA is widely used becauseit is useful to find the most probable sequence within a trellis and we can use a trellisdiagram to represent any finite state Markov process.

Recalling our communication model, let s = (s0, s1, . . . sL) be the sequence we want toestimate and let y be the received sequence of symbols. VA finds:

s = arg{

maxs

P [s | y]}

(3.1)

where y is the noisy set of symbols we have at the decoder after sampling. To be moreprecise y is the observation. From Bayes theorem we have:

s = arg{

maxs

P [y | s]P [s]P [y]

}(3.2)

since P [y] does not change with s, we can rewrite equation 3.2 as:

3.2 SISO Unit: SOVA. 15

s = arg{

maxs

P [y | s]P [s]}

(3.3)

In order to compute equation 3.3, we could try all sequences s and find the one thatmaximizes the expression. However, this idea it is not scalable when the frame size is toolarge.

Since there is a first order Markov process involved, we can take advantage of two ofits properties to simplify the search for s. These properties are:

P [si+1 | s0 . . . si] = P [si+1 | si] (3.4)P [yi | s] = P [yi | si → si+1] (3.5)

Equation 3.4 establishes that the probability of next state does not depend on the entirepast sequence. It only depends on the last state. Equation 3.5 states that the conditionalprobability of the observation symbol yi through white noise is only relevant during thestate transition.

Using these properties we can work on 3.3:

P [y | s] =L−1∏i=0

P [yi | si → si+1] ,

P [s] =L−1∏i=0

P [si+1 | si] ,

s = arg

{max

s

L−1∏i=0

P [yi | si → si+1]L−1∏i=0

P [si+1 | si]

}(3.6)

A hardware implementation of an adder requires less resources than a hardware imple-mentation of a multiplier. So if we apply natural logarithm on 3.6 we can replace multi-plications with additions without altering the final result. Thus it yields:

s = arg

{max

s

L−1∑i=0

lnP [yi | si → si+1] + lnP [si+1 | si]

}(3.7)

Introducing λ (si → si+1) = lnP [yi | si → si+1] + lnP [si+1 | si], we can rewrite equation3.7 as:

s = arg

{max

s

L−1∑i=0

λ (si → si+1)

}(3.8)

λ (si → si+1) is known as the branch metric associated with transition si → si+1.

The observation yi during state transition si → si+1 is actually the output of theencoder observed through white noise during the state transition. For our BPSK model this


s0 s0

s1

s2

s3

s1

s2

s3

mi

xpi

(usi,upi)= (-1,-1)

(1,1)

BPSK Modulation

xsi

1212

−=

−= ii ii pp ssxu

xu

Figure 3.2: Output during state transition for a given trellis.

observation is related to the systematic and parity bit pair (Figure 3.2). Thus, assumingnoise independence, we can express the conditional probability of yi during state transitionas follows:

P [yi | si → si+1] = P [ysi | usi ]P [ypi | upi ] (3.9)

where usi and upi are the systematic and parity bits respectively after BPSK modulationand

P [ysi | usi ] = 1σ√

2πexp

[−1

2

(ysi−usi

σ

)2]

dysi ; P [ypi | upi ] = 1σ√

2πexp

[−1

2

(ypi−upi

σ

)2]

dypi ,

since we are dealing with white Gaussian noise with σ2 variance. In addition, it is moreconvenient to express P [si+1 | si] in terms of the message bit mi since state transitionsare due to this bit. Then,

P [si+1 | si] = P [mi] (3.10)

This is our a priori probability. For turbo decoding it is easier to work with log-likelihoodratios, then:

Lai = lnP [mi = 1]P [mi = 0]

P [mi] =

{eLai

1+eLaimi = 1

11+eLai

mi = 0⇒ lnP [mi] = Laimi − ln

(1 + eLai

)

It is important to remark that for the first iteration, all message bits are assumed to beequally likely, then P [mi = 1] = P [mi = 0] = 0.5 → Lai = 0. For successive iterationsLai is the extrinsic information provided by the other decoder through the interleaver.Replacing equation 3.9 and the above expression in the branch metric equation, we have:


λ (si → si+1) = ln(

dysidypi

σ22π

)− 1

2(ysi − usi)

2

σ2− 1

2(ypi − upi)

2

σ2+ Laimi − ln

(1 + eLai

)= − 1

2σ2

[(ysi − usi)

2 + (ypi − upi)2]

+ Laimi

= − 12σ2

[(y2

si− 2ysiusi + u2

si

)+(y2

pi− 2ypiupi + u2

pi

)]+ Laimi

=1σ2

[ysiusi + ypiupi ] + Laimi

Note that in order to simplify equations we have neglected terms that do not change whenvarying sequence s. From chapter 2 we know that σ2 = N0

2Es and Es = rEb where r = 12

is the code rate. So finally we obtain:

λ (si → si+1) =Eb

N0[ysiusi + ypiupi ] + Laimi (3.11)

It is more common to express equation 3.11 as shown below, since channel reliabilityLc = 4aEs

N0( a = 1 for our model ),

λ (si → si+1) = Lcysixsi + Lcypixpi + Laimi (3.12)

then 3.8 becomes:

s = arg

{max

s

L−1∑i=0

Lcysixsi + Lcypixpi + Laimi

}(3.13)

where xsi , xpi are the raw bits at the output of the channel encoder before the BPSKmodulation. Also mi = xsi for our RSC encoder.

It is important to remark that according to [11], for the SOVA algorithm, Lc can beassumed to be equal to 1. This means that there is no need to estimate the SNR ofthe channel. This is possible because at the beginning of the decoding process, at thefirst iteration Lai = 0 which leads the resulting extrinsic information to be weighted byLc. This extrinsic information becomes Lai for the next SISO decoder making all terms inequation 3.13 to be weighted by Lc. Therefore Lc has no influence in the decoding process.The fact that the SOVA does not need the channel estimation saves a lot of difficultiesand represents a big advantage over the MAP algorithm.

Summarizing, table 3.1 shows the relevant equations for applying the SOVA algorithm.


Element Equation

Branch Metric λ (si → si+1) = ysixsi + ypixpi + Laimi (3.14)

Sequence Estimator s = arg

{max

s

L−1∑i=0

ysixsi + ypixpi + Laimi

}(3.15)

Where {xsi , xpi} is the encoder output symbol when the input message bit is mi ; {ysi , ypi}is the received symbol, when the encoder output symbol is BPSK-modulated and trans-mitted through an AWGN channel. Finally Lai represents the LLR of the message bitmi.

Table 3.1: Equations summary.

In the next subsection we will develop an example in order to show how expression3.15 and the trellis diagram are applied in the decoding process.

3.2.1 Viterbi Algorithm Decoding Example

Figure 3.3 shows a trellis diagram example for a code with Pfb = [111], Pg = [101], andtries to clarify the decoding process.

• As shown on figure 3.3.a, The process begins at time i = 0 from state 0 becausethat is the state the encoder takes when initialized. Thus, the probability of beingat state 0 is one, and probability of being at any other state is zero. We assign theseprobabilities, as path metrics in log domain, to each state:

pmi,k ⇒ pm00 = 0pm0k = −∞ ∀k 6= 0

• Then, the branch metrics are computed at each state for message bit 0,1 and corre-sponding parity bit.


s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = L-2 i = L-1 i = L

( )00 10s sλ →

( )01 10s sλ →

( )02 11s sλ →

( )03 11s sλ →

−∞

−∞

−∞

0

(a) Computing branch metrics

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = L-2 i = L-1 i = L

−∞

−∞

−∞

0

(b) Surviving branches

( )10 20s sλ →

( )11 20s sλ →

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = L-2 i = L-1 i = L

−∞

−∞

−∞

0

(c) Continuing at i=1

s0

s1

s2

s3i = 0 i = 1 i = 2 i = 3 i = L-2 i = L-1 i = L

m 101 10= L survival path

(d) Tracing back from last state.

Figure 3.3: Trellis diagram for VA, Code given by Pfb = [111] , Pg = [101] .


λ(si,k → si+1,k′

)⇒ λ(s0,0 → s1,0) = (ysi + Lai) 0 + ypi0

λ(s0,0 → s1,2) = (ysi + Lai) 1 + ypi1λ(s0,1 → s1,2) = (ysi + Lai) 0 + ypi0λ(s0,1 → s1,0) = (ysi + Lai) 1 + ypi1λ(s0,2 → s1,3) = (ysi + Lai) 0 + ypi1λ(s0,2 → s1,1) = (ysi + Lai) 1 + ypi0λ(s0,3 → s1,1) = (ysi + Lai) 0 + ypi1λ(s0,3 → s1,3) = (ysi + Lai) 1 + ypi0

• The incoming path metrics for each state at time i = 1 are calculated by adding theincoming branch metrics to the corresponding path metrics of states at time i = 0.Figure 3.3.b.

• For each state at time i = 1, the incoming branch with the greater incoming pathmetric is kept. The new path metrics of these states are the survival incoming pathmetrics.

pm1,0 = max (pm0,0 + λ (s0,0 → s1,0) , pm0,1 + λ (s0,1 → s1,0))pm1,1 = max (pm0,3 + λ (s0,3 → s1,1) , pm0,2 + λ (s0,2 → s1,1))pm1,2 = max (pm0,1 + λ (s0,1 → s1,2) , pm0,0 + λ (s0,0 → s1,2))pm1,3 = max (pm0,2 + λ (s0,2 → s1,3) , pm0,3 + λ (s0,3 → s1,3))

In figure 3.3.b, 3.3.c the survival branches are drawn thicker.

• This algorithm is repeated from item 2 until time i = L − 1. Note that the finalstates will be at i = L.

• In order to find s at this point, there are two possibilities: if the encoder wasterminated, the system should trace back from the state at which the encoder wasterminated —usually state 0— through all survival linked branches. If the encoderwas not terminated, the system should choose the state with the greater path metricand trace back from there. Each branch within the trellis has a message bit mi

associated. The set of those bits is the most probable message. This step is shownin figure 3.3.d while the survival path is colored in green.

3.2.2 Soft Output extension for the VA.

The Viterbi Algorithm is able to find the most probable sequence within the trellis andhence its associated bits. Turbo coding techniques also demand the SISO unit to supplysoft output information. There are two well-known extensions for the Viterbi Algorithmthat produce soft output [11]. One was proposed by Battail [2] and it is known as BR-SOVA. The other one was proposed by Hagenauer [7] and it is known as HR-SOVA. Thelatter is mostly used rather than the former, even though the BR-SOVA performs betterin terms of BER. However, HR-SOVA allows an easier hardware implementation. We willexplain the HR-SOVA extension and remark the main idea.


Soft output information represents a measure of the bit reliabilities. As a startingpoint for the algorithm, a reliability ρ of infinity is assumed for every bit in the frame,thus ρi = ∞ ∀ i. The remaining steps proceed as follows:

• As shown in the example of figure 3.4.a, at time i = L and state k = 0 the trace backof the survival path starts. The survival path has been colored in green as exhibitedin the legend of the figure. In order to find the bit reliabilities, the competing pathalso needs to be traced back from time i = L and state k = 0 to the time it mergeswith the survival path. This competing path has been colored in orange, and forthe example of figure 3.4.a, the time where both paths merge is im = L − 4. Alsothe difference between both incoming path metrics at time i and state k has to befound. In figure 3.4 this value is represented as:

∆i,k =∣∣pmi−1,k′ + λ

(si−1,k′ → si,k

)− pmi−1,k′′ + λ

(si−1,k′′ → si,k

)∣∣ (3.16)

where k is the next state of k′ and k′′, for a message bit mi ∈ {0, 1} respectively.See Figure 3.4.a for references.

• Let j be a new time index in the range im < j ≤ i. At every time instant j, thesystem compares the message bit of the survival path with the message bit of thecompeting path. If they differ then the reliability ρj has to be updated according to

ρj ⇐ min (ρj ,∆i,k) (3.17)

In figure 3.4 a red square is placed on the branches that differ in the message bit.The BR-SOVA has also an updating rule for the case where the message bit of thesurvival path do not differ to the message bit of the competing path:

ρj ⇐ min(ρj ,∆i,k + ρc

j

)(3.18)

This is the main difference between HR-SOVA and BR-SOVA. Nevertheless thisupdating rule implies the knowledge of the bit reliabilities of the competing pathsρc

j [11].

• Once the system reaches the state where the survival path and the competing pathmerge, it moves one time instant back from i to i − 1 through the survival pathand traces back once again the competing path at that state. This process is shownin figure 3.4.b. For the example, the system now starts at time i = L − 1 and thecorresponding state k = 0. For this case, the competing path and the survival pathnow merge at time im = L− 5.

• This algorithm continues from step 2 until time i = 1, thus allowing all the bitreliabilities to be updated. Figure 3.4.c shows one more iteration with the aim ofclarifying this process.

• Finally soft output information is obtained in terms of LLR(Log-Likelihood Ratio)as follows:

Λi = (2mi − 1) ρi 0 ≤ i ≤ L− 1 (3.19)


s0

s1

s2

s3

i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i = L

competing path

survival path

0,L∆'k''k kim

(a) Survival path and competing path at time i=L, state k=0

s0

s1

s2

s3

i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i = L

competing path

survival path

0,1−∆L(b) Survival path and competing path at time i=L-1, state k=0

s0

s1

s2

s3

i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i = L

competing path

survival path

1,2−∆ L(c) Survival path and competing path at time i=L-2, state k=1

Figure 3.4: Soft Output extension example for the Viterbi Algorithm. Code given byPfb = [111] , Pg = [101] .

where mi is the estimated message bit —mi ∈ {0, 1}. Note that (2mi − 1) only givesthe sign to Λi; its magnitude is provided by ρi.

After explaining the previous algorithm, it is important to remark the main idea of theprocess. At a given time 0 ≤ i ≤ L− 1 the question to ask is: How reliable is the messagebit mi? The extension for soft output indicates that, the correctness of bit mi can only be


as good as the decision to choose the “closest” competing path over the most likely path.

3.2.3 Improving the soft output information of the SOVA algorithm.

The soft output generated by the HR-SOVA turned out to be overoptimistic [12]. It meansthat the HR-SOVA algorithm produces a LLR that is greater in magnitude than the LLRproduced by the BR-SOVA or by the MAP algorithm. These overoptimistic values for theLLR lead HR-SOVA to a worse performance in terms of BER.

In [12] two problems associated with the output of the HR-SOVA are described. Oneis due to the correlation between extrinsic and intrinsic information when the HR-SOVAis used in a turbo code scheme. The other problem is due to the fact that the output ofthe HR-SOVA is biased. The first problem is not easy to solve, and most of the hardwareimplementations do not deal with it. In contrast, for the second problem there havebeen several proposals that are based on a normalization method. The idea behind anormalization method can be shown by assuming that the output of the HR-SOVA, givena message bit mi, is a random variable with a Gaussian distribution, then:

P [Λi | mi = 1] =1√

2πσΛ

exp

(−(Λi − µΛ)2

2σ2Λ

)dΛi, (3.20)

P [Λi | mi = 0] =1√

2πσΛ

exp

(−(Λi + µΛ)2

2σ2Λ

)dΛi, (3.21)

where µΛ is the expectation of Λi and σΛ =√

E{Λ2

i

}− µ2

Λ is the standard deviation. Inorder to find the LLR of the message bit mi, given the output of the HR-SOVA, we candefine:

Λ′i = ln

(P [mi = 1 | Λi]P [mi = 0 | Λi]

), (3.22)

using Bayes theorem, assuming P [mi = 1] = P [mi = 0], and working on the previousexpression with 3.20 and 3.21, yields:

Λ′i =

2µΛΛi

σ2Λ

, (3.23)

which indicates that the HR-SOVA output should be multiplied by the factor c = 2µΛ

σ2Λ

toobtain the LLR.

The factor c, according to [12], depends on the BER of the decoder output. Someschemes try to estimate factor c while others set up a fixed value for it. In our hardwareimplementation we will use a fixed scaling factor since in [10], it has been reported thatthe BER performance by a fixed scaling factor is better than by a variable scaling factor.

Chapter 4

Hardware Implementation of aTurbo Decoder based on SOVA

In the previous chapter we introduced the general ideas of a turbo decoder and presentedthe HR-SOVA algorithm —from now on we will refer to it just as SOVA— as the activepart of the SISO unit. In this chapter we will deal with the implementation issues andanalyze today most commonly used hardware architectures. Next we will introduce a newalgorithm for finding points of the survival path and consequently we will present thearchitecture for implementing it. We will describe the unit that updates bit reliabilitiesand finally we will present the improvements which allow the decoder to boost the BERperformance.

As a general scheme we present the figure 4.1. There are two blocks of RAM usedas input and output buffers. There are also two more blocks of RAM used to storetemporary data as a priori and extrinsic information. Then there is a unit that deals withthe interleaving process, a unit to control the system and to interact with the user andfinally the SISO unit that implements the SOVA algorithm. Note that we only use oneSISO unit. This is possible because of the fact that the interleaver/deinterleaver does notallow concurrent processing, so a frame has to be completed by one decoder before it canbe processed by the other. For the proposed architecture, this processing is always doneby the same decoder.

Data arriving at the receiver is processed and fed into the data-in RAM buffer, thena starting command is delivered to the control unit. The states the system goes throughare shown in figure 4.2. The system starts to process the interleaved data first, and at thelast iteration, it ends up with the deinterleaved data . This is done this way, in order tosave an access through the interleaver at the end of the decoding process which also savespower and allows a simpler control unit. However, the system has to wait until the entireframe is received, before decoding can take place.

Even though the same unit is used as decoder 1 and decoder 0, its behavior changesslightly, depending on the role the unit is playing. We can summarize the following tasksfor each role:

• SOVA unit is acting as decoder 1:

26 Hardware Implementation of a Turbo Decoder based on SOVA

– When SOVA addresses data-in RAM buffer, it addresses belong to the inter-leaved domain.

– Since it addresses belong to the interleaved domain, in order to get systematicdata, it has to go through the deinterleaver.

– It can address “parity data 2” directly.

– If the first iteration is running, then a priori information is assumed to be 0.Otherwise, it fetches a priori information through the deinterleaver from RAMLa/Le.

– It writes extrinsic information directly to the RAM Le/La. This entails that,when acting as decoder 0, it has to access a priori information through theinterleaver.

• SOVA unit is acting as decoder 0:

– It addresses belong to the deinterleaved domain, or the domain where informa-tion bits are in order.

– It can access systematic data and “parity data 1” directly form the data-inRAM buffer.

– The a priori information is accessed through the interleaver, since each word waswritten to an address in RAM Le/La, that belongs to the interleaved domain.

– It writes extrinsic information directly to the RAM La/Le.

– It writes hard output directly to the data-out RAM buffer. This can be done ateach iteration, allowing the user to check for a frame header, or when runningthe last iteration with the aim of saving power.

Interleaving/DeinterleavingUnit

SOVA

RAMData In

RAMLa/Le

RAMLe/La

RAMData Out

ControlUnit

data in

data out

cmd

status

Figure 4.1: Hardware implementation of a turbo decoder

4.1 Turbo Decoder RAM buffers. 27

Idle

Deco 0

Begin?

Deco 1

Y

N

Last Iteration?

N

Done

Y

+ Addresses belong to the interleaved domain.+ It fetches “parity data 2” directly from data-in RAM buffer.+ It fetches systematic data through the deinterleaver.+ It fetches a priori data through the deinterleaver from RAM La/Le.+ It writes extrinsic information directly to RAM Le/La.

+ Addresses belong to the deinterleaved domain.+ It fetches systematic data and “parity data 1” directly from data-in ram buffer.+ It fetches a priori data through the interleaver from RAM Le/La.+ It writes extrinsic information directly to RAM La/Le.+ It writes hard output directly to the output buffer.

Figure 4.2: Overall system states diagram

4.1 Turbo Decoder RAM buffers.

All the RAM buffers are based on double port RAMs. The figure 4.3 shows the schemeof data-in RAM. Since the systematic data and parity data 2 belong to different timedomains, two double port RAMs are used to store either information data. In figure 4.4the scheme of the data-out RAM is shown. Finally figure 4.5 presents the RAM La/Leand the RAM Le/La, which are equivalent.


systematic parity 1

RAM

wr

parity 2RAM

addr in

syst p1

p2addr out p2

p2

addr out sp1rd sp1

sys p1

rd p2

Figure 4.3: Data-in RAM

data-out RAM

wr

addr in

x xaddr out

rd

Figure 4.4: Data-out RAM

RAMLa/Le

rd la/le

addr la/le addr in

we la/le

data La data Le

RAMLe/La

rd la/le

addr le/la addr in

wr le/la

data La data Le addr le

Figure 4.5: RAM La/Le and RAM Le/La connections

4.2 Interleaving/Deinterleaving unit of the turbo decoder

There have been several proposals to design an area efficient interleaver. In [14] contentionfree interleavers, that allow concurrent processing, are studied. In our case for the sake ofsimplicity and versatility a ROM is used to carry out the interleaving/deinterleaving func-tions as look up tables. Figure 4.6 shows the interleaving/deinterleaving unit. The figurealso shows some control signals. The signal named “deco” indicates the role the SOVA unitis playing. Note that when working with “deco=1”, the address of the “parity data 2” is

4.3 SOVA as the core of the SISO. 29

DeinterleaverROM

InterleaverROM

addr_spla

addr out sp1

addr_la_1 addr_le

mux

addr_la_0

deco

dmux

wr le

wr le/la wr la/le

deco

Delay 1

addr out p2

1

1

0

0

α β

rd dint rd int

deco

rd p2

RAM p2 interface

rd sp1

‘1’

RAM sp1 interface

rd lale rd lela

RAM la/le interface

RAM le/la interface

RAM la/le - le/la write

port interface

Figure 4.6: Interleaving/Deinterleaving Unit

delayed one cycle, while the address of the systematic data goes through the deinterleaver.Also the a priori data is fetched form the RAM La/Le and the extrinsic information iswritten directly to the RAM Le/La. In contrast, when working with “deco=0”, there isno need to access the “parity data 2” RAM, since the “parity data 1” and the systematicdata are stored in the same RAM position. In this case, a priori information is accessedthrough the interleaver and extrinsic information is written directly to the RAM La/Le.

4.3 SOVA as the core of the SISO.

Before getting into our hardware implementation for the SOVA algorithm, it is importantto comment some of today most commonly used hardware architectures.

Since the SOVA algorithm is an extension of the Viterbi Algorithm, most of the mainunits have been based on the implementations achieved for the Viterbi Algorithm. Thisarchitectures are complemented with reliability updating units to produce the soft output.

Figure 4.7, shows a comparison between Viterbi decoders and SOVA decoders. Bothdecoders have a BMU (Branch Metric Unit), an ACSU (Add Compare Select Unit), andan SMU (Survival Memory Unit). However the SOVA ACSU has to provide with the ∆difference between path metrics, and the SMU includes an RUU (Reliability UpdatingUnit) that provides the soft output information. In the next subsection we will discussthe issues related to the SOVA components.


BMU

ACSU

SMU

BMU

ACSU

SMURUU

data outdata out

data in data inSOVAVA

∆

Figure 4.7: Viterbi and SOVA decoder schemes

4.4 Branch Metric Unit.

As it name suggests, this unit calculates the branch metrics. According to equation 3.14,the possible branch metrics depend on xsi , xpi and mi bits. When working with an RSCencoder of rate 1

2 , xsi = mi and there is only one parity bit xpi , which means that thereare four possible path metric at each time instant i:

• (xsi , xpi) = (0, 0) → λ0 = 0

• (xsi , xpi) = (0, 1) → λ1 = ypi

• (xsi , xpi) = (1, 0) → λ2 = ysi + Lai

• (xsi , xpi) = (1, 1) → λ3 = ysi + Lai + ypi

The BMU for an RSC encoder of rate 12 , is shown in figure 4.8.

4.5 Add Compare Select Unit. 31

ys

La

yp

+

+

( ) ( )0,0, =ps xx( ) ( )1,0, =ps xx( ) ( )1,1, =ps xx( ) ( )0,1, =ps xx0λ

1λ

3λ

2λ

Figure 4.8: BMU for the RSC encoder.

4.5 Add Compare Select Unit.

Applying equation 3.15 in the trellis diagram, yields the following expression:

pmi,k = pmi−1,k′ + λ(si−1,k′ → si,k

)where k is the next state of k′ that produces the higher incoming path metric. The previousexpression suggests that the path metric pmi,k can be obtained by recursion. In figure 4.9an ACSU for the SOVA unit is presented.

The set of registers holds the previous path metrics. The branch metrics are mapped tothe corresponding adders according to the outputs during state transitions to produce theincoming path metrics. Then these incoming path metrics are connected to the selectors,which choose the higher incoming path metric and produce the decision vector along withthe ∆ difference between incoming path metrics. The connections between adders andselectors represent the trellis butterfly.

One problem that might arise is the overflow of the path metrics after a certain amountof time. Since the relevant information is the difference between path metrics, a normal-ization method can be adopted. There have been proposed many normalization methodssince the introduction of Viterbi decoders. We find the modulo technique reported in [13]to be a good solution, since it actually allows the overflow.

The idea behind the modulo technique is that the maximum difference ∆B betweenpath metrics at all states is bounded. The figure 4.10 shows the mapping between allrepresentable numbers, by the path metric register of nb bits, on a circumference.

Let ipm′i,k and ipmi,k be two incoming path metrics at a given time i state k, then

it is shown in [13] that ipm′i,k > ipmi,k, if ipm′

i,k − ipmi,k > 0 in a two-complementrepresentation context. The number of bits nb relates to the bound as follows:

C = 2nb = 2∆B


0,i∆

++

++

++

++

0λ 1λ 3λ2λ

Sel

Sel

Sel

Sel

1,i∆

0,1−ipm 0,iv 1,iv 2,i∆ 2,iv 3,i∆ 3,iv

0,ipm

( )1,1 ( ) 00,0 λ=

( )0,0 ( )1,1( )1,0( )0,1

( )0,1( )1,0-

>0?

ki,∆ kiv ,kipm ,

selector

Figure 4.9: Add Compare Select Unit for the SOVA. Pfb = [111], Pg = [101]

This means that, even though the path metrics may grow in different ways, they all remainin the half of the representation space provided by C. An appropriate bound is ∆B = 2nB,being n the minimum number of stages to ensure a complete trellis connectivity amongall trellis states, and B is the upper bound on the branch metrics [13].

4.6 Survival Memory Unit. 33kiipm ,kiipm ,' 01−

12 −− nb increasing12 1−−nbkikinbkiki ipmipmipmipm ,,,, '?02mod' >⇒>−

Figure 4.10: Modular representation of the path metrics. Each path metric register has awidth of nb bits.

s0

s1

s2

s3

iFP i

Figure 4.11: Merging of paths in the traceback.

4.6 Survival Memory Unit.

The remaining SOVA units should obtain the soft output information for every bit in theframe along with the most likelihood path. One way to do so, is to store all the data theACSU provides. Then when the last time instant is reached, the data is traced back andthe bit reliabilities are updated according to the SOVA algorithm. However most of thehardware architectures do not do it that way because the latency is high and the amountof memory grows considerably with the frame size, the number of states of the encoderand the width of the quantization of ∆i,k.

Most of the SMUs take advantage of a trellis property to solve this problem. Thisproperty is illustrated in Figure

4.11, where a trellis diagram from a decoding process is shown. If all the paths aretraced back from all the states at a given time i, it is found that they merge at time instantiFP . Therefore, from time instant iFP down to i = 1 the only path remaining in the trace


PEU

PEU

PEU

PEU

1

0

1

0

1

0

1

0

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PEU

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

i i-U i-D

1,i∆

0,iv 1,iv 2,i∆ 3,i∆

MAX∆

2,iv 3,iv

0,Div− Di−ρ0,Div− Di−ρ Di−ρ0,Div− 0,Div− Di−ρ

0,i∆

MAX∆MAX∆MAX∆MAX∆MAX∆MAX∆MAX∆

Figure 4.12: Register Exchange SMU for the SOVA. Pfb = [111], Pg = [101]

back started at time i, is the survival path. We define the time instant along with the statewhere the paths merge as a FP (Fusion Point). Then, looking at the example of figure4.11, for time instant i there is a FP at (iFP,s3). Simulations have shown that the distancebetween the time instant i and the FP iFP is a random variable. It is also observed thatthe probability of the paths merging increases with the depth of the trace back and it isproportional to the constraint length of the code. Then a trace back depth of 10 timesthe constraint length of the code, might allow the paths to merge with high probability.

Below we will describe the mostly used architectures based on the previous property.

4.6.1 Register Exchange Survival Memory Unit.

The RE (Register Exchange) SMU for an RSC encoder of rate 12 is shown in figure 4.12.

This scheme is reported in [9]. It is an array of PE (Processing Elements) of n rows andD columns —n is the number of states of the encoder and D is the trace back depth. Theconnection topology between PEs is given by the trellis of the encoder. In figure 4.12, twotypes of PEs can be distinguished. The first U PEs —red outline— , besides tracing backthe paths, update the bit reliabilities. In figure 4.13.a a PE with updating capability isshown. In figure 4.13.b a normal PE is shown. The system allows the trace back of all thepaths from the states at time instant i. The ACSU provides the data that enters the REfrom the left. The first U units update the bit reliabilities of each path according to theSOVA algorithm. Each row of the array holds the information of one path. For example,the first row holds the path information corresponding to the path traced back from thestate 0 at time i. The second row holds the information corresponding to the path tracedback from state 1 at time i, and so on. After D clock cycles, if D is large enough to allowpaths merging, the message bit and its reliability are obtained. Note, that if paths mergebefore D then the data coming out from rows, at all states, is the same, since the tails ofall the paths belong to the survival path. Therefore only data from one row is selected.

Parameters U and D, represent a trade-off. Some architectures, set U in a range fromtwo to five times the constraint length of the code, while D is set between five to tentimes the constraint length of the code. If U and D are too large, the BER performanceincreases, so power consumption and area do. The area increasing, is also due to theresources spent in the connections, which becomes a serious problem with the number ofstates of the encoder. If U is large, and D is not, then resources are spent worthlesslysince BER performance is not increased. The same if D is large while U is short or when

4.6 Survival Memory Unit. 35

'∆'v''∆''v 'v ''v

''∆'∆

a>b?

a

bki,∆kiv ,(a) PE with updating capability.

'∆'v''∆''v 'v ''v

''∆'∆ kiv ,

(b) Normal PE. Trace back only.

Figure 4.13: Register Exchange processing elements.

both are short. The decoding latency for this scheme is D clock cycles and, as it can beobserved, the pipelined style of the architecture suggests high activity and hence a relativehigh dynamic power consumption.

4.6.2 Systolic Array Survival Memory Unit.

The RE scheme, presents one major problem that leads to a high power consumption. Theproblem is that all the paths are traced back D steps. The idea behind the SA (SystolicArray) is to trace back only one path, however, this path, after D steps will merge withthe survival path and will become the path we are looking for. SA is presented in [15].

The figure 4.14.a introduces the scheme of the SA for an RSC encoder of rate 12 and

four states in the encoder. The figure only shows the SMU for the VA. It is composed ofan array of elements arranged in n rows and 2D columns —n is the number of states of the


kDiv ,−

Selection Unit

1

1

1

0

1

0

1

1

1

0

0

0

1

0

1

0

1

0

1

0

1,iv 2,iv 0,ivTB

3,iv i i-2D

TB TB TB TB TB SU

(a) Systolic Array for the Viterbi Algorithm

Last State

kis ,1,iv 2,iv 3,iv 0,iv

',1kis−

(b) Trace Back element of the Systolic Array

Figure 4.14: Systolic Array for the Viterbi Algorithm.

encoder and D is the depth of the trace back. There is also one more row, with D TB(traceback) elements. The row of TB elements holds the sequence of the states belonging to thesurvival path. It can be observed that the connections between the elements in the arrayare much simpler than in the RE scheme.

The system works as follows: the selection unit, feeds the decision bits vi,k providedby the ACSU into the left of the array. After D clock cycles, the SA is half full and theselection unit begins to feed the state si,k with the higher path metric accumulated inthe ACSU registers into the left most TB element. The system also works if the selectionunit feeds any other state. However the state with the higher path metric is more likelyto be the survival path. Once the most likely state is fed, the TB elements along withthe decision vectors, do the trace back of that state D more cycles. Figure 4.14.b showsthe details of the TB cell. Finally after 2D cycles the SU(Survival Unit) —Figure 4.15— provides the most likely message bit. Note that for this scheme the latency is twicethe latency of the RE scheme, however, the trace back depth is only D. Note that thisstructure also suggests high activity and relatively high dynamic power consumption.

So far the SA deals with the VA. The SOVA extension for the SA presents some majorproblems that were cited in [6]:

• SOVA requires path metrics differences for every state,

• trace back must occur on two paths (survivor and competitor),


kis ,1,Div− 2,Div− 3,Div−

0,Div−

Figure 4.15: Survival unit for the Systolic Array.

s0

s1

s2

s3

ii-Di-D-U

competing path

survival path

Trace backReliability updating

Figure 4.16: Two Step idea. First tracing back, and then reliability updating.

• each state must have access to all the information about the path metric differencesand decision vectors for that particular time.

These issues make the SA not a good choice for a complete SOVA based decoder. HoweverSA has been used in [17] as a reliability updating unit in a Two Step configuration.

4.6.3 Two Step approach for the Survival Memory Unit.

This scheme was proposed in [9] with the intention to discard all the operations that donot affect the output. The idea is to postpone the updating process until the survival pathis found. Figure 4.16 shows this concept. The first D steps intend to find the survivalpath, while the remaining U steps updates the bit reliabilities. A FIFO(First In, FirstOut) memory is usually employed to delay the path metric differences along with decisionvectors until the updating process begins. The SMU we propose in this document isactually a Two Step configuration. However, we introduce a new scheme for finding thesurvival path.


FPU RUU

RAM

1,i∆

1,iv 2,i∆ 1,i∆ 3,i∆ 2,iv 3,iv 0,ivρ

Figure 4.17: Fusion Points based SMU

4.6.4 Other Architectures.

A lot of architectures and schemes have been proposed in the last years. In [4] differentSMUs for the VA are studied and compared. In [6] a trace back architecture based on anorthogonal memory is presented. However, all these schemes deal with a finite trace backdepth D and with a finite updating length U , which leads to a non optimum algorithmexecution. In the next subsection we will introduce a new architecture for the SOVAalgorithm that does not depend on the D-U trade-off.

4.6.5 Fusion Points Survival Memory Unit.

So far, two of the most common schemes have been studied. They are the RE and theSA. Both of them carry out a trace back with the aid of a pipeline architecture. Thesize of this pipeline architecture has an impact in the area, power consumption and BERperformance. One of the contributions of this work is a new type of architecture based ona new algorithm and the development of the architecture that implements it. The majoradvantage of this new scheme is that it is independent of the D-U trade-offs and it allowsrecursive processing which lessens the register activity.

The new architecture to implement the SOVA algorithm that we propose, as it namesuggests, deals with the Fusion Points. Figure 4.17 shows the general scheme. It consistsof a FPU (Fusion Point Unit) which finds the time instant and the state where the survivalpaths merge. It is inside this unit where the new algorithm is implemented. There is adual port RAM to store the data the ACSU provides, and finally there is a RUU thatupdates the bit reliabilities based on the information provided by the FPU.

The unit works as follows: the data the ACSU provides is stored in the dual portRAM, the decision bits vi,k are also used by FPU to implement the FP search algorithm.


s0

s1

s2

s3

Possible Fusion Points

Possible Fusion Points

Merging paths

Figure 4.18: Possibility of fusion points

Whenever a FP is found, it is indicated to the RUU which updates the bits reliabilities bya tracing back method aided with the data fetched through the second port of the dualport RAM.

4.6.5.1 Fusion Points Unit

This unit finds the Fusion Points along the trellis for a code with rate 12 by means of a new

algorithm1. The algorithm is based on the idea that a fusion point for a code rate 12 will

always reside in the merging point of two paths. Figure 4.18 shows these possibles fusionpoints. The following thought explain the previous idea: whenever a trace back operationtakes place, the system traces back from a given time instant i; while tracing back, paths,at different time instants, merge in groups of two. The last of these “two-paths mergingpoint” is a Fusion Point. Therefore a FP will always reside in the merging point of twopaths.

The following steps along with the example of figure 4.19 introduce part of the algo-rithm:

• Decision vectors coming from the ACSU, are used to identify the merging paths orpossible fusion points —Figure 4.19.a.

• Each possible fusion point is marked. Whenever a mark is set, the mark time andstate are held in registers —Figure 4.19.a.

• This mark is propagated along the branches to the next states —Figure 4.19.b.

• The mark is propagated at every clock cycle.

• If a mark propagates to all the sates at a given time, then the origin of that markis a fusion point. The fusion point coordinate is held by the register and can berecalled immediately —Figure 4.19.c.

After introducing the mark movements, figure 4.20 shows a sequence example where more1We develop the algorithm for a code of rate 1

2, however, it can be extended to any code rate.


s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8

Possible Fusion Point detected at (0,S0)

(a) Possible fusion point detecction

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6

Mark propagated

i = 7 i = 8

(b) Mark propagation

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7

Mark propagated

Fusion point at (0,S0) detected at time i=2

i = 8

(c) Mark propagation and fusion point detection.

Figure 4.19: Fusion Point detection algorithm.

than one mark is handled at the same time. In the figure two columns can be appreciated.The left column indicates the time instant the system is processing, and also the status.The status is composed by three pointers that are able to hold the time and the statesof FPs. The first two pointers hold the possible FPs detected while the third pointer


s0

s1

s2

s3

i = 0

s0

s1

s2

s3

i = 1

s0

s1

s2

s3

i = 2

Its pointer is free since it has no chance to become a fusion point

s0

s1

s2

s3

i = 3

Its pointer is free since it has no chance to become a fusion point

s0

s1

s2

s3

i = 4

We find that the blue mark and the yellow mark become fusion points, however if we trace back

from i=5 we find that paths merge at (3,s0). Whenever two marks coincide, the one with the

latest origin is kept.

s0

s1

s2

s3

i = 5

Pointer 0 (1,s0)

Pointer 1 (1,s2)

(0,s0)i=2FP

Time Instant i = 1System StatusPointer 0 (0,s0)

Pointer 1 (-,-)

(-,-)FP

Time Instant i = 0System Status

Pointer 0 (1,s0)

Pointer 1 (2,s2)

(-,-)FP


Pointer 1 (3,s0)

(-,-)FP


Pointer 1 (4,s2)

(3,s0)i=5FP


Pointer 1 (4,s2)

(-,-)FP

Time Instant i = 5System StatusFigure 4.20: Sequence of the Fusion Point algorithm


indicates a FP. The right column shows the sequence from time time i = 0 to time i = 5.The algorithm proceeds as follows:

• i = 0, a possible fusion point is detected at (0, s0). A green mark is set and ispropagated to the states (1, s0), (1, s3). Its coordinate (0, s0) is held in the pointer0 register.

• i = 1, a possible fusion point is detected at (1, s0). A blue mark is set and ispropagated to the states (2, s0), (2, s3). Another possible fusion point is detected at(1, s2). A fuchsia mark is set and is propagated to the states (2, s1), (2, s3). Since thegreen mark propagates to all the state at i = 2 its origin becomes a fusion point. Thefusion point register is set with the data of the pointer 0, which holds the coordinateof the green mark, and pointer 0 is free. Also a green straight line across all thestates at time i = 2 indicates the time the FP is detected. Note that even thoughthe actual time instant is i = 1, the detection line of the FP is at i = 2. Beforemoving into the next time instant, the coordinates of the blue and fuchsia marks arestored in pointer 0 and pointer 1 registers respectively. Note that pointer 0 registerwas free when the fusion point was detected.

• i = 2, a possible fusion point is detected at (2, s0). A red mark is set and ispropagated to the states (3, s1), (3, s3). The fuchsia mark is propagated to the state(3, s2) and its pointer is free. The reason why the fuchsia mark pointer is free willbe explained later. The blue mark propagates to the states (3, s0), (3, s1), (3, s3).Before moving into the next time instant, the coordinate of the red mark is storedin pointer 1 since it is the only free pointer available.

• i = 3, a possible fusion point is detected at (3, s0). A yellow mark is set and ispropagated to the states (4, s0), (4, s2). The red mark is propagated to the state(4, s3) and its pointer is free. The reason why the red mark pointer is free is the sameas that for the fuchsia mark pointer in the previous instant and will be explainedlater. The blue mark is propagated to (4, s0), (4, s2), (4, s3). Before moving to thenext time instant, the coordinate of the yellow mark is stored in pointer 1 since it isthe only free pointer available.

• i = 4, a possible fusion point is detected at (4, s0). A turquoise mark is set and ispropagated to the states (5, s0), (5, s2). Another possible fusion point is detectedat (4, s2). A brown mark is set and is propagated to states (5, s1), (5, s3). Both,the blue mark and the yellow mark propagates to all the sates at i = 5. This meansthat the origin of the blue mark and the origin of the yellow mark are both fusionpoints. However, the point we are looking for, is the closest FP to the time beingprocessed, which in this case that FP corresponds to the origin of the yellow markat (3, s0). The reason is the definition of a FP. If the system traces back from timei = 5, it finds that all paths merge at (3, s0). So (3, s0) represents the point where allpaths merge in a trace back operation from i = 5. The point (1, s0) corresponding tothe origin of the blue mark, belongs to the survival path, but it does not representa merging point for a trace back operation that starts at time i = 5. Now we canextend the previous thought in the following way: suppose that two marks propagateto the same states. This means, that in the future, their propagations will always be


1,iv 2,iv 3,iv 0,ivMark

Detection

Mark Propagation

MarkProcessing

FP detected

current addr

FP addr

FP state

FP sel

Combinational Logic Memory

Mark registers

State code registers

Address registers

Figure 4.21: FPU architecture for a code with constraint length K = 3.

the same. They will have the same possibilities to become fusion points. Howeverthe closest FP to the time being processed is the true FP. Then, it is not necessaryto propagate and process the behavior of both marks. The mark that is relevantis the one with the origin closest to the time being processed. Therefore, we canenunciate the following rule:

whenever two marks coincide, the one with the latest origin is kept.

Finally, before getting back to the algorithm, it is time to explain why the redmark pointer and the fuchsia mark pointer were free in the previous steps. Wesaw that either mark propagated to only one state. Therefore if the systemkeeps propagating those marks, in the best case, they will coincide in the futurewith a possible fusion point, and whenever two marks coincide the one withthe latest origin is kept. For this case the mark to be kept is the future possiblefusion point. Summarizing, this last rule becomes:

whenever a mark propagates to only one state, it has no chance to becomea FP in the future, then its pointer can be freed.

Now that we have set the main ideas and rules, we return to the algorithm.The fusion point register is set with the pointer 1 data. The pointer 1 andthe pointer 0 are free, and then the coordinates of the turquoise mark and thebrown mark are stored in them.

• i = 5 The algorithm is executed, but there are no possible fusion points detections,only mark propagations.

Figure 4.21 presents a design of the FPU for a code with constraint length K = 3. Itconsists of a Mark Detection Unit, which uses the decision bits vi,k provided by the ACSU


to detect possible FPs according to the trellis butterfly. There is a Mark Propagationblock, which propagates, along the trellis, the new marks and the stored marks. There isa processing unit, which compares all the marks at the input, and proceeds as follows:

• if there are two equal marks, then the one with the latest address is kept.

• if there is a mark with only one bit set, then its corresponding register is freed, sinceit has no chance to become a FP in the future.

• if there is a mark with all bits set, then a FP is indicated with its address and state.

Finally there is a set of registers used to hold marks, addresses, and state codes.

It is important to point out some major concerns:

• The algorithm can be computed by recursion.

• There are at most n2 new possible FPs at each time instant, where n is the number

of states of the encoder.

• Simulations have shown that for an RSC encoder of rate 12 with n = 2K−1 states,

the amount of registers the FPU needs is:

– n−2, registers of n+1 bits to hold marks —the remaining bit is used to indicateif the register is empty.

– n− 2 registers of K − 1 bits to hold state codes.

– n− 2 registers of A bits to hold addresses, where A is the number of bits usedto code the frame size.

• Since the processing unit compares all marks at the same time to see if there areequal marks, then the number of XOR gates increases drastically with the constraintlength of the code. However it has been observed that Turbo Code schemes withencoders with short constraint length have better BER performance than encoderswith large constraint length [18].

Comparing our approach with the previous implementations, we conclude with the resultsof table 4.1 for an RSC code of rate 1

2 , K = 3 and a message frame size of 1024. We seethat for a code with constraint length K = 3, a frame size 2A = 210 = 1024 bits, and atrace back depth of D = 5 ∗K, the RE SMU needs (5 ∗ 3) ∗ 4 = 60 register of one bit, andthe FPU needs (4− 2) ∗ (4 + 1) + (4− 2) ∗ 2 + (4− 2) ∗ 10 = 34 register of one bit. Also,the FPU will always find the correct FP, while the RE SMU might produce wrong results,if paths do not merge within the trace back pipeline. Another difference is that the REoutputs the symbol sequence of the survival path, while the FPU outputs the sequence ofFPs that are spread along the trellis. However, in a turbo code scheme context, the RUUmay take advantage of these FPs as we will show in the next subsection.

4.7 Fusion Points based Reliability Updating Unit. 45

Observation REU FPUOne bit Registers 60 34Reliability depends on the trace back depth OptimumOutput Rate One state per clock cycle Random

Table 4.1: Comparison between the REU and FPU for a code with rate 12 , K = 3 and a

frame size 2A = 210 = 1024

s0

s1

s2

s3

i = 4competing pathsurvival path Possible competing path in the future

0,3∆ 0,4∆

The reliability of these bits, could depend on or0,4∆

2,4∆

2,4∆

i = 0 i = 1 i = 2 i = 3

Figure 4.22: Reliability updating problem

4.7 Fusion Points based Reliability Updating Unit.

Before getting into the hardware issues it is important to highlight the main problem weface at the moment of updating bit reliabilities. For example, figure 4.22 illustrates oneexample. While processing data at time instant i = 4, a FP is found at (3, s0). This FPis colored in green. The example shows the survival path and the competing path tracedback from the FP until they merge. The blue branches indicate possible future branchesof the survival path, while the red paths indicate possible competing paths in the future.The RUU could start to update bit reliabilities as soon as a FP is detected. However,figure 4.22, shows how the reliability of bits i = 2, i = 3 might depend on ∆4,0, or ∆4,2

. The earlier release of those bit reliabilities leads to a non optimum SOVA algorithmexecution.

One solution to the mentioned problem, is illustrated in figure 4.23. The idea is to traceback U steps, to allow all the competing path, that start after time i, to merge. After Usteps, the remaining bit reliabilities could be released. However, this solution introducesthe U factor which is a trade-off between BER performance and power consumption.It has no impact on the area, since as we will show later, bits reliabilities are updatedrecursively. Anyway, the introduction of the U factor, leads to a non optimum SOVAalgorithm execution.

The solution we adopted is introduced by the example of figure 4.24 . By the time i,two FPs have been detected. Since the second FP, resides after the detection line of thefirst one, the updating process takes place starting from the second FP. Once, the first FPis reached, the system continues updating and releasing the bit reliabilities. The fact that


s0

s1

s2

s3iFP -U iFP i

Updating OnlyUpdating and Releasing

3,FPI∆survival path

competing path

Figure 4.23: One possible solution to the problem of bit reliabilities releasing.

s0

s1

s2

s3

iFP1 iDFP1 iFP2 i iDFP2

paths traced back from any instant after i DFP1 will merge at IFP1

Possible competing path in the future

Bits reliabilities before iFP1, will not be affected by future competing paths. Therefore they can be released.

Updating OnlyUpdating and Releasing

Figure 4.24: Solution adopted for the bit reliabilities releasing problem.

the second FP needs to reside after the detection line of the first one, is due to the conceptthat any path traced back after the detection line will merge at the FP of that detectionline. Therefore any future competing path of the survival path, at most will merge at thefirst FP, and will not affect the bit reliabilities before the first FP.

We can generalize this solution in an algorithm as follows:

• Wait for the first FP provided by the FPU.

• Wait for the second FP.

• If the second FP is detected after the detection line of the first one then, proceedwith the updating process.

• If the second FP is detected before the detection line of the first FP, then wait forone more FP:

– If the third FP resides after the detection line of the second FP then, proceedto the updating process with the information of the second and third FP.

– If the third FP does not reside after the detection line of the second FP, but itdoes after the detection line of the first one, then the updating process proceeds,with the information of the first and third FP.


– If the third FP does not reside after the detection lines of any of the other twoFPs, then the third FP is discarded. The RUU continues from step 4.

• When the updating process finishes, then the last FP, becomes the first FP, and theprocess is repeated from step two.

• If the end of the frame is reached by the ACSU, then the RUU is interrupted and itbegins to update the bits reliabilities from the end.

Figure 4.25.a presents the RUU general scheme. There is a state machine which controlsthe unit. It also carries out the previous algorithm. The registers at the left of the figure,hold FP state codes, FP addresses and FP detection lines which are used to address theRAM block and control the updating process. The lastState Unit, calculates the previousstate in the trellis, based on the current state, and the decision bit for that state. Thisunit is actually doing the trace back, at each clock cycle, of the survival path. The currentstate is used to drive the multiplexers to select the message bit associated with the survivalpath and the ∆ difference between the metric of the survival path, and a competing path.These elements are fed into the recursive Updating unit, which calculates the reliabilitymagnitude of bits ρi.

The term Lepi is stored in the RAM block in conjunction with the decision bits vi,k

and ∆i,k. This term is equivalent to:

Lepi = ysi + Lai

which is used to calculate the final extrinsic information Lei:

Lei = Λi − Lcysi − Lai

Lei = Λi − (ysi + Lai)Lei = Λi − Lepi

The term Lepi is calculated when ysi and Lai are available at the time of branch metricsbecause it saves clock cycles at the time of computing Lei. Not doing it a that timesupposes to access the data-in RAM buffer and the RAM La/Le-Le/La again. Besides,the access has to be done through the interleaving/deinterleaving unit, which might bebeing used. The calculation of Lei is done in the following way: the recursive units outputsρi, which is actually the magnitude of Λi. The bit mi gives the sign to Λi. Since a twocomplement representation is used, the bit mi will indicate whether to complement ρi ornot. Then we have: {

Lei = ρi + (0− Lepi) mi = 1Lei = not ρi + (1− Lepi) mi = 0

The operation in parenthesis is done first and its result is delayed until ρi comes out of therecursive unit. This allows to distribute combinational delays among the registers. Theresulting Lei is stored in the RAM La/Le-Le/La depending on the decoder.

The recursive updating unit is shown in figure 4.26. This unit updates bits reliabil-ities by managing all competing path at once. In the scheme there is a set of registerthat holds the different ∆ for each state. These ∆ are propagated to the corresponding


Starting state

m

Delay

lastState

a<=

bflushing

load

=downCounter

flushing_addr_start

finish

Lepi

addr_le

addr_x^

m

delay

RecursiveU

pdate

Starting state

queue

State

Machine

mmm

m

>=

f2

m

Delay

put_frame_end_data

RA

M S

MU

Current statedecision bit

mi+

1

ki, ∆ ki v,s

i+1

,k1+ ∆i ki v,1+S

urvival path states

Message bits corresponding

to the survival path

Decision bits

+

a b

1

It calculates the last state of the sequence based on the current state and decision bit

previous state

wr le

finish_delayeddelay

f2>=

f1

FP

s address registers

FP

s detection line address registers

FP

s state code register

delay

4+i ρ

These m

ultiplexers select the end-of-the-fram

e data w

hen the AC

SU

reaches the end of the fram

e.

si,k

+

NO

T

delay

Path m

etric differences at each state of the survival path

.

The inversion is controlled by the

bit mi+45+i Le

Lepi+1

-

This operation is equivalent to xxxxxxxxxxxxxx.

This is done this w

ay in order to distribute com

binational delays among register

()444 12+++ −−iii Lepmρ

Figure

4.25:Fusion

Points

basedR

eliabilityupdating

unit


MIN

MIN

MIN

MIN

mm

mm

p0

Trace back Trellis connection topology

mm

deco

der

si,k

p0

p1

p2

p3

mm

mm

mm

MAX∆

ps0

MAX∆

MAX∆

MAX∆

p1

p2

p3

i∆

ps1

ps2

ps3

MIN

MIN

MIN

mm

mm

D Trace back of all the competing paths Calculates the Minimum D∆ ∆

c0

c1

c2

c3

1,ivps1p12,ivps2p23,ivps3p3

0,ivps0p0

1,ivmi2,ivmi3,ivmi

0,ivmi

c0

c1

c2

c3

MAX∆

MAX∆

MAX∆

MAX∆

it complements the decision bit of the

survival state

it selects the paths that differ in the message bit associated.

3+iρ

Figure 4.26: Recursive Updating Unit

s0

s1

s2

s3

i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10

survival path competing path

competing path

competing path

competing path ( )∆∆∆= ,,minρ

Figure 4.27: Recursive Updating Process.

previous states by the pair of multiplexers and the reverse trellis topology connection ac-cording to the trellis decision vector. The moving of the ∆ is actually the trace back ofthe competing paths. It can be observed its similarities to the ACSU in the recursiveprocedure. Whenever two competing path merge, the one with the minimum ∆ is kept.At each stage, the decision bits along with the estimated message bit, are used to drivethe multiplexers that select the relevant ∆. The minimum ∆ among these relevant ∆ isthe resulting bit reliability.

In order to clarify how the recursive unit works, we will introduce the example of figure4.27. The set of registers from figure 4.26 will hold the colored ∆ from figure 4.27. Whenthe updating process is lunched, the set registers are set to ∆MAX .


• The unit begins at the time instant i = 10. The orange ∆ is fed into the systemthrough the multiplexer from state 1. A the same time a minimizing process isstarted with this orange ∆ and the remaining ∆ of the registers. The orange ∆ issent to state 3.

• At time i = 9. The blue ∆ is fed into the system through the multiplexer fromstate 2. The orange ∆ from state 3 and the blue ∆ from state 2 participate in theminimizing process. The blue ∆ it is sent to state 1, while the orange ∆ is sent tostate 3 again.

• At time i = 8. The fuchsia ∆ is fed into the system through the multiplexer fromstate 0. Now there are three ∆ participating in the minimizing process. Finally theorange, blue and fuchsia ∆ are sent to state 2, state 3 and state 0 respectively.

• The remaining steps are about the same. Note that, at time i = 6, state 2, twocompeting path merge. For this example the blue ∆ is assumed to be less than theturquoise ∆ and that is why it is kept.

Before moving into the next section it is important to talk about some throughput issues.In figure 4.24 we can see that the unit RUU updates some distance before it can releasethe final bit reliabilities. Therefore if we think of the time distance between fusion pointsas a random variable with a mean D. Then the RUU processes 2D time instants for eachFP detected by the FPU. This means, that the FIFO input data rate will be higher thanFIFO output data rate and the FIFO will get full. If the FIFO gets full, then the RUUmisses some FPs, however, this is not as bad as is seems, since the algorithm that managesthe FPs is still valid.

Let denote the amount of bits remaining to be updated, when the ACSU unit reachesthe end of the frame, with the parameter DR. Then the throughput of the SOVA SISOcan be estimated by

THSISO =L

L + DRf [bps] (4.1)

where L is the frame size and f is the frequency of the system. It is straight forward, thatif we want to increase the throughput of the system DR should be reduced. This can beachieved by increasing the working frequency of the RUU so it processes more FPs pertime unit and at the end of the frame, less bits remain to be updated.

4.8 Control Unit

We finally present the design of the control unit, which is basically a finite state machinethat delays and synchronizes modules. Figure 4.28, shows the scheme. There are twocounters, one is responsible for the frame address count, and the other is responsible forthe iteration count. The iteration counter is first loaded with the number of iterationsthat the user indicates. Figure 4.29 shows the state diagram that the entire system goesthrough. Once the user drives the go signal high, the system begins to work. It firstinitializes the units and progressively activate the corresponding modules before settling

4.8 Control Unit 51

State Machine

Iteration C

ounter

niters x2+1 niters

Bit C

ounter

=0?

=?

Frame Length

iters Finished

Frame Finished

Figure 4.28: Control Unit General Scheme.

Initializing Modules

Decoding

Finishing

Idle

go

Frame Finished

Iters Finished

/Iters Finished

Figure 4.29: Control Unit State Diagram.

down in the decoding state. Once the end of the decoding process is reached the systemchecks whether there is an iteration left or not.


4.9 Improvements

The most common implementation of the SOVA decoder only updates bits reliabilitiesby the HR-SOVA rule that was described in 3.2.2. A BR-SOVA updating rule wouldbe desirable since it has been proved in [5] that max-log-map algorithm and BR-SOVAare equivalent and that the max-log-map algorithm perform better in terms of BER thanthe HR-SOVA. However the BR-SOVA updating rule requires the knowledge of the bitreliabilities of the competing paths which implies a higher complexity in the decoder andthis is the reason why we do not do a strict BR-SOVA, instead we approximate its behaviorby introducing a bound for the bit reliability of the competing path as shown below.

The BR-SOVA updating rule and HR-SOVA updating rule are the same when theestimated bit and the competing bit are different. In contrast the following equationsrecall the updating rules for each algorithm when the estimated bit and the competing bitdiffer.

ρBRj ⇐ min

(ρj ,∆i,k + ρc

j

)ρHR

j ⇐ ρj (4.2)

If we assume ρcj = ∞, equation 4.2 can be rewritten as:

ρj ⇐ min(ρj ,∆i,k + ρc

j

)That is why we can think of the HR-SOVA as a BR-SOVA with an unbounded ρc

j . Theimprovement proposed in this work is to bound ρc

j to a known value. When workingwith an RSC binary code, the two incoming branches, at any state of a trellis diagram,are associated with different message bits. Therefore, the ∆ difference between the pathmetrics is actually a bound for the reliability of those message bits. The resulting updatingrule becomes: {

ρj ⇐ min (ρj ,∆j,k) mi 6= ci

ρj ⇐ min(ρj ,∆i,k + ∆c

j

)mi = ci

where ρj is the reliability of bit j; ∆i,k is the path metric difference between competingpath and survival path; mi is the estimated message bit; ci is the estimated message bitwhich is associated with the competing path and finally ∆c

j is the path metric differenceat each state at time j that belongs to the competing path.

Figure 4.30 and 4.31 show the modified RUU and the Recursive Updating Unit respec-tively. They allow the previous rule to be executed. Note that the main difference is thehandling of all the ∆ since they represent the bound for the competing bit reliabilities.

4.9 Improvements 53

Starting statem

Del

ay

last

Sta

te

a<=

bflu

shin

g

load

=

downCounter

flushing_addr_start

finis

h

Lep

i

addr

_le

addr

_x^

m

dela

y

Rec

ursi

veU

pdat

e

Sta

rtin

g st

ate

queu

e

Sta

te

Mac

hine

m m m

m

>=

f2

m

Del

ay

put_

fram

e_en

d_da

ta

RA

M S

MU

Cur

rent

sta

te deci

sion

bit

mi+

1

ki,

∆

kiv,

s i+

1,k

1+∆

i

kiv

,1+

+

ab

1

prev

ious

sta

te

wr

le

finis

h_d

elay

edde

lay

f2 >=

f1de

lay

4+iρ

s i,k

+

NO

T

dela

y

5+iL

e

Lep

i+1

-

ki

,1+∆

All

delta

s ne

ed to

be

avai

labl

e fo

r th

e re

curs

ive

upda

te

Fig

ure

4.30

:R

elia

bilit

yU

pdat

ing

Uni

tw

ith

BR

-SO

VA

appr

oxim

atio

n


D Trace back of all the competing paths Calculates the Minimum D∆ ∆

MIN

MIN

MIN

MIN

mm

mm

p0 mm

deco

dific

ador

si,k

p0

p1

p2

p3

mm

mm

mm

MAX∆

ps0

MAX∆

MAX∆

MAX∆

p1

p2

p3

i∆

ps1

ps2

ps3

MIN

MIN

MIN

mm

mm

c0

c1

c2

c3

1,ivps1p1

2,ivps2p2

3,ivps3p3

0,ivps0p0

1,ivmi

2,ivmi

3,ivmi

0,ivmi

c0

c1

c2

c3

3+iρ

ki ,∆

0

++

0

++

0

0

0,1+∆ i

1,1+∆ i

2,1+∆i

3,1+∆ i

The is the bound for ki ,1+∆ 1+icρ

Figure 4.31: Recursive Update with BR-SOVA approximation

Chapter 5

Methodology

The whole practical design process was carried out with the aid of powerful software tools.Mainly three tools were employed in this thesis:

• Matlab 7.1. The mathematics software package Matlab was extensively used in thesimulation and verification of the design. It was employed to model the whole com-munication system: encoder, channel, receiver and decoder. We also used Matlabfor the HIL (Hardware In the Loop) verification of the design. It was carried outby establishing a serial port communication with an interface circuit specificallydeveloped for testing purposes.

• Xilinx ISE 8.2. The synthesis software package of Xilinx, ISE 8.2, was used inall the tasks referred to the implementation, specifically the mapping, translation,placement and routing along with the back annotation and the static timing analysis.The FPGA programmer iMPACT is also included in this package; it was used todownload our design into the Xilinx Spartan III FPGA.

• ModelSim 6.1. VHDL code and Post-Place and Route models were simulated withthis tool.

Figure 5.1 summarizes the work flow. Five steps have taken place with some feedbackbetween them. On the rightmost part we have the fundamental stages of this processwhereas on the leftmost part the verification tasks associated with each stage are displayed.The blue boxes show the main tool employed in the related task. Now we give a descriptionfor the stages of the process:

• Information recopilation. A considerable amount of papers and journals were re-copilated. They allowed us to understand the main problem and to focus our mainconcerns on some aspects of the subject.

• Specification. The specification of this work consisted on the design and implemen-tation of a SOVA based Turbo Decoder implementation.

• High Level Design. A high level model was programmed using the software toolMATLAB 7.1. This model allowed us to try the system in different environmentsand also to fine tune the design specifications cited in step two.

56 Methodology

Information Recopilation

Design Specifications

High Level DesignVerification

VHDL Synthesis

Behavioral Verification

FPGA Post-Place & Route Model Verification

In-Circuit Verification

Matlab

ModelSim

ModelSim

Matlab

High Level Design Implementation

VHDL Implementation

Figure 5.1: Project Work Flow.

• VHDL Implementation. Once we were familiar with all the concepts related tothe decoding algorithm we started to work on the structure of the datapath. Itwas described on VHDL code and all the combinational modules were verified byappropriate test-benches on ModelSim. After the Datapath was totally defined, webegan to specify the control needs of our system and the way it would communicateexteriorly, subsequently we gradually defined the whole system.

• VHDL Synthesis. After a VHDL functional model was achieved, the synthesis wascarried out. The targeted device was an Spartan 3 X3S200FT256. The systemwas first verified by a Post-Place and Route model. Later the FPGA was programedwith the iMPACT tool for In-Circtuit verification. Figure 5.2 illustrates the approachemployed for this purpose while figure 5.3 shows the followed procedure. The serialport baud rate was set to 115200 bps.

57

MatlabInterface

UnitDecoder

RS232 Serial Port

Spartan 3 X3S200FT256 FPGA

Figure 5.2: Hardware-in-the-loop approach

AWGNDiscrete Channel

source Channel Coding BPSK Modulation

Channel Decoding

sink

im

iyim

Spartan 3 X3S200FT256 FPGA

Matlab

BER Calculation

FPGA

Figure 5.3: Hardware-in-the-loop verification procedure

58 Methodology

Chapter 6

Measures and Results

The system presented in chapter 4 was described using VHDL (Very high speed integratedcircuits Hardware Description Language). A generic and parameterizable VHDL code waswritten. A VHDL package code includes the frame size, quantization scheme, polynomialsof the code, and the SOVA algorithm mode (HR-SOVA or BR-SOVA approximation).The system can be configured through this package before the synthesis is performed.The targeted device was a general purpose Xilinx FPGA Spartan 3 X3S200FT256.

All the tests have been done for two major polynomial pairs. One is the pair we havebeen using through all this work, Pfb = [111], Pg = [101]. The other pair is the UMTSpolynomial pair, Pfb = [1011], Pg = [1101]. The size of the data frame has been set to 1024bits and it is the same for all simulations and syntheses. The depth of the RUU FIFO hasbeen set to 16 FPs. We have employed two types of interleavers. One of the interleaversis given in [14] —from now on, MCF. It is described by the following equations:{

α (x) = 31x + 64x2 mod1024, Deinterleaver ;β (x) = 991x + 64x2 mod1024, Interleaver.

The other interleaver was randomly generated —from now on RAND. Its function isdescribed by a look-up table. The tests with the normalization by the fixed scaling factors,have not been done yet and is left as future work. We will present the results in thefollowing subsections according to their nature.

6.1 Quantization Scheme

The quantization scheme is presented in table 6.5. The same quantization scheme is usedin all the tests. This scheme has been adopted from [18].

Element Word width : Fractional PartReceived Symbols yi 4:2

Extrinsic Information 7:2Path Metrics 10:2

∆s 4:2

Table 6.1: Quantization Scheme Summary

60 Measures and Results

0.5 1 1.5 2 2.5 310^(−7)

10^(−6)

10^(−5)

10^(−4)

10^(−3)

10^(−2)

10^(−1)

10^(0)

EbNo [dB]

BE

R

4:26:28:2

Figure 6.1: ∆ quantization effect on the system BER performacne. BR-SOVA approxi-mation scheme. Simulation with quantification. MCF. Pfb = [111], Pg = [101]

The only quantization study that has been carried out is related to the path metricdifference ∆ which has a significant impact in the system BER performance. Figure 6.1shows the BER curve against the received signal SNR. It is observed that, for the currentexample, the scheme 4:2 is better than the 6:2, 8:2. This behavior has been reported in [11]as a method of improving the system BER performance. Since quantization saturates the∆, the overoptimistic values of the bit reliabilities are lessened and consequently the systemBER performance increases. Note that adopting the reduced quantization scheme yieldsmore benefits. First the RAM that stores the data the ACSU is reduced. Furthermorethe logic related to the RUU is also reduced.

6.2 Synthesis Results

Tables 6.2 and 6.3 present the synthesis results for the short pair of polynomials and theUMTS polynomials respectively. Both pairs of polynomials were synthesized with thequantization scheme given in table 6.5.

6.2 Synthesis Results 61

Observation HR BRap ResourcesLogic Utilization

Number of Slice Flip Flops 720(18%) 752(19%) 3840Number of 4 input LUTs 776(20%) 803(20%) 3840

Logic DistributionNumber of occupied Slices 677(35%) 674(35%) 1920Number of Slices containing only related logic 677(100%) 674(100%) 674Number of Slices containing unrelated logic 0 0 674

Total Number 4 input LUTs 789(20%) 816(21%) 3840Number used as logic 776 803 1Number used as a route-thru 13 13 1Number of Block RAMs 10(83%) 10(83%) 12Number of MULT18X18s 1(8%) 1(8%) 12Number of GCLKs 4(50%) 4(50%) 8

Total equivalent gate count for design 671207 671658

Table 6.2: Synthesis Results for Pfb = [111], Pg = [101]

Observation HR BRap ResourcesLogic Utilization

Number of Slice Flip Flops 1045(27%) 1108(28%) 3840Number of 4 input LUTs 2067(53%) 2096(54%) 3840

Logic DistributionNumber of occupied Slices 1256(65%) 1329(69%) 1920Number of Slices containing only related logic 1256(100%) 1329(100%) 674Number of Slices containing unrelated logic 0 0 674

Total Number 4 input LUTs 2082(54%) 2111(54%) 3840Number used as logic 2067 2096 1Number used as a route-thru 15 15 1Number of Block RAMs 11(91%) 11(91%) 12Number of MULT18X18s 1(8%) 1(8%) 12Number of GCLKs 4(50%) 4(50%) 8

Total equivalent gate count for design 748769 749432

Table 6.3: Synthesis Results for Pfb = [1011], Pg = [1101]

Note that the BR-SOVA approximation spent almost the same amount of resourcesthan the HR implementation. In contrast the amount of used resources significantlyincreases when working with the UMTS polynomials. This is due to the fact that theUMTS encoder has twice the number of states.

Table 6.4 shows the max frequencies that the system can attain. When working withthe pair of short polynomials, the system can reach up to 85 MHz. The critical path islocated in the ACSU unit and it is related to the add, compare, select and ∆ quantiza-tion delays. On the other side, when working with the UMTS pair of polynomials, themaximum clock frequency suffers a considerable degradation. This is due to the excessive


0.5 1 1.5 2 2.5 310^(−7)

10^(−6)

10^(−5)

10^(−4)

10^(−3)

10^(−2)

10^(−1)

10^(0)

EbNo [dB]

BE

R

HR Iter 1BRap Iter 1HR Iter 3BRap Iter 3HR Iter 5BRap Iter 5HR Iter 8BRap Iter 8Max−log−map Iter 8

Figure 6.2: HR-BRapprox comparison. Infinite precision simulations. MCF interleaver.Pfb = [111], Pg = [101]

combinational logic that the FPU gets for an eight states code. The optimization of theseunits should be considered as future work.

Polynomials Maximum clock frequency Critical PathShort Pfb = [111], Pg = [101] 85 MHz ACSUUMTS Pfb = [1011], Pg = [1101] 29 MHz FPU

Table 6.4: Maximum clock frequencies

6.3 Bit Error Rate Results

Before getting into the HIL results, we will discus the BR-SOVA approximation BERperformance that it is shown in figure 6.2. These results were obtained by simulationwith a floating point numeric representation. We observe that for an error probability of10−4 the BR-SOVA approximation gains 0.3dB over the HR-SOVA at the eighth iteration.For an error probability of 10−5 the BR-SOVA approximation gains only 0.23dB over theHR-SOVA at the eighth iteration. We also observe that for higher SNRs, the curves beginto converge and the distance between them gets shorter.

Figure 6.3 exhibits the real system BER performance when implementing the HR-SOVA for the short pair of polynomials. The figure illustrates the comparisons betweenthe hardware implemented HR-SOVA and the floating point simulations. Note that thereal HR-SOVA performs better. This is due to the ∆ quantization effect that was explainedin 6.1.

6.3 Bit Error Rate Results 63

0.5 1 1.5 2 2.5 310^(−7)

10^(−6)

10^(−5)

10^(−4)

10^(−3)

10^(−2)

10^(−1)

10^(0)

EbNo [dB]

BE

R

HR Iter 1 Inf.PreHR Iter 1 HILHR Iter 5 Inf.PreHR Iter 5 HILHR Iter 8 Inf.PreHR Iter 8 HILMax−log−map Iter 8 Inf.Pre

Figure 6.3: HR-SOVA HIL results. MCF interleaver. Pfb = [111], Pg = [101]

Figure 6.4 shows the real system BER performance when implementing the BR-SOVAapproximation for the short pair of polynomials. For low SNRs, we see that the realdecoder performs worse than the floating point simulation. For high SNRs the oppositesituation is observed. Note, that for the BR-SOVA approximation the BER performance ofthe real decoder is about the same as the BER performance of the floating point simulation.The ∆ quantization does not improve the BER as much as in the HR implementation.

The comparison between the HR-SOVA implementation and BR-SOVA approximationimplementation is shown in figure 6.5. The figure also shows a partial plot of a quantizedmax-log-map algorithm with the following quantization scheme:

Element Word width : Fractional PartReceived Symbols yi 4:2

Extrinsic Information 7:2γ 7:2α 9:2β 9:2

Table 6.5: Quantization Scheme Summary

We observe that, in the worst case, the HR-SOVA is 0.14dB from the BR-SOVAapproximation and the latter is only 0.1dB from the quantized implementation of themax-log-map.

Finally figures 6.6 and 6.7 show some partial results of the BER performance with theUMTS polynomials and the randomly generated interleaver.


0.5 1 1.5 2 2.5 310^(−7)

10^(−6)

10^(−5)

10^(−4)

10^(−3)

10^(−2)

10^(−1)

10^(0)

EbNo [dB]

BE

R

BRap Iter 1 Inf.PreBRap Iter 1 HILBRap Iter 5 Inf.PreBRap Iter 5 HILBRap Iter 8 Inf.PreBRap Iter 8 HILMax−log−map Iter 8 Inf.Pre

Figure 6.4: BR-SOVA approximation HIL results. MCF interleaver. Pfb = [111], Pg =[101]

0.5 1 1.5 2 2.5 310^(−7)

10^(−6)

10^(−5)

10^(−4)

10^(−3)

10^(−2)

10^(−1)

10^(0)

EbNo [dB]

BE

R

HR Iter 8 HILBRap Iter 8 HILMax−log−map Iter 8 Quant.Max−log−map Iter 8 Inf.Pre

Figure 6.5: HR-BRapprox HIL comparison. MCF interleaver. Pfb = [111], Pg = [101]

6.3 Bit Error Rate Results 65

0.5 1 1.5 2 2.5 310^(−5.5)

10^(−5)

10^(−4.5)

10^(−4)

10^(−3.5)

10^(−3)

10^(−2.5)

10^(−2)

EbNo [dB]

BE

R

HR Iter 1BRap Iter 1HR Iter 3BRap Iter 3HR Iter 5BRap Iter 5HR Iter 8BRap Iter 8

Figure 6.6: HR-BRapprox comparison. Infinite precision simulations. RAND interleaver.Pfb = [1011], Pg = [1101]

0.5 1 1.5 2 2.5 310^(−5.5)

10^(−5)

10^(−4.5)

10^(−4)

10^(−3.5)

10^(−3)

10^(−2.5)

10^(−2)

EbNo [dB]

BE

R

BRap Iter 1 Inf.PreBRap Iter 1 HILBRap Iter 5 Inf.PreBRap Iter 5 HILBRap Iter 8 Inf.PreBRap Iter 8 HIL

Figure 6.7: BR-SOVA approximation HIL results. RAND interleaver. Pfb = [1011],Pg = [1101]


6.4 Throughput Results

In this section we investigate the effect of running the RUU at higher frequencies and itsimpact in the system throughput. A DCM (Digital Clock Manager) was used in order togenerate the corresponding frequencies.

Figures 6.8, 6.9 and 6.10 show the throughput histogram statistics for the frequenciesrelations fRUU = f ,fRUU = 2f and fRUU = 3f respectively and for the short pair ofpolynomials. The statistics were generated with 50000 samples. We observe that thethroughput increases whit the RUU working frequency as we expected.

In a real application context, the system has to guarantee a constant throughput soit could be set to one of the minimum intervals observed in the histogram. These valuesare summarized in table 6.6. We can think of a power saving benefit since the system,according to the figures, will work faster than the guaranteed throughput. Therefore,when the system finishes the execution it goes to an idle state until a new set of dataarrives, during this idle state no activity is performed in the circuit which supposes animportant reduction in the power consumption.

Figures 6.11, 6.12 and 6.13 show the same throughput histogram statistics but thistime for the UMTS pair of polynomials. We observe the same effect than with the shortpair. However we notice a slightly difference in the statistics between them. This is dueto the appearing frequency of FPs, which is higher for higher constraint lengths.

Observation Short Polynomials UMTS PolynomialsfRUU = f 0.5259f [bps] 0.5270f [bps]fRUU = 2f 0.8258f [bps] 0.8308f [bps]fRUU = 3f 0.9543f [bps] 0.9399f [bps]

Table 6.6: Minimum estimated throughput.

6.4 Throughput Results 67

0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.610

500

1000

1500

2000

2500

3000

3500

4000

SISO throughput as a percentage of the system clock in [bps]

Num

ber

of o

bser

vatio

ns

Figure 6.8: Throughput statistics. f = 25MHz, fRUU = 25MHz. Pfb = [111], Pg = [101]

0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90

500

1000

1500

2000

2500

3000


Num

ber

of o

bser

vatio

ns

Figure 6.9: Throughput statistics. f = 25MHz, fRUU = 50MHz. Pfb = [111], Pg = [101]


0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.9850

500

1000

1500

2000

2500

3000

3500

4000

4500

5000


Num

ber

of o

bser

vatio

ns

Figure 6.10: Throughput statistics. f = 16.66MHz, fRUU = 25MHz. Pfb = [111],Pg = [101]

0.52 0.54 0.56 0.58 0.6 0.620

500

1000

1500

2000

2500

3000

3500

4000


Num

ber

of o

bser

vatio

ns

Figure 6.11: Throughput statistics. f = 25MHz, fRUU = 25MHz. Pfb = [1011], Pg =[1101]

6.4 Throughput Results 69

0.82 0.84 0.86 0.88 0.9 0.92 0.940

500

1000

1500

2000

2500

3000

3500

4000


Num

ber

of o

bser

vatio

ns

Figure 6.12: Throughput statistics. f = 25MHz, fRUU = 50MHz. Pfb = [1011], Pg =[1101]

0.935 0.94 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.980

1000

2000

3000

4000

5000

6000

7000

8000


Num

ber

of o

bser

vatio

ns

Figure 6.13: Throughput statistics. f = 16.66MHz, fRUU = 50MHz. Pfb = [1011],Pg = [1101]


6.5 Power Results

The power consumption has been estimated by simulations. Table 6.7 summarizes theresults. The system frequencies were set to f = 25MHz, fRUU = 50MHz. The simulationtest bench was carefully designed in order to guarantee a SISO throughput of 0.8f =20Mbps. This throughput is feasible according to figures 6.9 and 6.12. We observe adynamic power consumption of (22 − 12) = 10mW for the Short pair of polynomials.The dynamic power consumption rises up to (29 − 12) = 17mW when working with theUMTS polynomials. This effect was expected since the area increasing is about 50% whenjumping from four states to eight. Table 6.7 only shows the power consumption of the BR-SOVA approximation since the difference between the BR-SOVA approximation schemeand the HR-SOVA scheme is negligible.

Observation Short Polynomials UMTS PolynomialsTotal estimated power consumption[mW] 47 54

Vccint 1.20V: 22 29Vccaux 2.50V: 25 25Vcco25 2.50V: 0 0Clocks: 6 6Inputs: 1 1Outputs: 2 4Vcco25 0 0Signals: 2 5Quiescent Vccint 1.20V: 12 12Quiescent Vccaux 2.50V: 25 25

Table 6.7: Estimated Power consumption. BRapprox. f = 25MHz, fRUU = 50MHz

Chapter 7

Conclusions and future work

We have design a complete Turbo Decoder based on the SOVA algorithm. For this purposewe have introduced a new algorithm for doing the SOVA decoding and we have designedthe architecture that implements it. The resulting design is not affected by the D-U trade-off and it achieves an optimum SOVA execution. We have also introduced a modificationto the previous architecture that approximates the BR-SOVA. The resulting BER of thislast scheme is 0.1 dB from a comparable Max-Log-Map algorithm.

As future work, the following key points are proposed:

• The system throughput is affected by the management of the fusion points. Differentschemes should be studied with the aim of improving the resulting throughput. Forexample, a LIFO memory could be employed instead of a FIFO at the input of theRUU.

• The power consumption of the system could be reduced by properly selecting theFPs that lunch the reliability updating process. This way, a long updating-without-releasing process can be avoided.

• The critical path of the system, for the UMTS polynomials, resides inside the FPU.Optimization strategies should be analyzed in order to reduce the combinationaldelays.

• An non-optimum SOVA execution should be adopted by properly reducing and man-aging the RAM buffer that is used to store the data the ACSU units provide. Takinginto account other implementations, this memory could be reduced by more than50% without major BER performance degradation.

• A complete BR-SOVA should be carefully studied for implementation. This couldbe probably achieved by replicating the recursive updating unit. One of these unitstraces back and updates the survival path, while the others do the same with thecompeting paths.

72 Conclusions and future work

Bibliography

[1] Sorin Adrian Barbulescu. What a wonderful turbo world ... E-book, 2004.

[2] G. Battail. Ponderation des symboles decodes par l’agorithem de Viterbi. Ann.Telecommun, 42:31–38, January 1987.

[3] C. Berrou, A. Glavieux, and P. Thitimasjshima. Near Shannon Limit Error-CorrectingCoding and Decoding: Turbo-Codes. Proceedings of the IEEE International Confer-ence on Communications, Geneva, Switzerland, May 1993.

[4] Gennady Feygin and P.G. Gulak. Architectural Tradeoffs for Survivor Sequence Mem-ory Management in Viterbi Decoders. IEEE TRANSACTIONS ON COMMUNICA-TIONS, 41:425–429, March 1993.

[5] Marc P. C. Fossorier, Frank Burkert, Shu Lin, and Joachim Hagenauer. On the Equiv-alence Between SOVA and Max-Log-Map Decoding. IEEE COMMUNICATIONSLETTERS, 2(5), May 1998.

[6] David Garret and Micea Stan. Low Power Architecture of the Soft-Output ViterbiAlgorithm. Low Power Electronics and Design, 1998. Proceedings. 1998 InternationalSymposium on, pages 262– 267, August 1998.

[7] Joachim Hagenauer and Peter Hoeher. A Viterbi Algorithm with Soft-Decision Out-puts and its Applications. Proc. GLOBECOM IEEE, 3:1680–1686, November 1989.

[8] Pablo Ituero Herrero. Implementation of an ASIP for Turbo Decoding. Master’sthesis, KTH, May 2005.

[9] Olaf Joeressen, Martin Vaupel, and Henrich Meyr. High-Speed VLSI Architecturesfor Soft-Output Viterbi Decoding. Proc. IEEE ICASAP’92, Oakland, California,,pages 373–384, August 1992.

[10] D. W. Kim, T. W. Kwon, J. R. Choi., and J. J. Kong. A modified two-step SOVA-based turbo decoder with a fixed scaling factor. Circuits and Systems, 2000. Pro-ceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, 4:37–40,May 2000.

[11] Lang Lin and Roger S. Cheng. Improvements in SOVA-Based Decoding For TurboCodes. Communications, 1997. ICC 97 Montreal, ’Towards the Knowledge Millen-nium’. 1997 IEEE International Conference on, 3:1473–1478, June 1997.

74 BIBLIOGRAPHY

[12] Lutz Papke and Patrick Robertson. Improved Decoding with the SOVA in a ParallelConcatenated (Turbo-code) Scheme. IEEE International Conference on Communi-cations, Conference Record, Converging Technologies for Tomorrow’s Applications.,1:102–106, June 1996.

[13] C.B Shung, P.H. Siegel, G. Ungerboeck, and H.K Thapar. VLSI Architectures forMetric Normalization in the Viterbi Algorithm. Communications, 1990. ICC 90, In-cluding Supercomm Technical Sessions. SUPERCOMM/ICC ’90. Conference Record.,IEEE International Conference on, 4:1723–1728, April 1990.

[14] Oscar Y. Takeshita. On Maximum Contention-Free Interleavers and PermutationPolynomials over Integer Rings. Submitted as a Correspondence to the IEEE Trans-actions on Information Theory, April 2005.

[15] T.K Truong, Ming-Tang Shih, Irving S. Reed, and E. H. Satorius. A VLSI Design fora Trace-Back Viterbi Decoder. Communications, IEEE Transactions on, 40:616–624,March 1992.

[16] Matthew C. Valenti. Iterative Detection and Decoding for Wireless Communications.PhD thesis, Virginia Polytechnic Insitute and State University, July 1999.

[17] Yan Wang, Chi-Ying Tsui, and Roger S. Cheng. A Low Power VLSI Architectureof SOVA-based Turbo-code decoder using scarce State Transition Scheme. IEEEInternational Symposium on Circuits and Systems, Geneva, Switzerland, 00:00–00,May 2000.

[18] Zhongfeng Wang. High Performance, Low Complexity VLSI Design of Turbo De-coders. PhD thesis, University of Minnesota, September 2000.

turbo decoding using sova

Documents