mimo detector based on sumis626323/fulltext01.pdflanguage svenska/swedish engelska/english...

61
Institutionen för systemteknik Department of Electrical Engineering Examensarbete Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS Examensarbete utfört i Kommunikationssystem vid Tekniska högskolan vid Linköpings universitet av Tomas Frostensson LiTH-ISY-EX--13/4664--SE Linköping 2013 Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

Upload: others

Post on 06-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Institutionen foumlr systemteknikDepartment of Electrical Engineering

Examensarbete

Hardware Implementation and Assessment of a SoftMIMO Detector Based On SUMIS

Examensarbete utfoumlrt i Kommunikationssystemvid Tekniska houmlgskolan vid Linkoumlpings universitet

av

Tomas Frostensson

LiTH-ISY-EX--134664--SE

Linkoumlping 2013

Department of Electrical Engineering Linkoumlpings tekniska houmlgskolaLinkoumlpings universitet Linkoumlpings universitetSE-581 83 Linkoumlping Sweden 581 83 Linkoumlping

Hardware Implementation and Assessment of a SoftMIMO Detector Based On SUMIS

Examensarbete utfoumlrt i Kommunikationssystemvid Tekniska houmlgskolan vid Linkoumlpings universitet

av

Tomas Frostensson

LiTH-ISY-EX--134664--SE

Handledare Mirsad Čirkićisy Linkoumlpings universitet

Sven HolmquistSynective Labs

Examinator Daniel Perssonisy Linkoumlpings universitet

Linkoumlping 20 maj 2013

Avdelning InstitutionDivision Department

Division of Communication SystemsDepartment of Electrical EngineeringSE-581 83 Linkoumlping

DatumDate

2013-05-20

SpraringkLanguage

SvenskaSwedish

EngelskaEnglish

RapporttypReport category

Licentiatavhandling

Examensarbete

C-uppsats

D-uppsats

Oumlvrig rapport

URL foumlr elektronisk version

httpurnkbseresolveurn=urnnbnseliudiva-92627

ISBN

mdash

ISRN

LiTH-ISY-EX--134664--SE

Serietitel och serienummerTitle of series numbering

ISSN

mdash

TitelTitle Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

FoumlrfattareAuthor

Tomas Frostensson

SammanfattningAbstract

To allow faster and more reliable wireless communication a technique is to use multipleantennas in the transmitter and receiver This technique is called MIMO The usage of MIMOadds complexity to the receiver that must determine what the transmitter actually sent Thisthesis focuses on hardware implementation suitable for an FPGA of a detection algorithmcalled SUMIS

A background to detection and SUMIS in particular is given as a theoretical aid for a bet-ter understanding of how an algorithm like this can be implemented An introduction tohardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion and sum oflogarithmic values are analyzed and suitable hardware architectures are presented Theseoperations are implemented in RTL hardware using VHDL targeted for an FPGA Virtex-6from Xilinx

The accuracy of the implemented operations is investigated showing promising resultsalongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithms are discussedand more suitable approaches for a future implementation of SUMIS are commented onThe key aspects are flexibility through software reprogrammability and area efficiency bydesigning a custom processor architecture

NyckelordKeywords FPGA MIMO soft detection SUMIS

Abstract

To allow faster and more reliable wireless communication a technique is to usemultiple antennas in the transmitter and receiver This technique is called MIMOThe usage of MIMO adds complexity to the receiver that must determine whatthe transmitter actually sent This thesis focuses on hardware implementationsuitable for an FPGA of a detection algorithm called SUMIS

A background to detection and SUMIS in particular is given as a theoretical aidfor a better understanding of how an algorithm like this can be implemented Anintroduction to hardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion andsum of logarithmic values are analyzed and suitable hardware architectures arepresented These operations are implemented in RTL hardware using VHDL tar-geted for an FPGA Virtex-6 from Xilinx

The accuracy of the implemented operations is investigated showing promisingresults alongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithmsare discussed and more suitable approaches for a future implementation of SUMISare commented on The key aspects are flexibility through software reprogramma-bility and area efficiency by designing a custom processor architecture

iii

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor MirsadČirkić at ISY for examining and providing feedback during this masterrsquos thesisIt has been interesting to hear about the problems associated with the subjectfrom another point of view rather than just my own

I would like to acknowledge everyone at Synective Labs in Gothenburg for thefriendly atmosphere and the possibility for discussions I also appreciate thefeedback from my opponent Emelie Nilsson which led to a better report

Gothenburg May 2013Tomas Frostensson

v

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 2: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Hardware Implementation and Assessment of a SoftMIMO Detector Based On SUMIS

Examensarbete utfoumlrt i Kommunikationssystemvid Tekniska houmlgskolan vid Linkoumlpings universitet

av

Tomas Frostensson

LiTH-ISY-EX--134664--SE

Handledare Mirsad Čirkićisy Linkoumlpings universitet

Sven HolmquistSynective Labs

Examinator Daniel Perssonisy Linkoumlpings universitet

Linkoumlping 20 maj 2013

Avdelning InstitutionDivision Department

Division of Communication SystemsDepartment of Electrical EngineeringSE-581 83 Linkoumlping

DatumDate

2013-05-20

SpraringkLanguage

SvenskaSwedish

EngelskaEnglish

RapporttypReport category

Licentiatavhandling

Examensarbete

C-uppsats

D-uppsats

Oumlvrig rapport

URL foumlr elektronisk version

httpurnkbseresolveurn=urnnbnseliudiva-92627

ISBN

mdash

ISRN

LiTH-ISY-EX--134664--SE

Serietitel och serienummerTitle of series numbering

ISSN

mdash

TitelTitle Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

FoumlrfattareAuthor

Tomas Frostensson

SammanfattningAbstract

To allow faster and more reliable wireless communication a technique is to use multipleantennas in the transmitter and receiver This technique is called MIMO The usage of MIMOadds complexity to the receiver that must determine what the transmitter actually sent Thisthesis focuses on hardware implementation suitable for an FPGA of a detection algorithmcalled SUMIS

A background to detection and SUMIS in particular is given as a theoretical aid for a bet-ter understanding of how an algorithm like this can be implemented An introduction tohardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion and sum oflogarithmic values are analyzed and suitable hardware architectures are presented Theseoperations are implemented in RTL hardware using VHDL targeted for an FPGA Virtex-6from Xilinx

The accuracy of the implemented operations is investigated showing promising resultsalongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithms are discussedand more suitable approaches for a future implementation of SUMIS are commented onThe key aspects are flexibility through software reprogrammability and area efficiency bydesigning a custom processor architecture

NyckelordKeywords FPGA MIMO soft detection SUMIS

Abstract

To allow faster and more reliable wireless communication a technique is to usemultiple antennas in the transmitter and receiver This technique is called MIMOThe usage of MIMO adds complexity to the receiver that must determine whatthe transmitter actually sent This thesis focuses on hardware implementationsuitable for an FPGA of a detection algorithm called SUMIS

A background to detection and SUMIS in particular is given as a theoretical aidfor a better understanding of how an algorithm like this can be implemented Anintroduction to hardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion andsum of logarithmic values are analyzed and suitable hardware architectures arepresented These operations are implemented in RTL hardware using VHDL tar-geted for an FPGA Virtex-6 from Xilinx

The accuracy of the implemented operations is investigated showing promisingresults alongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithmsare discussed and more suitable approaches for a future implementation of SUMISare commented on The key aspects are flexibility through software reprogramma-bility and area efficiency by designing a custom processor architecture

iii

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor MirsadČirkić at ISY for examining and providing feedback during this masterrsquos thesisIt has been interesting to hear about the problems associated with the subjectfrom another point of view rather than just my own

I would like to acknowledge everyone at Synective Labs in Gothenburg for thefriendly atmosphere and the possibility for discussions I also appreciate thefeedback from my opponent Emelie Nilsson which led to a better report

Gothenburg May 2013Tomas Frostensson

v

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 3: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Avdelning InstitutionDivision Department

Division of Communication SystemsDepartment of Electrical EngineeringSE-581 83 Linkoumlping

DatumDate

2013-05-20

SpraringkLanguage

SvenskaSwedish

EngelskaEnglish

RapporttypReport category

Licentiatavhandling

Examensarbete

C-uppsats

D-uppsats

Oumlvrig rapport

URL foumlr elektronisk version

httpurnkbseresolveurn=urnnbnseliudiva-92627

ISBN

mdash

ISRN

LiTH-ISY-EX--134664--SE

Serietitel och serienummerTitle of series numbering

ISSN

mdash

TitelTitle Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

FoumlrfattareAuthor

Tomas Frostensson

SammanfattningAbstract

To allow faster and more reliable wireless communication a technique is to use multipleantennas in the transmitter and receiver This technique is called MIMO The usage of MIMOadds complexity to the receiver that must determine what the transmitter actually sent Thisthesis focuses on hardware implementation suitable for an FPGA of a detection algorithmcalled SUMIS

A background to detection and SUMIS in particular is given as a theoretical aid for a bet-ter understanding of how an algorithm like this can be implemented An introduction tohardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion and sum oflogarithmic values are analyzed and suitable hardware architectures are presented Theseoperations are implemented in RTL hardware using VHDL targeted for an FPGA Virtex-6from Xilinx

The accuracy of the implemented operations is investigated showing promising resultsalongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithms are discussedand more suitable approaches for a future implementation of SUMIS are commented onThe key aspects are flexibility through software reprogrammability and area efficiency bydesigning a custom processor architecture

NyckelordKeywords FPGA MIMO soft detection SUMIS

Abstract

To allow faster and more reliable wireless communication a technique is to usemultiple antennas in the transmitter and receiver This technique is called MIMOThe usage of MIMO adds complexity to the receiver that must determine whatthe transmitter actually sent This thesis focuses on hardware implementationsuitable for an FPGA of a detection algorithm called SUMIS

A background to detection and SUMIS in particular is given as a theoretical aidfor a better understanding of how an algorithm like this can be implemented Anintroduction to hardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion andsum of logarithmic values are analyzed and suitable hardware architectures arepresented These operations are implemented in RTL hardware using VHDL tar-geted for an FPGA Virtex-6 from Xilinx

The accuracy of the implemented operations is investigated showing promisingresults alongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithmsare discussed and more suitable approaches for a future implementation of SUMISare commented on The key aspects are flexibility through software reprogramma-bility and area efficiency by designing a custom processor architecture

iii

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor MirsadČirkić at ISY for examining and providing feedback during this masterrsquos thesisIt has been interesting to hear about the problems associated with the subjectfrom another point of view rather than just my own

I would like to acknowledge everyone at Synective Labs in Gothenburg for thefriendly atmosphere and the possibility for discussions I also appreciate thefeedback from my opponent Emelie Nilsson which led to a better report

Gothenburg May 2013Tomas Frostensson

v

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 4: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Abstract

To allow faster and more reliable wireless communication a technique is to usemultiple antennas in the transmitter and receiver This technique is called MIMOThe usage of MIMO adds complexity to the receiver that must determine whatthe transmitter actually sent This thesis focuses on hardware implementationsuitable for an FPGA of a detection algorithm called SUMIS

A background to detection and SUMIS in particular is given as a theoretical aidfor a better understanding of how an algorithm like this can be implemented Anintroduction to hardware and digital design is also presented

A subset of the operations in the SUMIS algorithm such as matrix inversion andsum of logarithmic values are analyzed and suitable hardware architectures arepresented These operations are implemented in RTL hardware using VHDL tar-geted for an FPGA Virtex-6 from Xilinx

The accuracy of the implemented operations is investigated showing promisingresults alongside of a presentation of the necessary resource usage

Finally other approaches to hardware implementation of detection algorithmsare discussed and more suitable approaches for a future implementation of SUMISare commented on The key aspects are flexibility through software reprogramma-bility and area efficiency by designing a custom processor architecture

iii

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor MirsadČirkić at ISY for examining and providing feedback during this masterrsquos thesisIt has been interesting to hear about the problems associated with the subjectfrom another point of view rather than just my own

I would like to acknowledge everyone at Synective Labs in Gothenburg for thefriendly atmosphere and the possibility for discussions I also appreciate thefeedback from my opponent Emelie Nilsson which led to a better report

Gothenburg May 2013Tomas Frostensson

v

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 5: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor MirsadČirkić at ISY for examining and providing feedback during this masterrsquos thesisIt has been interesting to hear about the problems associated with the subjectfrom another point of view rather than just my own

I would like to acknowledge everyone at Synective Labs in Gothenburg for thefriendly atmosphere and the possibility for discussions I also appreciate thefeedback from my opponent Emelie Nilsson which led to a better report

Gothenburg May 2013Tomas Frostensson

v

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 6: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Contents

Notation ix

1 Introduction 111 Background 112 Goal 213 Limitations 214 Outline 2

2 Theory 321 MIMO 322 Detection 4

221 Soft Detection 423 SUMIS 5

231 First Stage 6232 Second Stage 6233 Complexity Selection 7

24 Number Representation 725 Hardware Introduction 826 Programmable Hardware 9

261 Hardware Flow 9262 Reusable Modules 10

3 Problem Analysis 1131 Overview 1132 Matrix multiplication 1133 Matrix Inversion 12

331 LDLT Decomposition 12332 Reciprocal 14333 Forward Substitution 14334 Final Steps 16

34 Log Sum of Exponentials 16

4 Methodology and Equipment 19

vii

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 7: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

viii CONTENTS

41 Modeling 1942 VHDL 1943 RTL 2044 Hardware 20

5 Implementation 2351 Overview 2352 Matrix Multiplication 24

521 IP Block Trade-offs 24522 Interface 24523 Example Implementation 24

53 Matrix Inversion 26531 LDLT Decomposition 26532 Reciprocal Unit 28533 Forward Substitution 30

54 Jacobi Logarithm 33

6 Result and Analysis 3561 Testing and Measurements 35

611 Matrix Multiplication 35612 LDLT Decomposition 36613 Forward Substitution 36614 Jacobi Logarithm 36

62 Resource Usage 36621 Matrix Multiplication 36622 Matrix Inversion 37623 Jacobi Logarithm 38

63 Remaining Work 38631 Hyperbolic Tangent 38632 Exponential Function 39633 Additional Matrix Operations 39634 Control Structure 40

64 Improvements 40641 Hardware Time-Multiplexing and Control 40642 Wordlength Optimization or Floating Point Implementation 40643 Design Space Exploration using High Level Synthesis 41

65 Alternative Approaches and Comparison 4166 Insights from Alternative Approaches 42

661 Number Representation 42662 Processor Architecture 43663 Flexibility 43664 Integration 43

67 Final Conclusions 44

Bibliography 45

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 8: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Notation

Number sets

Notation Meaning

R Set of real numbersC Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated CircuitBRAM Block RAM

CORDIC Coordinate Rotation Digital ComputerFFT Fast Fourier Transform

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIEEE Institute of Electrical and Electronics Engineers

IP Intellectual PropertyJTAG Joint Test Action GroupLLR Log-Likelihood RatioLUT Lookup TableMAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-OutputOFDM Orthogonal Frequency-Division MultiplexingQAM Quadrature Amplitude ModulationRAM Random Access MemoryRTL Register Transfer Level

SIMD Single Instruction Multiple DataSNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference SuppressionVHDL VHSIC Hardware Description LanguageVHSIC Very High Speed Integrated Circuit

ix

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 9: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

1Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO

Unfortunately this technique adds increased complexity to the receiver since thereceiver has to determine what was actually sent given the overlapping inputfrom multiple antennas Since this is a complex problem efficient methods mustbe developed to cope with this complexity given strict real time demands from acommunication system

11 Background

The main area of this thesis is the implementation aspect of detection algorithmsin the receiver used in a MIMO system

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson 2012] These papers presents a detection algorithmcalled SUMIS (subspace marginalization with interference suppression) whichhas shown promising results compared to other detection algorithms with a lowercomplexity

The given high level description in the mentioned papers of the mathematicsinvolved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system Therefore this thesis willexamine the implementation aspects of the proposed algorithm

1

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 10: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

2 1 Introduction

12 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures forthe implementation of a soft MIMO detector based on the SUMIS algorithm onan FPGA

The selected operations described in Chapter 3 of the SUMIS algorithm will beimplemented in hardware and discussed The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration whenimplementing such a detection algorithm

The algorithm will be evaluated to determine how suitable this algorithm is forreal time implementation in contemporary and future wireless systems

Implementation-wise it should serve as a proof of concept with discussion aboutpossible improvements rather than providing a solution ready for production

13 Limitations

Limitations have been made to reduce the complexity and limit the work loadassociated with this thesis to a reasonable amount The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM sinceit affects the size of the numbers involved

The main limitation is that only a subset of the operations involved in the SUMISalgorithm has been considered for hardware implementation and these are de-scribed in Chapter 3

14 Outline

The thesis is divided in several chapters Chapter 2 describes the backgroundtheory that is useful for the understanding of the succeeding chapters

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems The hardwarethat was utilized and the methodology used for the implementation is describedin Chapter 4

The step of actual hardware implementation is presented in Chapter 5 where theindividual modules are described

Finally the results of the implementation measurements and comparisons withother implementations can be seen in Chapter 6 The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 11: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

2Theory

This chapter describes the background theory that is necessary to comprehendother sections of this thesis

21 MIMO

A MIMO communication system is a communication system that uses multipleantennas for transmission as well as for reception A basic setup of a MIMOsystem can be seen in Figure 21

R1

R2

RNr

Receiver

T1

T2

TNt

Transm

itter

Figure 21 A MIMO system using Nt transmit and Nr receive antennas

A real valued MIMO channel can be seen as

y = Hs + e (21)

3

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 12: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

4 2 Theory

where H isin RNrtimesNt The matrix H denotes the channel matrix Each entry of

the matrix is a possible path from the transmitter to the receiver Therefore itcontains Nr times Nt elements which are all the possible paths from the transmittingantennas to the receiving antennas The vector s isin SNt contains the modulatedsymbols that the transmitter will try to send where S is the set containing thepossible symbols The vector e isin RNr is the noise vector e sim N (0 N0

2 I) containingadditive Gaussian noise with zero mean and N0

2 variance Finally y isin RNr is the

vector with the received symbols as seen by the receiver

As mentioned before the MIMO channel described in Equation 21 is real valuedIt is more common with a complex channel but as described in [Larsson andJalden 2008] every complex channel given a few prerequisites can be posed as areal model This is straightforward since C

n is isomorphic to R2n A real model

is used since it simplifies the explanation of the SUMIS algorithm and this modelcan easily be derived from a complex valued model

22 Detection

The principle of detection in MIMO systems is to determine s given y describedin Equation 21 The channel matrix H is assumed to be known to the receiverand is often so in practice by estimation

Detection can be divided in two subcategories hard detection and soft detectionHard detectors give an estimate of s without additional information while softdetectors provide both an estimate of s and probability information for each bitin the symbols in s This means that the detector provide information of howaccurate the estimated s is on bit level

Since detectors in communication systems are commonly used together with acoding scheme this probability information is useful when trying to decode thereceived symbol If it is known to the decoder that a specific bit in the receivedsymbol has lower probability of being correct it can be possible to achieve a lowererror rate by inverting that bit

As the title of this thesis describes the focus lies mainly on soft detectors

221 Soft Detection

The information that the detector can provide the decoder with is the log-likelihoodratio LLR which is the logarithm of the likelihood ratio Likelihood ratio is a sta-tistical test to compare the fit of two models in this case if a zero or one wastransmitted given the received data This ratio tells how many more times likelyone case is over the other

With this ratio expressed for each of the received bits the decoder can use thisknowledge to decode the received data correctly With the ratio expressed in thelogarithmic domain the sign will show the hard detection thus if the detectordetected a zero or one while the magnitude of the ratio will tell how accurate this

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 13: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

23 SUMIS 5

detection is The log-likelihood ratio is

l(si |y) = log

sum

forallsisinssi=1exp

(minus 1N0y minusHs2

)sum

forallsisinssi=0exp

(minus 1N0y minusHs2

) (22)

given that the symbols are uniformly distributed thus equally probable that azero or one is being sent

The sums in Equation 22 are over the set s si = x which means all possiblevectors s where the ith bit is x = 0 or x = 1 respectively

The computation effort needed to calculate the log-likelihood ratio will growpolynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas Nt If |S| is all of the possiblesymbols s can contain the complexity of the calculation will be proportional to|S|Nt This is the big limitation when it comes to MIMO detectors with the con-stellation size growing as well as the number of antennas the computation effortwill be impractical to deal with

Numerous methods to deal with this complexity by introducing approximationsexists such as sphere decoding in [Chu and McAllister 2012] The method thatis investigated further in this thesis is SUMIS which is introduced in [Čirkić andLarsson 2012] SUMIS is based upon a mix of two approaches partial marginal-ization and soft interference cancellation Partial marginalization is further de-scribed in [Larsson and Jalden 2008] [Čirkić et al 2011] [Persson and Larsson2011] and [Persson et al 2012] Soft interference cancellation is described in[Lampe and Huber 1999] and [Choi et al 2000]

23 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 21into

y = Hs + Hs + e (23)

The partitioning can be used to group together Hs + e and treat it as interferenceand noise

The partition in Equation 23 is dependent on the parameter ns isin 1 Ntwhich can be seen as a complexity parameter This complexity parameter deter-mines how much effort that will be put in to the detection algorithm The dimen-sions of the partitioned matrices will be as follows H isin R

Nrtimesns H isin RNrtimes(Ntminusns)

s isin Sns and finally s isin SNtminusns

The partitioning must be chosen so that the interesting bit si is contained by sTo be able to cover all of the available bits it means that it is necessary to haveNt different partitions to have at least one partition that contains each interestingbit

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 14: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

6 2 Theory

If ns = 1 it is easy to choose a partition for bit si since there exists only one but forns gt 1 it is a more complex problem In [Čirkić and Larsson 2012 Section 3C] asuitable approach to perform this selection is presented The approach is to basethe selection on the matrix product HTH The goal is to minimize the impact ofHs + e on the selected columns that will be contained in H This is achieved byselecting the column in HTH that contains the interesting bit along side with thens minus 1 columns that contains the largest values intersecting the chosen columnThis will leave the remaining columns to H and the impact will be minimized

231 First Stage

Given Equation 23 it is possible to choose an approximate model

y asymp Hs + n (24)

where n sim N (0Q) and Q = HHT + N02 I

The key point of Equation 24 is that computations can be simplified by assumingthat the interference from Hs can be seen as Gaussian noise With these assump-tions made it is possible to perform the first step of the SUMIS algorithm whichhas the purpose of reducing the impact of the interfering terms This is achievedby computing the conditional expected value of each bit approximately and thiscomputation is performed symbol-wise by first computing

λk = log

sum

forallsisinssk=1exp

(minus1

2 (y minusHs)TQminus1(y minusHs))

sumforallsisinssk=0

exp(minus1

2 (y minusHs)TQminus1(y minusHs)) (25)

followed by

Esk |y = tanh(λk

2

) (26)

232 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector s The first step is defining a new model to suppress this vector andthis model is

yprime asymp Hs + nprime (27)

where nprime sim N (0Qprime) and Qprime = HΦHT + N02 I The matrix Φ is the conditional

covariance matrix of s and is described as

Φ = ES2|y minus ES|y2 (28)

In Equation 28 the matrix S is a diagonal matrix with the diagonal consisting ofthe elements from s With all of these computations performed the model canbe assumed to be purified and it is possible to calculate the desired LLRs Themain difference from Equation 22 is that these computations in SUMIS are overthe space spanning ns dimensions instead of the original Nt dimensions This

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 15: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

24 Number Representation 7

computation is performed for each bit and is described by

l(si |y) asymp log

sum

forallsisinssi=1exp

(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs))

sumforallsisinssi=0

exp(minus1

2 (yprime minusHs)TQprimeminus1(yprime minusHs)) (29)

Since the LLRs are the information desired by the decoder the SUMIS algorithmhas completed its task

233 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of thealgorithm and can be assumed to be much smaller than Nt With ns = Nt thebenefits of SUMIS are non existing since H = H and the complete computation inEquation 22 will be performed The work in [Čirkić and Larsson 2012] furtherdescribes optimizations possible to minimize the computations needed and theseresults have been used when selecting the operations to be analysed One aspectis that the inverse Qminus1 can be computed for all of the partitions by inverting alarger matrix of dimension Nt followed by smaller inverses of dimension ns

24 Number Representation

Throughout the thesis a fixed point number representation is being used for thehardware implementation A fixed point number representation is used to repre-sent a decimal number using a limited number of bits The wordlength denotesthe number of bits used

To be able to understand how the number representation works it is possible tostart with how a regular integer is represented using tworsquos complement This canbe exemplified by

X = minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i (210)

which denotes the value of a number X represented by N bits xNminus1 x0

With a N -bit binary number as described in Equation 210 any integer in therange minus2Nminus1 le X le 2Nminus1 minus 1 can be represented

With the knowledge of how to represent whole numbers it is possible to move onto decimal numbers These numbers can be represented by allocating a numberof bits for the integer part of the number and the rest for the fractional part Thisis achieved by applying a scaling factor to the number and this can be seen in

X = 2minusf lowast (minusxNminus1 lowast 2Nminus1 +Nminus2sumi=0

xi lowast 2i) (211)

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 16: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

8 2 Theory

which also features a N -bit binary number like the one in Equation 210 but thistime representing a decimal number

The number represented by Equation 211 is scaled by 2minusf which means thatf bits has been allocated for the fractional part and the remaining N minus f bitsrepresent the integer part and sign

The number can be in the range minus2Nminus1minusf le X le 2Nminus1minusf minus2minusf in steps of 2minusf Onebig difference compared to a floating point representation is that the resolutionis constant over the whole number range

25 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary

Digital circuits can mainly be divided in two main areas combinatorial and se-quential Combinatorial circuits perform boolean algebra on a given set of inputto produce one or multiple output signals It has no memory and thus the outputis only dependent on the provided input Given the ability to express booleanalgebra many different kind of circuits can be constructed some examples areadders which can add two numbers and multiplexers that work as switches withmultiple inputs and one output

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory Sequential logic on the other hand groups togethercombinatorial circuits with memory elements that allows the circuit to not onlytake into account the input signals but also the current state The basic memoryelement of a sequential circuit is called a flip-flop A common D-type flip-flophas a data input data output and a clock input The flip-flop will only changeits output value on the rising edge of the clock otherwise it will contain the oldvalue

With sequential logic it is possible to create more advanced circuits such as finitestate machines counters and registers A register is constructed using a flip-flopand a multiplexer and it has a load signal When the load signal is low the oldvalue will remain regardless of the clock signal When the load signal is high andthere is a rising clock edge a new value will be stored in the register

Random access memories are very important in digital circuits and heavily usedin this thesis Such memories are much more suitable than flip-flops when thereis a need to store greater amounts of data since they are more area efficient Thememories have an address port a data port and a write signal With an addressprovided the data stored at that particular address will be available on the dataport with a certain delay Using the write signal it is possible to store new datainto the memory by selecting the correct address provide data on the data portand asserting the write signal

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 17: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

26 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from[Danielsson and Bengtsson 1996]

26 Programmable Hardware

When it comes to programmable hardware the current choice is often to use anFPGA An FPGA is a field-programmable gate array that can be configured toimplement almost any digital design

An FPGA is build up of small logic blocks that can be configured and connectedto each other to implement different functions Instead of using logic gates suchas AND OR and NOT boolean functions are represented by their truth tableThis truth table is stored in a small component called LUT The LUT is a lookuptable with the input variables to the boolean function connected as an addressand the output is the value stored in the truth table This allows a 4 input LUTto implement any boolean function with at maximum 4 inputs Additional LUTscan be interconnected to implement boolean functions with more inputs

An FPGA does not only contain LUTs but also flip-flops that can be connectedto the output of a LUT which makes it possible to implement sequential circuitsmentioned in Chapter 25 All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware Since the interconnection betweenthe building blocks provide overhead the manufacturers often add additionalbuilding blocks that the customers are likely to use such as multipliers and ran-dom access memories If a memory were to be implemented using only flip-flopsthe overhead would be substantial and this would limit what else that can be im-plemented at the same time The same reasoning is valid for multipliers sincemultiplication is complex to implement with the aid of only LUTs Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks

261 Hardware Flow

From the designerrsquos point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog The hardware is described in termsof software even though the code is supposed to be a description of hardwareand not be executed on the hardware itself The written code can be simulated asit is to verify the behaviour even if not everything that can be simulated can betransformed to hardware

The source code that describes the hardware can be synthesised into a netlist ofbuilding blocks such as LUTs and flip-flops appropriate for the targeted FPGAdevice This can be seen as an analogy to how a compiler compiles softwarewritten in a high-level language into a low-level language

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 18: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGAThe place-and-route then attempts to connect them using the routing networkavailable in the FPGA The result is a configuration file that can be loaded intothe FPGA using a configuration interface such as JTAG

262 Reusable Modules

With increasing demands on a fast time-to-market it has become more commonto reuse existing building blocks as much as possible These blocks are commonlyreferred to as IP cores or IP blocks where IP stands for intellectual propertyThese blocks can be anything from a simple counter to a complete processor andcan be seen in analogy to the software world as a library

This allows for a shorter implementation cycle since each IP blockrsquos functionalitycan be verified beforehand and the block can often easily be integrated with therest of the design

It is common for FPGA manufacturers to provide a collection of simpler IP coresthat can be used on their devices The form the IP block is delivered in varies itcan be for example readable VHDL code or an already synthesised netlist

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 19: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

3Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 31 that are needed for implementation of the SUMIS algorithm

31 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation Since the algorithm relies heavilyon matrix operations such as matrix multiplication and matrix inversion thesesubproblems are described further in Chapter 32 and Chapter 33

Since probabilities are handled in the log-domain there exist problems that hasto be accounted for when summarizing them This is described in Chapter 34

32 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm Both matrix-matrix and matrix-vector multiplications are used heavily A standard matrixmultiplication is described by

AB = C (31)

where A isin RMtimesL B isin RLtimesN and C isin RMtimesN

A naive algorithm for matrix multiplication can be seen in Algorithm 31 Otheralgorithms exists that will reduce the number of multiplications but introduceseveral additions and subtractions instead that will affect the constant that isusually left out when discussing asymptotic complexity This implies that the

11

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 20: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very largematrices

Algorithm 31 Matrix multiplication - naive algorithm

for i = 1rarr M dofor j = 1rarr N do

sum = 0for k = 1rarr L do

sum = sum + A[i][k] lowast B[k][j]end forC[i][j] = sum

end forend for

If N = M = L = 8 the number of multiply-and-add will be 512 In some ofthe matrix multiplications such as HTH some of the operations could be reducedsince the result will be symmetric around the diagonal The drawback with thesereductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations The advantage of a general matrix multiplicationimplementation is that it is possible to reuse for all of the matrix multiplicationsof the same dimension that are necessary to compute

33 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrixinverse The matrix is sufficiently large so that a closed form formula does notexist for calculating the inverse

Common ways to calculate the inverse of a larger matrix is by using some sortof decomposition to decompose the original matrix into a product of matricesThe matrices acquired from the decomposition have regular structure such astriangular or diagonal that makes them easier to invert The inverse of theseindividual matrices can be combined into the original sought inverse matrix

The following sections will describe the steps involved to calculate the inversedenoted Qminus1 given an original positive definite matrix Q starting with the chosenmethod of decomposition

331 LDLT Decomposition

The chosen method of decomposition is the LDLT decomposition described by[Golub and Van Loan 1996] The decomposition is closely related to Choleskydecomposition also described by the previously mentioned authors

One of the advantages of LDLT decomposition compared to Cholesky decom-position is that the latter require evaluation of square roots This is a complex

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 21: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

33 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided The LDLT decom-position demands that the matrix to be decomposed is symmetric and positivedefinite It is possible to rewrite the matrix equations in the detection algorithmto fully comply with these prerequisites to be able to utilize this decompositionThese rewrites are described in detail in [Čirkić and Larsson 2012]

The decomposition can be described by

Q = LDLT (32)

where L is a lower triangular matrix D is a diagonal matrix containing only pos-itive elements and LT being the transpose of L A lower triangular matrix is amatrix where only the elements below and including the diagonal are non-zero

Pseudo code for the LDLT decomposition can be seen in Algorithm 32 where thematrix Q is of dimension N Loops are not evaluated if the lower higher is greaterthan the higher higher

Algorithm 32 Algorithm for LDLT decomposition The input matrix is Q andthe output matrix is L along with the vector d which is the diagonal of D

v = zeros(N 1)d = zeros(N 1)L = zeros(NN )for i = 1rarr N do

sum = 0for j = 1rarr i minus 1 do

v[j] = L[i][j] lowast d[j]sum = sum + L[i][j] lowast v[j]

end forv[i] = d[i] = Q[i][i] minus sumrec = 1v[i]for j = i + 1rarr N do

sum = 0for k = 1rarr i minus 1 do

sum = sum + L[j][k] lowast v[k]end forL[j][i] = (Q[j][i] minus sum) lowast rec

end forend for

In Algorithm 32 it is required to have a temporary vector denoted v to storeintermediate results It is also possible to rewrite the algorithm to work in-placeand store the resulting matrix L and vector d in the original matrix Q The reasonfor not choosing that approach is for readability and ease of implementation

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 22: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

14 3 Problem Analysis

332 Reciprocal

In the LDLT decomposition described in Section 331 some divisions needs tobe performed Division is by far the most expensive operation of the four basicmath operations in terms of hardware area and speed One effective approach isto calculate the reciprocal of the divisor and multiply that result with the divi-dend This means that instead of dividing the number n by d the reciprocal 1

d iscalculated and the operation n lowast 1

d is subsequently performed

The reciprocal 1d can be approximated using the Newton-Raphson method [Chen

et al 2005] The Newton-Raphson method consist of choosing a function f (x)that is zero at x = 1

d and use Newtonrsquos method to approximate the root A suitablefunction is

f (x) =1xminus d (33)

The Newton-Raphson method is an iterative method and each iteration can bedescribed by

xi+1 = xi minusf (xi)f prime(xi)

(34)

where xi+1 is the next approximation closer to the root while xi is the value fromthe previous iteration

Combining Equation 33 and Equation 34 gives

xi+1 = xi(2 minus d lowast xi) = 2 lowast xi minus d lowast x2i (35)

The performance of this algorithm is dependent on how good the guess of xifor the first iteration thus x0 is A good approach to avoid excessive number ofiterations is to use a lookup table with an initial guess that can be correct for upto a few decimals To store a complete table with the desired final precision is notfeasible since this table will be very large

333 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late Lminus1 since this intermediate result is needed to produce the original inversedescribed in Section 33

It is possible to calculate Lminus1 by solving the matrix equation

Lxi = ei (36)

for i = 1 n where ei is the ith column of the unit matrix and n is the dimen-sion of L The resulting vectors x1 xn are the column vectors of Lminus1

These equations can be solved efficiently by applying forward substitution Anoutline of a general algorithm to solve the equation described in Equation 36 canbe seen in Algorithm 33

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 23: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

33 Matrix Inversion 15

Algorithm 33 Forward substitution - general algorithm

for i = 1rarr N dofor j = 1rarr N do

sum = 0for k = 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = (e[j][i] minus sum)L[j][j]

end forend for

Since Algorithm 33 is general it does not use all available knowledge about thematrices x = (x1 xn) and e = (e1 en) If L is of dimension 8 this algorithmneeds 224 multiply-and-add 64 subtractions and 64 divisions The number ofoperations can be reduced by adopting the algorithm to this particular case byusing the prior knowledge available about the input and output data

What prior knowledge can be utilized to decrease the number of operations Thefollowing knowledge can be considered useful

1 L is unitriangular This means that the diagonal consists of only ones

2 The inverse of a lower triangular matrix is also a lower triangular matrix

3 e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisionswill be by one This assumption also gives the fact that the diagonal of x willconsist of only ones

The second assumption will change the limits on the second innermost loop sinceonly the lower triangular matrix of the result will be non-zero It will also changethe limits on the innermost loop since the upper triangular part of x will be zero

Since e is a unit matrix the first multiply-and-add operation when k = i willbe a multiplication by one and thus can be eliminated and lifted outside of theloop With these changes the number of operations has been greatly reducedIf L is of dimension 8 the operation count is now 56 multiply-and-add and 28subtractions The modified algorithm can be seen in Algorithm 34

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 24: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

16 3 Problem Analysis

Algorithm 34 Forward substitution - optimized for this particular case

for i = 1rarr N dox[i][i] = 1for j = i + 1rarr N do

sum = L[j][i]for k = i + 1rarr j minus 1 do

sum = sum + L[j][k] lowast x[k][i]end forx[j][i] = minussum

end forend for

334 Final Steps

As of now Lminus1 has been obtained from the forward substitution in Chapter 333

One additional matrix is needed for the calculation of the matrix inverse Dminus1This matrix can be obtained for free from the LDLT decomposition in Chap-ter 331 by taking the values from the reciprocal unit instead of the values fromthe d vector since D is diagonal and thus Dminus1 consist of the reciprocal values ofD

The matrix inverse Qminus1 can now be obtained by

Qminus1 = LminusTDminus1Lminus1 (37)

where the matrix LminusT is the transpose of Lminus1 With these final matrix multiplica-tions the inverse Qminus1 has been calculated

34 Log Sum of Exponentials

In the SUMIS algorithm and in detection algorithms in general probabilities arehandled in log space The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precisionused when performing the calculations If the calculations are performed in logspace the quantities will be scaled to a workable range where the precision doesnot affect the result as much

When performing calculations in log space regular multiplication will be mappedto addition division to subtraction and exponentiation will be mapped to multi-plication A summary of these identities can be seen in Table 31

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 25: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

34 Log Sum of Exponentials 17

Operation Log Spacelog(a lowast b) log(a) + log(b)log(ab) log(a) minus log(b)log(ab) b lowast log(a)

Table 31 Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist The operation that must be performed is

log(a + b) = log(elog(a) + elog(b)) (38)

Note that a and b are not actually stored but instead their logarithmic counterpartlog(a) and log(b)

Apart from requiring several operations including exponentiation and subsequentlogarithm Equation 38 has additional drawbacks If one of the probabilities a orb is very small underflow might occur and its value will disappear in the addi-tion If multiple probabilities are summarized overflow is possible since the summight be very large

With these limitations in mind it is possible to rewrite Equation 38 and normal-ize the calculations using the largest value of the two probabilities The rewriteyields

log(elog(a) + elog(b)) = log(emax(log(a)log(b))(1 + eminus| log(a)minuslog(b)|))

= max(log(a) log(b)) + log(1 + eminus| log(a)minuslog(b)|) (39)

and is often denoted Jacobi Logarithm

As can be seen in the Equation 39 the summation of the two probabilities in logspace will be performed by selecting the maximum value of the two probabilitiesand add it to the additional logarithmic expression

The advantage of this method is that the remaining logarithmic expression islimited in size Its maximum value will be log(2) asymp 069 and it will approach 0when the difference between log(a) and log(b) grows large Since the expressionis limited to a small range it can be precalculated and stored in a table to allowfaster computations

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 26: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

4Methodology and Equipment

This chapter describes the methodology and technology involved in the project

41 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations The op-erations were rewritten in using lower level abstractions and implementing thematrix operations in separate functions This allowed for an easier way to trans-form the software into a suitable hardware structure

The number range was investigated using Matlab to see how large the largestnumbers were in the different sections of the algorithm and therefore how manybits the numbers had to be represented by Numeric scopes was widely used sinceit allowed visualization of the precision needed

42 VHDL

The hardware description language used in this thesis is VHDL In VHDL it iscommon when working with fixed point numbers to use an ordinary data typecalled std_logic_vector that simply contains a number of bits and think of thedecimal point as implicit This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the datatype is not explicitly specified

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE2009] has been used instead of the simple approach The package is named

19

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 27: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop 2008] The package contains thenecessary definitions to perform basic arithmetic on signed and unsigned fixedpoint integers The package allows the wordlengths both integer and fractionalto be configured more easily by using constants or generics instead of hard codingthe word lengths in each and everyone of the operations performed

For some of the operations there are IP blocks available These are configuredusing the tool CoreGen from Xilinx The use of IP blocks when designing hard-ware is described in Chapter 262 The IP blocks might lack in flexibility but willreduce design time immensely

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality The VHDL source code was synthesized and adoptedfor the FPGA using tools from Xilinx

43 RTL

The design abstraction used when describing the hardware is register transferlevel This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another

Finite state machines has been described using two separate parts One purelycombinatorial that produces the next state and the appropriate outputs and onepart that is sequential The sequential part will only store the next state into thestate registers

Records has been heavily used since if the registers are grouped together it iseasier to add additional registers without much rewrite Records in VHDL arethe equivalent of structs in for example C

44 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of theVirtex-6 family namely XC6VLX240T speed grade -2 More information aboutthis family of devices can be found in [Xilinx Inc 2012] The development boardused is delivered by Hitech Global and is a PCI-Express based board This PCI-Express connection allows the board to be connected to the PCI-Express bus of acomputer as a peripheral card and be interfaced from running software

How a common FPGA is constructed is described in Chapter 26 FPGAs fromXilinx are divided in blocks called slices Each slice in a Virtex-6 FPGA containsfour LUTs and eight flip-flops with the appropriate interconnect circuitry TheFPGA also contains RAMs denoted block RAM or BRAM and other dedicatedhardware such as clock managers that can generate an arbitrary clock frequencyfrom the system clock

One important type of the dedicated blocks is called DSP48E1 This is a highlyoptimized building block containing a 25x18 bit multiplier and an adder It also

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 28: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

44 Hardware 21

has a register so it can accumulate the calculated result This block can performnumerous operations and the behavior can be modified dynamically The inclu-sion of such building blocks is described in Chapter 26

The interesting resource count of the chosen part is summarized in Table 41

Name of resource Number of resource unitsSlice 37680Block RAM (36 Kb) 416DSP48E1 768PCI-Express block 2

Table 41 An overview over interesting resources available in theXC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosingsuitable structures

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 29: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

5Implementation

This chapter describes the implementation of the chosen hardware modules forthe SUMIS algorithm

51 Overview

The following sections describes the implementation of a subset of the hardwaremodules needed for the SUMIS algorithm A 4times4 complex MIMO setup has beenassumed which implies that the corresponding real matrices will be of dimension8 times 8

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimension The choice of matrixdimension also affect the wordlengths necessary to provide accurate results

A small note on the notation used in this chapter can be considered useful Asignal of type std_logic is a one dimensional logic signal An array of logicvalues with the dimension N is denoted std_logic_vector(N-1 downto 0) Asigned fractional number is described by sfixed(A downto -B) where it denotesA + 1 integer bits and B fractional bits where position 0 is implicitly the decimalpoint

23

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 30: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

24 5 Implementation

52 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc2011b] The drawback of this IP block is that the input wordlength is limited toat most 18 bits and this limitation has affected the choice of wordlengths for theother modules as well

521 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding theperformance versus hardware usage by selecting an unroll factor This unrollfactor determine how many elements of the input matrices that can be providedas input simultaneously and thus how many multipliers that will be used

The highest unroll factor for a 8times8 matrix is 64 It requires 8 multipliers and oneelement of the resulting matrix is computed each time The lowest unroll factoris 1 where all of the elements of the resulting matrix are computed in parallelThis requires 512 multipliers but naturally faster than calculating one elementeach time

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel Given that each element is represented with 18 bits to provide the twomatrix inputs it would be necessary with a bus of width 18 times 64 times 2 = 2304 It isnot feasible to both route this wide bus and provide this parallel memory accesssince the matrix multiplication is just one of the modules in the whole design

For the reasons described an unroll factor of 64 is suitable which allows the use ofa single block RAM to provide the input Additional discussion about the unrollfactor can be seen in Chapter 64

522 Interface

The interface to the IP block is called AMBA AXI4-Stream which is originallydeveloped by ARM but adopted by Xilinx as described in [Xilinx Inc 2011a]

For each of the data inputs there exists three signals valid last and data Whendata is transferred to the module valid must be held high and a new elementmust be available each clock cycle When the last element of the matrix is presentthe signal last must be held high during that clock cycle A figure of this behaviorcan be seen in Figure 51

Figure 51 Control signals for the AMBA AXI4-Stream interface

523 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation One of the first

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 31: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

52 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTH and this case was chosenas an example

This example implementation assumes that the matrix H has previously beenwritten to a block RAM with dual read ports It reads out the input data fromthe dual port block RAM feeds it to the IP block and gather the output in anadditional block RAM

The matrix H is stored row-wise in the block RAM To be able to access both H andHT simultaneously to provide input to the IP block both read ports of the blockRAM must be used To read out the regular matrix a counter can be used thatwill count from 0 to 63 and generate the read address To obtain the transposedmatrix the read out must be performed column-wise this can be solved by usingthe same counter as for the original matrix but insert a lookup table betweenthe counter and the address input This lookup table will address the matrix incolumn order instead of row order Thus the lookup table will map the sequence0 1 2 62 63 onto 0 8 16 55 63

Everything in the implementation will be controlled by a control FSM that willcontain the address counter and control the control signals shown in Figure 51It will observe the status signals from the IP block and also store the result inan additional block RAM A block diagram of the implementation can be seen inFigure 52

InputBRAM

OutputBRAM

IP blockMatrix mult

port a port b

ControlFSMLUT

input addr

control

c output

status

output addr

write

output

Figure 52 Block diagram of the matrix multiplication implementation

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 32: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

26 5 Implementation

53 Matrix Inversion

To be able to perform matrix inversion multiple modules are needed The firststep is the LDLT decomposition described in Chapter 531 including a reciprocalunit described in Chapter 532 The resulting matrix L will be inverted using aforward substitution unit described in Chapter 533

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 52

531 LDLT Decomposition

The LDLT decomposition has been implemented as a separate module in hard-ware An effort has been made to use a minimum amount of control logic byperforming redundant computations to be able to exploit regular structures inthe computations

Recall the algorithm describing the decomposition in Algorithm 32 The vectorv can be obtained by performing pair-wise multiplication between the vector dand the current row of L The reason why the described loop only iterates to i minus 1is that the remaining elements of both L and d are still zero

The computation of the next element in position i of v and d can be interpretedas a pair-wise multiplication between the newly calculated v and the same rowfrom L as before followed by a summation that can be performed with an addertree This result can be subtracted from the current diagonal element from Q andthus the new element of both v and d has been calculated

The same structure can be seen in the remaining calculations for each iterationbut now there are more values that needs calculation and there is an additionalmultiplication with the reciprocal of the ith element of d These conclusionsresult in the computation unit presented in Figure 53 with the adder tree to theleft

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 33: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

53 Matrix Inversion 27

Add Sub

Mult

Reciprocal

From

inpu

t BR

AM

Stor

e in

L-B

RAM

Stor

e in

V

D re

gist

ers

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Add

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Mult

Mult

Add

L-BR

AML-

BRAM

Mux

VD

Mux

VD

Figure 53 Computation unit used in the LDLT unit with 8 parallel multi-pliers and an adder tree

The data path described in Figure 53 also contains the reciprocal unit which isdescribed in detail in Chapter 532 To be able to fully utilize the computationunit it must be possible to access a complete row of the matrix L simultaneouslywhile being able to write an individual element This is possible to achieve usinga dual port block RAM created using CoreGen since it allows for asymmetricaccess ports The dual port block RAM has two sides A and B Side A has a wideread and write port that allows a complete matrix row to be read or written atonce Side B on the other hand has narrow ports that allow a single element to beread or written This asymmetric memory is constructed of a multiple of smallermemories together with some logic to perform the address decoding In this casethe block RAM for L is composed of eight smaller memories as building blocks

The structure of the LDLT unit can be seen in Figure 54 The control unit is builtof an FSM that controls the memories computation unit and registers

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 34: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

28 5 Implementation

Input BRAM (Q)

Output BRAM (L)

Computation Unit

V regsD regs

ControlFSM

Input Q

Output L

Output D

new element L

new elements VD

addrctrl

addr

load

Figure 54 Block diagram of the LDLT unit

The input and output ports are described in Table 51

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(5 downto -12) Data inputwe in std_logic Write enableready out std_logic Ready for inputdone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressL_data out sfixed(2 downto -15) L matrix outputD_data out sfixed(2 downto -15) Dminus1 matrix output

Table 51 Input and output ports of the LDLT decomposition module

532 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described inEquation 35 One problem is that the lookup table must be limited in size whilestill providing a good initial guess for all input numbers If the input d can bescaled to 05 le d lt 1 it follows that 1 lt 1

d le 2 and the lookup table can be limitedin size To perform this dynamic scaling in hardware the most significant bitthat is one of the input number must reside on position minus1 next to the decimalpoint If the current bit position is known it is possible to scale the number byshifting left or right the appropriate number of steps until the bit is in the correctposition

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 35: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

53 Matrix Inversion 29

If the input is shifted N steps to provide the correct scaling then the reciprocalapproximated must also be shifted N steps to reflect the reciprocal of the originalinput number

With the practicalities of input scaling handled the issue of how to index into thelookup table remains By investigating the scaled number following conclusionscan be made

1 The smallest number is 05 which means only bit minus1 is set and all bits to theright are zero

2 The largest number is almost 1 which means that bit minus1 is set and all bitsto the right are one

Since bit minus1 is always set it can be ignored and the remaining bits to the right canbe used as an index to the lookup table This means that at index 0 the initialguess for input = 05 must be stored while at the last index the guess for inputasymp 1 must be stored This manipulation can be seen as a subtraction by 05 whichmoves the interval 05 le d lt 1 to 0 le d lt 05 more suitable as an index

One additional adaptation of Equation 35 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers A block diagram ofthe complete structure for the reciprocal unit can be seen in Figure 55 Registersare placed after the operations to allow for a higher operating frequency as wellas balance the paths This is not present in Figure 55 for clarity

Mult

Lookup Table

Find MSB index

Shift

Square

Sub

Shift

Shift

1d

1

d

Figure 55 Block diagram of the reciprocal unit

The input and output ports of the reciprocal unit can be seen in Table 52 This

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 36: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

30 5 Implementation

unit has no control signals except for load and is used as a component in theLDLT decomposition unit

Name Dir Type Commentclk in std_logic Input clockload in std_logic Load new dd in ufixed(5 downto -12) d inputresult out ufixed(5 downto -12) 1d output

Table 52 Input and output ports of the reciprocal unit

533 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLT decomposition Instead of pursuing minimized controllogic efforts has been made to utilize more efficient building blocks and avoidunnecessary computation at the cost of increased control overhead

Analyzing Algorithm 34 yields that apart from subtraction the operation used ismultiply-and-accumulate which can be described as

c = c plusmn a times b (51)

with c being an accumulator register It is also necessary to be able to clear thevalue in register c A suitable hardware structure for the multiply-and-accumulateoperation can be seen in Figure 56

Mux

AddSub

Register

Mult

ba

clear

c output

0

Figure 56 Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 34 the final subtraction can bemoved into the accumulation and thus the only necessary computation unit needed

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 37: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

53 Matrix Inversion 31

is this multiply-and-accumulate unit performing c = c minus atimes b The main problemthat has to be solved is how to control these units and provide them with theinput in correct order and store resulting values

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements thisoperation among others This means that the complete structure shown in Fig-ure 56 can be absorbed fully inside a single DSP48E1 block [Xilinx Inc 2011c]

One multiply-and-accumulate unit was chosen for the implementation it wouldbe possible to use several units since the matrix equations described in Chap-ter 333 are independent and thus easily can be solved by different computationunits This is a compromise between much hardware and execution time

The implementation contains one multiply-and-accumulate unit one memory forthe input matrix L one memory for the output matrix X and a control FSM Sincethe FSM would involve many states a separate memory was used for the controlsignals which are summarized in Table 53

Name Purposesel Control input mux to MAC unitclr Clear accumulator registerL_x L_y X Y coordinate in L matrixX_x X_y X Y coordinate in X matrixW_x W_y X Y coordinate in X matrix for writewe Write signal for X matrix

Table 53 Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memorycontaining the control signals A complete block diagram of the forward substi-tution unit can be seen in Figure 57

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 38: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

32 5 Implementation

Input BRAM (L)

Output BRAM (X)

MACunit

Mux

1

C

Control Memory

input data output data

a

b

write addrX addrL addr

wesel

clr

Control Counter

Figure 57 Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described inTable 54

Name Dir Type Commentclk in std_logic Input clockrst_n in std_logic Reset active lowstart in std_logic Start computationaddr_in in std_logic_vector(5 downto 0) Input addressdata_in in sfixed(2 downto -15) Data inputwe in std_logic Write enabledone out std_logic Computation doneaddr_out in std_logic_vector(5 downto 0) Output addressdata_out out sfixed(2 downto -15) X matrix output

Table 54 Input and output ports of the forward substitution module

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 39: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

54 Jacobi Logarithm 33

54 Jacobi Logarithm

As mentioned in Chapter 34 the use of the method called Jacobi logarithm issuitable when adding probabilities in log space to avoid overflow underflow andunnecessary computations Overflow means that the resulting number is to largefor the chosen integer wordlength and underflow means that the number is of tosmall magnitude to be represent using the chosen fractional wordlength

Recall Equation 39 especially the second term If log(a) and log(b) are availableas input x can be defined as x = | log(a)minus log(b)| With x defined the computationthat has to be performed is

result = max(log(a) log(b)) + log(1 + eminusx) (52)

Since log(a)minus log(b) must be calculated it is possible to use this knowledge whenperforming the max selection If the result of the subtraction is negative thenlog(b) is the largest term and shall be selected otherwise log(a) This means thatit is possible to use a simple multiplexer with the sign bit of the result as controlsignal to select the largest value

The remaining term in the expression presented in Equation 52 is log(1 + eminusx) Agraph of this function can be seen in Figure 58

0 1 2 3 4 5 6 7 80

01

02

03

04

05

06

07

x

log(1

+ e

xp(minus

x))

Figure 58 The function log(1 + eminusx) on the interval 0 le x lt 8

Since the expression is limited in value on a small interesting interval it is suitableto use a table with precomputed values instead of implementing the complexoperations standalone As can be seen in Figure 58 the expression goes towardszero and it is only necessary to precompute a table for the interval 0 le x lt 8 andstill achieve good accuracy

A block RAM is suitable for storage of the precomputed table To avoid exces-sive hardware it is suitable to limit the table so it fits in a single 36Kb block

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 40: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

34 5 Implementation

RAM primitive With a data width of 16 bits this allows for 2048 elements in thelookup table or in terms of the address bus width 11 bits

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits Since the table will cover 0 le x lt 8 it is possible to saturate xto only contain log2(8) = 3 integer bits This leaves 11 minus 3 = 8 bits for the frac-tional part of x With these limitations on x the table can be precomputed with xranging 0 le x lt 8 in steps of 2minus8

A block diagram of the complete structure for the Jacobi logarithm module canbe seen in Figure 59 Not shown in the figure are the delay elements neededbefore and after the selection mux since the subtraction and table lookup has alatency

Sub

Mux

Add

Lookup Table

Abs

Select bits

MSB

log(b)log(a)

Result

10

Figure 59 Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 55The lack of control signals is because this module can be seen as a computationunit that has a certain latency and is supposed to output results continuouslyTherefore a control signal such as start is unnecessary

Name Dir Type Commentclk in std_logic Input clocklog_a in sfixed(5 downto -12) log(a) inputlog_b in sfixed(5 downto -12) log(b) inputresult out sfixed(5 downto -12) Result output

Table 55 Input and output ports of the Jacobi logarithm unit

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 41: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

6Result and Analysis

This chapter describes the result from the implementation both the accuracy ofthe computations as well as the resource usage The result is discussed and theapproach taken in this thesis is also compared with other implementations andapproaches to see what remains until a complete implementation of SUMIS canbe obtained

61 Testing and Measurements

The modules were tested with input data generated using Matlab All of themodules were simulated in ModelSim using this input data and the result wasobtained The result of these computations were then imported into Matlab andwas compared and verified with the expected output

This was performed to ensure correct functionality and to be able to determinehow accurate the hardware was compared to ideal computations performed withdouble precision floating-point numbers Descriptions of the accuracy are pre-sented in the following sections

The error presented was acquired using randomized input data and observingthe largest individual error in the output elements Multiple simulations wererun to ensure that the maximum error was likely to be observed

611 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 00002which directly corresponds to an accuracy of 12 fractional bits which was chosenas the output fractional wordlength It would be possible to achieve a higher ac-

35

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 42: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be ifthe module where the results are used only utilizes fewer bits

612 LDLT Decomposition

While testing the decomposition module a maximum error of approximately 0001was observed and this corresponds to an accuracy of 10 fractional bits Since thealgorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right The accuracy inthe leftmost columns would correspond to 14 fractional bits

613 Forward Substitution

The forward substitution module had a maximum error of approximately 000002which corresponds to an accuracy of 15 fractional bits All of the computationswhere performed using 3 integer bits and 15 fractional bits so this accuracy wasexpected To allow for a higher precision the computations would need to beperformed with more bits

614 Jacobi Logarithm

The error observed when testing the Jacobi module was very small approximately00002 which indicates that the implementation has negligible accuracy loss andthe achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits The precision of the lookup table does not affect the result in anymeaningful way mostly because the table is very limited in range from 0 to log(2)and still has a step size of 2minus8

62 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented The interesting resources are primarily LUTs flip-flops DSP48E1 and block RAMs

It is also interesting to note how high frequency the modules can operate at Thisis described along side with a description of the critical path of the module whichdictates what the maximum frequency can be

621 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen inTable 61 This implementation computes HTH as described in Chapter 523

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 43: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

62 Resource Usage 37

Resource Used Total PercentageFlip-flops 3024 301440 10 LUTs 1459 150720 10 Block RAM (36 Kb) 10 416 24 DSP48E1 8 768 10

Table 61 Resource usage of the matrix multiplication unit

The maximum operation frequency of the matrix multiplication is 234 MHz Thecritical path is not in the matrix multiplication IP block itself but instead in therounding of the result The result from the IP block is 40 bits wide and has tobe rounded and saturated to fit in 18 bits Without this rounding the operatingfrequency would be higher since the IP block is highly optimized for a high clockfrequency The reason why rounding is so expensive is that the rounding andsaturation examines all of the bits to determine a suitable rounding and this isnot quite as optimized in an FPGA as a regular adder

622 Matrix Inversion

This section described the resource usage of the components in the matrix inver-sion

LDLT Decomposition

The resource usage for the LDLT decomposition including the reciprocal unit canbe seen in Table 62

Resource Used Total PercentageFlip-flops 831 301440 lt 1 LUTs 1802 150720 12 Block RAM (36 Kb) 9 416 22 DSP48E1 19 768 24

Table 62 Resource usage of the LDLT decomposition unit

The maximum operation frequency of the LDLT decomposition is 101 MHz Thereason for this quite low operating frequency is the rounding of the numbersfrom the adder tree and the multiplication with the reciprocal value This round-ing could favorably be pipelined and thus allow for a higher frequency It couldalso be investigated if earlier rounding is possible to avoid excessive bit growththat has to be taken into account and result in a large multiplier for the multipli-cation with the reciprocal value

Forward Substitution

The resource usage for the forward substitution unit can be seen in Table 63

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 44: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

38 6 Result and Analysis

Resource Used Total PercentageFlip-flops 30 301440 lt 1 LUTs 124 150720 lt 1 Block RAM (36 Kb) 2 416 lt 1 DSP48E1 1 768 lt 1

Table 63 Resource usage of the forward substitution unit

The maximum operation frequency is 1665 MHz The limiting factor in this mod-ule is the minimal use of pipelining registers inside the DSP48E1 block With anadjusted FSM and enabled pipeline registers the frequency could be increasedgreatly

623 Jacobi Logarithm

The resource usage of the Jacobi logarithm unit can be seen in Table 64 The unitis fully pipelined and can produce a new result every clock cycle with an initiallatency of 7 clock cycles

Resource Used Total PercentageFlip-flops 180 301440 lt 1 LUTs 156 150720 lt 1 Block RAM (36 Kb) 1 416 lt 1 DSP48E1 0 768 0

Table 64 Resource usage of the Jacobi logarithm unit

The maximum frequency which the unit can operate at is 3165 MHz The criticalpath that determines the maximum operating frequency is located in the round-ing of the computed result Since the output from the addition is the one withthe most number of bits in the Jacobi logarithm unit it is natural that this is therounding that has the longest critical path

63 Remaining Work

This section contains a description of the remaining work necessary to completea proof-of-concept implementation of SUMIS given the design choices alreadymade in this thesis

631 Hyperbolic Tangent

In Equation 26 the function tanh is used One area efficient way of calculatingtanh with good accuracy is to use the CORDIC algorithm described in [Muller1997] The CORDIC algorithm only requires a small lookup table shifters andadders and operates by performing successive rotations by predefined angles Un-

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 45: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

63 Remaining Work 39

fortunately it is not possible to calculate tanh directly but since

tanh(x) =sinh(x)cosh(x)

(61)

where both sinh and cosh can be calculated using a CORDIC block it is possibleto produce tanh using two separate CORDIC blocks

632 Exponential Function

In the algorithm it is necessary to compute ex to be able to use a probabilitycalculated in the logarithmic domain

One idea of how to implement this is to use a similar approach as in Chapter 54with precomputations coupled with a constrained table lookup This approachstarts with rewriting the base of the calculations from e to 2 with

ex = exlowast ln(2)

ln(2) = 2xlowast1

ln(2) (62)

where 1ln(2) can be precalculated This rewrite can be further refined with

2xlowast1

ln(2) = 2f loor(xlowast1

ln(2) ) lowast 2xlowast1

ln(2)minusf loor(xlowast1

ln(2) ) (63)

If y = x lowast 1ln(2) is defined Equation 63 becomes

2y = 2f loor(y) lowast 2yminusf loor(y) (64)

where 2f loor(y) can be implemented with a simple binary decoder while 2yminusf loor(y)

can be precomputed and stored in a lookup table with y minus f loor(y) ranging from0 to 1

If this approach does not provide enough accuracy Tangrsquos method described in[Muller 1997] can be further investigated instead The drawback is that themethod is described for floating point numbers

633 Additional Matrix Operations

It is not only matrix multiplication that is needed in the SUMIS algorithm butalso addition and subtraction These operations can be performed in the samemanner as for the matrix multiplication in Chapter 52 with a generated IP blockfrom Xilinx These modules will have the same interface as the matrix multipli-cation but might have a smaller latency since addition and subtraction are not ascomputationally demanding as multiplication

As described in Chapter 233 matrix inversions of dimension ns are also neededIf ns is small for instance 2 there exists closed formulas for the matrix inverse

An inversion of a matrix of dimension 2 can be described by[a bc d

]minus1

=1

ad minus bc

[d minusbminusc a

] (65)

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 46: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

40 6 Result and Analysis

iff ad minus bc 0 as explained in [Strang 2009]

634 Control Structure

As of now separate modules has been described that can solve subproblems ofthe SUMIS algorithm To implement a working solution these modules have tobe coordinated and utilized It is necessary to provide an interface between themodules that are supposed to perform computations in sequence

Some sections of the algorithm such as the calculation of LLRs require additionalwork where there only exists a computation unit that implements the Jacobi log-arithm and not the complete structure including the necessary preprocessing

64 Improvements

The implementation approach in this thesis can be improved in a couple of waysThese sections describes some of the possible improvements

641 Hardware Time-Multiplexing and Control

The approach in Chapter 531 with minimized control logic is not as hardwareefficient as the approach in Chapter 533 A more complex control logic could bemitigated by implementing some sort of decoding structure and store the instruc-tions in a block RAM This would make the implementation behave more like anapplication specific instruction set processor with limited functionality

The unroll factor discussed in Chapter 521 could be investigated further If thematrix multiplication becomes the limiting factor in the computations it shouldbe investigated if another unroll factor might be suitable This would allow forinstance smaller matrices to be multiplied almost completely in parallel with areasonable cost of the interconnection

642 Wordlength Optimization or Floating Point Implementation

As can be seen in the different sections in Chapter 5 different wordlengths havebeen used for the different modules These wordlengths have been chosen fromsimple Matlab simulations of subsections of the algorithm and this could be re-fined further by running complete simulations of the algorithm

More extensive simulations can allow for a minimization of the necessary wordlengthswhile still providing enough precision for good performance The drawback ofthese simulations is that it limits the reuse of components since a multiplier couldnot be shared between different sections that mandates different wordlengths

Because of the said limitations regarding resource sharing a floating point im-plementation might be of interest Perhaps not a complete implementation ofIEEE 754 double precision described in [IEEE 2008] but instead a custom formatwhich allows for the necessary accuracy while still providing enough dynamicrange to be used for all of the modules

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 47: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

65 Alternative Approaches and Comparison 41

The reason why an approach with floating point might be suitable is that it allowsfor more variations of the algorithm than with a fixed point representation Forinstance the wordlengths needed in the matrix inversion is very much dependenton the modulation scheme used as well as the noise level N0 and with a floatingpoint implementation this dynamic range could be accounted for without anyadditional changes

643 Design Space Exploration using High Level Synthesis

Since the SUMIS algorithm is computationally demanding it is a suitable candi-date for high level synthesis This method is used to transform a higher levelsoftware model of a problem into RTL hardware that can be synthesised given aset of constraints It can allow for design space exploration where different setsof constraints such as number of multipliers or latency are predetermined andsoftware tries to schedule the operations to fulfill these constraints

High level synthesis is not a quick solution to automatically transform softwareto hardware with good results but rather a suitable tool that can be used to testdifferent approaches Writing VHDL to describe the order of operations is timeconsuming and necessary to be able to evaluate a design approach If a softwarecan aid in this process it would be very beneficial More about high level synthesiscan be seen in [Coussy and Morawiec 2008]

65 Alternative Approaches and Comparison

During the thesis a number of hardware implementations of MIMO detectors wasstudied In this section these implementations will be presented and in Chap-ter 66 the insights from these approaches will be discussed to show how a futureimplementation of the SUMIS could be performed

One of the first implementations investigated was the one described in [Studeret al 2011] It is not quite comparable to the SUMIS implementation consid-ered in this thesis since this is an ASIC implementation and supports iterativedecoding Iterative decoding means that the detector and decoder cooperatesand exchange information to determine the correct data iteratively

The detector in [Studer et al 2011] uses coarse grained parallelism with the de-tection divided in eight units that can operate simultaneously and this is veryhelpful to provide a high throughput The processing units are quite similarto the work described in Chapter 533 with a number of computation units con-trolled by an FSM The design works with complex elements represented by fixedpoint numbers with various wordlengths in different parts of the detector

The detector in [Chu and McAllister 2012] employs a detection algorithm thatconsists of a tree search called sphere decoding It is intended for usage in anFPGA and is built of a large collection of small programmable processors Onegreat advantage of this approach is that the design is software programmableand changes can be made to the algorithm without changing the hardware sim-

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 48: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

42 6 Result and Analysis

ply by replacing the software running on the small processors The same fixedpoint representation is used in all of the processors but each individual processorcan be equipped with a different co-processor capable of performing for instancedivision or square root calculations

Since the design in [Chu and McAllister 2012] is used in OFDM systems thedetection must be performed for each subcarrier this implicates that the same al-gorithm will be performed on multiple independent data streams and this makesit possible to share control logic between multiple processors since they will per-form the same operations in a SIMD fashion

Another detector described in [Kim et al 2008] avoids some of the computationalcomplexity by avoiding explicit matrix inversions and rather uses QR decompo-sitions only It uses a fixed point representation with constant wordlength in thewhole design but allows for higher precision with dynamic scaling in the differentsteps of the algorithm The detector uses a fixed architecture that does not allowfor programmability by software As with the previously described detectors thisimplementation also employ a complex model The QR decomposition providestructure in the decomposed matrices like the decomposition in Chapter 331and this has been exploited to avoid unnecessary computations

As of now the described detectors has all been soft MIMO detectors using a fixedpoint number representation The wildcard in this section is [Eilert et al 2008]which does not perform soft detection and utilizes a custom floating point num-ber representation The reason why this detector is still interesting for this thesisis because of its programmable nature

Instead of developing a mainly fixed function hardware solution that will solvethe detection problem in [Eilert et al 2008] a complete processor architecture hasbeen developed that is capable of performing detection among other problemsThe processor architecture contains multiple floating point arithmetic units capa-ble of performing commonly used complex valued operations such as multiply-add and absolute value The great advantage of an approach like this is that it ispossible to use the same hardware to perform other calculations As described in[Eilert et al 2008] it is possible with a minor addition of hardware to calculate a512-point FFT efficiently

66 Insights from Alternative Approaches

Several insights can be gathered from the alternative approaches for an imple-mentation described in Chapter 65 These sections aim to provide some discus-sion about how a future implementation of SUMIS could be carried out

661 Number Representation

It is quite cumbersome to design using fixed point numbers if the wordlengthsshall be minimized since the magnitude of the numbers involved in the algorithmare not equal between the operations To be able to reuse as much components

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 49: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

66 Insights from Alternative Approaches 43

as possible a custom floating point representation could be used Each operationwould require more area to work with the representation but if operations couldeasily be shared it would lead to a lower overall area

If a fixed point number representation is still chosen dynamic scaling as used in[Kim et al 2008] could be utilized to achieve as high accuracy as possible withthe given constraints This scaling would also add area to the design and it isnecessary to evaluate which approach that would be more efficient

As of now the SUMIS algorithm is described with real operations it could befavorable to explore the use of a complex model instead For a pure softwareimplementation this would probably not provide any benefits since for instancea complex multiplication requires four real multiplications and two additions sothese operations would make a complex model comparable to a real model oflarger dimensions If SUMIS is implemented in hardware however these opera-tions can be performed in parallel and be pipelined which would allow a newcomplex multiplication to be carried out in the same time as a regular multiplica-tion apart for an initial latency

662 Processor Architecture

To achieve a compact solution it would be favorable to design small processorsas in [Chu and McAllister 2012] and [Eilert et al 2008] and describe the opera-tions in a program memory Since the processor architecture could be completelycustom it would be possible to choose suitable operations that it can perform Itcould be more complex operations than ordinary multiplications and additionssuch as addition of small sub matrices and so on

Each processor could be equipped with custom hardware such as the Jacobi loga-rithm unit which would accelerate operations performed on probabilities in thelogarithmic domain

663 Flexibility

If a custom processor architecture as described in Chapter 662 were designed itwould allow for a high degree of flexibility It would be possible to have differ-ent subprograms for different modulation schemes and thus support all commonmodulations with the possibility to add even more

It is possible to have multiple small processors working in parallel and this is fa-vorable since the workload of detection will increase if the detection is performedin an OFDM system because of multiple carriers

664 Integration

Since devices that are using wireless technology are shrinking more and more theneed for integrated solutions is increasing rapidly In customer appliances it ismore common with larger system on chips than individual ASICs interconnectedand therefore it would be suitable to package a SUMIS detector as an IP blockrather than fabricating a custom ASIC

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 50: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

44 6 Result and Analysis

Not only is the development of an ASIC extremely costly but also prohibitingwhen trying to integrate in a complete product It would be more cost effectiveand flexible to resell the detector for integration in larger system on chips

67 Final Conclusions

A subset of the operations used in SUMIS was successfully adopted for hardwareand implemented using VHDL Different approaches were taken for the individ-ual modules to highlight implementation details necessary to investigate whenconstructing a detector

The remaining work is described in Chapter 63 and it would be suitable to per-form further simulations to determine what kind of accuracy is needed while stillproviding more than adequate detection performance

The SUMIS algorithm still need more work for a complete adaptation in hard-ware but with increasing use of wireless technology an algorithm like SUMIS isnecessary to provide great performance even with very poor SNR With high SNRthe benefits of an advanced detection algorithm compared to a simpler one willnot be as substantial

In the future if the advises in Chapter 66 are taken into account it will be possi-ble to construct a flexible highly efficient implementation of SUMIS for usage inmodern contemporary wireless systems

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 51: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Bibliography

David Bishop Fixed point package userrsquos guide Packages and bodies for theIEEE 1076-2008 LRM 2008 URL httpwwwvhdlorgfphdlFixed_ugpdf

Dongdong Chen Bintian Zhou Zhan Guo and Peter Nilsson Design and im-plementation of reciprocal unit In Circuits and Systems 2005 48th MidwestSymposium on pages 1318 ndash1321 Vol 2 aug 2005

Won-Joon Choi Kok-Wui Cheong and J M Cioffi Iterative soft interferencecancellation for multiple antenna systems In Wireless Communications andNetworking Conference 2000 WCNC 2000 IEEE volume 1 2000

Xuezheng Chu and J McAllister Software-Defined Sphere Decoding for FPGA-Based MIMO Detection Signal Processing IEEE Transactions on 60(11)6017ndash6026 2012

M Čirkić D Persson EG Larsson and J-A Larsson Gaussian approximationof the llr distribution for the ml and partial marginalization mimo detectorsIn Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE InternationalConference on pages 3232ndash3235 2011 doi 101109ICASSP20115946710

Mirsad Čirkić and Erik G Larsson SUMIS A Near-Optimal Soft-Ouput MIMODetector at Low and Fixed Complexity CoRR abs12073316 2012

Philipe Coussy and Adam Morawiec High-Level Synthesis from Algorithm toDigital Circuit Springer 2008 ISBN 978-1-4020-8587-1

Per-Erik Danielsson and Lennart Bengtsson Digital teknik Studentlitteratur AB1996 ISBN 914400110X

J Eilert Di Wu and D Liu Implementation of a programmable linear MMSEdetector for MIMO-OFDM In Acoustics Speech and Signal Processing 2008ICASSP 2008 IEEE International Conference on pages 5396ndash5399 2008

Gene H Golub and Charles F Van Loan Matrix computations (3rd ed) JohnsHopkins University Press Baltimore MD USA 1996 ISBN 0801854148

45

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 52: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

46 Bibliography

IEEE IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 pages1ndash58 2008

IEEE IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2008(Revision of IEEE Std 1076-2002) 2009

Hun Seok Kim Weijun Zhu Jatin Bhatia Karim Mohammed Anish Shah andBabak Daneshrad A practical hardware friendly MMSE detector for MIMO-OFDM-based systems EURASIP J Adv Signal Process 2008941ndash9414 Jan-uary 2008

A Lampe and JB Huber On improved multiuser detection with iterated soft de-cision interference cancellation In Communication Theory Mini-Conference1999 pages 172ndash176 1999 doi 101109CTMC1999790259

EG Larsson and J Jalden Fixed-Complexity Soft MIMO Detection via PartialMarginalization Trans Sig Proc 56(8)3397ndash3407 August 2008

Jean-Michel Muller Elementary functions algorithms and implementationBirkhauser Boston Inc Secaucus NJ USA 1997 ISBN 0-8176-3990-X

D Persson and EG Larsson Partial marginalization soft mimo detection withhigher order constellations Signal Processing IEEE Transactions on 59(1)453ndash458 2011 ISSN 1053-587X doi 101109TSP20102068293

D Persson EG Larsson and M Skoglund Joint source-channel decoding overmimo channels based on partial marginalization Signal Processing IEEETransactions on 60(12)6734ndash6739 2012 ISSN 1053-587X doi 101109TSP20122214215

Gilbert Strang Introduction to Linear Algebra (4th ed) Wellesley - CambridgePress 2009 ISBN 978-0-9802327-1-4

C Studer S Fateh and D Seethaler ASIC Implementation of Soft-InputSoft-Output MIMO Detection Using MMSE Parallel Interference CancellationSolid-State Circuits IEEE Journal of 46(7)1754ndash1765 2011

Mirsad Čirkić and Erik G Larsson Near-Optimal Soft-Output Fixed-ComplexityMIMO Detection via Subspace Marginalization and Interference SuppressionIn IEEE International Conference on AcousticsSpeech and Signal ProcessingIEEE Signal Processing Society 2012

Xilinx Inc AXI Reference Guide (v 131) User Guide 761 2011a

Xilinx Inc LogiCORE IP Linear Algebra Toolkit (v 10) Data Sheet 829 2011b

Xilinx Inc LogiCORE IP Multiply Accumulator (v 21) Data Sheet 716 2011c

Xilinx Inc Virtex-6 Family Overview (v 24) Data Sheet 150 2012

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright
Page 53: MIMO Detector Based On SUMIS626323/FULLTEXT01.pdfLanguage Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats ... A subset of the

Upphovsraumltt

Detta dokument haringlls tillgaumlngligt paring Internet mdash eller dess framtida ersaumlttare mdashunder 25 aringr fraringn publiceringsdatum under foumlrutsaumlttning att inga extraordinaumlraomstaumlndigheter uppstaringr

Tillgaringng till dokumentet innebaumlr tillstaringnd foumlr var och en att laumlsa ladda nerskriva ut enstaka kopior foumlr enskilt bruk och att anvaumlnda det ofoumlraumlndrat foumlr icke-kommersiell forskning och foumlr undervisning Oumlverfoumlring av upphovsraumltten viden senare tidpunkt kan inte upphaumlva detta tillstaringnd All annan anvaumlndning avdokumentet kraumlver upphovsmannens medgivande Foumlr att garantera aumlkthetensaumlkerheten och tillgaumlngligheten finns det loumlsningar av teknisk och administrativart

Upphovsmannens ideella raumltt innefattar raumltt att bli naumlmnd som upphovsmani den omfattning som god sed kraumlver vid anvaumlndning av dokumentet paring ovanbeskrivna saumltt samt skydd mot att dokumentet aumlndras eller presenteras i saringdanform eller i saringdant sammanhang som aumlr kraumlnkande foumlr upphovsmannens litteraumlraeller konstnaumlrliga anseende eller egenart

Foumlr ytterligare information om Linkoumlping University Electronic Press se foumlrla-gets hemsida httpwwwepliuse

Copyright

The publishers will keep this document online on the Internet mdash or its possi-ble replacement mdash for a period of 25 years from the date of publication barringexceptional circumstances

The online availability of the document implies a permanent permission foranyone to read to download to print out single copies for hisher own use andto use it unchanged for any non-commercial research and educational purposeSubsequent transfers of copyright cannot revoke this permission All other usesof the document are conditional on the consent of the copyright owner Thepublisher has taken technical and administrative measures to assure authenticitysecurity and accessibility

According to intellectual property law the author has the right to be men-tioned when hisher work is accessed as described above and to be protectedagainst infringement

For additional information about the Linkoumlping University Electronic Pressand its procedures for publication and for assurance of document integrity pleaserefer to its www home page httpwwwepliuse

copy Tomas Frostensson

  • Front Page
  • Title Page
  • Library Page
  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Background
    • 12 Goal
    • 13 Limitations
    • 14 Outline
      • 2 Theory
        • 21 MIMO
        • 22 Detection
          • 221 Soft Detection
            • 23 SUMIS
              • 231 First Stage
              • 232 Second Stage
              • 233 Complexity Selection
                • 24 Number Representation
                • 25 Hardware Introduction
                • 26 Programmable Hardware
                  • 261 Hardware Flow
                  • 262 Reusable Modules
                      • 3 Problem Analysis
                        • 31 Overview
                        • 32 Matrix multiplication
                        • 33 Matrix Inversion
                          • 331 LDLT Decomposition
                          • 332 Reciprocal
                          • 333 Forward Substitution
                          • 334 Final Steps
                            • 34 Log Sum of Exponentials
                              • 4 Methodology and Equipment
                                • 41 Modeling
                                • 42 VHDL
                                • 43 RTL
                                • 44 Hardware
                                  • 5 Implementation
                                    • 51 Overview
                                    • 52 Matrix Multiplication
                                      • 521 IP Block Trade-offs
                                      • 522 Interface
                                      • 523 Example Implementation
                                        • 53 Matrix Inversion
                                          • 531 LDLT Decomposition
                                          • 532 Reciprocal Unit
                                          • 533 Forward Substitution
                                            • 54 Jacobi Logarithm
                                              • 6 Result and Analysis
                                                • 61 Testing and Measurements
                                                  • 611 Matrix Multiplication
                                                  • 612 LDLT Decomposition
                                                  • 613 Forward Substitution
                                                  • 614 Jacobi Logarithm
                                                    • 62 Resource Usage
                                                      • 621 Matrix Multiplication
                                                      • 622 Matrix Inversion
                                                      • 623 Jacobi Logarithm
                                                        • 63 Remaining Work
                                                          • 631 Hyperbolic Tangent
                                                          • 632 Exponential Function
                                                          • 633 Additional Matrix Operations
                                                          • 634 Control Structure
                                                            • 64 Improvements
                                                              • 641 Hardware Time-Multiplexing and Control
                                                              • 642 Wordlength Optimization or Floating Point Implementation
                                                              • 643 Design Space Exploration using High Level Synthesis
                                                                • 65 Alternative Approaches and Comparison
                                                                • 66 Insights from Alternative Approaches
                                                                  • 661 Number Representation
                                                                  • 662 Processor Architecture
                                                                  • 663 Flexibility
                                                                  • 664 Integration
                                                                    • 67 Final Conclusions
                                                                      • Bibliography
                                                                      • Copyright