slides it2 ss2012

Lecture Course: Information Theory II

Marius Pesavento

Communication Systems Group

Institute of Telecommunications

Technische Universitat Darmstadt

19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 1 NTS

COURSE ORGANIZATION

I Instructors: Dr.-Ing. Marius Pesavento, S3/06/204,

[email protected] FG Nachrichtentechnische Systeme

(NTS),

I Teaching assistant: Yong Cheng, S3/06/205, e-mail:

[email protected]

I Website: http://www.nts.tu-darmstadt.de/

I Lecture notes and slides will be posted in TUCAN

I Office hours: on request (please send an e-mail to the TA or instructor)

I Written final exam (closed-book)

I Examination date (presumably) Tuesday July 31, 2012: 12.00 - 14.00


RECOMMENDED TEXTBOOKS

1. D. Tse, Fundamentals of Wireless Communication, Cambridge University

Press, 2005. (main reference)

2. A. El Gamal and Y.H. Kim Network Information Theory, Cambridge

University Press, 2012.

3. A. Goldsmith, Wireless Communications, Cambridge University Press, 2005.

4. T. M. Cover and J A. Thomas, Elements of Information Theory, John Wiley

& Sons, 1991.


COURSE OUTLINE

I Overview of basics of information theoryI Entropy, mutual information, capacityI Source coding and channel coding theoremI Memoryless Gaussian channel

I Multi-antenna channel capacity, water-fillingI Basic theory of network information theory

I Multi-access channelsI Broadcast channelsI Relay channels

I Cyclic codesI Convolutional codesI Turbo-codes


Topics of the earlier basic IT course

I Information, entropy, mutual information, and their derivatives

I Basic theory of source coding, Shannon’s source coding theorem, Huffman

coding, Lempel-Ziv coding

I Channel capacity, Shannon’s channel coding theorem, Gaussian channel,

bandlimited channel, Shannon’s limit, multiple Gaussian channels, multiple

colored noise channels, water-filling, ergodic and outage capacities, basics of

MIMO channels,

I Basic theory of channel coding, linear block coding, Reed-Muller codes, Golay

code


REVIEW OF PROBABILITY THEORY:CDF AND PDF

Let X be a continuous random variable with the cumulative density function (cdf)

FX (x) = Probability{X ≤ x} = P(X ≤ x)

Probability density function (pdf):

fX (x) =∂FX (x)

∂x

where

FX (x0) =

∫ x0

−∞fX (x) dx


NORMALIZATION PROPERTY OF CDFs

Since FX (∞) = 1, we obtain the so-called normalization property∫ ∞−∞

fX (x) dx = 1

Simple interpretation:

fX (x) = lim∆→0P{x −∆/2 ≤ X ≤ x + ∆/2}

∆

f (x)X

x1 x2

SURFACE Probability{x1<X< x2}=

x


EXAMPLE 1

Let the real-valued random variable X be uniformly distributed in the interval

[0, T ].

F (x)X

f (x)X

0

1

0 T

T

1/T

x

x19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 8 NTS

EXAMPLE 2

Let the real-valued random variable X has the so-called Gaussian (normal)

distribution

fX (x) =1√

2πσ2X

e−(x−µX )2/2σ2X

where σ2X = var{X} is the variance and µX is the mean. The corresponding

distribution function is given by

FX (x) =1√

2πσ2X

∫ x

−∞e−(ξ−µX )2/2σ2

X dξ


CDF AND PDF OF GAUSSIAN RANDOMVARIABLE

F (x)X

f (x)X

0

0 x

x

1

(2πσX2)−1/2

σ2 X

µX


PROBABILITY MASS FUNCTION

Let X now be a discrete random variable which takes the values xi (i = 1, ... , I )

with the probabilities P(xi ) (i = 1, ... , I ), respectively.

For discrete variables, we define the probability mass function

P(xi ) = Probability(X = xi )

The normalization condition:I∑

i=1

P(xi ) = 1


EXTENSION TO DISCRETE VARIABLES

How to extend the concepts of pdf and cdf to discrete variables?

Define the unit step function as

u(x) =

{0 , x < 0

1 , x ≥ 0

Define the Dirac delta-function as

δ(x) =

{∞ , x = 0

0 , x 6= 0,

∞∫−∞

δ(x) dx = 1


EXTENSION TO DISCRETE VARIABLES

Relationships between the delta-function and unit step function:∫ x

−∞δ(ξ) dξ = u(x) , δ(x) =

∂u(x)

∂x

Shifting property of delta-function:∫ ∞−∞

g(x) δ(x − y) dx = g(y)

Using the definition of the unit step function, we can express the cdf as

FX (x) =I∑

i=1

P(xi )u(x − xi )


EXTENSION TO DISCRETE VARIABLE

Then, the pdf can be expressed as

fX (x) =I∑

i=1

P(xi )δ(x − xi )

Using the delta-function sifting property, we have∫ ∞−∞

fX (x) dx =

∫ ∞−∞

I∑i=1

P(xi )δ(x − xi ) dx

=I∑

i=1

∫ ∞−∞

P(xi )δ(x − xi ) dx =I∑

i=1

P(xi ) = 1


EXAMPLE 1

Let the random variable X be an outcome of the coin tossing experiment.

0

0.5

1.0

F (x)X

f (x)X

x

x1

0 1

0.5 0.5( ) ( )


EXAMPLE 2

Let the random variable X be an outcome of the die throwing experiment.

F (x)X

f (x)X

6

6

0

0 x

x

1/6( )

1

1/6


STATISTICAL EXPECTATION

Expected value (mean) of a continuous random variable:

µX = E{X} =

∫ ∞−∞

x fX (x) dx

For a discrete random variable:

µX = E{X} =

∫ ∞−∞

xfX (x) dx =

∫ ∞−∞

xI∑

i=1

P(xi )δ(x − xi ) dx

=I∑

i=1

∫ ∞−∞

xP(xi )δ(x − xi ) dx =I∑

i=1

xiP(xi )


STATISTICAL EXPECTATION

We can also compute expected value of a function of continuous random variable:

E{g(X )} =

∫ ∞−∞

g(x) fX (x) dx

For a discrete random variable:

E{g(X )} =I∑

i=1

g(xi ) P(xi )


VARIANCE OF A RANDOM VARIABLE

var{X} = E{(X − E{X})2}= E{X 2} − E{X}2 = σ2

X

where σX is commonly called standard deviation.

The variance and standard deviation can be interpreted as measures of the

statistical dispersion of a random variable w.r.t. the expected value.


EXAMPLE

Compute the mean and variance of the random variable uniformly distributed in

the interval [0, 1]

f (x)X

0 x1

1

µX =

∫ 1

0

x dx =x2

2

∣∣∣∣∣1

0

=1

2

σ2X =

∫ 1

0

x2 dx − µ2X =

x3

3

∣∣∣∣∣1

0

− 1

4=

1

3− 1

4=

1

12


JOINT DISTRIBUTION

Let us now consider two random variables X and Y jointly.

Joint distribution function:

FX ,Y (x , y) = P(X ≤ x , Y ≤ y)

Joint pdf:

fX ,Y (x , y) =∂2FX ,Y (x , y)

∂x ∂y


JOINT DISTRIBUTION

The inverse relationship:

FX ,Y (x0, y0) =

∫ x0

−∞

∫ y0

−∞fX ,Y (x , y) dx dy

Any pdf satisfies the following normalization property:∫ ∞−∞

∫ ∞−∞

fX ,Y (x , y) dx dy = 1

Also, ∫ ∞−∞

fX ,Y (x , y) dx = fY (y) ,

∫ ∞−∞

fX ,Y (x , y) dy = fX (x)


CONDITIONAL DISTRIBUTION

In practical problems, we are often interested in the pdf of one random variable X

conditioned by the fact that a second random variable Y has some specific value

y . It is obvious that

P(X ≤ x ; Y ≤ y) = P(X ≤ x |Y ≤ y)P(Y ≤ y)

Then, conditional cdf is defined as

FX (x |y) = P(X ≤ x |Y ≤ y) =FX ,Y (x , y)

FY (y)

From symmetry, it also follows that

FY (y |x) =FX ,Y (x , y)

FX (x)19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 23 NTS

CONDITIONAL DISTRIBUTION

fX (x |y) =fX ,Y (x , y)

fY (y)

fY (y |x) =fX ,Y (x , y)

fX (x)

From the last two equations, we obtain the Bayes rule

fX (x |y) fY (y) = fY (y |x) fX (x)


NORMALIZATION CONDITION

∫ ∞−∞

fX (x |y) dx =

∫ ∞−∞

fX ,Y (x , y)

fY (y)dx

=1

fY (y)

∫ ∞−∞

fX ,Y (x , y) dx = 1

Conditional expectation:

E{g(X )|y} =

∫ ∞−∞

g(x)fX (x |y) dx


STATISTICAL INDEPENDENCE

Two random variables X and Y are statistically independent if

fX ,Y (x , y) = fX (x)fY (y)

Substituting this equation to the conditional pdf, we obtain that statistical

independence implies

fX (x |y) = fX (x)

That is, the variable Y does not have any influence on the variable X .


EXAMPLE

Let

fX ,Y (x , y) =

{4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1

0 , otherwise

Are the variables X and Y statistically dependent?

fX (x) =

∫ ∞−∞

fX ,Y (x , y) dy = 4x

∫ 1

0

y dy = 4xy 2

2

∣∣∣∣∣1

0

=

{2x , 0 ≤ x ≤ 1,

0 , otherwise

fY (y) =

{2y , 0 ≤ y ≤ 1,

0 , otherwise

fX ,Y (x , y) = fX (x)fY (y) and, hence, the variables are independent!


CORRELATION AND COVARIANCE

Two fundamental characteristics of linear statistical dependence are correlation

rXY = E{XY }

and covariance

cov{X , Y } = E{(X − E{X})(Y − E{Y })}= E{XY } − E{X}E{Y }= E{XY } − µXµY

For X = Y , covariance boils down to variance:

cov{X , X} = E{X 2} − µ2X = var{X}


SOME USEFUL PROPERTIES

I var{X + Y } = var{X}+ var{Y }+ 2 cov{X , Y }.I If the variables X and Y are statistically independent then for any functions h

and g , E{h(X )g(Y )} = E{h(X )}E{g(Y )}.I If the variables X and Y are statistically independent then cov{X , Y } = 0.

Therefore, covariance is sometimes used a measure of statistical dependence.

However, the reverse statement is not necessarily true!

I If the variables X and Y are statistically independent then

var{X + Y } = var{X}+ var{Y }.


EXTENSION TOMULTIVARIAT DISTRIBUTIONS

We may also consider multiple (more than two) random variables X1, ... , Xn.

Joint distribution function:

FX1,X2,...,Xn(x1, x2, ... , xn) = P(X1 ≤ x1, X2 ≤ x2, ... , Xn ≤ xn)

Joint pdf:

fX1,X2,...,Xn(x1, x2, ... , xn) =∂nFX1,X2,...,Xn(x1, x2, ... , xn)

∂x1 ∂x2 · · · ∂xn


MULTIVARIAT DISTRIBUTIONS

Introducing vectors

X = [X1, X2, ... , Xn]T

x = [x1, x2, ... , xn]T

we rewrite the previous equations in symbolic (vector) notation as

FX(x) = P(X ≤ x)

fX(x) =∂NFX(x)

∂x1 ∂x2 · · · ∂xn

Normalization condition: ∫ ∞−∞· · ·∫ ∞−∞

fX(x) dx = 1


MULTIVARIAT DISTRIBUTIONS

Statistical expectation can be defined as

E{g(X)} =

∫ ∞−∞· · ·∫ ∞−∞

g(x) fX(x) dx

where g(X) is some function of the random vector X.

In particular bivariate case

E{g(X , Y )} =

∫ ∞−∞

∫ ∞−∞

g(x , y) fX ,Y (x , y) dx dy


MULTIVARIAT GAUSSIAN DISTRIBUTIONS

Jointly Gaussian random variables have the following joint multivariate pdf:

fX(x) =1

(√

2π)ndet{R}1/2e−

12 (x−µX)T R−1(x−µX)

where the mean

µX = E{X}

and the covariance matrix

R = E{(X− E{X})(X− E{X})T} = E{XXT} − µXµTX

In symbolic notation

X ∼ N (µX, R)


MULTIVARIAT GAUSSIAN DISTRIBUTION

In the case of a single (n = 1) random variable X = X1, the n-variate Gaussian

pdf reduces to

pX (x) =1√

2πσ2X

e−(x−µX )2/2σ2X

which is the well-known Gaussian pdf.

In the case of two (N = 2) random variables X = X1 and Y = X2, we have that

R =

[σ2X ρ σXσY

ρ σXσY σ2Y

], ρ =

E{(X − µX )(Y − µY )}σXσY

Note that ρ = ρXY is nothing else as correlation coefficient.



The determinant of R is given by

det{R} = σ2Xσ

2Y (1− ρ2)

and, therefore, the n-variate pdf reduces to the so-called bivariate pdf

fXY (x , y) =1

2πσXσY√

1− ρ2

· exp

{− 1

2(1− ρ2)

[(x−µX )2

σ2X

− 2ρ(x−µX )(y−µY )

σXσY+

(y−µY )2

σ2Y

]}The maximum of this function is located in the point {x = µX ; y = µY } and the

maximal value is

max {fX ,Y (x , y)} = fX ,Y (µX ,µY ) =1

2πσXσY√

1− ρ2



In the case of uncorrelated X and Y , ρ = 0 and we have

fXY (x , y) =1

2πσXσYexp

{−1

2

[(x−µX )2

σ2X

+(y−µY )2

σ2Y

]}

=

(1√

2πσXe−(x−µX )2/2σ2

X

)(1√

2πσYe−(y−µY )2/2σ2

Y

)= fX (x) fY (y)

i.e., the variables X and Y become statistically independent. This is a very

important result showing that any uncorrelated Gaussian random variables are also

statistically independent! Note that in the case of non-Gaussian random variables,

this is not true.


MULTIVARIAT GAUSSIAN DISTRIBUTIONEXAMPLE

Contour plots of the bivariate Gaussian pdf with the parameters µX = µY = 0 and

σX = σY = 1.

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

BIVARIATE GAUSSIAN PDF, CORRELATION COEFFICIENT = 0



−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

BIVARIATE GAUSSIAN PDF, CORRELATION COEFFICIENT = 0.25



−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y




−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

BIVARIATE GAUSSIAN PDF, CORRELATION COEFFICIENT = −0.25



−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y



BASICS OF INFORMATION THEORY

Shannon: Information is the resolution of uncertainty about some statistical event:

I Before the event occurs, there is an amount of uncertainty.

I After the occurrence of the event, there is no uncertainty anymore, but there

is gain in the amount of information.

I Highly expected messages deliver small amount of information, while highly

unexpected ones deliver a large amount of information.Hence, the amount of

information should be inversely proportional to the probability of the message.


Information and entropy

The amount of information of the symbol x with the probability P(x):

I (x) = log

(1

P(x)

)= −log(P(x)) with [I (x)] = Bit

Considering a source with the alphabet X = {x1, ... , xN}, entropy is defined as the

statistically averaged amount of information (mean of I (X )):

H(X ) = E{I (X )} = E{−log(P(X ))}

=N∑i=1

−P(xi ) log(P(xi ))

=N∑i=1

P(xi ) log

(1

P(xi )

)with [H(X )] = Bit/Symbol


Example

Entropy of non-symmetric binary source with the probabilities P(0) = p and

P(1) = 1− p

HB(X ) = −p log(p)− (1− p) log(1− p)

maximale Ungewissheit

0 10.5

1

p

H (X)B , Bit/Zeichen

I The entropy characterizes the source uncertainty.

I The entropy is a concave function of probability.


SOME “DERIVATIVES” OF ENTROPYJoint Entropy

The definition of entropy can be extended to a pair of random variables X and Y

(two discrete sources X = {x1, ... , xN} and Y = {y1, ... , yM}).

The joint entropy H(X , Y ) is defined as:

H(X , Y ) = −E{log(P(X , Y ))}

= −N∑i=1

M∑l=1

P(xi , yl) log(P(xi , yl))


Conditional Entropy

The conditional entropy H(Y |X ) is the amount of uncertainty remaining about

the random variable Y after the random variable X has been observed:

H(Y |X ) = −EX ,Y {log(P(Y |X ))}

= −N∑i=1

M∑l=1

P(xi , yl) log(P(yl |xi ))

= −N∑i=1

P(xi )M∑l=1

P(yl |xi ) log(P(yl |xi ))

where we use the Bayes rule

P(xi , yl) = P(xi |yl)P(yl) = P(yl |xi )P(xi )


Useful properties

Important conditional entropy property:

H(X , Y ) = H(X ) + H(Y |X )

Hence the entropy, conditional entropy, and joint entropy are related quantities.

Another important property: Conditioning reduces entropy:

H(X |Y ) ≤ H(X )

with the equality if and only if X and Y are statistically independent.


Mutual information

Let us consider two random variables (sources). The amount of information

exchanged between two symbols xi und yl can be defined as:

I (xi ; yl) = log

(P(xi |yl)

P(xi )

)= log

(P(xi , yl)

P(xi )P(yl)

)with [I (xi ; yl)] = Bit

where we again use the Bayes rule P(xi , yl) = P(xi |yl)P(yl).


Mutual information

The amount of mutual information exchanged between two sources X and Y can

be obtained by averaging of I (xi ; yl) as

I (X ; Y ) =N∑i=1

M∑l=1

P(xi , yl) log

(P(xi |yl)

P(xi )

)

=N∑i=1

M∑l=1

P(xi , yl) log

(P(xi , yl)

P(xi )P(yl)

)with [I (X ; Y )] = Bit/Symbol


Mutual information

Mutual information is the reduction in the uncertainty of X due to the knowledge

of Y :

I (X ; Y ) = H(X )− H(X |Y )

Relation of mutual information to entropies and joint entropy:

I (X ; Y ) = H(X ) + H(Y )− H(X , Y )


Channel capacity

The input probabilities P(xi ) are independent of the channel. We can then

maximize the mutual information I (X ; Y ) w.r.t. P(xi ). The channel capacity can

be then defined as the maximum mutual information in any single use of the

channel, where the maximization is over P(xi ) (i = 1, ... , M):

C = max{P(xi )}

I (X ; Y ) with [C ] = Bits/Symbol

or bits per channel use (bpcu).


Example

Channel capacity of a binary symmetric channel

x y

x y2

1

2

p

p

p1-

p1-

1

CB = 1 + p logp + (1− p) log(1− p)

= 1− HB(X )


Entropy/capacity of a binary symmetric channel

0

1

H (X)

10.5 10.50

1

BCB , Bit/Zeichen , Bit/Zeichen

p p


Channel coding/decoding

The inevitable presence of noise in a channel causes errors between the output

and input data sequences of a digital communication system. To reduce these

errors we will resort to channel coding.

Channel encoder maps the incoming source data into a channel input sequence. It

adds redundancy to these data to protect it from errors.

Channel decoder inversely maps the channel output sequence into an output data

sequence in a way that the overall effect of the channel noise on the system is

minimized.


Shannon’s Channel-Coding Theorem

Let information be transmitted through a discrete memoryless channel of capacity

C . If the transmission rate

R < C

then there exists a channel coding scheme for which the source output can be

transmitted over the channel with an arbitrarily small probability of error.

Conversely, if

R ≥ C

than it is impossible to transmit information over the channel with an arbitrary

small probability of error.


Joint source-channel coding theorem

If

H(X ) > C

then it is impossible to transmit the source outputs over the channel with an

arbitrary small probability of error.

The latter theorem follows from the direct combination of source-coding and

channel-coding theorems.


Continuous sources

The mutual information between two continuous random sources X and Y with

the joint symbol pdf fX ,Y (x , y) is given by

I (X ; Y ) =

∫ ∞−∞

∫ ∞−∞

fX ,Y (x , y) log

(fX (x |y)

fX (x)

)dx dy

=

∫ ∞−∞

∫ ∞−∞

fX ,Y (x , y) log

(fX ,Y (x , y)

fX (x)fY (y)

)dx dy


What is the relationship between the discrete and continuous mutual information?

It can be shown that the definitions of mutual information in the continuous and

discrete cases are essentially similar.

This property enables to use the continuous mutual information to define the

capacity in the case of continuously distributed (infinite alphabet) sources.


Continuous-time bandlimited channel

Consider a continuous-time bandlimited channel with additive Gaussian white

noise (AGWN). The output of such AWGN channel can be described as

Y (t) = (X (t) + Z (t)) ∗ h(t)

where X (t) and Z (t) are the signal and noise waveforms, respectively, and h(t) is

the impulse response of an ideal bandpass filter with the cutoff frequency B.


+

n(t)- B B f

H(f)

idealer TiefpassBBandbreite

AWGN

Bandbegrentzter AWGN-Kanal

f

S (f)

0

NN /2

Leistungsdichtespectrumdes Rauschens

x(t) y(t)


Capacity

Capacity of the bandlimited channel:

C = B log

(1 +

P

N0B

)bits per second

where it is taken into account that PN = N0B.

Shannon’s bound:

C∞ = limB→∞

B log

(1 +

P

N0B

)= loge · P

N0' 1.44

P

N0


Parallel AWGN channels

Consider multiple parallel AWGN channels

Yi = Xi + Zi , i = 1, ... , K

with a common power constraint

E

{K∑i=1

X 2i

}=

K∑i=1

E{X 2i } =

K∑i=1

Pi ≤ P

where Zi ∼ N (0, PN,i ), the noise is statistically independent from channel to

channel, and Pi = E{X 2i }.

How to distribute the power P among the channels to maximize the total

capacity?


Water-filling

Result (water-filling): The total capacity is maximized when

Pi = (ν − PN,i )+

where the value of ν is chosen that

K∑i=1

Pi =K∑i=1

(ν − PN,i )+ ≤ P

and (·)+ denotes the positive part, i.e., for any x ,

(x)+ ,

{x , if x ≥ 0

0, if x < 0


Water-filling


SKETCH OF THE PROOF

The mutual information of a system with multiple Gaussian channels can be

shown to be upper-bounded by the value

1

2

K∑i=1

log

(1 +

Pi

PN,i

)

Equality is achieved when X = [X1, X2, ... , XK ]T is Gaussian vector:

X ∼ N (0, P)


SKETCH OF THE PROOF

The covariance matrix

P =

P1 0 · · · 0

0 P2 · · · 0...

.... . .

...

0 0 · · · PK

= diag{P1, ... , PK}

Hence, the capacity of multiple Gaussian channels is given by

C =1

2

K∑i=1

log

(1 +

Pi

PN,i

)

Let us now maximize C over {Pi}Ki=1 subject to the constraints∑K

i=1 Pi = P. and

Pi ≥ 0, for i = 1, ... , K .


SKETCH OF THE PROOF

We use the Lagrange multiplier method. The Lagrangian function can be written

as

L(P1, ... , PK ) =1

2

K∑i=1

log

(1 +

Pi

PN,i

)+ λ0(P −

K∑i=1

Pi ) +K∑i=1

λiPi

where λ0, ... ,λK are the Lagrange multipliers. Differentiating L(P1, ... , PK ) w.r.t.

Pi , we have

∂L∂Pi

=∂

∂Pi

(1

2log e ln

(1 +

Pi

PN,i

)+ λ0(P −

K∑i=1

Pi ) +K∑i=1

λiPi

)

=1

2log e

1/PN,i

1 + Pi/PN,i− λ0 + λi

=log e

2

1

Pi + PN,i− λ0 + λi


SKETCH OF THE PROOF

From the so-called Karnush-Kuhn-Tucker (KKT) conditions for constraint convex

optimization problems:

K∑i=1

P?i = P ; P?

i ≥ 0 (constraint satisfaction)

log e

2

1

P?i + PN,i

− λ?0 + λ?i = 0 (zero gradient)

λ?i P?i = 0 (complementary slackness)

λ?i ≥ 0; i = 1, ... , K (for inequality constraints)

Thus P?i ≥ 0 and

∑Ki=1 P?

i = P as well as

P?i (λ?0 −

log e

2

1

P?i + PN,i

) and λ?0 ≥log e

2

1

P?i + PN,i


SKETCH OF THE PROOF

from KKT: P?i ≥ 0 and

∑Ki=1 P?

i = P as well as

P?i (λ?0 −

log e

2

1

P?i + PN,i

) = 0 and λ?0 ≥log e

2

1

P?i + PN,i

Thus if

λ?0 <log e

2

1

PN,i,

then from the last equation we have P?i > 0 which by slackness conditions implies

that

λ?0 =log e

2

1

P?i + PN,i

.

and thus for ν? = log e/(2λ?0)

P?i = (ν? − PN,i ) .


SKETCH OF THE PROOF

from KKT: P?i ≥ 0 and

∑Ki=1 P?

i = P as well as

P?i (λ?0 −

log e

2

1

P?i + PN,i

) = 0 and λ?0 ≥log e

2

1

P?i + PN,i

Reversely if

λ?0 ≥log e

2

1

PN,i,

then P?i > 0 is impossible as it would imply that

λ?0 ≥log e

2

1

P?i

>log e

2

1

P?i + PN,i

,

which violates the complementary slackness condition. We conclude that for

PN,i ≤ ν? we have P?i = 0 and P?

i = (ν? − PN,i ) otherwise.


EXTENDED DEFINITIONS OF CAPACITYErgodic capacity

Ergodic capacity: In the case of random Gaussian channel, it is sometimes more

useful to separate the effects of the transmitted signal and the channel as

Y (i) = X (i)H(i) + Z (i)

where H(i) is the channel gain in the ith channel use. In contrast to noise and

signal waveforms, the channel gain is usually treated as non-random

(deterministic) value.


Ergodic capacity

For this model,

P = E{X 2}

can be interpreted as the transmitted signal power, whereas

P = E{(XH)2} = E{X 2}H2 = PH2

can be interpreted as the received signal power.


Ergodic capacity

In this case, the capacity formula reads

C =1

2log

(1 +

PH2

PN

)Note that the conventional capacity is instantaneous, that is, it characterizes the

maximal achievable rate for particular given realization of the gain H of the

channel.

How can we characterize the maximal achievable rate in average rather than for

some particular channel gain?


Ergodic capacity

In practice, wireless channels are random and, therefore, should be treated as

random.

Based on this fact, the ergodic capacity is defined as the instantaneous capacity C

averaged over the channel realizations:

CE = EH{C}

where EH{·} denotes statistical expectation over the random channel gain.


Ergodic capacity

Assume that we know the channel gain pdf fH(h). In this case, we can compute

the ergodic capacity as

CE =

∫ ∞−∞

fH(h) C (h) dh

Ergodic capacity provides another look at the achievable transmission rate as

compared to the conventional instantaneous capacity, because it gives the average

rather than the instantaneous picture.


Outage

Outage capacity: the transmission rate Cpout which does not exceed the

instantaneous capacity C in pout × 100 percents of channel realizations.

The quantity pout is called outage probability.

Outage is defined as the event where, for some particular channel realization, the

chosen transmission rate is higher than the instantaneous capacity (that is, where

no error-free transmission is possible).

In the cases of small pout (roughly speaking, pout ≤ 0.1), outage-induced errors

can be cured by means of channel coding.


Outage


Outage

The outage capacity can be characterized as follows. Let the pdf of the

instantaneous capacity C = C (H) be fC (c) where fC (c) = 0 for c < 0. Then, the

outage capacity is defined by the equation

p = P(C < Cpout ) =

∫ Cpout

0

fC (c) dc


Channel coding

Channel encoding and decoding is used to correct errors that may occur during

the signal transmission over the channel.


Linear block codes

Linear binary block codes: coding/decoding operations can be described using

linear algebra. Binary codes use modulo-2 arithmetic.

A code is said to be linear if the modulo-2 sum of any two codewords in the code

give another codeword of this code.

A code is denoted as (n, k) linear block code if n is the total number of bits of the

code, and k is the number of bits containing the message.


Linear block codes

Row-vector notations

m = [m1, ... , mk ]

b = [b1, ... , bn−k ]

c = [b1, ... , bn−k , m1, ... , mk ] = [b, m]

Block codes use the message bits to generate parity-check bits according to the

equation:

b = mP

where P is the k × (n − k) coefficient matrix. Noting that c = [b, m], we get that

c = [b, m] = [mP, m] = m[P, Ik ] = mG

where G is the k × n generator matrix.


Hamming codes

Hamming codes, a family of codes with

n = 2m − 1

k = 2m −m − 1, n − k = m

(7,4) Hamming code (n = 7, m = 3, k = 4) generator matrix:

G = [P, I4] =

1 1 0 1 0 0 0

0 1 1 0 1 0 0

1 1 1 0 0 1 0

1 0 1 0 0 0 1


Message Codeword Hamming Weight

0000 0000000 0

0001 1010001 3

0010 1110010 4

0011 0100011 3

0100 0110100 3

0101 1100101 4

... and so on ... ...

For the given Hamming code, dmin = 3. Therefore, it is a single-error correcting

code.


MULTI-ANTENNA CHANNELS

Consider a multiple-input multiple-output (MIMO) channel:

Tx Rx

N Mantennas antennaschannel

In the frequency flat fading case, the signal in the mth receive antenna

Ym(t) =N∑

n=1

Hmn(t)Xn(t) + Zm(t), m = 1, ... , M

where Hmn is the channel coefficient between the mth receive and nth transmit

antennas, Xn is the signal sent from the nth transmit antenna, and Zm is the noise

in the mth receive antenna.19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 88 NTS

MIMO channel

Defining the M × N channel matrix

H =

H11 H12 · · · H1N

H21 H22 · · · H2N

......

. . ....

HM1 HM2 · · · HMN

and the transmit signal, receive signal, and noise column-vectors

x = [X1, ... , XN ]T , y = [Y1 ... , YM ]T , z = [Z1, ... , ZM ]T

we can write the system input-output relationship in the matrix form as:

y = Hx + z


SIMO channel

One particular case of the MIMO channel is a single-input multiple-output

(SIMO) channel:

Tx Rx

N1 antenna antennas

In the frequency flat fading case, the signal in the nth receive antenna

Yn(t) = Hn(t)X (t) + Zn(t), n = 1, ... , N

where Hn is the channel coefficient between the nth receive antenna and the

transmit antenna, and X (t) is the signal sent from the transmit antenna.


SIMO channel

Defining the N × 1 channel vector

h = [H1, ... , HN ]T

we can write the system input-output relationship in the vector form as:

y = hX + z


MISO channel

Another particular case of the MIMO channel is a multiple-input single-output

(MISO) channel:

Tx Rx

Nantennas 1 antenna

In the frequency flat fading case, the signal in the receive antenna

Y (t) =N∑

n=1

Hn(t)Xn(t) + Z (t)

where Hn is the channel coefficient between the receive antenna and the nth

transmit antenna, and Xn(t) is the signal sent from the nth transmit antenna.


MISO channel

Defining the N × 1 channel row-vector

h = [H1, ... , HN ]

we can write the system input-output relationship in the vector form as:

Y = hx + Z


Capacity in the case of an informed transmitter

Let us consider the MIMO case assuming that z ∼ NC (0,σ2I). Then, the equation

y = Hx + z

describes a vector Gaussian channel. In the case of known channel at the

transmitter, the capacity can be computed by decomposing this channel into a set

of parallel independent scalar Gaussian sub-channels.

Singular value decomposition (SVD) of H:

H = UΛVH

where the M ×M matrix U and the N × N matrix V are unitary, that is,

UHU = UUH = I and VHV = VVH = I.


SVD for any n ×m matrix A

A = UΛVH =∑

λiuivHi

��

��

��

��

��

��

UΛ

VH

U0

0

VH

A

=

=

n

m

n

m

n<m

n>m

A

Λ


MIMO capacity (informed transmitter)

Using the SVD of H, the MIMO model equation becomes

y = UΛVHx + z

Multiplying this equation by UH from right, and using the unitary property of U,

we have

UHy = ΛVHx + UHz



Introducing the notations y , UHy, x , VHx, z , UHz we become a system

of parallel Gaussian channels

y = Λx + z

where E{zzH} = UHE{zzH}U = σ2I, and, therefore

z ∼ NC (0,σ2I)

Moreover,

‖x‖2 = xHVVHx = ‖x‖2

Thus, the power is preserved!



The system of parallel channels can be also written componentwise

Yi = λi Xi + Zi , i = 1, ... , no

where no = min{N, M}. The transition to this equivalent system corresponds to

pre-processing

x = Vx

at the transmitter and post-processing

y = UHy

at the receiver. Hence, the pre- and post-processing operators are V and UH ,

respectively.



To implement the pre-/post-processing operations, the original vector to be

transmitted has to be x. It should be pre-processed at the transmitter to obtain

x = Vx

The vector x should then be sent over the channel. At the receiver, we have

y = Hx = UΛVHVx = UΛx

and after post-processing, we obtain y = UHy = UHUΛx = Λx



The capacity of the resulting parallel independent channel system:

C = Bno∑i=1

log

(1 +

Piλ2i

σ2

)bits/s

where Pi are the water-filling power allocations:

Pi =

(ν − σ2

λ2i

)+

and the water level ν is obtained from the total power constraint∑no

i=1 Pi ≤ P

Each λi corresponds to an eigenmode of the channel, also called eigenchannel.


Wireless MIMO channel

N Tx M Rx

wireless MIMO channel

System model

assume perfect channel state information at the transmitter

+

+

+

+

+

+

MIMO channel equation in matrix notation

; ; ;

N Tx

M Rx

What is the optimum transmission and power allocation scheme if the channel matrix H is known at the transmitter?

; ;

Capacity of a MIMO channel

sum power contraint due to hardwarelimitations and/or regulations

max subject to: (1)

(2)

Singular Value Decomposition of MIMO channel

U and V are unitary:

• ui and vi are left and right singular vectors.

• λi is corresponding sigular value (≥ 0)

;

Decoupling the channels using linear transformation

; ;

for i = 1,…,r;

r independent parallel channels

Independent parallel channel representation

+

+

+X

X

X

r independent parallel channels

Optimization problem

Capacity:

subject to (1)

(2)

pi is power assigned to i-th input signal .

max

Water-filling principle

High SNR regimeWhat are the key parameters that determine the performance?

At high SNR, the water level is high and the policy of allocating equal amounts of

power to each channel is asymptotically optimal. In this case,

C ' Br∑

i=1

log

(1 +

Pλ2i

rσ2

)

' Br∑

i=1

log

(Pλ2

i

rσ2

)

' rB log SNR + Br∑

i=1

logλ2i

rbits/s

where r , rank{H} and SNR = P/σ2.


High SNR regimeWhat are the key parameters that determine the performance?

It can be proved that among the channels with the same power gain, the channels

with the equal spread of singular values result in the highest capacity.

This means that well-conditioned channel matrices are preferable in the high SNR

regime.


Low SNR regimeWhat are the key parameters that determine the performance?

In this regime, the optimal policy is to allocate power to the channel with the

strongest eigenmode:

C ' B log

(1 +

Pλ2max

σ2

)and ill-conditioned (rank-one) channel matrices are preferable.

Using the property log(1 + x) ' x loge that is valid for x � 1, we have

C ' BPλ2max log e

σ2


MIMO capacity (uninformed transmitter)

Let us now obtain the MIMO channel capacity based on general considerations

assuming that H is fixed, while the other values (x, y and z) are random. In such

a case, no assumption on the channel knowledge at the transmitter is used at this

time, but the receiver is assumed to know H.

Capacity via mutual information:

C = maxp(x)

I (x; y) = maxp(x)

[H(y)− H(y|x)]

The output covariance matrix is given by

R = E{yyH} = HPHH + σ2I

where

P , E{xxH}19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 118 NTS

MIMO capacity (uninformed transmitter)

Result: (Telatar, 1995; Foschini and Gans, 1998): Consider the model

y = Hx + z

where x ∼ NC (0, P), z ∼ NC (0,σ2I), and H is fixed. Let B be the channel

bandwidth in Hz. Then, the MIMO channel capacity is equal to

C = B log det

{I +

1

σ2HPHH

}bits/s


Result 1

Let X1, ... , Xn have a multivariate complex circular Gaussian distribution with the

mean µ and covariance matrix P:

fX(x) =1

(π)ndet{P}e−(x−µX)HP−1(x−µX)

Then

H(X) = H(X1, ... , Xn) = log ((πe)n det{P})

Proof:

H(X) =

∫fX(x)(x− µx)HP−1(x− µx)dx

+ ln ((π)ndet{P})


Result 1 (proof)

= E{(x− µX)HP−1(x− µX)}+ ln((π)ndet{P})= E{tr(P−1(x− µX)(x− µX)H)}+ ln((π)ndet{P})= tr(P−1E{(x− µX)(x− µX)H}) + ln((π)ndet{P})= tr(P−1P) + ln((π)ndet{P})= n + ln((π)ndet{P})= ln((πe)ndet{P}) nats

= log((πe)ndet{P}) bits


Result 2

Let the random vector x ∈ Cn have zero mean and covariance ExxH = P. Then

H(X) = H(X1, ... , Xn) ≤ log {(πe)n det{P}} with equality if and only if

X ∼ NC (0, P).

Proof:: Let g(x) be a pdf with covariance [P]ij =∫

g(x)xix∗j dx and let φP(x) be

the complex circular Gaussian pdf NC (0, P).

Note, that the logarithm of the complex circular Gaussian vector

φP(x) ∼ (x− µx)HP−1(x− µx) is a quadratic form in x.


Result 2 (proof)

Then the Kullback-Leibler D(g(x)||φP(x)) distance between the two pdf’s is given

as

⇒ 0 ≤ D(g(x)||φP(x)) =

∫g(x) log(

g(x)

φP(x))dx

= −Hg (X)−∫

g(x) log(φP(x))︸︷︷︸quadratic form

dx

︸︷︷︸∼second moment of X

= −Hg (X)−∫φP(x) log(φP(x))dx

= −Hg (X) + HφP (X)⇒ HφP (X) ≥ Hg (X)

The Gaussian distribution maximizes the entropy over all distributions with the

same variance.


Proof of MIMO capacity result

It has been shown that for all random vectors with the covariance matrix R, the

entropy of y is maximized when y is zero-mean circularly-symmetric complex

Gaussian. But this is only true when the input vectors x are zero-mean

circularly-symmetric complex Gaussian, and, therefore, it is the optimal

distribution for X.

Using these facts, the capacity formula can be proved by obtaining explicit

expressions for H(Y) and H(Y|X).



Recall the signal model:

Y = HX + Z

Then the mutual information between X and Y is given as

I (X; Y) = H(Y)− H(Y|X)

= H(Y)− H(HX + Z|X)

= H(Y)− H(HX|X)︸︷︷︸0

−H(Z|X)

= H(Y)− H(Z)



From Result 2 we know that the entropy H(Y) is maximized for the complex

circular Gaussian input distribution, thus

maxall pdfs with R

I (X; Y) = maxall pdfs with R

H(Y)− H(Z)

= log{(πe)n det(HPHH + σ2I)} − log{(πe)nσ2}= log det(I + 1/σ2HPHH)} bits per channel use


Transition to the classic Shannon’s capacityresult

Assuming a single-input single-output (SISO) system with N = M = 1 and the

constant channel gain H, which transmits with a power P, we have

H = H, P = P, I = 1

and, therefore

C = B log det

{I +

1

σ2HPHH

}= B log

{1 +|H|2P

σ2

}This is the classical Shannon capacity formula for a bandlimited channel!


Channel known at the transmitter

If the channel matrix H is known at the transmitter then the in general unequal

powers should be chosen, and P is not a scaled identity matrix.

Eigenchannels and power allocation using water-filling should be used as discussed

above.


Channel unknown at the transmitter

If the channel matrix H is unknown at the transmitter, then if follows from the

symmetry reasons that P should be scaled identity matrix. Using the power

constraint

tr{P} = P

we obtain that P has to be chosen as

P = (P/N)I

Indeed, the power constraint is satisfied because

tr{P} = tr {(P/N)I} = (P/N)tr{I} = P



Choosing P = (P/N)I, we obtain that the MIMO capacity in the uninformed

transmitter case is given by

C = B log det

{I +

P

σ2NHHH

}Assuming that, although being fixed, the entries of H are statistically independent

random values with the unit variance, and using the law of large numbers, we

obtain that for a large number of transmit antennas and a fixed number of receive

antennasHHH

N→ I



Using the latter property, we obtain that for large N,

C = B log det

{(1 +

P

σ2

)I

}= B log

{(1 +

P

σ2

)M}

= MB log

(1 +

P

σ2

)which is M times the SISO Shannon capacity!


Parallel SISO channel interpretation

Consider the general MIMO channel capacity formula. Let the eigendecomposition

of the positive semi-definite Hermitian matrix HPHH be

HPHH =r∑

i=1

λiuiuHi = UΛUH

where UHU = I, and r , rank{HPHH}. The matrices U and Λ should not be

confused with that of the SVD of the matrix H used earlier!

We will use the property

det{I + AB} = det{I + BA}

valid for any matrices A and B of conformable dimensions.



Assuming that A = U and B = ΛUH , we obtain that

C = B log det

{I +

1

σ2HPHH

}= B log det

{I +

1

σ2UΛUH

}= B log det

{I +

1

σ2ΛUHU

}= B log det

{I +

1

σ2Λ

}= B log

{r∏

i=1

(1 + λi/σ2)

}= B

r∑i=1

log{

1 + λi/σ2}



The latter formula interprets the capacity of the MIMO channel as the sum of

capacities of r parallel SISO channels.

Assuming the case of uninformed transmitter (P = (P/N)I), r can be interpreted

as the rank of H → full rank channels are preferable!

If H is drawn randomly, then almost sure

rank{H} = min{M, N}

This leads us to the conclusion that the capacity grows nearly proportionally to

min{M, N}.


Assume M = N and let the Frobenius norm of H be given. What type of channel

will maximize the MIMO capacity?

Result: The capacity is maximized in the case when H is orthogonal:

HHH = HHH = ζI

where ζ is a constant. In this case,

C = B log det

{(1 +

Pζ

σ2N

)I

}= B log

(1 +

Pζ

σ2N

)N

= NB log

(1 +

Pζ

σ2N

)


SIMO channel capacity

Consider a SIMO column-vector channel h with one transmit and N receive

antennas. The capacity formula becomes

C = B log det

{I +

1

σ2PhhH

}= B log

(1 +

1

σ2PhHh

)= B log

(1 +

P

σ2‖h‖2

)Hence, the SIMO channel comprises only one spatial data pipe. The addition of

receive antennas yields only a logarithmic (rather than linear) increase of capacity.


MISO channel capacity

Consider a MISO row-vector channel h with one receive and N transmit antennas.

The capacity formula becomes

C = B log

(1 +

1

σ2hPhH

)= B log

(1 +

1

σ2‖hP1/2‖2

)The situation is similar to that in the SIMO case. The increase in capacity is only

logarithmic (rather than linear).


Ergodic MIMO channel capacity

The channel matrix H is no longer fixed, but is treated as random. The capacity

formula can be averaged over H:

EH{C} = B EH

[log det

{I +

1

σ2HPHH

}]Result (Telatar, 1999): Let H be a Gaussian random matrix with i.i.d. elements.

Then, the average capacity is maximized subject to the power constraint

tr{P} ≤ P when

P =P

NI

That is, to maximize the average capacity, the antennas should transmit

uncorrelated streams with the same power – an intuitively appealing fact.


Ergodic MIMO channel capacity:Proof (sketch)

H be a Gaussian random matrix with i.i.d. elements.

C = maxP:trP≤P

EH

[log det{I +

1

σ2HPHH}

]Introduce P = ∆P + Poff

C = maxP:trP≤P

EH

[log det{I +

1

σ2H∆PHH +

1

σ2HPoffHH}

]= max

∆P:tr∆P≤PEH

[log det{I +

1

σ2H∆PHH +

1

σ2HPoffHH}

]≤ max

∆P:tr∆P≤Plog det{EH

[I +

1

σ2H∆PHH +

1

σ2HPoffHH

]}

Where the last inequality follows form Jensen’s inequality.



C ≤ max∆P:tr∆P≤P

log det{EH

[I +

1

σ2H∆PHH +

1

σ2HPoffHH

]}

= max∆P:tr∆P≤P

log det{EH

[I +

1

σ2H∆PHH

]+ EH

[1

σ2HPoffHH

]︸︷︷︸

=0

}

where the last term in the second equation is identical zero due to the statistical

independence of the entries in H.

We conclude that restricting the transmit covariance to exhibit the diagonal

structure P = ∆P does not reduce the achievable capacity.



Thus

C = max∆P:tr∆P≤P

EH log det{[

I +1

σ2H∆PHH

]}

We can show that due to the i.i.d. property of H the objective function is

symmetric w.r.t. the input variable, i.e. exchanging the order of the entries

P1, ... , PN does not change the function value. Further the function is concave.

We conclude that the optimal power allocation strategy in this case is to equally

distribute the power among the transmitted symbols, e.g. to choose

P1 = P2 = ... = PN .



Note that the latter choice of P coincides with our earlier choice of this matrix in

the case of fixed channel and uninformed transmitter.

Choosing P = (P/N) I, the maximal average capacity (which is commonly

referred to as ergodic capacity) becomes

CE = B EH

[log det

{I +

P

σ2NHHH

}]Ergodic capacity has an important advantage w.r.t. fixed-channel capacity as it

gives an average rather than an instantaneous picture.



Using the parallel SISO channel interpretation and denoting the singular values of

H as γi , we obtain

CE = B EH

[r∑

i=1

log

{1 +

Pγ2i

σ2N

}]

= Br∑

i=1

EH

[log

{1 +

Pγ2i

σ2N

}]Please, note the difference with the water-filling capacity. In contrast to it, in the

latter expression equal powers are used for each eigen-channel.


Large antenna regime

Let us denote SNR = Pσ2

Then, the capacity formula becomes

CE = Br∑

i=1

EH

[log

{1 +

SNRγ2i

N

}]Assume M = N and i.i.d. Rayleigh fading. Then, using random matrix theory, it

can be obtained that for any SNR

limN→∞

CE

N= const

Therefore, capacity grows linearly in N at any SNR in such an asymptotic regime!


Outage capacity

A value Cout which is larger than the capacity C in pout percents of channel

realizations. In other words,

Pr(Cout > C ) = pout

If one wants to transmit with Cout bits per second, then the channel capacity is

less than Cout with the probability pout. Hence, the transmission is impossible (the

system is in outage) in pout · 100 percents of time.

Alternatively, we can write

Pr(Cout ≤ C ) = 1− pout

and, hence, in (1− pout) · 100 percents of time the transmission is possible as the

system is not in outage.


Outage capacity

1− pout is called non-outage probability.

Using the instantaneous MIMO capacity formula, we can define the MIMO outage

capacity by means of the following expression

mintr{P}≤P

Pr

(Cout > B log det

{I +

1

σ2HPHH

})= pout

where we additionally use the opportunity to minimize the outage probability by

means of a proper choice of P. This particular choice, of course, depends on the

statistics of the random channel matrix H.


Example: Rayleigh fading channel

Rayleigh fading, the channel coefficients are circularly symmetric complex

Gaussian with zero mean and unit variance: a) known channel at the transmitter;

b) unknown channel at the transmitter

−10 0 10 20 30 4010

−1

100

101

102

SNR (dB)

Out

age

capa

city

(bi

ts /

s / H

z)

pout

= 0.01

pout

= 0.1

pout

= 0.5

−10 0 10 20 30 4010

−1

100

101

102

SNR (dB)

Out

age

capa

city

(bi

ts /

s / H

z)

pout

= 0.01

pout

= 0.1

pout

= 0.5


MULTIUSER CHANNELS

Why multiuser channels:

I Up to now, we have considered point-to-point communications links.

I Most of communication systems serve multiple users. Therefore, multiuser

channels are of great interest.

I In multiuser channels, one user can interfere to another user. This type of

interference is called multiuser interference (MUI).

Common multiuser channel types:

I Multiple-access channels

I Broadcast channels

I Relay channels


Multiple-access channel


Broadcast channel


Relay channel


Multiple-access channels

Two-user multiple-access Gaussian channel:

Y (i) = X1(i) + X2(i) + Z (i), Z (i) ∼ NC (0,σ2)

In the point-to-point (single user) case, the rate limit is the channel capacity. The

achievable rate region is, therefore, given by:

R < B log

(1 +

P

σ2

)In the two-user case, we should extend this concept to a capacity region C which

is a set of all pairs (R1, R2) such that users 1 and 2 can simultaneously reliably

communicate at rates R1 and R2, respectively.


Multiple-access channels

Since the two users share the same bandwidth, there is a tradeoff between the

rates R1 and R2: if one user wants to communicate at a higher rate, then the

other user may need to lower its rate.

Example of tradeoff: In orthogonal multiple access schemes such as OFDM, the

tradeoff can be achieved by varying the number of subcarriers allocated to each

user.


Rate region

Different scalar performance measures can be obtained from the capacity region:

I The symmetric capacity

Csym = max(R,R)∈C

R

is the maximum common rate at which both users can simultaneously reliably

communicate.

I The sum capacity

Csum = max(R1,R2)∈C

(R1 + R2)

is the maximum total throughput that can be achieved.


Rate region

If we have two users with the powers P1 and P2, then the capacity region for the

two-user channel is defined by the following inequalities:

R1 < B log

(1 +

P1

σ2

)R2 < B log

(1 +

P2

σ2

)R1 + R2 < B log

(1 +

P1 + P2

σ2

)The first two constraints say that the rate of each individual used cannot exceed

the capacity of the point-to-point link with the other user absent.

The last constraint says that the total throughput cannot exceed the capacity of

the point-to-point link with a single user defined as the sum of the two users.


Rate region

That is, not only the rates R1 and R1 are limited, but their sum is limited as well.

This means that the signal of each user may be viewed as an interference for

another user.

Result: The two-user capacity region is a pentagon.


Rate region: multiple-access channel


Rate region: multiple-access channel

Remark: Surprisingly, user 1 can achieve its single-user rate bound

R1 = B log(1 + P1

σ2

)while at the same time, user 2 can get a non-zero rate, as

high as R2 = B log(

1 + P2

P1+σ2

). This corresponds to point A of the capacity

region plot. Indeed,

R1 + R2 = B log

((1 +

P1

σ2

)(1 +

P2

P1 + σ2

))= B log

(1 +

P1

σ2+

P2

P1 + σ2+

P1P2

σ2(P1 + σ2)

)= B log

(1 +

P21 + P1σ

2 + P2σ2 + P1P2

σ2(P1 + σ2)

)= B log

(1 +

P1 + P2

σ2

)19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 158 NTS

Successive interference cancellationHow to achieve this?

Each user should encode its data using a capacity achieving channel code. The

receiver should decode the information of both users in two stages:

I In the first stage, the data of user 2 are decoded treating user 1 as AWGN.

Then, the maximum rate of user 2 can achieve R2 = B log(

1 + P2

P1+σ2

).

I In the second stage, the reconstructed (decoded) signal of user 2 is

subtracted from the aggregate received signal, and then the data of user 1

are decoded. Since the user 2 is already subtracted and there is only the

background AWGN left in the system, the achieved rate for user 1 will be

R1 = B log(1 + P1

σ2

).

This two-stage decoding is called successive interference cancellation.


Successive interference cancellation

If one reverses the order of cancellation then one can achieve point B rather than

A.

All other rate points on the segment AB can be obtained by time-sharing between

the multiple-access strategies of points A and B.

The segment AB contains all the optimal operating points of the channel, in the

sense that any point in the capacity region is dominated by some point on AB.

That is, for any point within the capacity region that corresponds to the rates R1∗and R2∗ we can always find a point on the segment AB whose rates R1 and R2

satisfy:

R1∗ ≤ R1, R2∗ ≤ R2


Pareto-optimal

The points on the segment AB are called Pareto-optimal.

One can always increase the user rates to move to a point on the segment AB,

and there is no reason not to do this.


The concrete choice of the point on AB depends on our particular objectives:

I To maximize the sum capacity Csum, any point on AB is equally fine. Note

that we have already computed the sum of R1 and R2 in the point A. Hence,

Csum = B log

(1 +

P1 + P2

σ2

)I To maximize the symmetric capacity Csym, we should take the point on AB

that gives us equal rates R1 and R2.I Some operating points on AB may be not fair, especially if the received

power of one user is much higher than that of the other user. In this case, we

should consider operating on the corner point in which the stronger user is

decoded first.19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 162 NTS

How does the system with successive cancellation compares to

a standard CDMA system in terms of achievable rate?

The principal difference between CDMA detection and successive cancellation

detection is that:

I In the CDMA system, each user is decoded treating other users as

interference. This corresponds to the single-user receiver principle and we

immediately conclude that the performance of the CDMA system is

suboptimal; i.e., it achieves the point which is strictly in the interior of the

capacity region.

I In contrast to CDMA, the successive cancellation receiver is a multiuser

receiver: only one of the users (say, user 1) is decoded treating user 2 as

interference, but user 2 is decoded with the benefit of the signal of user 1

being already removed.


In the successive cancellation receiver case,

R1 = B log

(1 +

P1

σ2

), R2 = B log

(1 +

P2

P1 + σ2

)or

R1 = B log

(1 +

P1

P2 + σ2

), R2 = B log

(1 +

P2

σ2

)In the CDMA receiver case,

R1 = B log

(1 +

P1

P2 + σ2

), R2 = B log

(1 +

P2

P1 + σ2

)That is, one of the rates in the CDMA case is always lower than in the case of

successive cancellation!


Correspondingly, in the successive cancellation receiver case,

Csum = B log

(1 +

P1 + P2

σ2

)

In the CDMA receiver case, the sum rate is

B log

(1 +

P1

P2 + σ2

)+ B log

(1 +

P2

P1 + σ2

)= B log

((1 +

P1

P2 + σ2

)(1 +

P2

P1 + σ2

))= B log

(1 +

P1 + P2

σ2− P1P2(P1 + P2 + σ2)

σ2(P1 + σ2)(P2 + σ2)

)< Csum


K -user multiple-access Gaussian channel

Y (i) =K∑

k=1

Xk(i) + Z (i), Z (i) ∼ NC (0,σ2)

Similar to the two-user case, in the case of K users, all of them share the same

bandwidth, and there is a tradeoff between the rates Rk (k = 1, 2, ... , K ). If one

(or more) users want to communicate at higher rate(s), then the other user(s)

may need to lower their rate(s).


In the K user-case, we can define the capacity region C as a set of all

(R1, R2, ... , RK ) such that users 1, 2, ... , K can simultaneously reliably

communicate at rates R1, R2, ... , RK , respectively.

This capacity region is described by the 2K − 1 constraints:

Rk < B log

(1 +

Pk

σ2

), k = 1, ... , K

Rk + Ri < B log

(1 +

Pk + Pi

σ2

), k , i = 1, ... , K

Ri + Rk + Rl < B log

(1 +

Pk + Pi + Pi

σ2

), k, i , l = 1, ... , K

· · · · · · · · · · · · · · · · · · · · · · · ·K∑

k=1

Rk < B log

(1 +

∑Kk=1 Pk

σ2

)19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 167 NTS

K -user multiple-access Gaussian channel

The K -user capacity region can be written in a short form as∑k∈S

Rk < B log

(1 +

∑k∈S Pk

σ2

)for all S ⊂ {1, ... , K}

The right hand side

B log

(1 +

∑k∈S Pk

σ2

)is the maximum sum rate that can be achieved by a single transmitter with the

total power of the users in S and with no other users in the system.


The sum capacity can be defined as

Csum = max(R1,...,RK )∈C

K∑k=1

RK

It can be shown that

Csum = B log

(1 +

∑Kk=1 Pk

σ2

)and that there are exactly K ! corner points in the capacity region, each one

corresponding to a different successive cancellation order among the users.


In the equal power case(P1 = P2 = · · · = PK = P)

Csum = B log

(1 +

KP

σ2

)Observe that the sum capacity is unbounded as the number of users grows. In

contrast, in the conventional CDMA receiver (decoding each user treating all the

other users as noise), the sum rate will be only

BK log

(1 +

P

(K − 1)P + σ2

)which approaches

BKP

(K − 1)P + σ2log e ' B log e

as K →∞. The growing interference is a limiting factor here!


The symmetric capacity can be defined as

Csym = max(R,R,...,R)∈C

R

It can be shown that in the equal power case (P1 = P2 = · · · = PK = P),

Csym =B

Klog

(1 +

KP

σ2

)This rate for each user can be obtained by orthogonal multiplexing where each

user is allocated a fraction 1/K of the total degrees of freedom (for example, of

the total bandwidth B).

Note that Csym = Csum

K


Broadcast channels

Two-user broadcast AGWN channel:

Yk(i) = hkX (i) + Z (i), k = 1, 2; Zk(i) ∼ NC (0,σ2)

where hk is the fixed complex channel gain corresponding to the kth user.

The broadcast case is often referred to as downlink.

The transmit power constraint: the average power of the transmit signal is P.

As in the multi-access (uplink) channel case, we can define the capacity region Cas the region of rates (R1, R2) at which both users can simultaneously reliably

communicate.


Broadcast channels

We have just two single-user bounds:

Rk < B log

(1 +

P|hk |2

σ2

), k = 1, 2

For any k, this upper bound on Rk can be attained by using all the transmit

power to communicate to user k (with the rate of the remaining user being zero).

Thus, we have two extreme points:

R1 = B log

(1 +

P|h1|2

σ2

), R2 = 0

R2 = B log

(1 +

P|h2|2

σ2

), R1 = 0


Rate region in the symmetric case |h1| = |h2|

Further, we can share the degrees of freedom (time and bandwidth) between the

users in an orthogonal manner to obtain any rate pair on the line joining these two

extreme points.

Hence, for the symetric ase of |h1| = |h2| the capacity region is a triangle.


Rate region in the symmetric case |h1| = |h2|


In the symmetric case |h1| = |h2| , |h| sum rate can be shown to be bounded by

the single-user capacity:

R1 + R2 < B log

(1 +

P|h|2

σ2

)The latter conclusion follows from the triangle form of the capacity region.


As have been already mentioned, the rate pairs in the capacity region can be

achieved by sharing the degrees of freedom (bandwidth and time) between the

two users. What are the alternative ways to achieve the boundary of the capacity

region?

The channel symmetry suggests an alternative natural approach:

I Let the channel of the user 2 be stronger than that of user 1 (|h1| < |h2|).

Thus, if user 1 can successfully decode its data from Y1, then user 2 (which

has higher SNR) should also be able to decode the data of user 1 from Y2.

Then, user 2 can subtract the data of user 1 from its received signal Y2 to

better decode its own data; i.e., it can perform successive interference

cancellation.


Consider the following transmission strategy that superposes the signals of two

users, much like in a spread-spectrum CDMA system. The transmitted signal is

the sum of two signals:

X (i) = X1(i) + X2(i)

where Xk(i) is the signal intended for user k .


Superposition coding

Weaker user 1 decodes it own signal by treating the signal for user 2 as noise.

Stronger user 2 performs successive interference cancellation: it first decodes the

data of user 1 by treating X2 as noise, subtracts the so-determined signal of user 1

from Y2, and then extracts its own data. As a result, for any possible power split

of P = P1 + P2, the following rate pair can be achieved

R1 = B log

(1 +

P1|h1|2

P2|h1|2 + σ2

)R2 = B log

(1 +

P2|h2|2

σ2

)This strategy is commonly referred to as superposition coding.


Orthogonal scheme

On the other hand, in orthogonal schemes, for any power split P = P1 + P2 and

degree-of-freedom split α ∈ [0, 1], the following rates are jointly achieved

R1 = αB log

(1 +

P1|h1|2

ασ2

)R2 = (1− α)B log

(1 +

P2|h2|2

(1− α)σ2

)Here, α can be interpreted, for example, as the fraction of bandwidth (e.g. both

bandwidth B and noise power reduced by factor α) assigned to user 1. Alternative

α can be interpreted as a fraction of time assigned to user 1 (e.g · bits per 1

seconds becomes · bits per α second and signal power P1 is consumed in fraction

α second of time).


Rate region in the symmetric case|h1| = |h2| = |h|

Assume that superposition coding is used and that the power is split such that

P ≤ P1 + P2. In this case if user 1 can decode its data treating the data of user 2

as noise, then also user 2 can decode the data of user 1, substract it from its

received signal and decoding its own data. Hence the the following rates pairs are

suported.

R1 ≤ B log(1 +P1|h1|2

P2|h1|2 + σ2) = B log(1 +

(P1 + P2)|h1|2

σ2)− B log(1 +

P2|h1|2

σ2)

R2 ≤ B log(1 +P2|h2|2

σ2)

Thus for |h1| = |h2| = |h| and the power constraint P ≤ P1 + P2 the sum capacity

is given by

R1 + R2 ≤ B log B log(1 +P|h|2

σ2)

.19. April 2012 | NTS TU Darmstadt | Marius Pesavento | 181 NTS

Rate region in the general case|h1| ≤ |h2|

Solid line: optimal power split using superposition coding.

Dashed line: optimal degrees of freedom split using orthogonal coding.


In the K -user broadcast case, the boundary of the capacity region can be proved

to be given by

Rk = log

(1 +

Pk |hk |2

σ2 + (∑K

l=k+1 Pl)|hk |2

), k = 1, ... , K

for all possible power splits P =∑K

k=1 Pk of the total power at the base station.

The optimal points are achieved by superposition coding and successive

interference cancellation at the receivers. The cancellation order at every receiver

should be always to decode the weaker users before decoding its own data.


Fading channels

Until now, all multi-user channels have been considered without random channel

fading.

Let us now include fading in the signal model. Channel state information issue is

critical in such cases.


Multiple-access fading channels

K -user multiple-access fading channel:

Y (i) =K∑

k=1

hk(i)Xk(i) + Z (i)

where {hk(i)} is the random fading process of user k .

We assume that

E{|hk(i)|2

}= 1, k = 1, ... , K

and that the fading processes of different users are i.i.d.


Slow fading

The time-scale of communication is short relative to the channel coherence time

of all users. Hence, hk(i) = hk for all K .

Suppose all users transmit at the rate R. Conditioned on each realization of

h1, ... , hK , we have the standard multiple-access AGWN channel with received

SNR of user k equal to |hk |2P/σ2. If the symmetric capacity is less than R, then

this results in outage. Using the expressions for the K -user capacity region, the

outage probability can be written as

pout =Pr

{B log

(1+SNR

∑k∈S

|hk |2)

)< |S|R for some S⊂{1, ... , K}

}

where |S| denotes the cardinality of S and SNR = P/σ2.


Fast fading

Each hk(i) is modelled as a time-varying ergodic process.

The sum capacity in the fast fading case:

Csum = E

{B log

(1 +

∑Kk=1 |hk |2P

σ2

)}

How does this compare to the sum capacity of the uplink channel without fading?

Let us use Jensen’s inequality which basically says that

E{f (X )} ≤ f (E{X})

for any concave function f (·) and random variable X .


Using this inequality, we obtain that

Csum = E

{B log

(1 +

∑Kk=1 |hk |2P

σ2

)}

≤ B log

1 +E{∑K

k=1 |hk |2}

P

σ2

= B log

(1 +

KP

σ2

)where the property E

{|hk(m)|2

}= 1 (k = 1, ... , K ) has been used in the last line.

The last expression can be identified as the sum capacity of the AWGN

multiple-access channel. Hence, without channel state information at the

transmitter, fading can only hurt.


However, if the number of users K becomes large, then

K∑k=1

|hk |2 → K

and the penalty due to fading vanishes. Basically, the effect of fading is averaged

over a large number of users.


Let us now assume that we have full (possibly also non-causal) channel state

information at both the transmitter and receiver sides.

Block-fading model:

Y (i) =K∑

k=1

hk(i)Xk(i) + Z (i)

where hk(i) = hk,l remains constant over the lth coherence channel period of Tc

(Tc � 1) symbols and is i.i.d. across different coherence periods.

The channel over L such coherence periods can be viewed as a number of L

parallel “sub-channels” which fade independently. Therefore, we can again use

water-filling philosophy.


For a given realization of the channel gains hk,l (k = 1, ... , K ; l = 1, ... , L), the

sum capacity is given by

max{Pk,l}

B

L

L∑l=1

log

(1 +

∑Kk=1 Pk,l |hk,l |2

σ2

)

subject to Pk,l ≥ 0 (k = 1, ... , K ; l = 1, ... , L) and the average power constraint

1

L

L∑l=1

Pk,l = P, k = 1, ... , K

The solution to this optimization problem as L→∞ yields the appropriate power

allocation policy.


This leads to a variable rate scheme: in each lth “sub-channel”, the rates that are

dictated by the above optimization problem are used.

Optimal strategy: The sum rate in the lth “sub-channel”

B log

(1 +

∑Kk=1 Pk,l |hk,l |2

σ2

)

for a given total power∑K

k=1 Pk,l allocated to this “sub-channel” is maximized by

giving all this power to the user with the strongest channel gain. That is, each

time only one user with the best channel is allowed to transmit. Under this

strategy, the multiuser channel for each time l reduces to a point-to-point channel

with the channel gain

maxk=1,...,K

|hk,l |2


Broadcast fading channels

K -user downlink fading channel:

Yk(i) = hk(i)X (i) + Zk(i), k = 1, ... , K

where {hk(i)} is the random fading process of user k .

Similar to the uplink case, we assume that

E{|hk(i)|2

}= 1, k = 1, ... , K

and that the fading processes of different users are i.i.d.

The transmit power is constrained to be equal to P.


Let us first consider the case when the channel state information is available only

at the receiver.

We have the following single-user bounds:

Rk < B E

{log

(1 +

P|h|2

σ2

)}, k = 1, ... , K

where h is a random channel gain.

For any k , this upper bound on Rk can be attained by using all the transmit power

to communicate to user k (with the rate to the remaining users being zero). Thus,

as in the non-fading case, we have K extreme points of the capacity region.


Similar to the non-fading case, it can be shown that the sum rate is also bounded

by the same quantity

K∑k=1

Rk < B E

{log

(1 +

P|h|2

σ2

)}This bound can be achieved by transmitting only to one user or by time-sharing

between any number of users.

It can be shown that the rate pairs in the capacity region can be achieved by both

orthogonal schemes and superposition coding.


Let us now consider the case when the channel state information is available both

at the transmitter and receiver.

Let us focus on the sum capacity. As in the uplink case, it can be shown that the

sum capacity is achieved by transmitting only to the best user at each time. Under

this strategy, the downlink channel reduces to a point-to-point channel with the

channel gain

maxk=1,...,K

|hk |2


Multiuser diversity

We have seen that in the full channel state information case, from the sum

capacity perspective, the optimal strategy both in the uplink and downlink cases

reduces the multiuser case to the single-user (point-to-point) case with the fading

of magnitude maxk |hk(i)|. Compared to a system with a single user, the multiuser

diversity gain comes from:

I the increase of the total transmit power in the uplink case;

I the improvement of the effective channel gain at time i from |hk(i)|2 to

maxk=1,...,K |hk |2.

The second effect appears entirely due to the ability to dynamically schedule

resources among the users as a function of the channel state.


Remarks

I The multiuser diversity gain comes from the following effect: when many

users fade independently, at any time there is a high probability that one of

them has a strong channel. By allowing only that user to transmit or, vice

versa, transmitting only to that user, the shared channel resource is used in

the most efficient manner, and the total throughput is maximized.

I The larger the number of users, the higher is the multiuser diversity gain.

I The amount of multiuser diversity gain depends critically on the tail of the

distribution of |hk |2: the heavier the tail, the more likely there is a user with

the strong channel, and the larger the multiuser diversity gain.


System requirements to extract the multiuserdiversity benefits

I the base station has to access the channel quality of each user:

I in downlink, each user has to track its own channel SNR and feed back the

channel quality to the base station.I in uplink, the base station has to track the user channel quality (user SNRs).

I the base station has to schedule transmissions among the users as well as to

adapt the data rate as a function of instantaneous channel quality.

Such a scheduling procedure is often called opportunistic scheduling.


Fairness and delay

I In reality, the fading statistics of different users may be non-symmetric: there

are users some users who are closer to the base station and better in their

average SNR; there are users that are stationary (non-moving), or having no

scatterers around.

I The multiuser diversity strategy is only concerned with maximizing long-term

average throughputs. In practice, there are latency requirements, that is, the

average throughput over the delay is the performance metric of interest.


Channel measurement and feedback

I All scheduling decisions are done as a function of user channel states. Hence,

the quality of channel estimation is a primary issue, and feedback from the

users to the base station is needed in the downlink case.

I Both the error in channel measurement and the delay/error in feeding the

channel state back are significant bottlenecks of practical applications of the

multiuser diversity strategy.

Slow or limited fading:

I We have observed that the use of multiuser diversity strategy requires fading

to be rich and fast. Not useful for line-of-sight scenarios or cases with little

scattering or slowly changing environments.


Proportional fair downlink scheduling

I Keeps track of the average throughput Tk(i) (k = 1, ... K ) of each user in

some (e.g., exponentially weighted) time-window of length tW .

I In the ith time-slot, the base station receives the requested/supportable rates

Rk(i) (k = 1, ... K ) from all users, and transmits to the user k∗ with the

largest

γ = Rk(i)/Tk(i)

I The average throughputs are updated as:

Tk(i + 1) =

{(1− 1/tW )Tk(i) + Rk(i)/tW , k = k∗

(1− 1/tW )Tk(i), k 6= k∗

This algorithm is used in the downlink mode of the 3G system IS-856.


Combination of multiuser diversity and super-position coding

I Divides the users in several classes (say, in two classes depending on whether

they are near to the base station or near the cell edge). Then, users in each

class have statistically comparable channel strengths.

I Users whose current channel is instantaneously strongest in their own class

are scheduled for simultaneous transmission using superposition coding. Users

of “stronger” classes (e.g., nearby users) receive less power, still enjoying very

good rates and minimally affecting the performance of the “weak” classes of

users.


ADVANCES OF CHANNEL CODING

We have already discussed the linear block channel codes in the Information

Theory I. Now, we will discuss cyclic codes as well as convolutional codes.


Cyclic codes

An important subclass of linear block codes.

Consider an n-tuple

c = [c0, c1, ... , cn−1]

Cyclically shifting the components of c, we have

c(1) = [cn−1, c0, ... , cn−2]

Using i subsequent cyclic shifts, we have

c(i) = [cn−i , cn−i+1, ... , cn−1, c0, c1, ... , cn−i−1]


Definition: cyclic codes

An (n, k) linear block code C is called a cyclic code if every cyclic shift of any

codeword in C is also a codeword in C .

Properties:

I Linearity: the sum of any two codewords is also a codeword;

I Cyclic property: Any cyclic shift of any codeword is also a codeword.

To develop the theory of cyclic codes, let us treat the components of the

codeword c as the coefficients of the following polynomial:

c(X ) = c0 + c1X + · · ·+ cn−1X n−1

where X is an indeterminate.

The fact that all ci are binary is taken into account by using the binary arithmetic

for all polynomial coefficients when operating with polynomials.


Cyclic codes

There is one-to-one correspondence between the vector c and the polynomial

c(X ). We will call c(X ) the code polynomial of c.

Each power of X in the polynomial c(X ) represents a one-bit shift in time. Hence,

multiplication of c(X ) by X may be viewed as shift to the right.

Key question: how to make such a shift cyclic?

Let c(X ) be multiplied by X i , yielding

X ic(X ) = X i(c0 +c1X +...+cn−i−1X n−i−1+cn−iXn−i+...+cn−1X n−1)

= c0X i +c1Xi+1+...+cn−i−1X

n−1+cn−iXn+...+cn−1X n+i−1

= cn−iXn+...+cn−1X n+i−1+c0X i +c1X

i+1+...+cn−i−1Xn−1

where, in the last line, we have just rearranged the terms.


Cyclic codes

Recognizing, for example, that cn−i + cn−i = 0 in the modulo-2 arithmetic, we

can manipulate the first i terms as follows:

X ic(X ) = cn−i +...+ cn−1Xi−1 + c0X i + c1X i+1 + ... + cn−i−1X n−1

+cn−i (X n + 1) + ... + cn−1X i−1(X n + 1)

Defining

c(i)(X ) , cn−i +...+ cn−1Xi−1 + c0X i + c1X i+1 + ... + cn−i−1X n−1

q(X ) , cn−i + cn−i+1X + ... + cn−1X i−1

we can reformulate the first equation in this page in the following compact form

X ic(X ) = q(X )(X n + 1) + c(i)(X )


Cyclic codes

The polynomial c(i)(X ) can be recognized as the code polynomial of the codeword

c(i) obtained by applying i cyclic shifts to the codeword c.

Moreover, from the latter equation, we readily see that c(i)(X ) is the remainder

that results from dividing X ic(X ) by (X n + 1).

Hence, we may formally state the cyclic property in polynomial notation as

follows: if c(X ) is a code polynomial, then the polynomial

c(i)(X ) = X ic(X ) mod(X n + 1)

is also a code polynomial for any cyclic shift i , where mod(X n + 1) stands for

“modulo-(X n + 1) multiplication”.


Cyclic codes

Note that n cyclic shifts of any codeword does not change it, which means that

X n = 1, and hence X n + 1 = 0 in modulo-(X n + 1) arithmetics!

Generator polynomial: a polynomial g(X ) of minimal degree that completely

specifies the code and is a factor of X n + 1. The degree of g(X ) is equal to the

number of parity-check bits of the code, n − k .

It can be shown that any cyclic code is uniquely determined by its generator

polynomial in that each code polynomial in the code can be expressed in the form

of a polynomial product as follows:

c(X ) = a(X )g(X )

where a(X ) is a polynomial of degree k − 1.


Cyclic codes

Given the generator polynomial g(X ), we want to encode the message

[m0, ... , mk−1] in an (n, k) systematic form. The codeword structure is

[b0, b1, ... , bn−k−1, m0, m1, ... , mk−1]

Define the message bits and parity bits polynomials as

m(X ) , m0 + m1X + ... + mk−1X k−1

b(X ) , b0 + b1X + ... + bn−k−1X n−k−1

We want the code polynomial to be in the form c(X ) = b(X ) + X n−km(X )

This means that b0, ... , bn−k−1 occupy the first n − k positions of each codeword,

whereas the message bits start from the (n − k + 1)st position.


Cyclic codes

Using equation c(X ) = a(X )g(X ) yields

a(X )g(X ) = b(X ) + X n−km(X )

Equivalently,X n−km(X )

g(X )= a(X ) +

b(X )

g(X )

which means that b(X ) is the remainder left over after dividing X n−km(X ) by

g(X ).


Example: A (7,4) cyclic code

We start with the polynomial X 7 − 1 and factorize it into three irreducible

polynomials as

X 7 − 1 = (1 + X )(1 + X 2 + X 3)(1 + X + X 3)

where by an irreducible polynomial we mean a polynomial that cannot be factored

using only polynomials with binary coefficients.

Let us take

g(X ) = 1 + X + X 3

as generator polynomial whose degree is equal to the number of parity bits.



We can also define a parity check polynomial

h(X ) = 1 +k−1∑i=1

hiXi + X k

such that g(X )h(X ) = X n + 1

or, equivalently g(X )h(X ) mod(X n + 1) = 0

For our example, the parity check polynomial

h(X ) = 1 + X + X 2 + X 4

so that h(X )g(X ) = (1 + X + X 2 + X 4)(1 + X + X 3) = X 7 + 1.



How to encode, for example, the message sequence 1001?

The corresponding message polynomial is

m(X ) = 1 + X 3

Multiplying m(X ) by X n−k = X 3, we have

X n−km(x) = X 3 + X 6

Dividing X n−km(x) by g(X ), we have

X 3 + X 6

1 + X + X 3= X + X 3 +

X + X 2

1 + X + X 3



That is,

a(X ) = X + X 3, b(X ) = X + X 2

and the encoded message is

c(X ) = b(X ) + X n−km(X )

= X + X 2 + X 3(1 + X 3)

= X + X 2 + X 3 + X 6

or, alternatively,

c = [0111001]


Relationship to conventional linear block codes

for the considered (7, 4) code, we can construct the generator matrix from

generator polynomial by using

g(X ) = 1 + X + X 3

Xg(X ) = X + X 2 + X 4

X 2g(X ) = X 2 + X 3 + X 5

X 3g(X ) = X 3 + X 4 + X 6

as the rows of the 4× 7 generator matrix

G =

1 1 0 1 0 0 0

0 1 1 0 1 0 0

0 0 1 1 0 1 0

0 0 0 1 1 0 1


Relationship to conventional linear block codes

Clearly, the latter generator matrix is in non-systematic form. We can put it in a

systematic form by manipulating with its rows, that is, by adding the first row to

the third row and adding the sum of the first two rows to the fourth row. Then,

we get

G =

1 1 0 1 0 0 0

0 1 1 0 1 0 0

1 1 1 0 0 1 0

1 0 1 0 0 0 1

Decoding cyclic codes can be made in the same way as for any other linear block

codes, e.g., using syndrome.


Popular cyclic codes are the so-called cyclic redundancy check (CRC) codes,

Bose-Chaudhuri-Hocquenghem (BCH) codes, and non-binary Reed-Solomon (RS)

codes. The are parts of different international communication standards, e.g.,

digital subscriber line (DSL) standards.


Convolutional codes

Most powerful class of linear codes.

Similar to the linear block codes the encoder of a convolutional code accepts k-bit

message blocks and produces an encoded sequence of n-bit blocks. However, each

encoded block depends not only on the corresponding k-bit message block, but

also on the M previous message blocks.

Such an encoder is said to have a memory order of M.

The ratio

R = k/n

is called the code rate.


Convolutional codes

The message sequence m = [m0, m1, m2, ...] enters the encoder one bit at a time.

The encoder output sequences are obtained as the convolution of the input

sequence with the encoder generator sequences. For an encoder with the memory

order M, the length of these sequences is M + 1. For example, in the case of two

impulse generator sequences,

g(0) = [g(0)0 , ... , g

(0)M ], g(1) = [g

(1)0 , ... , g

(1)M ]

we can write encoding equations

c(0) = m ∗ g(0), c(1) = m ∗ g(1)

where ∗ denotes the discrete convolution and all operations are modulo-2.


Convolutional codes

The convolution operation implies that

c(j)l =

M∑i=0

ml−ig(j)i , j = 0, 1

where ml−i = 0 for all l < i .

After encoding, the output sequences are multiplexed into a single sequence called

the codeword

c = [c(0)0 , c

(1)0 , c

(0)1 , c

(1)1 , ... ]


Convolutional codes

Defining a matrix

G =

g

(0)0 g

(1)0 g

(0)1 g

(1)1 · · · g

(0)M g

(1)M

g(0)0 g

(1)0 g

(0)1 g

(1)1 · · · g

(0)M g

(1)M

. . .. . .

. . .

where all blank areas are zeros, we can rewrite the encoding equations in matrix

form as

c = mG

This form of this equation is equivalent to that of linear block codes! Therefore,

we call G the generator matrix of the code.

In the case of semi-infinite message sequence, the matrix G is semi-infinite as well.

However, if m is finite-length, then G becomes finite-length as well.


Example: R = 1/2 code

With the generator sequences:

g(0) = [1011]

g(1) = [1111]

Let the message sequence be

m = [10111]

Encoding equations yield

c(0) = [10111] ∗ [1011] = [10000001]

c(1) = [10111] ∗ [1111] = [11011101]

and, hence, the 2(k + M)-bit codeword

c = [11 01 00 01 01 01 00 11]


Example: R = 1/2 code

Alternatively, we can write the k × 2(k + M) generator matrix as

G =

11 01 11 11

11 01 11 11

11 01 11 11

11 01 11 11

11 01 11 11

and obtain the same codeword as

c=[10111]

11 01 11 11

11 01 11 11

11 01 11 11

11 01 11 11

11 01 11 11

=[11 01 00 01 01 01 00 11]


Code tree and trellis

Let us discuss the concepts of code tree and trellis using a particular example of

the R = 1/2 convolutional code with M = 2 and the impulse responses

g(0) = [111], g(1) = [101]

Consider the input sequence m = [10011]. Similar to the example above, it can be

shown that the codeword becomes

c = [11 10 11 11 01 01 11]

To enforce the R = 1/2 property, let us truncate the codeword by dropping the

last 2M = 4 bits (the effect of truncation becomes negligible if longer messages

and codewords are used). Then, the codeword becomes [11 10 11 11 01]


Convolutional Encoder


The code tree is defined as follows: each branch of the tree represents an input

symbol (0 or 1). The corresponding output (coded) symbols are indicated on each

branch. A specific path can be traced for each message sequence. The

corresponding coded symbols on the branches following this path form the output

sequence.


Code tree


State diagram


Trellis diagram


Complexity of Viterbi Decoder

Over L binary intervals, the total number of comparisons made by the Viterbi

algorithm is 2K−1L, rather than 2L comparisons required by the standard

maximum-likelihood procedure (full tree search).


Probability of deviating from correct path

Let a(d) denote the number of pathes with a Hamming distance d deviating from,

and then returning to, the all-0 test path. The error probability of Pe of deviating

from the correct path is then upper bounded by

Pe <

∞∑d=dF

a(d)Pd

where Pd denotes that probability that d bits are received in error and dF denotes

the minimum free distance.

Inequality sign because pathes a not mutually exclusive.

Pe depends critically on the minimum free distance dF !


CONCLUSION

I We have studied advanced information theory including the capacity

characterization of multi-antenna and multi-user channels (and the resulting

concept of multiuser diversity), and advanced channel coding approaches

such as cyclic, convolutional codes.

I To apply these concepts and approaches to practice or to do research in these

fields, a deeper study is required.


slides it2 ss2012

Documents